Thursday, May 28, 2009

Comparing schemas and versioning

A friend asked me about comparing schemas, and I got to thinking about the question of versioning.

This past week I was reading over the 2008 symposium presentations from Balisage, the conference B.Tommie Usdin chairs in Montreal, and the versioning symposium.

A few people have mentioned this as an area in XML that needs work. The question of comparing schemas is really where the rubber hits the road on this issue.

I think of it this way: the rules of XML imply a kind of symmetry, that some bits are irrelevant -- up to parsing of the DTD under the rules. So introducing white space inside tags is just like the rotation of a paper plate -- it is just a rigid transformation of the same DTD object in some "DTD space".

Ditto for other variations that XML says are equivalent, like single versus double quotes and multiple attributes in an ATTLIST versus an ATTLIST declaration for each attribute. In mathematical terms, XML defines a set of "equivalence classes" among parsed DTDs, to wit, two DTDs that happen to parse to the same lexical specification have this same lexical form as a representative, and thus are in the same equivalence class.

Now, if some process were to normalize the syntactic expressions between these two DTDs, that gets you close to being able to use the syntactical form if there are few differences, but does little if one is trying to compare really big DTDs with lots of variations between them; or if one is trying to do a structural analysis between DTDs which have a lot of overlap of the abstract structures but do not share labels. In the first case one can always eyeball it. In the second case, the number of syntactic differences may be enough to make the process tedious and prone to overlooking something important.

In the third case, the DTDs may not even share a common lineage but may represent convergent evolution, so that a person may be able to spot some structural similarities but differences in namespace prefixes and local names may render a syntactic diff intractable. I know of no way to make this completely tractable, but one observation that could be made is that if two elements represent highly correlated concepts, they can be viewed much like necklaces differing by a few beads.

If one were to substitute the same label for the respective element names, and another label for each attribute with the same data type, one might get a better idea of the similarity between elements. It might even be possible to attach a weight to the degree of similarity, and repeat the comparison for each pair of elements between the two DTDs, and produce a ranking for the best fit mappings of the elements. But this would be a relatively poor heuristic guess at best. I can imagine some student trying to do this for an academic project, but I don't think it would be practical for mapping in the real world except for schemas that already exhibit a very high degree of similarity (and thus easy to compare using syntactic diff methods).

No comments: