Monday, February 1, 2010

Commenting on namespaces

James Clark commented on namespaces recently, (well, actually, it was last month.) I picked up on it in Robin Cover's newsletter. In the blog, Clark and a number of well known standards participants debate the mistakes that were made in the language architecture of the XML Namespaces rec.

One issue is that, if in a moment of inspiration I decide to invent an XML element name,
and at some time before or after my glorified creative moment, you also create a foo for your vocabulary, then who is to say that my foo is not the same as your foo? More to the point, if and when both of our foos are found in the same content, how are they to be distinguished for processing?

Of course, that's the main case behind XML Namespaces. The answer to the first question is, you say which foo is yours. That's where prefixes come in syntactically, and it is observed that significant complexity arises because programs don't actually get to work with the prefixes, but instead must work with projections of the prefixes onto URIs in the data model. The projection is not 1:1, but 1:many, so that a relation representing the mapping (prefixes onto uris) would not, in general, be invertible. Programs have to track a lot of state to juggle what is really going on when, for instance, nested elements reuse prefixes previously declared in the same document; or when script language instructions are embedded within attribute values, and they refer to namespace qualified names implicitly through a fixed prefix (syntactically a QName).

But what could avoid this state of affairs? Users of XML now routinely process multi-gigabyte sized content streams -- is it reasonable to expect a program to somehow know anything meaningful about all of the element names used in such a stream prior to adding a new fragment somewhere along the line? To me this implies a considerable amount of look-ahead, which is obviated by namespaces' allowance of just-in-time introduction of nested prefix declarations. On the other hand, a stream thus modified would have to go through a linear scan in order to determine all the namespaces to which it refers, instead of accessing, e.g. a header , catalog or external dictionary of some kind.

I faced an admittedly simpler problem of resolving part numbers from multiple suppliers. Actually, they were businesses our parent company had swallowed up in fire sales. The point is we had all kinds of identifiers that might look like one another. An experienced tech might tell by careful inspection, but they weren't designed to be distinct. Certainly a Web application couldn't tell the difference based on syntax (they were just numbers). The solution was to add a header to the payload, to specify the authority responsible for creating the identifier. I called it a "registration authority".

The registration authority was composed of a few pieces. The first piece gave the organization responsible for maintaining whatever registries there were. The second gave an abstract domain... in effect a semantic specification of the type of the identifier, with respect to the organization. The rest of the fields gave the domain-specific identifier, split into whatever fields it might contain.

My structure was not particularly original, except that it identified the registration authority and the domain it established, as a meta-object in its own right. This to me is the true underlying nature of an XML Namespace: it is a relation between the organization (the registration authority) and the abstract domain (the 'name space') for a given XML vocabulary.

One of the posters (John Cowan?) noted that in LMNL, they restricted namespace prefixes so that the relation is bijective across the entire document. Effectively, this ensures that the namespace relation is an invertible function no matter which element context you consider. This is terribly difficult to accomplish with indeterminate streams of markup content, since it requires complete knowledge of the prefixes used across the entire document instance "space".

Technically, the unique naming can be achieved by mangling of the prefix. For instance, any good markup API will computing ordinal numbers representing element positions, and no other element within the content will ever have precisely the same position. So users of very large streams could certainly ensure that no prefix was ever reused, and they need not look ahead to see what other prefixes might cause a conflict.But the element content would have to be mangled too, if it contained references to that prefix. I hope you can see where I'm going with this. It is a small step, and one I'm sure many developers have been doing for many years, to build a dictionary of functional mappings between the prefixes and the URIs.

So it doesn't appear insurmountable to require that the prefix-URI mapping be a function across the entire content, but it might make it considerably more difficult to compose fragments.

No comments: