Friday, August 7, 2009

HTML Tag Soup: Breaking The Web, One Tag At A Time

The ability to parse raw-in-the-wild HTML is of very great value, like composting food garbage or processing sewage to produce a usable product. But you wouldn't want to compost your evening meal or send fresh drinking water through a sewage plant, and you shouldn't expect an HTML parser to look at well-formed XML markup and give you the structured document you expect.


Tag-soup parsing is based on a set of one-of exceptions, specifically designed to produces predictable results when it is given HTML crap. Well-formed XHTML is not html soup but looks enough like it that you can pass it to the HtML5 parser and get something out. The problem is, what you get out is not what you said you wanted. So tag-soup browsing of Well-Formed markup is "WYWINWYG" - What You Want Is Not What You Get.

For Instance

If you write:

<div id="foo" style="border: thin solid red;" />
<div id="bar" style="border: thin solid green;">
<p>Paragraph inside green border, but NOT inside red border?</p>

You do not get two DIV siblings, you get that the first DIV is the parent of the second DIV. That's all by the soupy rules, now quantified by HTML5:

Paragraph inside green border, but NOT inside red border?

One might argue that the first div shouldn't be empty. That's not the point. The point is, that it is compatible with XHTML syntax, and means one thing by XML parsing but quite another by HTML parsing.

This, by the way, was true in HTML parsers before HTML5 -- I'm not picking on the Draft. We all know that sewage treatment plants provide a fundamental value. I just don't want people shoveling crap out the door and telling me that, according to some W3C Recommendation, it is food.

No comments: