| Tornado F3 Conversion of Publications Data to AECMA 1000D - A Case Study | Table of contents | Indexes | Technology driving the SGML marketplace driving technology | |||
| Wood Lauren |
Getting to XML from HTML |
Introduction |
| Interest in XML is growing, particularly now that major browser vendors are showing some interest in XML. There is an opportunity for people to add the richness they need to their documents, getting away from the restrictions of HTML. The best methods to do this will depend on the systems you currently have in place, as well as what you want to do with the documents. |
| Using XML together with HTML is a new subject, and there aren't many answers yet. At the last Web conference in April, people were busy discussing some of the ramifications of trying to combine XML and HTML, wondering just how badly broken an HTML document had to be before you couldn't do anything useful with it, and how to attach HTML behaviours to arbitrary XML elements. Since there aren't many answers yet, this talk will mostly present some of the questions. |
What XML? What HTML? |
| What does it really mean, to get to XML from HTML? It could mean: |
| There isn't much point to simply converting documents in HTML to the XML syntax for delivery over the Web to a standard HTML browser, although you may wish to convert an HTML document to XML syntax for some other application. There is a reason to convert an HTML document into XML syntax if you wish to embed chunks of XML in your document, and want to signal to the browser or other application that the entire document is to be treated as XML, or you want to use some of the features of XML that aren't available in HTML. |
SGML-based Systems |
| If your production systems are in SGML, you won't have any problems, no matter whether you convert to HTML on the fly, or batch convert to HTML for Web delivery. You can continue to use your SGML authoring systems to produce documents, and then convert the documents to XML for delivery. |
DTDs |
| In general, you don't need to convert your SGML DTDs to use XML syntax. Many applications will only need the document, not the DTD. The advantage is that you can use the more complex SGML syntax that may be in your system, such as marked sections. What matters is that the document coming out is XML-compliant, not that the system that produced it is XML-compliant. Even if you need to provide a DTD, because the processing application on the other side of the Web uses the DTD, it may be possible to provide an XML-compliant DTD that matches the documents, but isn't the DTD you use for authoring. |
What You Can Use in XML that You Can't in HTML |
| Even though XML is almost SGML, there are some useful things that it doesn't have. HTML doesn't have these either, so I won't talk about them here. XML does finally give you access to some incredibly useful things that HTML doesn't have, and it always surprises me that HTML got along without them for so long: |
| Text Entities |
| HTML got along without these because the application made up for it. Now you can get rid of many of your server-side scripts, server-side includes, and structured comments, because you can use real text entities. If you don't know what the preceding HTML terms were, XML means you don't need to learn them. |
| NOTATION |
| HTML has helper applications, plug-ins, the OBJECT, EMBED, and APPLET elements, all of which subsume some NOTATION functionality. NOTATION is cleaner. NOTATION doesn't rely on the MIME type, which in practice is determined by the file ending. |
| CDATA marked sections |
If we had these in HTML, it would have made using the SCRIPT element, which has a CDATA content model, much easier. Trying to explain CDATA element rules to people when they want to use document.write("<P>this is a paragraph</P>");
has not been easy. |
Converting HTML |
Valid HTML Documents |
| Getting to XML from a well-written HTML document (such as you get when you use an HTML editor that is based on an SGML editor) is easy. HTML documents typically don't use a DTD, so applications (such as browsers) that process or present that document don't usually expect or require a DTD. So you can turn your valid HTML documents into XML by adding the elements you need, and converting it all into XML syntax. The changes needed include: |
Broken HTML Documents |
| With broken HTML documents the problem is much larger. XML requires well-formed documents. Broken HTML documents typically have many problems that preclude any automatic conversion into XML. Examples of these are: |
<B> this is <I> bold </B> italic </I>
|
<A HREF="xxx.html>....
|
<!-- everyone -- including big sites -- does this -->
|
Is that <ADDRESS> or <ADRESS>?
|
| Fixing these problems isn't something that is easy to automate. You can write a filter that attempts to automate the process, but in general you will need to load the document into a good HTML editor (one which helps you find and fix problems) and fix the errors before converting into XML. |
Embedding XML into HTML |
| Once you have a well-written HTML document, there are various ways of adding XML to it. You can sprinkle new elements through the HTML document, which are used by the applications that recognize these elements, or apply style sheets to them. |
| There are also proposals for allowing specialised applications to work on a designated part of an HTML document. This part of the document is in XML, the surrounding document is in HTML. The exact mechanisms for passing off the XML part of a document to another application while allowing the rest of the HTML document to be rendered by a standard Web browser have not yet been worked out. In an SGML browser that understands the XML/SGML NOTATION mechanism, this is easy. In a Web browser that works with plug-ins or helper applications on the basis of MIME type instead, it is not. |
Web Collections |
| Web Collections is a way of attaching metadata to a document. At the time of writing of these proceedings, the DTD had not been finalized. It seems probable that the final syntax will be XML. Some of the issues about the syntax and how it is to be signalled to the browser should be settled by the time I give this talk. |
Mathematics |
| The W3C HTML-Mathematics Working Group has decided to use XML as the core syntax for putting mathematics into Web browsers. It is not yet clear how this will interact with the HTML in the rest of the page. This group may also have settled some of the issues by the time I give this talk. |
Worrisome Issues |
CGI-bin Programs |
| There is no reason why cgi-bin programs should not work equally well with XML and HTML documents, since these programs need no particular interaction with the browser. The person who wrote a cgi-bin program to deal with HTML documents should be able to change it to work with XML documents and deliver the same results. It is not as straight-forward if you wish to use browser functionality. This is particularly the case with forms. The syntax by which an arbitrary element can tell an HTML browser that it should be treated as one of the form control elements hasn't been decided upon, and probably won't be for some weeks (hopefully not months!). I expect that those browsers that support XML will recognize the form control element names, though there will be issues with namespace to consider. There are SGML browsers that can treat arbitrary elements as form control elements, so it has been proved to be possible. An unfinished task is to standardize the method for doing so across Web browsers, so you don't need to keep different versions of your page around for different browsers. |
Scripting |
| Scripting languages such as JavaScript and VBScript are currently only applicable to HTML elements. The W3C Document Object Model Working Group is working to extend the scripting and object model to encompass XML as well as HTML, in a language and platform-neutral way. This work is scheduled to take till the end of 1998. The specifications delivered will be divided into levels of functionality. When implemented, you will be able to access your HTML and XML documents with the same API in any application that supports the Object Model. |
Conclusions |
| It is already obvious that using XML instead of HTML is the right answer for some documents and some applications. We don't have many of the answers we need, though I believe we have at least started to ask some of the right questions. The next six months are going to be filled with discussions of these and other related questions. Maybe by SGML 97 we will have some practical solutions. |
| Tornado F3 Conversion of Publications Data to AECMA 1000D - A Case Study | Table of contents | Indexes | Technology driving the SGML marketplace driving technology | |||