Tornado F3 Conversion of Publications Data to AECMA 1000D - A Case Study   Table of contents   Indexes   Technology driving the SGML marketplace driving technology

  Wood  Lauren 
 

Getting to XML from HTML

 

Introduction

 Interest in XML is growing, particularly now that major browser vendors are showing some interest in XML. There is an opportunity for people to add the richness they need to their documents, getting away from the restrictions of HTML. The best methods to do this will depend on the systems you currently have in place, as well as what you want to do with the documents.
 Using XML together with HTML is a new subject, and there aren't many answers yet. At the last Web conference in April, people were busy discussing some of the ramifications of trying to combine XML and HTML, wondering just how badly broken an HTML document had to be before you couldn't do anything useful with it, and how to attach HTML behaviours to arbitrary XML elements. Since there aren't many answers yet, this talk will mostly present some of the questions.
 
 

What XML? What HTML?

 What does it really mean, to get to XML from HTML? It could mean:
 
  • simply converting your HTML documents into XML syntax (adding the NET slash to empty elements, adding the appropriate PIs, etc.);
  •  
  • embedding chunks of XML into your HTML documents for use by some process; or
  •  
  • converting your HTML documents into a richer XML structure.
  •  There isn't much point to simply converting documents in HTML to the XML syntax for delivery over the Web to a standard HTML browser, although you may wish to convert an HTML document to XML syntax for some other application. There is a reason to convert an HTML document into XML syntax if you wish to embed chunks of XML in your document, and want to signal to the browser or other application that the entire document is to be treated as XML, or you want to use some of the features of XML that aren't available in HTML.
     

    SGML-based Systems

     If your production systems are in SGML, you won't have any problems, no matter whether you convert to HTML on the fly, or batch convert to HTML for Web delivery. You can continue to use your SGML authoring systems to produce documents, and then convert the documents to XML for delivery.
     
     

    DTDs

     In general, you don't need to convert your SGML DTDs to use XML syntax. Many applications will only need the document, not the DTD. The advantage is that you can use the more complex SGML syntax that may be in your system, such as marked sections. What matters is that the document coming out is XML-compliant, not that the system that produced it is XML-compliant. Even if you need to provide a DTD, because the processing application on the other side of the Web uses the DTD, it may be possible to provide an XML-compliant DTD that matches the documents, but isn't the DTD you use for authoring.
     
     

    What You Can Use in XML that You Can't in HTML

      Even though XML is almost SGML, there are some useful things that it doesn't have. HTML doesn't have these either, so I won't talk about them here. XML does finally give you access to some incredibly useful things that HTML doesn't have, and it always surprises me that HTML got along without them for so long:
     Text Entities
      HTML got along without these because the application made up for it. Now you can get rid of many of your server-side scripts, server-side includes, and structured comments, because you can use real text entities. If you don't know what the preceding HTML terms were, XML means you don't need to learn them.
     NOTATION
      HTML has helper applications, plug-ins, the OBJECT, EMBED, and APPLET elements, all of which subsume some NOTATION functionality. NOTATION is cleaner. NOTATION doesn't rely on the MIME type, which in practice is determined by the file ending.
     CDATA marked sections
     If we had these in HTML, it would have made using the SCRIPT element, which has a CDATA content model, much easier. Trying to explain CDATA element rules to people when they want to use document.write("<P>this is a paragraph</P>"); has not been easy.
     

    Converting HTML

     
     

    Valid HTML Documents

     Getting to XML from a well-written HTML document (such as you get when you use an HTML editor that is based on an SGML editor) is easy. HTML documents typically don't use a DTD, so applications (such as browsers) that process or present that document don't usually expect or require a DTD. So you can turn your valid HTML documents into XML by adding the elements you need, and converting it all into XML syntax. The changes needed include:
     
  • explicitly writing all implied tags into the document
  •  
  • quoting all attribute values
  •  
  • not using minimization
  •  
  • adding the NET slash to EMPTY elements (e.g. <HR/>)
  •  
  • adding the list of character entity references used (other than <, >, &, " and ')
  •  
     

    Broken HTML Documents

     With broken HTML documents the problem is much larger. XML requires well-formed documents. Broken HTML documents typically have many problems that preclude any automatic conversion into XML. Examples of these are:
     
  • overlapping tags <B> this is <I> bold </B> italic </I>
  •  
  • attribute values that lack quotes in the worst places <A HREF="xxx.html>....
  •  
  • comments that aren't meant to be comments <!-- everyone -- including big sites -- does this -->
  •  
  • misspelled element or attribute names Is that <ADDRESS> or <ADRESS>?
  •  Fixing these problems isn't something that is easy to automate. You can write a filter that attempts to automate the process, but in general you will need to load the document into a good HTML editor (one which helps you find and fix problems) and fix the errors before converting into XML.
     

    Embedding XML into HTML

     Once you have a well-written HTML document, there are various ways of adding XML to it. You can sprinkle new elements through the HTML document, which are used by the applications that recognize these elements, or apply style sheets to them.
     There are also proposals for allowing specialised applications to work on a designated part of an HTML document. This part of the document is in XML, the surrounding document is in HTML. The exact mechanisms for passing off the XML part of a document to another application while allowing the rest of the HTML document to be rendered by a standard Web browser have not yet been worked out. In an SGML browser that understands the XML/SGML NOTATION mechanism, this is easy. In a Web browser that works with plug-ins or helper applications on the basis of MIME type instead, it is not.
     
     

    Web Collections

     Web Collections is a way of attaching metadata to a document. At the time of writing of these proceedings, the DTD had not been finalized. It seems probable that the final syntax will be XML. Some of the issues about the syntax and how it is to be signalled to the browser should be settled by the time I give this talk.
     
     

    Mathematics

     The W3C HTML-Mathematics Working Group has decided to use XML as the core syntax for putting mathematics into Web browsers. It is not yet clear how this will interact with the HTML in the rest of the page. This group may also have settled some of the issues by the time I give this talk.
     

    Worrisome Issues

     
     

    CGI-bin Programs

     There is no reason why cgi-bin programs should not work equally well with XML and HTML documents, since these programs need no particular interaction with the browser. The person who wrote a cgi-bin program to deal with HTML documents should be able to change it to work with XML documents and deliver the same results. It is not as straight-forward if you wish to use browser functionality. This is particularly the case with forms. The syntax by which an arbitrary element can tell an HTML browser that it should be treated as one of the form control elements hasn't been decided upon, and probably won't be for some weeks (hopefully not months!). I expect that those browsers that support XML will recognize the form control element names, though there will be issues with namespace to consider. There are SGML browsers that can treat arbitrary elements as form control elements, so it has been proved to be possible. An unfinished task is to standardize the method for doing so across Web browsers, so you don't need to keep different versions of your page around for different browsers.
     
     

    Scripting

     Scripting languages such as JavaScript and VBScript are currently only applicable to HTML elements. The W3C Document Object Model Working Group is working to extend the scripting and object model to encompass XML as well as HTML, in a language and platform-neutral way. This work is scheduled to take till the end of 1998. The specifications delivered will be divided into levels of functionality. When implemented, you will be able to access your HTML and XML documents with the same API in any application that supports the Object Model.
     

    Conclusions

     It is already obvious that using XML instead of HTML is the right answer for some documents and some applications. We don't have many of the answers we need, though I believe we have at least started to ask some of the right questions. The next six months are going to be filled with discussions of these and other related questions. Maybe by SGML 97 we will have some practical solutions.

    Tornado F3 Conversion of Publications Data to AECMA 1000D - A Case Study   Table of contents   Indexes   Technology driving the SGML marketplace driving technology