XML-Enabling Enterprise Databases to Simplify Internet Applications   Table of contents   Indexes   Informix and XML

 

Do we need DTDs?

 Peter   Murray-Rust
  Director Virtual School Molecular Sciences, University of Nottingham, UK and Co-director Virtual HyperGlossary
   
       
Phone: 
Fax: 
 
Biographical notice:
 
I am a chemist, and motivated by the desire for humans and machines to collaborate and interoperate. For many years I have been trying to create a common framework for information interchange among chemists, first in academia (Stirling University, UK) and then in industry (Glaxo). The recent development of the WWW has given me the tools and the audience to make this possible and I see XML as the dawn of a new age. 2 years ago I moved to Nottingham to set up the Virtual School as a way of learning using Internet technology and resources. I have developed several XML applications and tools including JUMBO (the first XML browser), CML  (Chemical Markup Language) and the Virtual HyperGlossary as a way of exchanging ontologies using WWW/XML. I enjoy helping create virtual activities on the Internet and see XML as providing great scope.
 

Do we need DTDs?

 
XML allows authors the freedom to create documents which do not (and often never will) conform to a DTD. This freedom will be widely used, especially among those who come to XML from a background of HTML or legacy document systems. I have already encountered DTD "designs" proposed for scientific work where authors are recommended to "add extra attributes" if they cannot find their requirement in the DTD. Although some tools do require the use of DTDs (even for well formed documents) many will not and this will be increasingly true of authoring systems. I shall discuss whether DTDs have a role in emerging areas of XML, especially technical domains and where data/documents are combined
 
There will be an increasing number of opportunities where XML documents are created on-the-fly with a variable and possibly ephemeral tagset. Thus I frequently create "config" files for my programs using XML as the syntax. In many cases I rely of tree-structured algorithms to allow me to search for elements without precise constraints on their position or content. Thus the program logic may simply require match="file[@mimetype]" (XSL-syntax) and analyse the results without being concerned about their precise context. Simple documentation and code is all that is required to support this hacking [and the overhead of supporting DTDs within program operation is too high].
 

What does the DTD offer in XML?

 
My analysis is:
 
  •  A statement of ownership/responsibility for the field. Thus I have developed two quite widely known DTDs: CML and (with Lesley West) VHG. The fact that there is a DTD in an area is a statement that someone/some_org is taking responsibility for the XML developments. This has been extremely useful in both of these areas. It is often coupled to the use of namespaces. Thus CML :* elements are likely to belong to a CML namespace and the "fragment support" that that offers.
  •  A formal listing of the allowed elements, attributes (and possibly attribute values). In element-oriented computing (where XML elements are mapped onto, say, Java code) this can representpart of a formal contract between the architecture designer and the programmer. Thus, in CML , it is expected that a CML application has some way of processing (rendering, transforming, validating, editing) <molecule>, etc. Note that it may not be necessary for the content model to be well defined as long as descriptions of the requirements are attached to the elements
  •  A specification of content models. This is increasingly difficult to design and maintain as technical subjects are highly likely to re-use elements from other namespaces. DTDs cannot easily support multiple namespaces either syntactically (e.g. what is the prefix?) or anticipate the range of content models. In simple cases (where the content is known to be #PCDATA) the model may be useful. Also, a DTD-driven editor may be able to offer the author a list of allowable elementNames in a given context.
  •  Attribute values. Attribute value enumeration is the only real way DTDs can constrain (string) values. In some cases this may be valuable, but it is very limited
  •  Minimisation. The main value of this is to add "hidden" or default attribute names/values. This is of debatable value in many cases as the reader may read the document as "what I see is what I get" and "where didthis come from?" is a nasty surprise.
  •  Entity management. A DTD is syntactically necessary for entity management. I usually try to use the internal subset for this so that at least the logic is visible and the spirit is therefore "well-formed"
  •  Syntactic validation of document content and attributes. This can be useful for certain DTDs and we use it in this way for the VHG applications, where we encourage conformity. It is a (limited) way of encouraging good practice within a community. However much of our extensibility is through unconstrained attribute values.
  •  Authoring tools. Lists of potential elements and attributes can be very useful and avoid the need for manuals. The VHG is simple enough to author manually, but benefits from a DTD-driven authoring tool. [All content models are of form (A|B|C)* or (#PCDATA) so that order is unimportant].
 

What are the downsides of the DTD?

 
  •  It has no support for adding semantics. Textual annotation (comments) of DTDs is largely useless as it is not machine-manageable and it gives no formal links to ontological or programmatic resources.
  •  It breaks on multi-namespace applications. In CML many content models are simply ANY. A molecule may well contain text, graphics, HTML, numeric data, etc. There is no gain to be got from algorithmic reconstructions of DTDs that will manage this problem.
  •  A complete document with a DTDcannot be re-used/transcluded without a pointer-based mechanism . If I have molecule.xml which is headed by a DOCTYPE statement I cannot then transclude it in a large document (e.g. a report) using the ENTITY mechanism. This is a very serious drawback to re-use. If/when linking is developed so that such documents can bereferenced (e.g. through XLINK/HREF, preferably with type-checking) the situation will be improved.
  •  There is no type checking. As someone who uses integer, float, etc. on a daily basis this is far more important than syntactic checking.
  •  There is little flexibility. Where there is a formal business requirement to ensure conformity (e.g. tax-forms) the rigidity of a DTD is useful. Where the discipline is open-ended (e.g. chemistry) it is too constraining. Innovators will either "hack the DTD to fit" or simply ignore it.
  •  There is little value checking (only enumerated attribute values).
  •  The purpose and syntax of DTDs is too arcane for newcomers.
 
On balance, therefore, the DTD is too inflexible for most new XML applications. XML Schemas will overcome many of these limitations if they tackle the following:
 
  •  Flexible/constrained content models on demand. It is often useful to specify that some of the content must/must_not be present and the rest is optional or foreign.
  •  Data typing is essential. XML has spent too long avoiding this question which is critical for the new data-driven applications.
  •  Tight coupling between declarations and APIs. We need to be able to specify that FOO functionality is provided by FOO.class (or similar). This may come through the UML/XML efforts.
  •  Mechanisms for validating unconstrained fields (e.g. #PCDATA and string attribute values.) These are the most natural and powerful ways of simple extensions to DTD functionality.
  •  Transformation of Schemas. If Schemas are XML documents they can be merged or otherwise filtered in ways that greatly enhance the authoring process.
  •  Good support for XLINK within and external to Schemas and documents. Much of this will require good educational material.
  •  Algorithmic creation of schemas from other paradigms (e.g. UML or RDB schemas).

XML-Enabling Enterprise Databases to Simplify Internet Applications   Table of contents   Indexes   Informix and XML