The Addition of a Multilingual Component to An Existing Document Processing System   Table of contents   Indexes   Realising the Potential of Object Technology Through New Working Practices

 
 

SGML & schemas: from SGML DTDs to XML-DATA.


 
François   Chahuneau
  AIS
17 Rue Remy Dumoncel
Paris   France  F-75014
Email: fcha@ais.berger-levrault.fr
 
Biographical notice:
 
François Chahuneau
 
François Chahuneau is General Manager of AIS and President of AIS Software, AIS's software publishing subsidiary. Throughout his ten years of experience with SGML and now XML during which he developed the AIS business, F. Chahuneau was involved as an expert consultant or project supervisor in a large number of SGML-related projects in France and Europe, covering all typical application areas. He supervised the design and development of several software products, including the SGML/Store SGML database, Balise and Dual Prism. Mr. Chahuneau graduated at the Ecole Normale Supérieure in Paris.
 
ABSTRACT:
 
This paper studies, from an historical perspective, the relationship between SGML and data modeling concerns.
 
SGML did not invent the concept of structural document models, or “schemas”. Nevertheless, through the notion of DTDs, it made this powerful concept available and understandable to a large number of people with little or no data modeling experience.
 
With the evolutionary trend towards “content oriented” DTDs, the emergence of well-described methodologies to design them and the appearance of specialized “case” tools to manipulate them, the potential of SGML as a data modeling methodology became clear, and some SGML enthusiasts suggested to use it as a general purpose tool.
 
However, because an SGML DTD intimately mixes the notion of a “grammar” and that of a “schema", these two concepts remained partly confused, at least in the orthodox SGML approach. This original characteristic caused some misunderstandings and raised many suspicions from the traditional data modeling world. This largely precluded, so far, the use of SGML as a general data modeling tool outside the restricted arena of structured documents.
 
By introducing a simplified syntax with a fixed grammar, XML isolated the role of DTDs as pure schemas, and also made them unnecessary for pure recognition of the “de facto” document structure.
 
As a final attack against the traditional SGML DTD concept, Recent proposals such as RDF and XML-data suggest to use the XML syntax itself to encode document schemas, therefore making “traditional” DTDs obsolete. At the same time, they propose several extensions to the SGML data modeling semantics, by incorporating object-oriented concepts.
 
Will such evolutions allow XML to become the official, well-accepted and ubiquitous way to exchange structured data and associated models, and bring SGML power much beyond its original application niche?
 
 

Introduction

 
For anybody interested in the history of document and data structuring concepts, one of the striking facts of the last few months was to hear Microsoft's Chairman Bill Gates acknowledge the virtues of abstraction in document publishing technology through XML: “I think XML is really a breakthrough, because it brings the database and the publishing world into having an abstract way of describing properties.” (Seybold San-Francisco, Oct. 97)
 
Along the same line, the Web community at large recently discovered with XML the possibility to separate content from presentation, and the possibility to manage content on its own (hence the new buzzword “Content Management”).
 
For those of us who have been practicing SGML for the last ten years or more, hearing such “revelations” can be amusing at best... or irritating at worst. However, instead of looking at this phenomenon with contempt or disdain, exercising some self-criticism might be useful. Why did “traditional” SGML never seem able to deliver its message to the mainstream? Why did the impression of complexity and clumsiness dominate its marketing image, hiding to the uninitiated the few good and simple underlying ideas?
 
One of the focal points of this complexity seems to be the very notion of a DTD, and the way it is defined in the standard. This article summarizes various ideas which the author progressively developed on this topic, applied, and sometimes expressed over the last 10 years or so, but which found unprecedented echo and justification in the recent evolution of SGML and its usage and the advent of XML.
 
 

The grammar/schema confusion

 
 DTD, Document Type Definition 
 

The dual nature of SGML DTDs

 
SGML was invented more than ten years ago by a group of ingenious and creative individuals, characterized both by remarkable intuitions and either limited scholarly knowledge of computer science or little desire to conform to it. With the benefit of hindsight, after ten years of practice, the design of SGML appears as an unlikely and unique mixture of many brilliant ideas and a few mistakes, and strikes by its total lack of references to data modeling or language design theories that had already emerged in computer science at the time it was designed.
 
A major point of originality is the central SGML DTD concept itself: a DTD is both a generative grammar for the markup language which will be used to tag corresponding instances, and a schema which characterizes a document class: it assigns names to things and defines rules stating what structural patterns shall or shall not be not possible/required in an SGML document (modeled as a tree of typed nodes with attributes) which belongs to the class. In the same set of statements, one is instructed that “the end tag for AUTHOR can be omitted” and that “the document must have a title and a single one”, although these two pieces of information admittedly belong to totally different areas of concern.
 
This dual nature of DTD should not necessarily lead to confusing the two notions. Unfortunately, this is largely what happened in the SGML community...
 
 

SGML parsers share responsibility

 
Consider the following (invalid) SGML fragment:
 
  1:  <!doctype doc[  2:  <!element doc - - (title, p+)>  3:  <!element p - - (#PCDATA|emp)*>  4:  <!element (title|emp) - - (#PCDATA)>  5:  ]>  6:  <doc>  7:  <title>Title 1</title>  8:  <title>Title 2 </title>  9:  <p>abc<\\p>  10: <p>abc<emp>def</p></emp>  11: </doc>
 
An SGML parser will typically report three “syntax errors”, corresponding to three very different types of mistakes:
  • The typo in closing the P tag on line 9 (lexical error)
  • The overlapping of P and EMP elements on line 10 (structural inconsistency error)
  • The presence of two TITLE elements whereas the model only allows for a single one (model error).
 
This is indeed what the standard dictates: it makes no distinction between the arguably distinct “nature” of these three mistakes, and suggests a “flat” reporting without any notion of error category. This natural behavior of SGML parsers, of course, did not help to open the eyes of those who learned SGML “the hard way”, i.e., with the help of an ASCII text editor and a validating parser.
 
 

The role of cultural gaps

 
One might have expected that the computer science community, once it got interested into SGML (at least to build software implementations), would have quickly diagnosed this confusion. Actually, although it was pointed out by some, it went largely unnoticed. The author has a personal interpretation for this phenomenon.
 
The computer science educational system, because of the way it is organized, usually produces two distinct types of graduates, which can be caricatured as follows:
  • Type 1, software engineering curriculum: knows about language and compiler theory, grammars and BNF (Backus-Naur Form); has been hacking on Unix systems and is familiar with lex, yacc and regular expressions; has little interest for database applications (and a lot of contempt for business-oriented operating systems such as OS/2, Windows NT or mainframe OSes); has no notion of what of a database schema is, and little interest for Entity-Relationship data modeling (maybe a bit more for more modern “object-oriented” modeling approaches because was told that IDL or C++ could be used).
  • Type 2, business data processing curriculum: knows about databases, CASE tools; is familiar with the notion of schemas, and has been exercised about Entity-Relationship modeling or more recent data modeling methods. Has strong notions about data integrity and query languages, but much less about BNFs and parsers; has never played with lex and yacc because OS/2, NT or VM was used for training.
 
The net result of this separation is that it is not very common to find somebody (or even a group of people) who combines these two complementary cultures, and is equally at ease with the notion of a grammar and that of a schema. And yet, both are really necessary to understand what SGML designers probably intuited when they defined the DTD concept.
 
 

Irreconcilable differences

 
SGML experts will explain that there are good reasons why an SGML DTD cannot be simply equated with a schema in the (object-oriented) database sense... and they are right!
 
One area of resistance is that of SGML exceptions (no surprise it was eliminated in XML...). The concept of an SGML inclusion , for instance, means that what you are likely to find in a textual object (element) of type X does not only depend on what has been declared at the class level (element type definition), but also of the precise location of this object in the global structure (the SGML instance). This interaction between the “object composition hierarchy” and the class -> object inheritance mechanism has no known equivalent in any object-oriented or other data modeling approach...
 
The ill-fated ODA standard, once considered as a potential competitor to SGML, was, in many ways, much more “well-mannered”: its design — although very limited — was “clean”, and its vocabulary was much more familiar to computer science folks (one of the reasons, of course, was that underlying work had been funded mostly by European hardware manufacturers).
 
 

Confusion consequences

 
This grammar/schema confusion, caused by the dual nature of DTDs, had several negative effects on the maturity of ideas in the SGML community. One of them is the common misunderstanding of the huge difference between conversion “from SGML” and conversion “to SGML”.
 
Let us consider, for instance, that RTF is the non-SGML format. If these two processes: SGML -> RTF and RTF -> SGML are simply seen as conversions from one syntax to the other, there is indeed little difference, and one can think of using similar tools and algorithms for both ways.
 
Now, if one observes that the SGML data follow an explicit schema expressed in the associated DTD whereas the RTF data don't, then one can see a huge difference, and begin to think in terms of “information level gap”, which is what difficulty in legacy data conversion is all about...
 
 Schema 
 

The emergence of schemas behind DTDs

 
Despite all the reasons described above, several evolutionary ideas appeared in the SGML community over the last five years, which contributed to make us think of DTDs more and more as schemas and less and less as grammars.
 
Document model
 

CASE tools and methodologies for DTD

 
So-called “CASE tools” for DTD design appeared as early as 1994, and brought a new vision of DTD development wherein an abstraction cycle, leading to the specification of a data model, was taking precedence over syntax-related concerns. The book by Maler and El Andaloussi , which was published in 1996, popularized a new approach to document analysis and DTD design as “content-oriented” (semantic) modeling tools, relegating language/syntax aspects to a secondary role.
 
 

SGML and databases

 
The whole area of database-oriented SGML applications was naturally one in which DTDs would be first perceived as (a strange kind of) schemas. After all, “type-2” computer scientists, as described above, rule over this field! Defining bi-directional information exchanges between SGML documents and databases is pretty much based on thinking with a DTD in one hand and a database schema in the other.
 
Note that generic SGML databases (such as AIS' SGML/Store of Chrystal Software's Astoria) position themselves one step higher in genericity: they implement a schema which is that of an SGML document instance in general — to simplify: a tree of typed nodes with attributes. But because they usually store DTDs as well, and track the relationship between SGML instances and DTDs, DTDs can still be used as schemas when preparing queries to the database engine.
 
 XML 
 

The XML phase separation

 
 

The XML revolution

 
The major 1997 event for the SGML community was, beyond dispute, the advent of XML.
 
As some recent mails in the XML meaning list attest (such as the October discussion about EMPTY elements!), not everybody necessarily agrees on the interpretation of fundamental reasons why such or such feature finally made its way or not to the XML proposal. Hopefully, most participants still seem to agree on a significant number of points, and the status of DTDs in XML appears to be one of them.
 
DTDs are no longer required in XML because the new simplified syntax makes it possible to build, in an unambiguous way, a single tree of typed nodes with attributes from a piece of tagged text... which was not guaranteed with traditional SGML. More precisely, a “ well-formed XML document ” is a tagged structure which makes this possible.
 
Therefore, DTDs are no longer necessary to acquire knowledge of the de facto structure of tagged instances. They remain available, when they exist, to provide meta-information about the data model. In other terms, XML DTDs are reduced to the role of pure schemas .
 
This interpretation provides natural answers to most questions which were raised about “DTD-less” parsing of XML documents, and situations in which DTDs would still be required.
 
DTDs are useful in situations when reference to a schema is required :
  • To create structured data under the control of the schema;
  • To validate data against a schema before running schema-sensitive applications;
  • Sometimes, to help to design stylesheets for document presentation (at least, it can save time, because it provides an overview of the element type catalog and gives the good idea of all possible structural patterns which can occur in an instance).
 
DTDs are no longer required:
  • When building the SGML grove is the only requirement;
  • To run schema-insensitive applications (displaying documents under the control of a stylesheet is typically one of them);
  • When the schema is already known on the application side, and when the data can be ascertained to follow it (for instance, for computer-generated data)
 
This means that document schemas will only be used when really needed. In the same way as many simple RDBMS applications are developed everyday without explicit (or even conscious) reference to any conceptual data model expressed with the entity-relationship formalism (or any other method of comparable abstraction level), it is anticipated that many simple XML applications will be developed without explicit DTD design.
 
 

New application areas: XML/EDI

 
This new vision of DTDs as pure schemas leads itself to considering using XML as an ideal replacement for existing technology, in areas where structured data exchange standards already exist, but where they lack schema information transfer capability. In such cases, schema-based message validation requires development of specific, proprietary validation programs (the schema is part of the code), as opposed to the more modern, robust and cost-effective approach in which a generic program, developed once, dynamically loads a schema as a data structure before attempting message validation.
 
One such area is Electronic Data Interchange (EDI). Using XML as an EDI information vehicle allows taking advantage of schema information exchange, and to reap the benefits of software standardization. Details can be found at http://www.geocities.com/WallStreet/Floor/5815/.
 
 

Still much work to do...

 
Still, much work seems to be done in terms of user-education. The old grammar/schema confusion is apparently tough-lived, as can attest the recent discussion about DTDs on the WEBDAV (Web Distributed Authoring and Publishing) discussion list (lists.w3.org/Archives/Public/w3c-dist-auth/). The discussion was about whether a DTD should be used instead of a BNF for describing the XML structures designed in the WEBDAV standard, and was opposing two camps: the SGMLers were explaining that DTDs were the natural syntax to express such things, whereas Web-trained folks (presumably of “type-2” described above) were explaining that “the DTD syntax is not well known amongst the HTTP community, of which DAV is a member, while BNF is.”
 
In our opinion, both camps, while focusing on the DTD vs. BNF debate, were missing a major point, which is that an XML DTD should no longer be equated to a grammar, but to a schema (for which BNF never was the preferred formalism).
 
XML-DATA
 

XML-Data: the ultimate evolutionary stage?

 
The latest stage in the evolution of DTD-related concepts seems to be what is at the core of the recent XML-data specification proposed by Microsoft, Inso and others as a candidate W3C standard (see: http://www.microsoft.com/standards/xml/xmldata.htm).
 
The XML-Data proposal squarely suggests giving up the traditional SGML DTD syntax, and to transfer equivalent schema-level information under the form of tagged data structures, therefore using XML “instance syntax” both for data themselves and for metadata. Considered from an SGML perspective, this amounts to design a “DTD for DTDs”, which captures the semantics of SGML as a data modeling language. XML-Data actually goes one step further, by extending the SGML semantics in three directions: inheritance (type extensions, subclassing, etc.), introduction of lexical data types (HyTime already had this notion), and definition of a few basic semantic data types (Date, Number, etc.).
 
 

Conclusion

 
Even if this is not yet obvious to everybody, the role of DTDs as a data schema, which was largely implicit in the original SGML standard, has made its way through various evolutionary changes. XML, by relieving DTDs from their “grammar role”, leaves their “schema role” as their only justification for existence. Indeed, alternate syntaxes can be proposed to convey identical information, which logically leads to the disappearance of DTDs in the traditional sense.
 
DTDs are dead — long life to DTDs!
 
Bibliography
MAL 96
Maler, E.; El Andaloussi, J.. Developing SGML DTDs. Prentice Hall, 1996.

The Addition of a Multilingual Component to An Existing Document Processing System   Table of contents   Indexes   Realising the Potential of Object Technology Through New Working Practices