Cost Justifying Your SGML Project   Table of contents   Indexes   Human Factors Engineering: Creating a Productive Environment for Authoring SGML Documents

  Haitto  Hasse 
 

SGML in Transition

 

Abstract:

 SGML has celebrated 10 years as a standard, and although the standard is only now being revised, theuse of SGML has evolved over time. This paper explores some of the features that has made SGML successful, the importance of adopted conventions, and speculates on future applications as SGML transitions into the next century.
  As an international standard, SGML is subject to orderly, voted-upon change. Already a decade in adoption, it is due to be revised. In many ways, the standard was farsighted in its design—a fact confirmed by it being applied well beyond its original publishing design intentions, and in becoming the foundation of promising standards such as ISO 10744 HyTime. Even the long delay in completing the companion standard ISO 10179DSSSL has not significantly slowed SGML's rise to prominence.
 However, even if the standard itself has yet to change, the use of SGML has made a number oftransitions .
 

The Evolution of DTD Design

  In DTD design, a significant trend has been moving away fromstructural towardscontent-oriented , functional design. As an example, one might formerly have tagged an assembly description in a maintenance manual as a bulleted list, but would these days rather tag it as a series of assembly steps, with hyperlinks to required tools, even if itsrendition is a bulleted list.
  The benefits of content-oriented tagging are increasingly in the re-use of information elements, in the coupling of SGML with databases, and in connection with queries which use the underlying markup as search criteria.
 The different ways SGML is used reflects a growing awareness of dealing withinformation rather than documents.
 
 

Domain-specific DTDs

 Early use of SGML reflected a one-size-fits-all approach of generic DTDs. This view was gradually replaced by that of domain-specific DTDs. The DocBook DTD is perhaps one of the most commonly known later efforts; various industries—aerospace, automobile, semi-conductors, etc.— have implemented their own DTDs as well.
  At the far end of the application spectrum, complex DTDs forIETMs (Interactive Electronic Technical Manuals) may need to include structures for conditional presentation behavior of input data and embed interactive elements such as clickable warning dialogs that are controlled by traversal rules.
 
 

Module-oriented Design

 DTDs have also become increasingly modular. A simple example is that of common elements that are included when necessary: say, a table model, or elements for mathematical formulae.
  A very disciplined modular design has been adopted by theTEI (Text Encoding Initiative) in its set of DTDs (seeGuidelines for Electronic Text Encoding and Interchange , edited by C.M. Sperberg-McQueen and Lou Burnard, http://www-tei.uic.edu/orgs/tei/). In these DTDs, you toggle the inclusion of DTD fragments as required, and the content models have provisions for being extended or replaced in a clean, extensible fashion. (These DTDs are highly recommended for study and use!)
 
 

Storage vs. Publishing DTDs

 Along with content-oriented tagging, the application of reusable information elements is becoming widespread—differentiating between tagging for storage and retrieval vs. generating data for some publishing-oriented DTD (say, HTML)from a storage-oriented markup.
 Content-oriented storage DTDs are ideal for SGML document databases that support SGML querying capabilities.
 
 

Prediction

 As DTDs evolve, one may need to maintain and restore earlier versions of both DTDs and related document instances. There will be a growing need for tools that addressDTD evolution , and that optimize queries based on SGML structures.
  ForIETMs , expect improvements in editing tools, to validate application semantics in addition to the primary SGML and SGML-related functionality.
 For implementing re-usable information elements, the SGMLsubdoc feature may become a key player.
 

Adopting Entity-based Approaches

 From the onset,portability andsystem independence were paramount for SGML, and highly touted as a selling point of the SGML approach. Actually, current tools still tend to influence (to a certain degree) how you will use SGML, but at least SGML minimizes the application exposure. The tools are also becoming better.
  The concept ofentity as a virtual storage system insulates SGML from any particular file system convention, and thereby prevents the standard from being locked into any particular operating system. Although a simple idea, it is also one of SGML's strongest points, and one which has grown in importance over time.
 With entities, the SGML standard can simply refer to an abstractentity manager to retrieve and deliver corresponding document contents, without worrying about how this will be done or from where the data is fetched. The entity mechanism is scaleable, used for simple things like inserting a foreign character, all the way up to entire documents and referring to non-SGML notation data such as images or video.
 External entities are declared using system or public identifiers. The latter form is mapped to the former when resolved. As the name indicates, a system identifier is system-dependent.
 Initially, a large part of the SGML community's efforts addressed issues of converting legacy data, authoring, validation, DTD design, and of course processing SGML. Not much attention was spent on making entities available across SGML applications. This has changed in latter day SGML use.
 
 

The SGML Open Catalog

 As SGML rose to prominence, more and more SGML tools appeared, and with a choice of tools, it became gradually clear that some form of harmonization was required to reach application independence, to isolate and neutralize the use of system identifiers. The SGML Open consortium thus agreed on anentity catalog , which defines a format for SGML systems to share common entities in a well-defined manner.
  In its simplest form, the SGML Open catalog is a mapping between public and system identifiers. (It is actually more complex, and is currently being extended even further). Many companies support the catalog format.
 
 

Customizable Entity Managers

 Once you start using public identifiers and the catalog scheme, you note that they are an asset over using system identifiers directly. You can reorganize the storage organization of your documents, and only update a single spot: the affected catalog.
  With acustomizable entity manager, you can further handle the processing of entities freely, to build on this powerful paradigm. A couple examples are:
 Runtime resolution.
  By allowing dynamic, runtime resolution of entities, you can (for instance) resolve documents whose data resides in databases or is assembled on the fly, from a variety of sources (say, as the result of a user query).
 Encryption-decryption.
 As a corollary to the previous step: Since the entity manager is decoupled from the SGML standard, you can even insert an encryption/decryption step as you process entity contents dynamically.
 The dynamic processing of SGML is a recent application, as applications have moved from static, pre-compiled proprietary data generated out of SGML source to working with SGML directly.
 
 

Prediction

  Being able to addressindirection will become increasingly important, and is required in the design of complex documentation systems that address redundancy and distributed server-based information bases. Avoiding hard-wired system identifiers will become more important—except perhaps for on-demand online publishing, where SGML documents may be generated as a transient, temporary piece of information which is read or processed, and then dispensed with.
 

SGML on the Web

 Around the same time as SGML companies began to adopt the SGML Open entity catalog, HTML made its sweep across the world, and it became an interesting proposal to access SGML on the Web as well. It therefore became necessary to use SGML dynamically.
 Two years of experience attest to the fact that SGML is harder to serve efficiently on the Web, compared to HTML. HTML, described in SGML terms, is essentially a fixed DTD with few elements, all of which are tied to a pre-defined layout. This allows HTML browsers to do a number of optimizations because of known pre-conditions. In contrast, an SGML browser needs a DTD, possibly included DTD fragments and entity sets, and support files such as style sheets. To do this kind of processing efficiently, the browser should support both local and remote catalogs, so that only data which cannot be found locally is transmitted across the Web.
 However, SGML documents tend to be complex, lengthy, and highly structured. In particular, by the very definition of the standard, the topmost element will encapsulate the entire document, and you have thus to read all of the document before you are done with it. All of these factors have bearing on web publishing: SGML is a bit cumbersome to use “as is”. Outside of intranets, transmitting SGML data becomes a time-consuming proposal because of current bandwidth restrictions. Two ways of addressing this problem have emerged.
 
 

The Extensible Markup Language

  XML (Extensible Markup Language) currently being designed simplifies SGML extensively, and though designed primarily for web publishing, is general enough to be useful far beyond this use.
  Thanks to HTML, the point was realized that DTDs may not always be necessary, and that people will gladly tag their documents as long as it is easy enough to do so. In consequence,XML does not require you to have a DTD—which means thatXML documents need not bevalid (but they can be); it is sufficient that they arewell-formed .
  TheXML Editorial Review Board has adopted a minimalist approach to keep the specification light-weight and easily implementable.
 
 

The SGML Open Fragment Specification

  In order to transmit SGML more efficiently on the web, SGML Open has defined a technical resolution to permit SGML documents to be served in chunks, with just enough context information about where the corresponding document fragment belongs. It appears likely that this effort will be eclipsed by the emergence ofXML .
 However, the fragment approach brings up the question ofaddressing , of describing locations in SGML documents. This is covered in next section.
 
 

Prediction

  Online SGML publishing will initially be successful in intranets as they have the bandwidth to support it, but will eventually migrate to the Web as well.XML will pave the way for this evolution; both SGML andXML will co-exist with HTML as HTML addresses different requirements than those which are solved by an SGML-purist approach.
  Conventions, through organizations like SGML Open, theTEI , and the W3C complement the standardization process. It is likely that this trend will continue and grow.
 

Addressing and Locations

 SGML was originally designed for the name space of a single document, so one could not mark-up (in a standard-defined manner) links to other documents. This shortcoming will be fixed in the upcoming revision of the standard.
 In the meantime, excellent SGML-based approaches have been designed and become implemented in the last few years.
 
 

TEI Extended Pointers

  TheTEI Guidelines define an SGML-based method for describing links and spans in documents. Although not an international standard, theTEI extended pointer mechanism is an influential and elegant addressing method, which permits structural links (such as addressing children or an enumerated element occurrence). TheTEI links are also notationally compact and, as they can be described in a single string, are suitable for parameter passing.
  Note also that theTEI extended pointers are being incorporated into theXML specification.
 
 

HyTime

  HyTime is the ISO standard for hypertext and multimedia, and is itself an application of SGML.
  Currently, subsets of the HyTime hyperlinking features have been most widely implemented, to support addressing—used for bookmarks, annotations, and similar user-defined (meta)data coupled to SGML documents; more complex use of HyTime can be found in the field ofIETMs .
  Note that one can support portions of HyTime selectively to great benefit—and as an example of how the use of SGML transitions, consider the content-oriented tagging which, together with HyTime functionality in the Topic Map processing tool, has enabled the automatic creation of the electronic equivalent of printed indexes, glossaries, and thesauri in these proceedings.
  As SGML serves as a foundation for HyTime, HyTime in turn is applied in upcoming standards for Topic Navigation Mapping (ISO/IEC CD 13250) and the Standard Music Description Language (ISO/IEC CD 10743).
 Joan Smith has written the following about HyTime: “This is the application of SGML that is destined to take information processing into the next millenium. ” This is certainly true, especially as the standard is so multi-faceted and complex that it will be several years before we see large-scale deployment of any comprehensive implementations of the standard. And just as for SGML, HyTime will be put to use its designers did not foresee.
 
 

Prediction

 You ain't seen nothing yet! We are only scratching at the surface of these novel uses.
 

The Forgotten SGML Features

  The SGML standard has a number of optional features, several of which are seldom implemented (and rightly so!). However, features such assubdoc andlink can be put to good use. TheTEI community has also found a need forconcur .
 
 

subdoc

 In SGML, a sub-document is an SGML document that assumes the current SGML declaration but has its own DTD, so the instance is a self-contained name space). This is therefore a natural unit for information re-use.
 
 

link

 The link feature (which has nothing to do with hyperlinks) lets you associate new attributes to a resulting SGML document when processing a source SGML document. This transformation process can be used for a number of things, such as associating style information or support data for the visually impaired. As the link definition is part of the DTD, all corresponding document instances are affected.
  The link feature also complements ISO 10179DSSSL (Document Style Semantics and Specification Language) .
 

Looking Forward

  The future of SGML has been laid in the alignment ofDSSSL and HyTime, which brought aboutproperty sets ,groves , and a common query language.
 
 

DSSSL

  DSSSL has a style language, which standardizes the formatting description of SGML documents, and a transformation language, to process instances. Its query language SDQL replaces HyTime's HyQ query language.
 
 

Groves and Property Sets

 Groves are an application-independent abstraction of the result of parsing, and which therefore can be unambiguously understood between applications. Effectively, there is now a model for different tools to share any piece of SGML information.
  Property sets define object classes and their properties; the SGML property set (published in theDSSSL standard) is used byDSSSL and HyTime, and will become part of the revised SGML standard. The output of an SGML parser can thus be described in these terms.
 It is exhilarating that these state-of-the-art advances are already being matched by tools (the SGML community is greatly indebted to James Clark).
 
 

Prediction

 Grove-based tools will radically change the way applications work with SGML as the formalism of groves goes hand in hand with the trend of content-oriented, dynamic, re-usable information elements. In particular, SGML documents can now become truly application independent, and tools can be devised for specific subtasks.
 

Conclusion

 Life is change. SGML and its evolutionary use reflects the new requirements that develop as a continuous process—in the transition towards our common future.
  In the first decade of SGML, we have witnessed a remarkable change in the world's perception of tagged data. On a global scale, documents are coming online, and it might appear that we are reaching a critical information mass. However, arsenals of tools (search engines, agents, browsers, processors, formatters, etc.) for HTML,XML ,DSSSL , HyTime, and SGML are being deployed as well, at a rate that indicates that the age of information will not be a threat, but a promise; a promise of information-centered, content-driven, applications that speak the universal language of SGML.

Cost Justifying Your SGML Project   Table of contents   Indexes   Human Factors Engineering: Creating a Productive Environment for Authoring SGML Documents