Pragmatic SGML-solutions in a telecommunications organization   Table of contents   Indexes   From Stone Age to Electronic Age for Aircraft Technical documentation

 
 

Microdocs, Birthrights, and Pottage Messes


 
Dave   Peterson
  SGMLWorks!
3 Winston Road
Lexington   Massachusetts  02173  USA
Email: davep@acm.org
Phone: +1 617 861 8475
 
Biographical notice:
 
Dave Peterson
 
Dave Peterson began working with SGML in 1986 at MIT. He was with Xyvision as Principal SGML Consultant from 1989 through 1993; he is now Principal Consultant with his own firm, SGML Works! . Dave is a Principal Member of NCITS V1 (formerly ANSI X3V1), and through ANSI is a Technical Expert representing the US to ISO/IEC JTC1 WG4, where he is active in pressing forward the revision to ISO 8879.
 
Dave's Ph.D. is in Mathematics, from the University of California (Berkeley). He has taught math and computer science at various institutes of higher education, and SGML in a variety of settings. He does lots of things with SGML, including document analysis and system design and programming for both users and system providers.
 
A document for most purposes still has a structure that is nested: the whole document, various chapters, sections and subsections, paragraphs, lists and items therein, various kinds of emphasized or otherwise special phrases, etc. The “microdocument” approach is to select certain of the smaller pieces and treat them as documents in their own right, and then build up the bigger document from microdocuments. So far, so good.
 
 

The Maybe Bad Guys: Proprietary Solutions

 
The problem comes when the bigger document is being built up. If the document management system that handles the document build isn't SGML-based, you're liable to find that this part of your document structure is tied to one vendor again. Only a document management system that produces these large documents as SGML can be counted on to keep your data system-independent.
 
It's certainly possible to get a document handling system that maintains microdocs and handles their superstructure in a proprietary way. But think about why you went to SGML in the first place: Wasn't one of the reasons that you wanted to get your data into a non-proprietary format? (From whence comes the title of this paper: Don't sell your birthright for a mess of pottage! ) If you go this route, you had better be sure you have in place a means of dumping the proprietary-format information in some SGML form. Then you still effectively have everything available in non-proprietary SGML. There is nothing wrong with proprietary so long as you can easily get the SGML out wnen needed.
 
One of the better ways to manage microdocuments using SGML is to have the “build list” be an SGML document itself. This can be done in either of two ways: Treat the microdocuments as SGML “subdocuments” or strip off their document type declarations and treat them as SGML “text entities”. The remainder of this paper explains and contrasts these two methods.
 
 

Subdocument Microdocuments

 
If microdocuments are organized as subdocuments, they can (in fact, must ) have their own DTDs. This can be a great advantage: Each microdoc DTD can be “lean and mean”, specialized for a particular purpose such as mathematics or a single article (e.g., for a journal). A math DTD could have sub and sup (or inf and sup) element types that allow all manner of complicated math expressions wthin them, without impacting the more simple sub and sup types that might be used elsewhere in ordinary text. An article DTD could concentrate strictly on the structure of articles, without worrying about all the details of journal front matter, advertisements, etc.
 
All of this happens because subdocuments have their own “name spaces”. The names you use in one subdocument are independent of the names you use in others. Unfortunately, this is a mixed blessing. It's convenient sometimes to reuse names with different semantics or different content models (as in the sub and sup example just mentioned), but other times it simply leads to extreme confusion for the poor person who has to create documents and must switch mental gears from one structure vocabulary to another often. Suppose, for example, that sub meant subscript sometimes, subroutine other times, and subsection in still other circumstances.
 
Another problem arises because of the separate name spaces of subdocuments: Their ID/IDREF name spaces are separate, and an IDREF in one cannot reference an ID in another. It is likely that the forthcoming revision to SGML will address this problem, but for now it's a sticker. Of course, even with “macro documents”, one sometimes wishes to be able to reference ID-ed elements in one document from another. Currently the only standard way to accomplish this is via HyTime.
 
There is also a problem with subdocuments that has nothing to do with name spaces. That is that subdocuments are, in SGML, treated as entities, and an entity reference to one is inherently legal anywhere in the document that a data character can legitimately occur. There is no way to control, as one does with elements via content models, what types of subdocuments can occur where. This too is being considered for the revision; there are proposals to prohibit direct entity references to subdocuments (which would be a new selectable option; right now its either don't allow them or permit them to be referenced almost anywhere). The safe way to incorporate subdocument entities into a larger document is to use entity-valued attributes on an EMPTY pointer element. For the revision, the proposal is to add a way to specify the document type(s) permitted for such an entity. Currently, you cannot require that the entity be a subdocument entity, and cannot require that it be of a particular (document) type.
 
When features like these are added to SGML, we will have a safe and sane way to handle subdocument entities, and they will probably be the wave of the future.
 
 

Text Entity Microdocuments

 
Once the prolog (for most purposes, it's document type declaration) is removed from a subdocument, it looks like an ordinary text entity which consists of one element. Such an entity can be incorporated into an SGML document by an entity reference; while you cannot control where the reference can occur, the entity is parsed as part of the whole document, so its contained element must conform to the DTD and must occur in a place permitted by the parent element's content model. The entity's IDs and IDREFs are visible throughout the containing document, so references will work anywhere within the document.
 
On the other hand, since IDs are visible throughout the containing document, they must be unique across all the microdocs that occur in the document. Sometimes this is a pain, because when editing one microdoc it may be hard to remember or find out what names are used in other microdocs in the same document. (And you still can't see IDs with IDREFs in other documents.)
 
In the long run (assuming the revision ever sees the light of day), this way of handling microdocs will probably lose most of its “pro”s and retain most of its “con”s, and will fall out of favor. But for now it does have the advantage that it does permit the (one, big) DTD to control where the content of the microdocs can be inserted in the big document, and within the document you can ID/IDREF between microdocs.

Pragmatic SGML-solutions in a telecommunications organization   Table of contents   Indexes   From Stone Age to Electronic Age for Aircraft Technical documentation