An SGML-based Office Document Exchange and Management   Table of contents   Indexes   Imposing Intelligence on Graphics: Using HyTime Hyperlinks with Non-SGML Data

 
 

Defining Reusable, Distributable Information Objects Using XML-Data Schemas


 
Dianne   Kennedy
  Consultant
  XMLXperts, Ltd. / GCA
146 NorthEnd Ave.
Elmhurst   Illinois  USA  60126
Phone: 630-941-8197
Fax: 630-941-8196
Email: dkennedy@gca.org Web: www.xmlxperts.com
 
Biographical notice:
 
Dianne Kennedy
 
Dianne Kennedy is an independent Publishing Systems Consultant. As a consultant, Ms. Kennedy delivers SGML/XML tutorials and offers document engineering and publishing system design consulting services.
 
Ms. Kennedy has been an active participant in many industry SGML standards activities. Currently she serves as chairperson for the DTD Working Group of SAE J2008 for the automotive industry. She is also convener WG6 for ISO 12083, the SGML standard for the coding of articles and books.
 
Dianne Kennedy has worked with the Graphics Communication Association since 1984 to introduce SGML and now XML to the publishing community through the GCA Information Technologies Tutorial Series. She currently se rves as a program consultant to GCA, is editor of GCA's XML Files Magazine, and is executive director of GCA's Independent Consultants Cooperative..
 
ABSTRACT:
 
Re-use of information has always been the promise of SGML. However only now are we beginning to realize this benefit of SGML systems. This paper traces the evolution in SGML design methodologies from document-ba sed design, to document family design, to microdocument design. The paper concludes with a case study of content management in the automotive industry and introduces how the new XML-Data specification may help make con tent management a reality.
 
 

Early SGML Design

SGML Design
 

Early SGML design was focused on the document. Although we all understood the potential to address, interchange, and re-use information as the key reasons for investing in SGML, the early designs rarely enabled these lofty goals. Instead most early SGML systems had one immediate function -- that of publishing the data. Hence early DTDs were most often developed based upon a print document model. Not only were the DTDs riddled with tags and attributes to assist us with formatting (this was pre-DSSSL), but each document was analyzed as an isolated exercise.
 
In this early SGML environment, the document analysis technique tended to be a top-down breakdown of a single documents. Breakdown typically provided a document hierarchy that reflected the published document (ch apter, section, subsection) with very little content tagging that was not required to reproduce the print output. By this, I mean that part numbers or consumables were not tagged unless the tags were required to trigge r some particular associated output format.
 
Organizations that were early adopters of SGML often made the mistake of assigning different teams to develop different document type definitions. Often these teams were from functionally different areas of the o rganization and did not communicate with other teams doing SGML development. The result was a series of standalone DTDs which did not even share the same base tag set. Clearly the goal of re-use or data sharing was no t realized in these early environments.
 
 

Considering Document Families

 
As we became more experienced with SGML and as the SGML Toolset grew in sophistication, organizations wanted to realize the goal of sharing content. However the discrepancies in tag sets made this task impossible .. During this era several things happened:
  • DTD Harmonization/Standardization within Organizations
  • Analysis of Document Families
  • DTD Harmonization
     

    Harmonization activities were often huge projects. I have worked with organizations which because of separate geographic sites and acquisitions of smaller publishers have had a s many as 40 DTDs, few of which were developed using any standard methodology or baseline tag set. Harmonization projects attempted to set standards and to retrofit existing DTDs to those standards. As you can imagine , for a large publisher this is quite a costly and time consuming undertaking. Yet harmonization was critical to the ability of the organization to share and re-use information in keeping with the vision for SGML.
     
    Following harmonization, new DTDs were developed using a method of analysis which attacked entire document families. This methodology required that an entire class of documents was analyzed. Shared data construc ts were determined and then modeled. Finally document specific DTDs where developed. Of course, these DTDs incorporated the document "building blocks" which were developed in the analysis of the overall document family..
     
    It is important to point out that even with analysis focusing on document families and with element nomenclature and content models harmonized, the ability to re-use content did not directly result. Typically, sh aring was accomplished by a "cut and paste" sort of activity rather than storing individual content elements and just pointing to them where appropriate. The "cut and paste" method of data reuse is tedious and updating shared data over time proved quite troublesome.
     
     

    New SGML Approaches to Content Management

    micro-documents
     

    Currently a new approach to SGML development is gaining wide-spread attention. This method focuses on the creation of "micro-documents". Introduced by Omnimark Technologies, the idea is that no DTD should be longer than 10 or so elements. In other words, we will create DTDs for data objects not for entire documents. Using this approach we could store the content objects along with metadata a bout the object. Then documents could be built on the fly based upon responses to queries on the metadata.
    minimum revisable unit
     

    In industries where re-use of elements is critical, this new methodology is being used. Along with the micro-documents is the requirement for a data model which will defin e and manage the metadata which is used to build virtual documents. The data model shows relationships between the information objects. Such an information object is often called an MRU  (Minimum Revisable Unit) . This concept of the MRU is used in the automotive and trucking industry where it is known as a "Service Information Element" and in the air line industry where it is called an "Anchor".
     
    Using the micro-document approach to document creation, the information objects are authored in SGML and metadata is attached, usually via a database. Documents are created last. They are assembled from the data base, according to metadata characteristics. This sort of SGML design is ideal for personalized publishing where information is pulled together to meet an end-user's exact information requirements.
     
     

    Content Management in the Automotive Industry

     automotive 
     

    In the automotive industry, we began to develop an industry standard DTD based on a document model in 1990. It became clear that since there was no industry standard for documenting a utomotive service information, using a document model for our SGML DTD would be nearly impossible. Actually this was quite fortunate for us. It forced us to look to a new model; and in doing so we not only created an interchange standard, but a method to manage our content as well.
     
    Rather than beginning with a document for our SGML model, the automotive industry developed a relational data model. Today this model accounts for all automotive and heavy truck service information and is made up of over 100 relational tables. The DTD which will be balloted later this year as an SAE standard is a mechanism for interchanging our relational table information and the information objects.
    Service Information Element
     

    This new design allows for data elements called "Service Information Elements" to be authored individually. Metadata for each SIE is defined according to the J2008 da ta model and stored in a database. For interchange both the object and the relational tables must be interchanged.
     
     

    XML-Data Schemas for J2008

     XML-Data  
     

    In 1998 the idea of XML-Data was introduced by Andrew Layman of Microsoft Corporation to W3C. XML-Data is a schema language which can describe any structured data. This includes text a s well as relational and object data. The idea is to interchange XML data along with the schema so that a receiving system will know what the data is, how data elements are related to one another and can be automatical ly configured to use the data delivered to the desktop.
     
    To us in the SAE J2008 world this is a powerful idea. We have spent 7 years trying to perfect a way to interchange data and specify its relationships in SGML. And although we have been quite clever in the standa rd, we still require some 250 pages of documentation to describe how each SGML element and attribute relates to the other elements and how these are expressed in the relational model.
     
    Employing XML-Data schemas would solve many of our current problems.. We would have a single schema language to precisely express our data model and the text constructs as well. Currently key members of SAE J2008 SGML working group are designing an XML-Data schema which we believe will constitute the next generation of our industry standard and will make our goal of content-management a reality.

    An SGML-based Office Document Exchange and Management   Table of contents   Indexes   Imposing Intelligence on Graphics: Using HyTime Hyperlinks with Non-SGML Data