The Role of Industry Standard DTDs   Table of contents   Indexes   CD 13250: SGML Applications - Topic Navigation Maps

  Brown  Bruce Eric 
  McNeill  James W. 
 

Bottoms-Up, A Paradigm Shift

 

Abstract:

  A new data modeling approach to producing SGML documents has been developed. Documents are assembled from content models, or information units, which are created and edited using common tools. These information units are collections of SGML elements, raw text, and processes, but less than whole documents. For this work, when an assembly of these objects or information units is made, then theDTD and FOSI are created for use with the output document. If the information objects conform to a givenDTD (say the ATM 2100), then the assembled document will also conform. We start by describing some of the real issues that SGML systems face, then some of the approaches others have taken. Finally we detail our solution and the research that is ongoing.
 

The Real Issues

 One of the authors, in a talk to Datamation Canada given in November 1996, stated that the principle issues with SGML adoption are:
 
  • Complexity
  •  
  • Lack of Theoretical Foundation
  •  
  • More an Art than a Science
  •  
  • Document Orientation
  •  
  • The Reuse Problem
  •  
  • Legacy
  •  Lets consider each of these in turn.
     

    Complexity

     SGML is viewed as too complex for the typical author, and SGML implementations are expense. Architects and designers have not been able to show an acceptable ROI to justify a corporate commitment. A secondary problem is that the tool developers receive a very small portion of the cost of the implementation. The majority of the costs are associated with document analysis, application development, data conversion, and training. Many consultants have made a lot of money for this work and their expertise is needed for a successful implementation.
     
     

    Lack of Theoretical Foundation

     SGML is not build upon any theoretical foundation. There is no "Relational Calculus" (as there is for SQL databases) driving the standard or any formal rigor behind the technology. This leads to a couple of issues, a lack of interest at the university level in the furtherance of the technology, its application, enhancement or the development of algorithms for the solution of problems using the technology. This also means that this subject is not being taught in the mainstream of education. You can take classes at Community Colleges and learn SGML but at the University of California in Berkeley the SGML is taught by a consultant as a continuing education class, not for degree seeking students.
     
     

    More an Art than a Science

      Document analysis is more an art than a science. If you take four data modelers and present them with a problem they will, in general, come up with the same solution. However, if you take four document analysts and ask them to develop aDTD , they will probably all be different. All will be valid, but they will be different. There is technology to prove that a document instance conforms to aDTD , but there is no technology to prove that theDTD is an optimum representation of the data. And there does not seem to be anyone really interested in making this a science.
     
     

    Document Orientation

      The focus within SGML is on the document. We buildDTD s, validate documents, parse documents and process documents. But, documents may no longer be the relevant information unit to manage. In the world of electronic delivery, personalized documents, and interactive information, focusing on the document rather than the document components may not be appropriate.
     
     

    The Reuse Problem

     One of the strengths claimed for SGML was data reuse, and clearly, at the document level, this reuse is providing significant benefits. However, it has not solved the problem of data reuse at the element level. Reuse, at the element level, requires a focus on standardizing, and minimizing the number of lower level content models used in a family of documents. There is currently nothing in the SGML analysis process that focuses on this issue.
     
     

    Legacy

     Probably the largest stumbling block in implementing SGML applications is legacy. Organizations tend to want to preserve their unstructured documents that contain style markup in the new SGML applications. This preservation causes the application to be overly complex and expensive and seldom achieves the desired legacy results. It is a kin to building a database application to replace an old sequential file application while insisting that the reports be identical to those produced today and that the tables must be maintained on magnetic tape.
     

    Traditional SGML approach

      Traditionally the first step in creating a SGML publishing environment is analyzing the types of documents that are to be published. Each document may have a different structure and different use. Understanding the use and structure leads to the development of the preliminaryDTD (Document Type Definition) . This is true for new documents and for legacy documents to be converted into an SGML system. Once the structure and use are determined at some level, the content models can be further developed and refined. Then the output specifications can be started. The authoring environment should also be considered as it will affect how the data is created and for legacy the conversion from old to new. These steps are outlined in most all SGML texts . This is the top down approach for which SGML was designed.
      If a wide variety of documents are to be put together using a singleDTD , then theDTD may become very large with many options. The larger theDTD , the less manageable it becomes because of the complexity. Maintainability and flexibility may be lost. Reusing portions of documents become problematic and the whole project may become bogged down and ultimately fail.
     

    Microdocument Approach

     The Microdocument is an attempt to fix some of the issues raised by the traditional approach. Omnimark Technologies (formerly Exoterica) has published this in <TAG> and a white paper is available from their Website . Their approach is to store portions of the document in a relational database. Reuse is their key interest. Creating personalized versions of documents can be done by establishing the document components needed, retrieving them from the database, concatenating them together, and making that document available to the user.
     One example that they discuss is their installation guide. All of the installation information for each module, on each platform, that a customer may buy is stored in the database. To create the specific document for installation, the user will answer a series of questions as they walk the tree of all possible combinations of installations. The number of combinations is over 600. At the conclusion of this phase the document pieces are retrieved and the customized document is built and delivered.
     

    SUBDOC

     The SUBDOC concept of the SGML standard is one solution to re-use. Eliot Kimber presented a paper on this at SGML '96 . His argument is that if re-use is needed, then the object or element must be self contained. Arguing that object-oriented programming has shown us how to encapsulate data, he demonstrates that SGML documents are self contained and self describing. Portions of the document do not have these properties.
     To have tools use portions of a document the tools would have to be modified to place the subdocument in its context to be self contained.
     Kimber argues that using some rules during the development of documents will make this re-use more possible. Further some of the HyTime constructs will make the subdocuments more usable and re-useable.
     

    Fragment Interchange -- SGMLOpen

     In 1996 the SGML Open Industry Consortium issued a Technical Resolution for Fragment Interchange . It has a stated goal to define a way to send fragments of an SGML document. Many users have wanted to view or edit one or more entities within a document and have no interest in seeing the whole document. The Resolution defines a way to accomplish the extraction of an arbitrary part of a document and its context for transmittal to an external user. It does not consider the bringing of the changed or edited data back into the original.
     

    Our Approach

      Our research has been aimed at building documents from the bottom up. Instead of working from the top down and first defining the completeDTD , we start with units of information that can be re-used. These Information Units may be a single SGML element or any thing larger up to a complete document. The Information Units are stored in a database. Virtual Documents are formed from combinations of Information Units and when assembled, the Output Document is created. Then theDTD is created from the fragments, if they exist, and the user can define the FOSI for the final document. We believe that this approach will allow for faster implementation, simpler tools, and real re-use of data. In the next section we describe the process and then describe an example.
     

    Process

      The user first decides what the Information Unit should be. Starting with SGML data it will usually be done by parsing theDTD . The user will determine the granularity of the Information Unit based upon thatDTD . If we are starting from scratch, having no data, then we define a fragment of aDTD that describes the content of the data that we call the Information Unit. This may be as simple as a paragraph with a heading, etc. If we are not using a SGML editing system, then it may just be a block of text. Having the Information Unit defined, the system then allows the populating of a database or repository with these units.
     Another type of Information Unit can also be defined. This is a Process Information Unit. Instead of containing textual data to appear in a document, it describes a process. The process creates data that will appear in the resulting document. For example, a document detailing an investment may want to have a table of the invested stocks. The process would take the investor's account number, query an SQL database for the account information, then return the table to the process for inclusion in the document.
     Each Information Unit has associated with it, some meta-data. The meta-data may be element attributes from an SGML instance or data that the user may want to have for reference to the unit. The meta-data is to be used to decide if the Information Unit is relevant when we search the meta-data as a starting point of building a Virtual Document. Having the meta-data and Information Unit stored where we can query it, then allows us to build up documents starting with a query.
     The results of querying the meta-data are displayed in a GUI so that the user can drag and drop the desired Information Units into the Virtual Document. If the query results in the desired document, no rearrangement is needed, it can be used directly to create the Virtual Document. We believe that full text search of the Information Units may help in finding all of the relevant units but have not implemented this in our research.
      When the Virtual Document is completely defined, the user can move onto the assembly process. In the document assembly process we traverse the Virtual Document, retrieving from the database the Information Units and anyDTD information that is stored. The document instance is assembled and the assembledDTD is made. We do a validation of the document against theDTD . When errors occur, we mark them and let the user go to an SGML editor to fix either theDTD or the instance. Changes made here to theDTD fragments or the Information Unit is stored back in the database as versioned copies of the Information Unit. The Virtual Document list is updated and stored in the database. The Output Document is then ready for printing or other processing.
     

    Results

      In this design, theDTD is not determined before hand, but rather created by the combining of each Information Unit's partialDTD . To handle conflicts in naming of the elements, a name mangling algorithm is used to assure uniqueness. When tested on data that originates with an existingDTD , we have seen that the resultingDTD is valid and a subset (although we are not sure that it is a proper subset) of the original. The Output Document also validates against the originalDTD , leading us to conclude that in this instance, we did not corrupt the data.
     

    Conclusions

      This research is ongoing. We actually started with an object oriented database and built a prototype, then built a second prototype utilizing Documentum as the repository. We have seen that the re-usability of the data is increased. The complexity of the problem seems to be reduced as the large analysis phase for an SGML system has been reduced. The user is only concerned with Information Units and theirDTD fragments and not the resulting documents. The design then goes faster with less complexity.
     Our work has shown that orienting the system to usable Information Units solves many of the legacy, reuse and orientation problems described at the beginning of this paper. It does reduce the complexity but still does not address the lack of a theoretical foundation and making SGML document systems into a science rather than an art. As the work is on going, we expect to be able to report on more success in the future.
     

    BIBLIOGRAPHY

     
  • [Colby, Martin and Jackson, David S.] "Special Edition Using SGML", Que Publishing, Indianapolis, Indiana, 1996, pp. 71-86
  •  
  • [Skinner, Eric and McFadden, John] "Microdocument Database Architectures",<TAG>, The SGML Newsletter, Boulder Colorado, October 1996, Vol. 9, No. 10
  •  
  • [URL:] HTTP://www.omnimark.com/resources/white/hddb
  •  
  • [Kimber, W. Eliot] "Re-Usable SGML: Why I Demand SUBDOC", in GCA SGML '96 proceedings (electronic copy), November 1996.
  •  
  • [DeRose, Steve, and Grosso, Paul] "Fragment Interchange", SGML Open Technical Resolution 9601:1996, available http://www.sgmlopen.org/sgml/docs/a601.htm SGML Europe '97

  • The Role of Industry Standard DTDs   Table of contents   Indexes   CD 13250: SGML Applications - Topic Navigation Maps