| Digital documentation trends for aircraft maintenance | Table of contents | Indexes | Making SGML Easier with Microdocument Databases | |||
| Schreier Richard A. |
Supporting SGML in Document Management Systems |
Abstract: |
| Most Document Management System architectures can be categorized by the ability to handle and organize information of different kinds. Supporting information based on the Standard Generalized Markup Language (SGML) involves unique requirements that bear on the tasks of managing structured documents. |
| This report overviews approaches to support SGML documents in a number of Document Management System architectures that were candidates to be used in an actual publishing system supporting the publishing and re-purposing of shared information for technical manuals. This publishing system supports content- and presentation-oriented SGML documents for a supplier of military equipment to a Canadian Department of National Defence (DND) Project Office. |
Introduction |
| There are different perspectives by DMS (Document Management System) vendors regarding how an electronic document can be divided into constituent components and what the resulting set of objects managed within a DMS is for a single document. DMS objects, or document components, can be, among other operations, created, manipulated, shared, managed, related to other objects within the system and deleted by users of the system. Depending on the granularity of recognizable structure in the documents, different types of internal DMS objects can be created from and combined into external documents. |
| The use of SGML (Standard Generalized Markup Language), ISO-8879, to describe the hierarchical structure of the component parts of a document brings to a DMS the ability to recognize a very fine level of granularity defined by the model of the information. The information model represented by the hierarchical structure is described unambiguously in SGML syntax in the document's DTD (Document Type Definition). This report |
| NOTE: |
| The presentation slides for this paper can be viewed at http://www.microstar.com |
| characterizes DMSs into one of three architectural classes based on their ability to recognize granularity of structure. |
| A publishing system that is built on an SGML based architecture has been implemented for a defence contractor supplying technical publications to the Canadian Department of National Defence (DND). One of the systems components is the DMS, responsible for maintaining the SGML coded source material used to publish the books. During the product selection process of the implementation contract, several commercially available DMS products were evaluated. Each of these architectures handles SGML coded information in different ways, varying the impact on the publishing system implementation. |
| Implementation requirements for the project were specified as a set of document management facilities, independent of the underlying database architectures upon which the DMS products were built. While the vendor selection criteria included underlying database issues, these criteria were not considered aspects of the DMS ability to manage SGML coded documents to satisfy the information management requirements of the publishing system architecture. |
| The analysis of these different DMS architectures can support the decision regarding which DMS architecture is necessary, or desirable, in the implementation of other SGML-based information publishing solutions. |
Project Background |
| At the time that implementation of this publishing architecture began, other aspects of the project had been in development for two years. |
| In 1994, DND released the DND CALS (Continuous Acquisition and Life-cycle Support) DTD, a content-oriented information model, designed to maintain the raw material used in Engineering and Maintenance Technical Information Products. This model was derived from an analysis of the specifications for 11 standardized printed manuals that DND requires manufacturers of equipment to supply. |
| In 1995, a defence contractor subcontracted the design of a publishing architecture capable of producing the required standardized printed manuals from an information store conforming to the DND CALS DTD. This architecture satisfies the original contractual requirement to supply the printed publications as well as the request from the DND CALS Office to provide the basic engineering and maintenance information in a prescribed SGML format. The DND can then use the information supplied by the contractor for purposes other than the production of printed manuals; for instance, the life-cycle management of the information. |
| In 1996 this contractor subcontracted the implementation of the publishing architecture, including the vendor selection for system components, one of which is the DMS used to maintain the SGML coded information. |
| This analysis summarizes the different approaches to managing SGML-coded information based on the different architectures available from DMS vendors considered for this implementation project. This analysis does not refer to specific products or vendors by name. During vendor demonstrations some products worked admirably well to advertised functionality, while others failed to deliver their promises and yet others failed to respect the SGML-based required functionality at all. All packages appear to be constantly being updated and improved; these packages will require re-evaluation of up-to-date releases in any subsequent implementation of this publishing architecture. |
DMS Architectures |
| The granularity of information maintained by a DMS is indicative of the DMS architecture as it applies to supporting SGML-based information. The granules are the objects being managed in the DMS, based on what has been defined as the users view of input. The DMS functionality can be analyzed independent of whether these objects are stored in a file system or in an underlying database. The three architectures summarized here support, respectively, file-level, fragment-level and element-level granularity of document dissection. |
| For many years the only definition of a document to many DMS vendors has been a complete or full instance contained in a single file of a specified or perhaps arbitrary file type. The file is managed from the point at which it is created to the point at which it is disposed. For many vendors, these file-level granules are the only kind of objects maintained in their products. For these products, support of SGML entails only supporting a complete SGML instance. There is no inherent structural relationship between objects in this DMS architecture, although users may group the objects into folders and the folders themselves may be grouped into other folders. |
| Being able to refine the information to a finer level of granularity comes from being able to recognize inherent structure within the file. Using this structure, files imported into a DMS can be decomposed into granules that are stored as objects and these files can also be reconstituted from the objects being stored. Recently, some vendors of file-level granularity DMS products have begun announcing support for compound documents, recognizing a one-level hierarchy of structure in the components of word processing documents. |
| Some vendors have announced SGML based or SGML aware DMS products that recognize the SGML syntax and instantiate multiple DMS objects from a single SGML document based on recognized granules at differing levels of granularity. |
| Innate in an SGML document is the content model of the DTD upon which the document is based. This model describes the syntactic breakdown of a document to a hierarchy of constituent parts, specifying the allowed ordering and cardinality of all parts and any parts of parts. Once an instance of the model is brought into the DMS as a set of internal DMS objects, proper maintenance of the object hierarchy requires that the user not be able to violate the SGML rules of order and cardinality as specified in the document's DTD. The proper emission of a complete document requires component parts to be reconstituted conforming to the document's concrete syntax and DTD using an appropriate document prologue. |
| Some SGML based products maintain portions of a document as individual DMS objects, each object representing an element and some or all of the constituent elements for that element. A document dissected at this fragment-level granularity is maintained as a hierarchy, thus, these objects may be entirely self-contained (as leaves of the hierarchical tree), or may reference other constituent objects (as branches). |
| A fragment could be either an MRU (Minimum Revisable Unit) or a referenced general entity. An MRU is typically defined such that users that need to manipulate any part of an MRU is obliged to manipulate the entire MRU to ensure sufficient contextual information is always maintained for all parts. A general entity reference is an SGML mechanism for referencing data maintained externally to the element in which the reference is found. This data may even be external to the document itself, and is typically referenced this way when the data is not text. Both MRU and general entity types of fragments can contain either or both element content and references to other fragments. The difference between an MRU and a general entity is found in the external representation of the document. |
| The MRU boundaries cannot be seen in the external document because the SGML syntax is seamless, with each fragment in place in its containing fragment without any evidence of a boundary. This requires that MRU boundaries be defined outside the document, typically as part of the document type characterization in the DMS. When checking out and manipulating an MRU fragment, the editing tool requires all referenced MRUs to be expanded for parsing purposes, but has no mechanism to preserve the fragment hierarchy. The DMS tool must be able to recognize any descendent MRU boundaries when the fragment is checked back in to the store. |
| The general entity boundaries can be seen in the general entity references in the element content of the external document. Where these general entities are external to the document, the identifiers for the fragments must be de-referenced through SYSTEM or PUBLIC constructs. Using SYSTEM constructs is not necessarily portable across different hardware and software platforms and using PUBLIC constructs requires supporting catalogues for correctly de-referencing the locations of components. The term for this type of hierarchy of document parts is the entity structure of the document. When manipulating a general entity fragment, the editing tool may choose to de-reference any contained fragments only for the purposes of parsing and need not present the referenced fragments unless required by the user. Some editing tools may be able to skip over an entity reference if it can be instructed to somehow restore the parsing state for data that follows the reference. |
| Some SGML based products maintain each and every element of a document as individual DMS objects, with each object representing only that which is required to reconstitute a single element or reference. The term for this element-level granularity hierarchy of document parts is the element structure of the document. |
| In this architecture, the DMS must be able to recognize the markup for correct element boundaries on new documents being added to the store. Proper version control of objects requires the DMS to recognize which elements are being checked in to the store after having been checked out. Mechanisms vary regarding how a particular DMS product identifies objects and how it relies on the SGML editor to hide these mechanisms to protect the user from inadvertently causing elements to loose their identity. |
| Innovative DMS architectures for supporting SGML based information are being announced by vendors. A new version of a commercial programming language with which one can create a DMS architecture implementing support for a hybrid of relational and structured text information mechanisms had been just announced at the time that DMS selection was being made in this case study. Document models with the characteristic of having a lot of repeated information are candidates for successful exploitation of this hyb rid system. |
SGML Document Components |
| An SGML document is made of three distinct components: the SGML declaration, the document type declaration (which includes the DTD), and the document element (which includes all the information content). The SGML declaration defines the lexical environment for the document, including the allowable character sets and characters within those sets, the definition of delimiters, the quantities and capacities, and other parameters. When not specified, the reference concrete syntax defines reference values for all of these parameters. The document type declaration specifies the syntactic environment, that is the markup language, for the document content by identifying the DTD (markup details) and the document element (where the content hierarchy begins). |
| An SGML document is only valid if the document element correctly parses against the document type declaration in the context of the applicable SGML declaration. If a file does not correctly parse, then the information in the file cannot be guaranteed to be correctly passed to an application. A parser is not obliged to attempt error recovery and pass information to the application, although there are parser products that are designed to do so. |
| Modularizing the DTD into subsets that are either external or internal to the document type declaration can help maintain a number of document instances that utilize the same model. It is quite common that the model be specified in an external file and that any required entities (perhaps pointing to external graphic images) be specified within the document type declaration contained in the instance. |
| It is critically important that any complete SGML document maintained in a DMS be correctly emitted as a valid, parseable file for processing by SGML applications. Unfortunately, there are a number of issues, only some of which are described as follows, that can hinder successfully manipulating or sharing fragments of a complete SGML document. By extension even manipulating a single element or sharing a single element in two SGML documents can induce the problems listed below. Examples of these problems are categorized by the three documents components described above. |
| It should be noted that some of the issues described below are situations that need to be addressed if a fragment is to be manipulated, stand-alone, in an SGML editing tool. Most SGML editing tools require that that file that is being edited is a valid SGML document, implying that the stand-alone fragment must be presented to the editor as valid SGML, which may require information to be wrapped around the fragment. |
| Strategies for dealing with these issues must be worked out between the system implementor and the vendors of the DMS products used in the implementation of a given application and its data. |
Lexical: Sharing Fragments in Different Concrete Syntaxes |
| When a document fragment is shared in separate complete SGML documents, the lexical environment for that fragment must respect the Concrete Syntaxes defined in the SGML Declarations of all of the documents within which it is used. |
| This requires that the delimiters used, examples being the specification of an element's start tag using angle brackets and the specification of the start of an entity reference using an ampersand character, must be defined identically in all documents a fragment is used. |
| A common quantity that is changed in documents is the NAMELEN parameter, specifying the maximum length of name tokens. Among other uses, names are used to identify elements, entities and linked identifiers and identifier references. All name tokens used in a fragment must not be longer than the maximum length specified in any of the documents within which the fragment is used. |
Syntactic: Sharing Fragments in Different DTDs |
| When a document fragment is shared in separate complete SGML documents, the syntactic environment for that fragment must be valid to the DTD of each of the documents within which the fragment is used. One should also note that the parsed content of the markup may also be subtly different in two documents where the DTDs are not the same, even though the actual markup conforms. |
| Regarding the content model, a fragment whose elements respect the model of one DTD may not be valid in the context of another document's DTD. An absent element that is optional in one DTD would be parsed as an error if the absent element was not optional in another DTD. A group of elements that are unordered in one DTD may not be correctly parsed in another DTD that prescribes a specific order. |
| Regarding attributes, an element with an absent implied attribute would be parsed as an error if the attribute was required in the context of another DTD. An implied attribute in a fragment would be processed differently if each of the DTDs for the documents within which it was used had different default values. The use of #CURRENT defaulted attributes is an interesting situation in that the fragment's default attribute value differs based on other fragments that are assembled ahead of the given fragment in separate documents. Note also that a fragment using an implied attribute that is relying on the default value having been defined by some other fragment could not be used as the first such fragment in a document because the first use of an attribute with #CURRENT default is obliged to have a value specified, otherwise a parse error will occur. |
Contextual: Fragments Out of Document Context |
| When working with a fragment out of the context of a complete document, two of the issues that come to light are entity references, and the use of ID and IDREF attributes. |
| When the document's DTD is split into an external declaration subset containing the content model definition and an internal declaration subset containing entity declarations, the user may or may not be presented with a complete set of entity declarations when manipulating only a fragment of the instance. Care must be taken that the user not create entity declarations that would conflict with entity declarations that may already be created for the entire instance. As well, a reference cannot be made to an entity that is not included in the DTD for the fragment. This implies that the editing tool would be unable to reference an entity if it is only defined in another fragment. Therefore, these scope issues require that all fragments have access to all entity declarations applicable to the entire instance. |
| Two of the attribute types available to be defined for an element type are ID (identifier) and IDREF (identifier reference), used for linking elements. The name token values of these attributes are in a single value space for a document. It is an SGML error to have two elements in one instance with the same ID type attribute values. It is also an SGML error to define an IDREF type attribute value for which there are no elements in the instance with an ID type attribute value of the same name token. When working with a fragment on its own, a mismatch of either kind can easily be true for the fragment if the match value resides in another fragment. When working with complete documents, two separately valid fragments may not be validly used within the same document due to name space collision. |
DMS Functionality |
| Some of the many aspects of DMS Functionality that were required for this case study are as follows. Some of these will parallel requirements in other projects that manipulate SGML based documents in DMS products. |
| Document burst and document build are the ways that a complete SGML document is, respectively, imported into and exported from a DMS. The DMS must know some way of defining multiple DMS objects from a single file imported into the system. A system that supports element-level granularity can accomplish this from SGML syntax. A system that supports fragment-level granularity recognized by general entity references can also accomplish this from SGML syntax. A system that supports fragment-level granularity by MRUs must know a-priori where to detect the boundaries of DMS objects from the seamless SGML. |
| It is interesting to note that there are products that maintain documents as a hierarchy of element structures, yet allow MRU-like groupings, referred to by one vendor as locking units, to define the granularity of access to constituent parts. |
| How the DMS implements the configuration of objects in a document is important. If the document configuration is an exoskeleton pointing to shared objects, then a given object may have different children based on the documents within which the shared objects are used. Otherwise, if the document configuration is derived entirely from drilling down object contents, then an object must reference its children, thereby preventing different children from being used in different documents. Neither approach is necessarily invalid, as the applicability depends on the user requirements and expectations for the information being maintained. |
| The level of granularity implemented by the DMS defines the level of granularity that information can be shared between documents. Objects in the document storage hierarchy may be shared, but It is not practical to share information that exists as only part of a single object. Users can have more control over documents and information sharing when documents are comprised of many objects. Being able to share objects allows common document content to be defined once in an object and shared in more than one document. Being able to check out and check in objects allows different objects of a single document to be simultaneously manipulated by different users. The DMS must, however, be able to prevent an inadvertently described infinite loop where an object's descendent elements refer directly or indirectly to the object itself. |
| Object check out and check in are the ways that a component of the document is reserved by one user, thereby preventing other users from changing the component. This is a necessary facility for editing the object, in order that the component not be accessed simultaneously by more than one user and having any changes made to the object while checked out being lost. Some vendors allow a hierarchy of multiple objects to be checked out at a single stroke, thus requiring all constituent objects to be marked sufficiently for identification during check in. |
| Regarding VC/CM (Version Control/Configuration Management), the user must measure their requirements against how a DMS changes the versions and configurations of objects that are changed in document hierarchies. When a user checks out any fragment, that fragment and any referred fragments therein must be recognized by the DMS at check in time so that revised objects can be correctly marked as being new versions or configurations. These functions are sometimes referred to as making a snapshot (marking) and taking a slice (accessing) of all the objects that are used in a particular complete document. When many users are constantly modifying individual fragments, it is critical to be able to identify the configuration of and find all of the particular versions of the individual objects that make any given version of the complete document. |
| When dealing in an environment with many hardware and software platforms, it is important to identify external document objects using SGML syntax mechanisms that are not platform dependent. A registration facility is an important feature of a DMS to ensure proper maintenance of entity names, references and definitions. The Formal Public Identifier mechanism in SGML, an implementation of the ISO-9070 standard, provides the methods for deriving identifiers for objects, some of which are used to construct id entifiers that are guaranteed to be unique world-wide. |
Case Study Analysis |
| The project implementation phase for the publishing architecture included an analysis of available technology from which the end user would select a DMS vendor. As well as the issues described herein to handle SGML documents, the end user's selection criteria included mandatory non-SGML issues from other departments including MIS (Management Information Systems). These mandatory non-SGML issues ended up having a higher priority for the client than SGML-related issues. |
| From the outset, the use of a non-SGML aware DMS to maintain the whole document instances as DMS objects was not considered. While the size of the instance of the DND CALS DTD that would be maintained in the system for delivery to the crown could not be reliably estimated before the book plans for the hundreds of manuals were created, it would not unrealistic that the single instance be as large as a gigabyte. None of the available editing tools could handle the entire instance, and, even if any could, th e performance might be severely impacted by the file size. Moreover, any edits being made anywhere in the file by one writer would lock out other writers from making edits anywhere else within the entire instance. Accordingly, it was mandatory that the system support the instance of the DND CALS DTD in manageable fragments. |
| A selection of products that implemented SGML at each of file-level, fragment-level and element-level granularity was reviewed and initial visits were made to vendors' sites for detailed information. The file-level granularity product was analyzed at the request of the client. The implementation deadlines for the project prevented the then just announced hybrid relational/structured text programming language from being considered. |
| At the time of implementation, the SGML industry (through the SGML Open Consortium) had not completed developing a method of describing fragments of SGML that could be considered for use in the project, thus the fragments were required to be manipulated by customized editing tool features, as complete SGML instances in their own right or as portions of SGML instances oriented for editing the fragment. It was interesting to note that at the time the scope of the industry efforts for fragments did not include trying to resolve all of the issues described above. For the fragment-level and element-level granularity DMS products reviewed, all vendors tied the editing of SGML content to specific brands of editors, each customized to handle some of the issues described above regarding dealing with portions of a complete SGML instance. |
| A selection of vendors of tools supporting SGML documents at either a fragment-level or the element-level granularity was then asked to demonstrate basic functionality on a small test instance modeled to a small DTD that reflects some of the unique nature of the DND CALS DTD. Not all vendors demonstrated the correct or expected behaviour for the tests prescribed. None of the vendors that satisfied an acceptable number of the SGML criteria were able to satisfy an acceptable number of the non-SGML criteria mandated upon the client from other corporate departments within the client's company. |
| It was interesting to note that while most uses of DMS technology to date have satisfied storage requirements for presentation-oriented and content-oriented document models, the DND CALS DTD model has a unique characteristic in that it is recursive from the document element to the first level of child elements of the document element. This recursive nature revealed problems in some demonstrated products regarding objects being able to eventually indirectly point to themselves, resulting in an infinite loop. |
| To meet sufficient criteria from all of the stakeholders within the client's company, the client decided not to use any of the fragment-level or element-level SGML aware products available. The client decided to have implemented some custom developed SGML aware logic on top of a non-SGML aware product data management tool rather than purchase an SGML based DMS. The tool as available off-the-shelf only supports SGML at the file-level granularity. |
| It was, therefore, decided to implement the MRU concept with fragment-level granularity as a set of files, each file being a complete SGML editing-oriented instance within which an individual fragment of the larger document is maintained. The entire instance of the DND CALS DTD is then generated on request from the set of fragment files using a document assembly approach by extracting the document fragments from each of the editing-oriented instances. The implementation is supported by a set of programs coded to the API (Application Programmer's Interface) of the product. |
| The resulting system design revealed initially unexpected benefits. Information product assembly could proceed using document fragments rather than the large DND CALS DTD based instance, thus improving production performance. The large instance need only be treated as yet another information product, assembled from the same fragments and produced as any other information product, only when required to be delivered to DND. This design had the added benefit of also removing any dependencies on custom features of editing tools as the document fragments were complete SGML instances in their own right. |
| Given that all DMS vendors are constantly upgrading their product's capabilities, it would be necessary to review for other projects the available product performance in any future implementation of the same publishing architecture that was chosen for this particular project. |
Conclusion |
| All user requirements for information management will vary to the extent that no one DMS architecture can be considered the best or the most appropriate for all possible implementations. Furthermore, this case study did not require that all possible DMS features available from existing products be exercised, some of which may be required when considering a DMS for other implementations. |
| Understanding a given project's requirements for maintaining information at a particular level of granularity is critical to knowing the impact to system implementors and end users that different DMS architectures will have on the information store used in that project. If some of these situations described in this analysis are indicative of requirements for other projects, it would behoove a system implementor to query DMS vendors regarding how the particular conditions are handled. |
| Considering this case analysis of the different approaches to supporting granular information can help the decision process regarding which DMS architecture is necessary, or desirable, in particular implementations of other SGML based information publishing solutions. The vendor selection criteria that is analyzed for these solutions can then include the DMS architecture as an important criterion. |
ACKNOWLEDGEMENTS |
| This paper was originally prepared by G. Ken Holman, formerly the Chief Technology Officer of Microstar Software Ltd. |
| Digital documentation trends for aircraft maintenance | Table of contents | Indexes | Making SGML Easier with Microdocument Databases | |||