SGML in Healthcare Information Systems   Table of contents   Indexes   Caterpillar Inc's New Authoring System

 Angerstein  Paula 
 

Why you do (or don't) need HyTime in your document management system

 

Abstract:

 This paper examines whether (or not) HyTime is an essential feature of a document management system. Scenarios for the appropriateness (or inappropriateness) of indirect linking are reviewed. Ways in which a document management system can help (or hinder) management of links are examined. Should (or shouldn't) a document management system treat HyTime markup as more than ordinary SGML?
 With the addition to HyTime of several annexes in the Technical Corrigendum (TC), HyTime becomes a broader framework for describing generalized SGML-based architectures. The potential impact of these far-reaching topics on document management systems is discussed.
 

Overview

  HyTime is the Hypermedia/Time-based Structuring Language, defined in ISO standard 10744. HyTime's full name indicates it covers structured encoding of two major areas: hypermedia and time-based information. In this paper, the hyperlinking aspects of HyTime's hypermedia features are examined. The spatial and time-based rendition aspect of HyTime is not discussed.
 HyTime's full name doesnot indicate that it contains a set of features generally applicable to SGML systems. To apply the hypermedia and time-based concepts defined in HyTime, a number of supporting constructs, like architectural forms, appear in the standard. HyTime's Technical Corrigendum (TC) extends this set of generally applicable features, and indeed, an annex to HyTime is called “SGML Extended Facilities”.
 HyTime standardizes a number of semantic definitions, for example, for hyperlink behavior. These semantic definitions are expressed in the standard such that they can be applied to the elements and attributes defined in any SGML Document Type Definition (DTD). The result is that the standardized semantics apply to the elements when they occur in an SGML document. So, in one sense, HyTime is “just SGML”; an SGML-compliant system can successfully parse a document with HyTime elements in it. In another sense, these semantics are meant to affect the behavior of the processing system; thus, recognizing and implementing these semantics is the essence of a true HyTime-compliant system.
  In this paper, the term document management system is used to refer to any set of tools used to manage a collection of SGML documents.
 

Links and HyTime

 The most notable user requirement satisfied by HyTime in the context of a document management system is a standardized mechanism for hyperlinking.
 Simply put, a link represents a relationship between two or more things. At this level, a link is a highly arbitrary concept with applicability to almost any aspect of information management. Most commonly, links are thought of as an aid to navigating through data, especially in the context of an interactive application such as an online browser. Probably the most well-known links are the links in an HTML document used to connect resources on the Internet via a Uniform Resource Locator (URL).
 Links are also becoming a common tool in the management of information, playing an important role in tracking the relationships within and among sets of data. In fact, links have become recognized as important pieces of data themselves that need to be authored and maintained just like other data. A link may not only “get me from here to there”, it may indicate a crucial dependency between two elements, for example, a diagnosis of a problem and its suggested repair procedure. When one of these elements changes, the other must be examined to see if it is still relevant and current.
  HyTime describes a link in this way: a link relates two or more link ends. Each link end is a locator of some sort to a piece of data known as an anchor. A contextual link has one of its link ends implied by the link element's position in the document; the link element itself serves as one anchor. An independent link is just that—it resides independently of any of its link ends.
 Link creation, management, and delivery are often seen as requirements within environments that employ a document management system. Some of the typical user requirements for links include the following:
 
 
  • Link ends into both structured and unstructured data
  •  
  • Link ends into data that cannot be modified, for example to associate review comments and annotations with the reviewed document.
  •  
  • Link ends that resolve into multiple anchors
  •  
  • Links that carry with them some set of semantics, such as a type and other attributes to describe the link's behavior
  •  
  • Control over the direction of traversal of a link
  •  
  • Notification when a link end is invalidated or modified
  •  
  • Version history of a link.
  •  Pre-HyTime SGML has not been a significant factor in the evolution of linking strategies. SGML defines a number of relationships, most notably:
     
     
  • the hierarchical parent/child relationship defined by element containment
  •  
  • the association of metadata with an element defined by the attribute definition list
  •  
  • the link defined by ID and IDREF type attributes.
  •  The first two relationships are not generally thought of as a link because they are inherent information in the SGML object model for a document. While ID/IDREF attributes do provide a linking mechanism, it is deficient in a number of ways:
     
     
  • Validation of the uniqueness of IDs (the “name space”) is limited to a single SGML document. For links to be useful in an information management system, they must be able to span multiple documents.
  •  
  • IDREFs are limited to resolving to the location of one or more SGML elements. Links must be able to resolve to other types of objects, including those that do not have a unique identifier or even spans of data that do not have a collective identifier at all.
  •  
  • There is no way to associate additional link information with a specific ID/IDREF relationship, and it cannot be tracked as a data object in its own right.
  •  
  • ID/IDREF values are embedded in the document itself; this type of link cannot be used when the data cannot be modified; maintenance of IDREFs is difficult when ID values change.
  •  
  • Heavy reliance on SGML IDs creates a maintenance problem when elements can be shared; uniqueness of IDs must be guaranteed in all contexts in which an element is shared.
  •  The good news is that to address these deficiencies, HyTime defines a robust, generalized mechanism for describing links and their usage. The bad news is that HyTime defines a robust, generalized mechanism for describing links and their usage.
     

    Meeting linking requirements

     While HyTime provides an elegant solution for linking, it carries with it an overhead for implementors and users of a document management system. Addressing mechanisms range from the straightforward to the arcane, location ladders may take a while to learn to climb, and the indirection of independent links may lead down a few blind alleys. Not to mention close encounters with batons and quanta.
     Even though HyTime has a number of modules, it is still difficult to determine the precise subset of features needed to satisfy a particular set of requirements. Ideally, a document management system would provide all HyTime features, making them transparently available to the end user through its normal interfaces. In practice, this is not as simple as it seems—a document management systems may have been designed and implemented before HyTime was a standard and may already provide its own solutions to customers' linking requirements.
     A document management system may, in fact, satisfy a set of linking requirements without HyTime at all. What HyTime generally adds to the picture is increased
     
     
    1. Data portability: the ability to move data from one system to another without loss of information embedded in proprietary form
    2. Interoperability: the ability for more than one system to have a shot at processing the data correctly.
     But HyTime is not the only game in town for meeting these objectives. That small standard called HTML has enjoyed immense success with portable, interoperable links across the World Wide Web. The Text Encoding Initiative's (TEI) extended pointer syntax is an example of an application convention that has gained enough support to become interoperable. Under development by the W3C SGML Working Group is the linking module of the Extensible Markup Language (XML) suite of specifications; XML–link draws on the strengths of all of the above to provide a set of robust yet simple linking features. XML-link gets many of its core precepts from HyTime, such as link roles, independent links, multi-directional links, and multi-ended links. It also carefully maintains compatibility with HTML links, for example, using the URL as its inter-document addressing mechanism. XML-link uses TEI extended pointers to provide somewhat simpler versions of HyTime's addressing mechanisms.
     Let's take a look at how a document management system might address three main areas of link management: creation of links, maintenance of links, and delivery of data with links in it, examining where HyTime plays a role.
     A couple of common threads run through these three discussions. First, requirements tend to fall into categories depending on whether the collection of documents within which links connect elements is bounded and well-known. An additional metric is whether or not data to be linked can be modified. An example of one extreme is surfing the Web: click on a link and you have no idea from where the next page will come, and you certainly have no authority to modify any data but your own. On the other side, consider a set of technical documents and online help describing a tractor: the tractor manufacturer has full knowledge and control over the set of information and can therefore organize and modify it at will.
     
     

    Link creation

     When authoring a link, the crux of the matter is: how can I highlight the intended target of my link and how is that target subsequently identified with a link end? Finally, where do I put the link I have just created?
     Currently, in an overwhelming number of cases, the anchor of a link is an SGML element that already has, or can have, an SGML ID attribute value on it. As already noted, the ID may not be unique within a repository of documents. A document management system, however, having access to all the documents, can provide repository-wide unique naming. In fact, the document management system might even already provide unique object identification for each element, making IDs unnecessary. This repository-wide naming and referencing scheme is subsequently referred to in this paper as the “RID/RIDREF” approach.
      HyTime overcomes the SGML limitation of ID/IDREFs to a single document with the name space addressing mechanism, commonly known as nameloc. A nameloc identifies both the ID of the link anchor and the entity in which it occurs. Use of nameloc requires management of entity declarations and entity resolution for all entities linked to in the document collection.
     What about legacy data, data imported into the repository, and electronic review of documents by multiple reviewers, scenarios that potentially require links into read-only data? If the incoming data is sufficiently marked up with IDs, you may be able to integrate it without change into your existing link strategy. If not, and you can modify the data, use the document management system to update ID values and existing links to harmonize it with your link management strategy. If you can't modify the data, for example, to associate comments with an element, you may need the more sophisticated addressing mechanisms HyTime provides to link to SGML data without IDs.
      The HyTime node location and query location addressing mechanisms can locate data without IDs. The most common node location is a treeloc (tree location), which identifies the path to a node in a tree by specifying a position in the sibling set for each descending level of the tree. A simple example of a treeloc is to identify the fourth paragraph in the third section of the second chapter in the book as “1 2 3 4” (where 1 represents the document element). A queryloc is a way to specify a user-supplied query to locate data based on its properties. A queryloc could use the Standard Document Query Language (SDQL), for example, to find all “command” elements that contain an “option” element with the content of “print”.
      What about anchors that are not an SGML element, for example a span of text, and anchors that are not SGML data at all, say an object in a graphic? This type of linking generally calls for an additional level of linking capability to describe offsets from other addressable elements. The HyTime data location addressing mechanism ( dataloc) addresses into spans of character data. Additionally, all HyTime addressing mechanisms can be combined into a span location to address a chunk of data that spans contiguous elements.
     Creating the link itself is generally dependent upon the DTDs used within the repository. Authoring interfaces to a document management system should be configurable to work with elements designated as link elements, at a minimum to handle a RID/RIDREF contextual linking scheme. A DTD could define more than one type of link element, with attributes to describe some aspects of the link. The ease with which you can designate existing elements as HyTime links depends on how closely the linking design was aligned with HyTime concepts (layering onto existing elements is easier with the TC). Some DTDs anticipate the use of HyTime and typically include a somewhat arbitrary set of declarations based on HyTime architectural forms.
     Contextual links are embedded in the document data at the place where the link takes effect. Creating these types of links generally integrates well with existing SGML authoring techniques; creating a contextual link is much like creating any other type of element.
     HyTime's independent links encapsulate both link ends of a link into a separate element that can reside anywhere. Using independent links requires a bit of extra planning to determine where to store the link elements as well as requiring a more sophisticated authoring user interface to the links; however, independent links provide a level of indirection that facilitates maintenance.
     
     

    Link maintenance

     A key requirement for link maintenance is ensuring that links continue to point to their intended anchor. In some environments, this means maintaining the exact same anchor; in other environments it may be perfectly fine, or even desired, for the anchor to be updated or for the link end to resolve to a different anchor. For example, a link from an overview topic in training material needs to be examined when the reference material it points to is updated, as the summary material may need to be changed. On the other hand, in a software user's guide on how to perform a task, a link to a commonly used procedure is still valid when the procedure description changes. Links to affiliations, addresses, dates, prices, and “the latest” anything may dynamically resolve to new anchors.
     Using the RID/IDREF strategy, a document management system with a “where-used” reporting mechanism can dynamically report changes to anchors. If integrated well with the authoring environment, the system can even report these changes in real time, for example, to prevent deleting an element that is an anchor of a link.
     Using HyTime addressing mechanisms for link ends means the document management system needs to know how to resolve these addresses in order to validate the anchors. Some addressing mechanisms complicate the ability to determine when an anchor has changed because the resolution of a link end into an anchor may not be static, for example, when the link end is a query. In this case, the validation of link ends may have to be driven by checking the link itself, rather than monitoring all possible anchors.
     Any link element, whether HyTime-enabled or not, can be treated as a data object. At a minimum, a document management system can treat a link element like any other SGML element, providing features like access control, versioning, and metadata for links. A HyTime independent link can be modified without editing any of its anchors, further providing a way to maintain a version of a link without affecting the referenced documents.
     
     

    Link delivery

     Typically, data with associated links is destined for an online presentation system wherein the links provide a method of navigation for the end user. At some point, the data is handed over to the presentation system by the document management system. In this one-way delivery, the data becomes read-only, so links can be frozen and stored for fast resolution.
     A goal of the document management system is to output data in a form that can be either directly used by the presentation system or easily transformed for use. Here, the requirements of the browser drive the form of the data. Some well-known browsers expect HTML, with its embedded A element. Some browsers additionally operate on any SGML, recognizing HyTime linking and addressing constructs. Other browsers may have proprietary forms or embellishments for data and links.
     The document management system, having a view to all the links in the repository, can determine the set of documents that need to be available to satisfy a web of links and facilitate packaging of these documents for delivery. Additionally, the document management system can provide “on the fly” content generation and transformation services to produce the final output.
     To facilitate link resolution for the presentation system, you can use the document management system's link resolution capabilities to resolve links and represent them in the output in the simplest form for the presentation system. For example, independent links could be resolved to contextual links, assuming you have authority to modify the documents as they are retrieved for the presentation system.
     The capabilities of the browser can have a backward effect on link requirements for creation and maintenance. There may be no need to maintain elaborate linking features in the repository that cannot be used in the presentation system, for example, HyTime constructs for aggregate linking and link traversal.
     

    HyTime and architectures

     The scope of HyTime is considerably expanded by its Technical Corrigendum, which generalizes a number of concepts pioneered in HyTime. While these features may indeed prove to become standard approaches for information management, they affect the fundamental core of a document management system, and as such may take some time to incorporate into off-the-shelf systems.
     

    Generalized architectural forms

      To capture HyTime rules in a DTD, the HyTime standard defines the concept of an architectural form. An architectural form defines a class of SGML elements that have a common set of semantics; an element in a particular DTD takes on these semantics by declaring itself to be of the architectural form through use of a designated attribute value.
     HyTime includes a number of architectural forms to define semantics of HyTime constructs, for example, to prescribe how a link is described.
     In the TC, architectural forms are generalized so that they can be used as “rules for creating and processing documents”. The TC standardizes the idea of a meta-DTD of architectural forms that define some set of semantics through element type forms and attribute list forms. Any particular DTD can then declare elements of the defined classes to take on the predefined semantics. Additionally, an element in a particular DTD can take on the semantics of more than one architectural form, providing the feature of multiple inheritance. HyTime becomes a conforming application of these concepts.
     Generalized architectural forms will provide a standard way to provide additional meaning and behaviors for elements. While a style sheet is still a necessary factor in fully automating the processing of elements, architectural forms give a document management system clues about treating groups of elements as a class. This could simplify the specification of features like metadata and query and possibly lead to system optimizations.
     
     

    Property set definitions

      The TC defines a mechanism for formally defining the object models that underlie the parsing used in HyTime and DSSSL, known as a grove. Most implementors of SGML processing software have invented a tree-based data structure for handling SGML. With the property set definition, this SGML model is rigorously defined, replacing the more informal Element Structure Information Set (ESIS). Theoretically, this mechanism can be used to describe arbitrary structured data, for example Postscript or RTF, enabling HyTime and DSSSL functions to transparently operate on this type of data.
     A few, brief concepts and definitions from the annex include:
     
     
  • In unaugmented SGML, the element structure of an SGML document is a tree. When HyTime location and linking mechanisms are introduced, this tree can become a graph. A grove is a graph of nodes; in many cases, a grove is a tree, but it can also represent a set of disjoint nodes or trees.
  •  
  • A property set defines classes, properties (attributes), datatypes, normalization rules, enumerated values for property values, and lexical types.
  •  
  • A node is an ordered set of property assignments, each of which associates a value with a property. The data type constrains valid values for the property.
  •  
  • One property of nodes indicates the parent/child relationship; other properties provide for internal and external references to other nodes.
  •  
  • Rules not expressible, such as subtle parsing rules, are specified by referring to the appropriate standard and clause.
  •  
  • A grove plan describes how parsed data is to be treated as a node-based structure via a property set definition.
  •  Document management systems probably already have some notion of a grove buried deep in their processing model. A grove plan is one way to externalize that model and exhibit conformance with standards for which a grove plan is defined, for example SGML, HyTime, and DSSSL. Systems with a “grove-enabled API” should be highly interoperable with each other, and potentially able to handle arbitrary data types given the data's grove definition.
     
     

    Formal System Identifiers and Storage Managers

      SGML has the notion of an external identifier for specifying the location of an external entity, which has two forms: a public identifier and a system identifier. The notation of a public identifier is described in 8879, and when the “formal” feature of SGML is used, a public identifier must conform to the notation. Public identifiers are intended to be “registered” and made available to the entity manager for resolution into a system identifier, which is an system-dependent locator to the actual data object. The SGML Open Catalog is a standard form for mapping a public identifier to a system identifier.
     To date, the form and interpretation of a system identifier has been unspecified. Most SGML processing tools have treated a system identifier as a pathname or filename compatible with the system on which the tool runs.
     The HyTime TC provides a formal notation for system identifiers, called a Formal System Identifier (FSI), and provides some pre-defined types of locators, including operating system files and URLs. An FSI, in essence, maps an entity to a storage object and supports one-to-many and many-to-one mappings.
      In addition, the TC formalizes the layered architecture of an entity manager, which resolves the location of an entity, and a storage manager that actually provides access to the entity. Arbitrary storage managers can be declared through an FSI definition document.
     These features will allow a document management system to define itself as a storage manager and subsequently specify FSIs to map directly onto its data objects. For example, an FSI might be in the form of the system's object identifier or it could be a query in the system's query language. At the very worst, this allows interchange of pointers to data between users of the same system; at best, support for popularly defined storage managers will lead to widespread interoperability.
     

    Conclusion

     You must carefully analyze your linking requirements in order to maximize support from your document management system. Some of the factors include:
     
  • Is the linked document collection well-bounded or is there a great deal of interaction with outside sources? Will you interchange your linked data with outside organizations?
  •  
  • Can you modify the data to meet a particular linking scheme or must the linking scheme apply to a fixed document model?
  •  
  • Is linking to SGML elements sufficient or do you need more sophisticated link targets, like text spans?
  •  
  • How much do you want your authors to know about linking? Can your linking scheme be made sufficiently easy for authors to use?
  •  
  • Will your links typically remain valid when the data they point to changes, or do you need dynamic link checking facilities?
  •  
  • What link information is your browser expecting and in what form?
  •  
  • How much can you afford to (or not afford to) plan ahead for the future?
  •  HyTime has some good answers for these real issues. It also has some overly complex answers for these issues. Document management systems can probably meet many of your linking needs without HyTime, but no doubt would be enhanced by supporting HyTime.
     Few, if any, off-the-shelf document management systems support HyTime today. Publication of the Technical Corrigendum may spur additional implementation activity, as the TC clears up several fuzzy areas of the standard that have hindered implementation. Meanwhile, XML-link is thundering up from the horizon, providing an alternative link description standard that may provide much of the power of HyTime with less of the overhead.
     But, as with all software features, it is the user community that dictates vendors' priorities. It's up to you to determine your needs and speak up accordingly.

    SGML in Healthcare Information Systems   Table of contents   Indexes   Caterpillar Inc's New Authoring System