SGML/XML in healthcare information exchange standards   Table of contents   Indexes   How SGML can Support a Dynamic Public Affairs and Communication Policy

 
 

Why Your Document Management System Should Care About Hyperlinks


 
Paula   Angerstein
  Senior analyst
  Texcel Research, Inc.
Austin   Texas  USA  78746
Email: paula@texcel.no
 
Biographical notice:
 
Paula Angerstein
 
Paula Angerstein, a principle architect of Information Manager, Texcel's SGML document management system, is currently responsible for deploying Information Manager in key customer accounts. Paula also represents Texcel on the W3C XML Working Group, and has had a long involvement with standards development including membership in the ANSI, ISO, and CALS committees developing DSSSL and the Output Specification. Previously, she has held consulting, planning, and marketing roles at Interleaf and Computer Task Group. At Texet, Xerox, and Unisys, she implemented SGML software. She received a BA in both Computer Science and Journalism from the University of Texas at Austin. She received GCA's Tekkie award in 1989.
 
ABSTRACT:
 
Hyperlinking has become a ubiquitous feature of just about every useful online application. Most discussion of linking has historically been done in the context of distribution, viewing, and display systems. However, with the increased pace of information update and distribution, it is imperative to consider the full life cycle of links. The need to create and maintain links cannot be just an afterthought; tools to manage the link life cycle are imperative for maintaining high-quality hyperlink networks. A document management system provides an excellent platform for managing the link lifcycle, especially when it is integrated with authoring and delivery tools.
 
This paper examines the aspects of hyperlinking that are relevant to document management systems. Various standard mechanisms for hyperlinking—XML, HyTime, and HTML—are reviewed and their relative merits discussed. Ways in which document management systems can facilitate link creation, maintenance, and delivery are presented.
 
.
 
 

Linking requirements

 
Simply put, a link represents a relationship between two or more things. At this level, a link is a highly arbitrary concept with applicability to almost any aspect of information management. Most commonly, links are thought of as an aid to navigating through data, especially in the context of an interactive application such as an online browser. Probably the most well-known links are the links in an HTML document used to connect resources on the World Wide Web via a Uniform Resource Locator (URL). Another common form of a link is a "cross-reference", or a way of indicating to a reader that relevant material is located elsewhere.
 
But links aren't just for viewing anymore; links have become a common tool in the management of information, playing an important role in tracking the relationships within and among sets of data. In fact, links have become recognized as important pieces of data themselves that need to be authored and maintained just like other data. A link may not only "get me from here to there", it may indicate a crucial dependency between two elements, for example, a diagnosis of a problem and its suggested repair procedure. When one of these elements changes, the other must be examined to see if it is still relevant and current.
 
Additionally, because links necessarily encompass addressing techniques, they are becoming a common way to think about locating distributed objects, both within known domains like a file system or repository and across networks. Links can also be used to point into data in order to associate metadata with it, obviating the need to modify the data itself.
 
The following list outlines some typical user requirements for links:
  • Links to SGML data, including to an entity, element, span of text, and arbitrary span of content.
  • Links to arbitrary data and to points within that data.
  • Links within and across documents.
  • Links with multiple endpoints. These provide a way to associate an arbitrary number of pieces of information.
  • Links that carry with them some set of semantics, such as a type and other attributes to describe the link's behavior. These provide "self-describing" connections among information.
  • Link ends into data that cannot be modified, for example to associate review comments and annotations with the reviewed document.
  • Control over the direction of traversal of a link.
  • Notification when a link end is invalidated or modified.
  • Version history of a link.
  • Context-sensitive links, for example, when blocks of material are reused within an information set and link traversal is dependent on the context in which the block is used.
 
Various combinations of these requirements have been considered in the development of common and standardized linking strategies. This has led to a set of well-accepted properties of links:
  • The link itself and the address of the object to which the link points are two distinct specifications
  • The specification of the link itself is usually via some combination of elements and/or attributes
  • The specification of how to find the endpoints is usually an attribute value on the link whose value is some interpretable addressing scheme
  • Some attribute indicates what the link is for (its role)
  • Some attribute indicates what to do when the link is "activated" (its behavior)
  • Some attribute indicates the allowed types of things the link can point to
  • Some attribute indicates the allowed direction of traversal between link endpoints
  •  
     

    Options for linking

     
    Linking is such a ubiquitous application requirement that a number of standards address the requirements, including SGML, HTML, HyTime, and XML. Additionally, an SGML-aware document management system can provide a number of useful features for handling links in an optimized fashion. These options for describing links are examined, including the shortcomings of each for meeting the stated user requirements in the previous section.
     
     

    SGML

     
    SGML itself provides a construct for linking two elements via attributes of type ID and IDREF. While useful for many purposes, ID/IDREF is deficient in a number of ways:
    • Validation of the uniqueness of IDs (the "name space") is limited to a single SGML document. For links to be useful in an information management system, they must be able to span multiple documents.
    • IDREFs are limited to resolving to the location of one or more SGML elements. Links must be able to resolve to other types of objects, including those that do not have a unique identifier or even spans of data that do not have a collective identifier at all.
    • There is no way to associate additional link information with a specific ID/IDREF relationship, and it cannot be tracked as a data object in its own right.
    • ID/IDREF values are embedded in the document itself; this type of link cannot be used when the data cannot be modified; maintenance of IDREFs is difficult when ID values change.
    • Heavy reliance on SGML IDs creates a maintenance problem when elements can be shared; uniqueness of IDs must be guaranteed in all contexts in which an element is shared.
     
     

    HTML

     
    HTML is the HyperText Markup Language, which currently defines the form of documents used as resources on the World Wide Web. Simply put, in HTML, the A tag is a link, and its href attribute value is a URL (Universal Resource Locator), which specifies the address of the Web resource that is the endpoint of the link.
     
    The following is an example of an HTML link: <A href = "http://www.texcel.no/texcel.htm">Click here to go to the Texcel home page.</A>
     
    While obviously useful for a certain class of linking application, HTML links do not satisfy the following requirements:
  • Links to spans of text and spans of content. The fragment identifier portion of a URL locates into a resource based on a single identifier assigned to an element.
  • Links into arbitrary data types.
  • Links with multiple endpoints.
  • Link ends into data that cannot be modified are possible only when the data already contains the appropriate fragment identifiers.
  • Control over the direction of traversal of a link.
  • Context-sensitive links.
  •  
     

    HyTime

     
    HyTime is the Hypermedia/Time-based Structuring Language, defined in ISO standard 10744. HyTime is an application of SGML that standardizes a number of semantic definitions for hyperlinks. These definitions are expressed in the standard as "architectural forms" such that they can be applied to the elements and attributes defined in any SGML Document Type Definition (DTD). The result is that the standardized semantics apply to the elements when they occur in an SGML document.
     
    So, in one sense, HyTime is "just SGML"; an SGML-compliant system can successfully parse a document with HyTime elements in it. In another sense, these semantics are meant to affect the behavior of the processing system; thus, recognizing and implementing these semantics is the essence of a true HyTime-compliant system.
     
    HyTime describes a link in this way: a link relates two or more link ends . Each link end is a locator to a piece of data known as an anchor . A locator is one or a combination of various robust addressing mechanisms to identify the resource, including identification by name, ID value, attribute/property values, position in a tree or list structure, character offsets, and arbitrary queries.
     
    A contextual link has one of its link ends implied by the link element's position in the document; the link element itself serves as one anchor. An independent link is just that—it resides independently of any of its link ends.
     
    The following is an example of a HyTime contextual link: <clink HyTime = "clink" linkend = "TexcelLogo">This is a HyTime link.</clink>
     
    HyTime is a fully featured set of semantics for links and covers the complete stated set of user requirements for linking.
     
     

    XML Linking Language (XLink)

     
    XML is the eXtensible Markup Language, a streamlined dialect of SGML developed by a W3C Working Group as an enhanced form for resources on the Web. A companion standard to XML, "XML Linking Language (XLink)" defines a powerful set of linking constructs.
     
    In XLink, a linking element , designated by the appearance of an xml-link attribute, has an href attribute, whose value is a URL. In addition to standard URL location semantics, the system-specific query part of the URL is defined to be an Xpointer , derived from the TEI extended pointer. Xpointers locate spans of information based on ID values, specified attribute values, position in an element tree, occurrence in a list, string matching, and character offsets. Additional attributes on a linking element specify the link's role and the preferred behavior for the timing and effects of link traversal.
     
    Like HyTime's contextual link, XML has a type of link where one of its link ends is implied by the position of the link element within a resource; this is known as a simple link . An extended link , like HyTime's independent link, can have any number of link ends, all of which are independent of the resources they locate.
     
    The following is an example of an XML simple link: <link xml-link = "simple" href = "file:///home/texcel/texcel.htm">This is an XML link.</link>
     
    XML links do not satisfy these user requirements:
  • Links into arbitrary data types.
  • Control over the direction of traversal of a link.
  •  
     

    Extended ID/IDREF (RID/RIDREF)

     
    SGML document management systems typically have unique object identification for every SGML element. These repository identifiers (RIDs) make complex addressing unnecessary: link resolution is simple reference to the RID (a RIDREF).
     
    Within its own domain, a document management system can provide unique ID generation for SGML elements, and internally maintain an efficient link management strategy based on RID/RIDREF. When data goes out of this domain, links can be exported or translated to a standard form.
     
    An example of an ordinary link would be something like: <link linkend = "TexcelLogo">This is a link.</link>
     
    The shortcomings of this type of scheme include:
  • Links must be to discrete SGML elements or entities; links to spans of text or to and within non-SGML data would require proprietary extensions.
  • Links semantics are proprietary to the system.
  • Linking into data that cannot be modified can be done only when the data already has appropriate ID values.
  • Control over the direction of traversal of a link would require proprietary extensions.
  •  
     

    Linking and Document Management

     
    The typical user encounters links in order to use them: a hypermedia system presents the links, the user clicks on them, and the links are traversed. But somebody had to create those links and some system must manage them. Here is where a document management system plays a role.
     
    Let's take a look at how a document management system might address three main areas of link management: creation of links, maintenance of links, and delivery of data with links in it, examining where standards play a role.
     
    A couple of common threads run through these three discussions. First, requirements tend to fall into categories depending on whether the collection of documents within which links connect elements is bounded and well-known. An additional metric is whether or not data to be linked can be modified. An example of one extreme is surfing the Web: click on a link and you have no idea from where the next page will come, and you certainly have no authority to modify any data but your own. On the other side, consider a set of technical documents and online help describing a tractor: the tractor manufacturer has full knowledge and control over the set of information and can therefore organize and modify it at will.
     
     

    Link creation

     
    For link authoring, a document management system comes into play to present the candidates for link targets, as the system has visibility to the set of items available for linking. Using the system's interfaces or integrating with authoring tool interfaces, items that can potentially be linked to can be presented in displays such as structure views, query result lists, and formatted content.
     
    Once the endpoints of the link are identified via, for example, point-and-click, the document management system can provide a number of services, including generating ID values for link targets, generating addresses to the link targets, creating the link element itself, storing the link element appropriately, and performing various types of constraint checking.
     
    The most straightforward and arguably most maintainable form for link addresses uses the unique ID of the link anchors. Linking based on an ID is one of the most flexible addressing methods because an element can be moved and links to it remain valid. Addressing based on relative position within a tree or list may resolve to a different element if the element's container is modified.
     
    As already noted, an ID may not be unique within a repository of documents. A document management system, however, having access to all the documents, can provide repository-wide unique naming as part of insertion and check–in of SGML components.
     
    HyTime overcomes the SGML limitation of ID/IDREFs to a single document with the name space addressing mechanism, commonly known as nameloc . A nameloc identifies both the ID of the link anchor and the entity in which it occurs. Use of nameloc requires management of entity declarations and entity resolution for all entities linked to in the document collection. For XML links, the specified URL locates the appropriate entity.
     
    What about legacy data, data imported into the repository, and electronic review of documents by multiple reviewers, scenarios that potentially require links into read-only data? If the incoming data is sufficiently marked up with IDs, you may be able to integrate it without change into your existing link strategy. If not, and you can modify the data, use the document management system to update ID values and existing links to harmonize it with your link management strategy. If you can't modify the data, for example, to associate comments with an element, you may need the more sophisticated addressing mechanisms HyTime and XML provide to link to SGML data without IDs.
     
    What about anchors that are not an SGML element, for example a span of text, and anchors that are not SGML data at all, say an object in a graphic? This type of linking generally calls for an additional level of linking capability to describe offsets from other addressable elements. The HyTime data location addressing mechanism ( dataloc ) addresses into spans of character data, as does the XML Xpointer STRING construct. Additionally, all HyTime addressing mechanisms can be combined into a span location to address a chunk of data that spans contiguous elements, and an XML Xpointer can designate a single span via a start and end point.
     
    Creating the link itself is generally dependent upon the DTDs used within the repository. Authoring interfaces to a document management system should be configurable to work with elements designated as link elements. A DTD could define more than one type of link element, with attributes to describe some aspects of the link. The ease with which you can designate existing elements as HyTime or XML links depends on how closely the linking design was aligned with HyTime or XML concepts.
     
    Contextual and simple links are embedded in the document data at the place where the link takes effect. Creating these types of links generally integrates well with existing SGML authoring techniques; creating a contextual link is much like creating any other type of element.
     
    HyTime's independent links and XML's extended links encapsulate all link ends of a link into a separate element that can reside anywhere. Using independent links requires a bit of extra planning to determine where to store the link elements as well as requiring a more sophisticated authoring user interface to the links; however, independent links provide a level of indirection that facilitates maintenance.
     
    Regardless of the way in which the link is stored, the document management system may provide some constraint checking, for example to ensure that the type of the link anchor is appropriate. The system may also store various types of metadata with the link itself, for example, the creation time and author to provide a version history for the link.
     
    Document management systems facilitate the reuse of blocks of information in multiple contexts. This poses some additional requirements for creating links in a shared block with multiple endpoints that may resolve differently depending on the context in which the block is used. The authoring interface for links must provide a way to specify the context for a particular link end.
     
    Another aspect of link creation is automated link creation, as opposed to user-generated links. A document management system may have an application that can synthesize links based on a rule set. These links might be generated periodically on an on-demand basis, or they could be automatically generated each time a document is updated.
     
     

    Link maintenance

     
    A key requirement for link maintenance is ensuring that links continue to point to their intended anchor. In some environments, this means maintaining the exact same anchor; in other environments it may be perfectly fine, or even desired, for the anchor to be updated or for the link end to resolve to a different anchor. For example, a link from an overview topic in training material needs to be examined when the reference material it points to is updated, as the summary material may need to be changed. On the other hand, in a software user's guide on how to perform a task, a link to a commonly used procedure is still valid when the procedure description changes. Links to affiliations, addresses, dates, prices, and "the latest" anything may dynamically resolve to new anchors.
     
    A document management system with a "where-used" reporting mechanism can dynamically report changes to anchors. If integrated well with the authoring environment, the system can even report these changes in real time, for example, to prevent deleting an element that is an anchor of a link.
     
    Using standard addressing mechanisms for link ends means the document management system needs to know how to resolve these addresses in order to validate the anchors. Some addressing mechanisms complicate the ability to determine when an anchor has changed because the resolution of a link end into an anchor may not be static, for example, when the link end is a query. In this case, the validation of link ends may have to be driven by checking the link itself, rather than monitoring all possible anchors.
     
    Within a document management system, any link element, whether designated as standard link type or not, can be treated as a data object. At a minimum, a document management system can treat a link element like any other SGML element, providing features like access control, versioning, and metadata for links.
     
    A document management system greatly facilitates the use of independent links because it provides a place to store and maintain the set of independent links. This link document provides an efficient place to get information about all the links in the repository. While authors may update and add links via editing of a particular document, the links themselves are actually updated in the independent link document. Alternatively, the document management system can provide an interface to modify independent links without accessing any of its anchors, providing a way to maintain a version of a link without affecting the referenced documents.
     
     

    Link delivery

     
    Typically, data with associated links is destined for a presentation system wherein the links provide a method of navigation for the end user. With its knowledge of how to resolve addresses to link anchors, it is easy for a document management system to provide a way to actually traverse links and launch applications to view the link anchors. This may serve as the basis supporting a viewing environment connected to the repository or to simulate an stand-alone viewing environment.
     
    In a stand-alone viewing application, the data is handed over to the presentation system by the document management system. In this one-way delivery, the data becomes read-only, so links can be frozen and stored for fast resolution.
     
    A goal of the document management system is to output data in a form that can be either directly used by the presentation system or easily transformed for use. Here, the requirements of the browser drive the form of the data. Some well-known browsers expect HTML, with its embedded A element. Some browsers additionally operate on any SGML, recognizing HyTime or XML linking and addressing constructs. Other browsers may have proprietary forms or embellishments for data and links.
     
    The document management system, having a view to all the links in the repository, can determine the set of documents that need to be available to satisfy a web of links and facilitate packaging of these documents for delivery. Additionally, the document management system can provide "on the fly" content generation and transformation services to produce the final output.
     
    To facilitate link resolution for the presentation system, you can use the document management system's link resolution capabilities to resolve links and represent them in the output in the simplest form for the presentation system. For example, independent links could be resolved to contextual links, assuming you have authority to modify the documents as they are retrieved for the presentation system. Also, context-sensitive links can be resolved for each context.

    SGML/XML in healthcare information exchange standards   Table of contents   Indexes   How SGML can Support a Dynamic Public Affairs and Communication Policy