Acknowledgments   Table of contents   Indexes   SGML Extended Facilities and HyTime Two

 
 

Using Meta data to Automate XML Document Production and Maintenance


 
Joe   Gelb
  Vice President, Electronic Document Projects
  LiveLink Systems Ltd.
POB 34059, 5 Mercaz Shatner
Jerusalem   91340  Israel
Phone: +972-2-6528274
Fax: +972-2-6528356
Email: joeg@livelink.com Web: www.livelink.com
 
Biographical notice:
 
Joe Gelb
 
Joe Gelb is Vice President of Electronic Document Projects at LiveLink Systems. He has designed and implemented database-driven document automation projects for technology companies and reference publishers.
 LiveLink Systems Ltd.  
Reichman, Katriel
 

Prior to joining LiveLink Systems in 1996, Mr. Gelb worked for General Electric Astro Division and McDonnell Douglas. He earned a B. Engineering (Mechanical Engineering) and a BA in History from Stevens Institute of Technology in 1992.
 
Katriel   Reichman
  President and founder
  LiveLink Systems Ltd.
 
ABSTRACT:
 
Brief Abstract
 
XML, SGML and even HTML contain implicit and explicit meta information that can be used to improve document presentation, and dramatically reduce document maintenance costs. This presentation describes how meta data is used by technical documentation groups to automate and maintain hyperlinks for technical document libraries.
 
Full Abstract
 
After our customers developed and deployed extensive electronic documents, they faced the threat of "deadwebs" - portions of web sites and intranets whose content stagnated while other portions were added or changed. As new information was added, older information slipped out of sync with the site as a whole. New information linked to old, but older documents did not link to the latest data or link to the latest available media. As the volume of information increased, the difficulties in adding and updating information escalated.
 
To avoid deadwebs, we have applied meta data to separate between content objects, information about the object, and information about how the object relates to other objects. Meta data is descriptive information about document objects. Examples of meta data might be the name of the product that the object describes, or the audience level served by the object. Meta data can be inferred, as well as defined explicitly by authors.
 
The meta data is then used to automatically code hyperlinks and to control the properties of the hyperlink. As documents are edited, deleted or modified, the automatic hyperlinking process determines the best possible links and codes them automatically. As a result, information can be edited freely and links will always be accurate and up-to-date.
 
This presentation will describe how this methodology can be applied, and provides specific examples for XML and HTML links, reviews techniques for inferring meta information, describes how to account for idiosyncratic information and discusses the limitations of the methodology.
 
In the case of XML, this presentation will describe how the methodology meets the challenges of preparing special XML link properties. This presentation will consider how publishers who need to downscale for HTML publishing can apply the methodology to control link properties automatically using JavaScript.
 
 

Introduction

associative documentation
linear documentation
 

The advent of electronic documents created and used over enterprise-wide networks and internet webs is, in some ways, an even more radical event than the invention of the printing press. Gutenberg only automated an existing process - reproduction of hardcopy documents. Electronic documents, however, require a fundamental shift in paradigm for effective production and maintenance. An effective paradigm is needed to bridge the gap between how documents are authored (linearly) and how they are used when delivered via browsers (associatively) . In this paper we will refer to a major subset of this challenge as the link coding problem .
 
In projects as diverse as the Webster's New World Dictionary, the Jerusalem Post Archive and BackWeb Technologies' technical documentation, we have applied meta data to automate coding of hyperlinks and to control the properties of links. Using a database, we have successfully applied automation on both legacy and new document objects to avoid the link coding problem.
 
Link Coding Problem
 

The Link Coding Problem

associative document preparation
linear document preparation
 

While for-print documents are traditionally prepared linearly , beginning with an outline and following a structure that is apparent to both the author and the reader, electronic documents require associative preparation . Readers need clues that will enable them to navigate and locate related topics without reference to a document hierarchy.
 
Linear preparation of documents continues to make sense for document authors. Bridging the gap from linear authoring to the coding required to facilitate associative use requires facing a series of challenges, including:
  • Coding is time consuming
  • Corporate data sets tend to be large, making it difficult to keep track of what related information is available
  • Interesting data tends to change, requiring constant re-coding
  • Coding is not intuitive
  • Different people own different parts of the data (in jargon, this is called the cooperative problem )
  •  
    Enterprise Publishing
    cooperative problem
     

    Enterprise Publishing TM Provides an Alternative Paradigm for Link Coding

     
    A new paradigm developed by LiveLink, and implemented in our Enterprise Publishing software, bridges the gap between linear document authoring and associative coding of hyperlinks.
    content objects
     meta data 
     

    The paradigm uses meta data to separate between content objects , information about the content object, and information about how the object relates to other objects. Meta data is descriptive information about document objects.
    automatic hyperlinking
     

    Using the meta data, Enterprise Publishing automatically codes hyperlinks and controls the properties of the hyperlinks. As documents are edited, deleted or modified, the automatic hyperlinking process determines the best possible links and codes them automatically.
     
     

    Goals of the Paradigm

    automatic hyperlink updating
     

    The goal of the paradigm is to enable genuine automation of hyperlink coding and updating without human intervention in real-life work environments. We suggest evaluating our own proposal, and others, by benchmarking it against realization of the goals.
     
    criteria for automatic hyperlinking
     

    Criteria for Evaluating the Paradigm

     
     

    Automation

     
    In order to meet the goals of fast creation and easy maintainability, the solution needs to be fully automated. All coding should be performed by the software using decision rules.
     
     

    Intelligence

     
    The links made should be appropriate and at least as good as what may be coded manually using point-and-click tools. If the solution creates nonsensical ("silly") links, redundant links or just too many links, it will fail to meet this criterion..
     
     

    Adaptability for Meeting Real-Life Needs

     
    The solution should be adaptable to meet the idiosyncratic needs of individual projects.
     
     

    Standards-based

     JavaScript 
     XML linking 
     

    The solution should support the file formats used today and currently in development. In the context of the Web, meeting this goal requires support for input and output of different implementations of HTML (including JavaScript ) and XML (including the variations of XML linking ).
     
     

    How the Paradigm Works

    document enrichment
     

    LiveLink assumes that the starting point (or "input") is structured electronic documents. The software works by identifying and using that structure first to break up and then reassemble "enriched" documents that improve upon the original documents.
    folder
    meta data, container
    sub-folder
     

    The software first assigns documents to folders and sub-folders , using a file cabinet metaphor. Folders and sub-folders act as containers to associate meta data to the files contained in each folder. For example, all of the documents associated with a particular product might be assigned to a particular folder. All of the documents for the product relating to maintenance of that product may be assigned to a sub-folder and product operation to a second sub-folder.
    dissection
    examples
     idiosyncratic links 
    notes
    pre-marked links
    table captions
    tables
    topic hierarchy
     

    After assigning documents to folders, the software breaks documents into topics using tags to identify topic hierarchy . It then further distinguishes between different elements in the topics. Coarse dissection separates main and sub topics, text and graphics. Fine dissection separates parenthetical information such as notes and examples , and detailed information such as tables , table captions and pre-marked links.
     
    content level
    expert level
     

    Distinguishing Between Content Level (documents) and Expert Level (meta data)

     database 
     

    LiveLink software distinguishes between content level and expert level information. Content level refers to text and media, stored in documents, that the software processes. Expert level refers to meta data - information about how the content should be interpreted. Expert level information is stored in a database .
    expert information persistence
    re-usable
     

    LiveLink software makes the expert level information persistent . That is, the information recorded by experts should be re-usable as the source files for the project change and should be portable to new projects in the organization.
     
    database driven
     

    Database Driven

    potential targets
     

    LiveLink software automatically populates a database by parsing files and isolating potential targets for hyperlinks . The database can be edited using standard database tools. The information stored in the database is read back and used by LiveLink products to control how documents are enriched.
    aliases
    ambiguous links
    nonsensical links
     stop list 
     

    The database approach allows users complete control over the database interface and expert level content. The user can create a "stop list" disabling particular targets that are too general and would lead to nonsensical or ambiguous links . In addition, aliases for potential targets may be added to the database. For example, "dipstick" can be an alias for "oil gauge."
    database rules
     

    Database rules may be coded to improve the quality of document enrichment. A sample database rule might be to ignore all targets with the text string "hint".
     
     

    Implicit and Explicit Database Clues

     
    LiveLink software automatically provides the database with clues about different information types. For example, the database records whether the target is a main header or a subhead, a glossary entry or a table caption.
     
    Clues are derived explicitly (typically from style tags in the source documents) and implicitly (using key phrases, juxtapositioning and other hints).
     
     

    Grouping Files in Projects, Folders and Sub-Folders

    folders
    project
    sub-folders
     

    LiveLink software supports the grouping of files in projects by folders and sub-folders . The realm of files that are processed together is called a project . Folders are logical groupings of files within the project. Sub-folders enable enhanced functionality in various areas:
  • Inheritance : Individual files within sub-folders inherit the attributes of a sub-folder (such as read-only).
  • Quickly assigning properties : Users can assign properties to files grouped in particular folders. For example, all files in a sub-folder can be specified as "source-only" , "target-only" or "source and target" .
  • Precedence: Precedence rules may refer to a sub-folder in order to resolve ambiguous links. (See below.)
  • Database rules: The database may use sub-folder affiliation in order to add properties to individual topics. For example, all items in a particular sub-folder may be coded to open a new window when they are the target of a link.
  •  
    assigning properties
     inheritance 
     precedence 
     source and target 
    source-only
     target only 
     

    Distinguishing Between Targets and Sources

     source and target 
    source only
     target only 
     

    LiveLink software allows the user to account for different levels of file ownership or readiness for public viewing by differentiating between "source-only" , "target-only" and "source and target" . Source-only files or topics can link to targets in other files but cannot themselves be the target of links. For example, an access-restricted file or a file that has not yet been approved for public viewing would be a good candidate for source-only status. Target-only files or topics can be linked to but not from, for example, a corporate glossary.
     
     precedence 
     

    Precedence

    ambiguous link solutions
     

    The intelligent agents built-into LiveLink software use precedence rules to deliver a high level of intelligence in links and to arbitrate under ambiguous conditions. An ambiguous condition is where the decision rules dictate multiple solutions but only one solution is possible, or where the solutions need to be ordered by rank.
     
    The precedence list for a folder or file determines if hyperlinks will be made to a particular file, and which files will have "precedence" when the software codes hyperlinks. Hyperlinks will only be created to files included in the precedence list for a particular folder or file. Files higher up in the precedence list are preferred candidates for hyperlinks.
     
    Sample uses for precedence might include "link from bug report screens to specification sheets but do not link from specification sheets to bug reports" or "link from troubleshooting procedure to parts list, but not to introduction."
     
    link properties
     

    Link Properties

     JavaScript 
    VB Script
    XML link attributes
     

    LiveLink software can control the type of link, as well as the link itself. In HTML this is accomplished by adding JavaScript or VB Script to the hyperlink. In XML this is accomplished by specifying XML link attributes .
     
    A link property might specify, for example, if the link should open a new instance of the browser or display the information in the current browser window.
     
    The link property is a special sort of meta data about the topic, specifying how the topic should be linked to. Like other meta data, link properties for specific topics can be assigned individually or can be inherited from their folder.
     
    At production, the link property is rendered using either XML attributes or JavaScript.
     
    XML link properties
     

    Handling Special XML Link Properties

     
    By specifying meta information for folders and files, and then setting rules for how the meta information will be interpreted, the user can automatically determine how the XML Link Properties will be coded.
    extended link groups
     

    To illustrate the power of the LiveLink paradigm, we will consider how it would be applied for controlling three relatively straightforward aspects of XML linking. Of course, the paradigm is equally valid (and even more useful) for XML linking features that are more difficult to code, such as extended link groups .
     
    Show link attribute
     

    Show

     
    The value of the Show attribute controls what happens when the link is traversed. "New" displays the link in a new context (typically a new browser window). "Replace" inserts the target of the link in place of the current resource. "Embed" inserts the target of the link in the current resource.
     
    By specifying meta information, and then setting rules for how the meta information will be interpreted, the user can automatically determine the value of the Show attribute. For example, if you determine that particular topics or folders contain detailed table reference information, you can set meta information for those topics or folders to "table information". If you specify a rule that all table information should open in a new window when requested, then the LiveLink software codes the Show attribute to the value New for all links to tables.
     
    Actuate link attribute
     

    Actuate

     
    The value of the Actuate attribute controls when the link is traversed. "User" activates the link when the user requests the link (typically by clicking on it). Auto activates the link whenever the browser or application encounters the link in the current resource.
     
    As is true for the Show attribute, by specifying meta information, and then setting rules for how the meta information will be interpreted, you can automatically determine the value of the actuate attribute. For example, if you determine that particular topics or folders contain important safety information, you can set meta information for the topics or folders to "safety notes". If you specify a rule that all links to safety notes should appear automatically, then the LiveLink software codes the Actuate attribute to the value New for all links to safety information.
     
     

    Link Quality

     
     idiosyncratic links 
     

    Idiosyncratic Links

     
    We anticipate that there will be some links that cannot be coded automatically. This category of links is referred to as idiosyncratic links.
     
    The most common source of idiosyncratic links is in the input files themselves. These are links that are coded on the source files by the author or by a content expert.
     
     

    Avoiding Nonsensical Links

    alias list
     stop list 
     

    Like the physician, LiveLink software says "do no harm." That is, avoid coding nonsensical links that distract the reader. Precedence, alias lists and stop lists are just some of the tools that the software puts at the disposal of the content expert to avoid nonsensical links. Our experience has been that careful definition of precedence rules, alias lists and stop lists typically enable total automation of linking.
     
    images
     

    Handling Images

     
    The LiveLink paradigm supports the full range of media, in addition to straight text documents.
    expert level database
    identifying strings
     

    To add links to images or other media files, simply add one or more identifying strings for each file to the expert level database . Whenever the string appears in any of the topics, the software automatically codes a link to the specified file. Whenever a new image or other media file is added to the database (or removed), the software automatically updates the links.
     PDF 
     

    Sample applications include: (a) On-line technical documents that can be matched to illustrations stored as PDF documents; (b) mention of part numbers that can be matched to refer to photographs of the parts; or (c) descriptions of procedures that can be linked to video clips that walk through the procedure.
     
    site verification
     

    The Difference Between Site Verification and Automatic Coding

     
    Verification of broken links is an important tool, and is always needed whenever idiosyncratic links are coded. Site verification is needed to detect problems after they have been coded and - once the problem is detected, the links must be corrected manually. The LiveLink software avoids link problems by only coding valid links. Because link coding is fully automated, links can be regenerated whenever the documents change, ensuring that links remain valid over time.
     
     

    Using JavaScript and Down-converting to HTML

     
    For end-consumers of the documents who may still be using HTML browsers (or XML browsers that don't support the full range of XML linking attributes), the LiveLink software provides dual format generation options. That is, HTML or XML files can be provided as input. Special linking properties can be implemented using either XML link attributes or JavaScript, depending on the output format specified.
     
    Acknowledgments
      Special thanks to Danny Goodman, Simcha Stern, Katriel Reichman and all my other colleagues at LiveLink Systems. Their review of this article and our many late night discussions about Enterprise Publishing have been invaluable.

    Acknowledgments   Table of contents   Indexes   SGML Extended Facilities and HyTime Two