Panel Presentation   Table of contents   Indexes   STEP/SGML harmonization - Data Architecture or Product Documentation?

 
 

Using the DOM as an XML/HTML repository API


 
Jonathan   Robie
  Texcel Research, Inc.
3207 Gibson Road,
Durham,   North Carolina  27703
Email: jonathan@texcel.no
 
Biographical notice:
 
Jonathan Robie
 
Jonathan Robie works for Texcel Ventures as a Research Consultant. He represents Texcel on the Document Object Model Working Group, and is an alternate to the XML Working Group. Before joining Texcel he was the SGML Product Manager at POET Software, where he helped design an SGML repository. Mr. Robie has two years experience with SGML repository design, seven years experience with object oriented databases, object oriented design, and object oriented languages, and a total of eleven years post-graduate experience as a computer scientist. He has a MS in Computer Science from Michigan State University.
 
ABSTRACT:
 
Document repositories are databases that contain documents, and treat document data in a manner analogous to the way that relational databases treat relational data. They allow queries to be done using the logical structure of documents, maintain version histories of documents as they evolve, and provide tools for managing documents that are being changed by multiple editors and authors simultaneously. The Document Object Model is an emerging W3C standard which defines fine-grained primitives which can be used to build, manage, and modify a document, but it is currently restricted to a single document. The programming interface for a repository allows a wide variety of tools to be written for creating, modifying, and managing documents and collections of documents. Currently, each vendor's programming interface is proprietary, and there is no way for XML programs to access repositories in a vendor-independent manner. The DOM is expected to be a programming interface familiar to script authors and programmers who work with other XML tools, it can be used to represent directory structures using a document metaphor, and its fine-grained representation of documents allows repository operations to be performed symmetrically at any level of detail in a repository or document.
 
This article explores ways to use the DOM as the basis for a vendor-neutral document repository programming interface. It discusses the basic features which must be added to support multiple documents in collections; basic repository operations like checkin, checkout, import, and export; and queries - each of these features should be available at any level of document structure, from entire documents to individual elements. If an interface supports queries, the DOM can also be used as a way to walk result sets and manipulate their content. We end the article with a simple scenario to illustrate how this programming interface might be used in practice.
 
 

The Document Object Model: A Standard for Programming XML and HTML Documents

 
The W3C Document Object Model (DOM) is a programming interface for HTML and XML documents. It has not yet been released, and is currently available only as a working draft, but it promises to provide an open way for a variety of tools to manage documents. Originally, the DOM was intended to provide a consistent way to write scripts and programs for dynamic documents that will work in any web browser. Editor vendors and repository vendors quickly realized that the DOM could also meet some of their needs, and the DOM has evolved as a way to provide one standard programming interface that can be used in a wide variety of environments and applications.
 
The programming model for the DOM is based the tree-like structures which make up the logical structure of HTML and XML documents; it defines the logical structure of documents and the way a document is accessed and manipulated. With the Document Object Model, programmers can create and build documents, walk their structure, and add, modify, or delete elements and content. Anything found in an HTML or XML document can be accessed, changed, deleted, or added using the Document Object Model.
 
In the Document Object Model, The object model itself closely resembles the structure of the documents it models. For instance, consider this table, taken from an HTML document:
 
<TABLE>
 
<ROWS>
 
<TR>
 
<TD>Shady Grove</TD>
 
<TD>Aeolian</TD>
 
</TR>
 
<TR>
 
<TD>Over the River, Charlie</TD>
 
<TD>Dorian</TD>
 
</TR>
 
</ROWS>
 
</TABLE>
 
The Document Object Model represents this table like this:
 
 
As a programming interface, the Document Object Model defines the code needed to create and build structures like the one pictured above, to walk their structure, and to modify them.
 
In the Document Object Model, documents have a logical structure which is very much like a tree; to be more precise, it is like a 'forest' or 'grove' which can contain more than one tree. However, the Document Object Model does not specify that documents be implemented as a tree or a grove, nor does it specify how the relationships among objects be implemented in any way. In other words, the object model specifies the logical model for the programming interface, and this logical model may be implemented in any way that a particular implementation finds convenient. The DOM specification uses the term "structure model" to describe the tree-like representation of a document.
 
 

The DOM as a Document Repository API

 
Document repositories are databases that contain documents. They allow queries to be done using the logical structure of documents, maintain version histories of documents as they evolve, and provide tools for managing documents that are being changed by multiple editors and authors simultaneously.
 
SGML and XML repositories often have a programming model that is based on the tree-like structures of the documents they contain, and their programming interfaces may be quite similar to the DOM; furthermore, the manner in which a document is represented may also be quite similar to the DOM representation. This means that a document repository is quite similar to a relational database in some ways, but instead of containing sets of two dimensional tables, a repository contains documents in a logical representation much like that of the DOM.
 
The programming interface for a repository allows a wide variety of tools to be written for creating, modifying, and managing documents. Repository vendors use these interfaces to develop their own tools, such as document and repository browsers, web-based viewers, document assembly tools, and workflow management. In-house applications may use these interfaces to create documents from data that is available to their programs, to extract data from documents, or to present various views of a document to a user. Tool vendors can use these interfaces for applications that modify documents at run time, e.g. interactive technical manuals (IETM) that adjust their content based on the background of the user or the current configuration of a system, online catalogs that display only those items that are currently available or only those items relevant to a particular user, or business document management systems that present only the data that a particular user is authorized to see.
 
One of the most important reasons for a programming interface is to allow tight integration with familiar tools, allowing users to create and manage documents using familiar tools like SGML, XML, or HTML editors, view them using browsers, or manage the process by which they are maintained with workflow management software. Currently, each SGML, XML, or HTML vendor's repository, editor, browser, and workflow management system has its own programming interface. This means that the effort required to support a number of different tools can be enormous; e.g., if an editor vendor wants to support several repositories, a different programming interface must be used for each repository; similarly, if a repository vendor wants to support several editors, a different programming interface must be used for each editor. Moreover, most of these proprietary interfaces are not designed for use on the Internet; the DOM provides an elegant, high-level interface which can be called by Java or JavaScript programs to manipulate documents in a variety of environments, including the Internet.
 
 

Extending the DOM for repositories

 
The way that documents are represented in the DOM is very useful for repositories, but the DOM currently provides a model for a single document, not for multiple documents, and provides no direct support for managing collections of documents, searching across them, or discovering what collections exist, nor does it provide support for versioning. This section explores the most important extensions needed to support repositories:
 
 

Operations on documents and document sub-trees

 
The Document Object Model defines fine-grained primitives which can be used to build, manage, and modify documents, but it does not provide operations that manage a document or a document sub-tree as a whole. Although the DOM provides the primitives necessary to implement such operations, writing scripts to perform these common operations would be reasonably complex, and in many cases a repository may have other ways that provide more efficient implementations. Some of the most common operations in document repositories are:
  • Queries. Apply a query to a repository, a set of documents, a single document, or a document sub-tree.
  • Import. Reads an entire document or document sub-tree into a repository.
  • Export. Writes an ASCII representation of an entire document or document sub-tree. (Note: it is also possible to export into non-ASCII representations, but this is less common).
  • Check-out. Creates an editable copy of an entire document or document sub-tree, locking the original in the database to ensure that only one copy is being edited at a given time.
  • Check-in. Update an entire document or document sub-tree to incorporate changes in a checked-out copy. Create a new version with these changes, while still maintaining the old version. Release the locks which were made by check-out so that further edits are possible.
  •  
    In a repository API, these operations should be available as primitives for two reasons: ease of use and efficiency. It is much easier for a script writer or programmer to call a function to import a document than to write all the code needed to write the document to the repository using fine-grained operations. It is also easier for the repository to implement these operations efficiently if they can use their proprietary interfaces to advantage; if the programmer builds the documents one step at a time, the repository is forced to follow the individual steps that the programmer chooses to import the document, which may not be an efficient way to import for a given repository.
     
    To allow fine-grained management, it is equally important to be able to perform these operations at any level of a document hierarchy; e.g. to check out one table from a document, modify it, and check it back in without checking out the entire document. This allows many authors or programs to modify the same document at the same time, working in different portions of the document. It also allows programs to be more efficient, since they need process only the part of a document which actually must be modified. In the DOM representation, this is easily done by letting operations affect subtrees of documents, which is the most important reason for using a DOM representation in repositories.
     
     

    Managing multiple documents

     
    Document repositories must manage multiple documents, but the Document Object Model represents only a single document. Therefore, a repository API must extend the Document Object Model to make it possible to find documents, share document data among documents, and structure the relationships among documents. Some key issues for multiple document management are:
  • Queries. Document data can be queried just like database data; queries should be able to extend across all documents in a repository.
  • Organizing and finding documents. Before a program can process a document, it must first locate the document. Just as an operating system organizes files onto separate hard disks with directories and subdirectories, it is helpful to be able to organize documents in separate repositories with a hierarchical directory structure. This directory structure can easily be modified using XML semantics for named elements and containers - this approach allows directories to be manipulated and explored just like any DOM document. File systems generally display information to identify files, including the name, size, and date created. Similarly, a repository should be able to display data for the documents it contains, and allow searches to find documents based on this data. To make documents or document sub-trees easier to browse and search for, it is helpful to be able to associate descriptive data with them; e.g. a document might have data describing the author, the date created, the date last modified, the organization in which it was created, etc. In an environment where fine-grained control over document data is desirable, this same level of detail may be useful for document sub-trees. There must also be some way of finding the repositories that are accessible in a given environment. Repositories may be in different physical locations, or use software provided by different vendors. A registry of repositories makes it possible to find available repositories. When new documents are created, they are registered in the registry, together with the address of the repository that contains them.
  • Versioning. One important function of a repository is to maintain versions of a document as it is developed. The DOM has no concept of versions, so any management of versions must be done through extensions. It is possible to design a DOM-based repository API with extensions to set versions globally, and have the repository always present the selected version of a document. Fine-grained comparisons of versions are more difficult; one possible solution is to present different versions as different documents, and compare them as such, but that requires significantly more work on the part of the application. In general, a repository API should extend DOM nodes to allow versions to be explicitly represented.
  • Sharing document data. One of the major advantages of a document repository is the ability to intelligently manage shared document data; e.g., to decide whether a change to the copyright notice should automatically be extended to all new printings of existing documents, or only to new documents. Since the DOM does not have a way to explicitly represent relationships among documents, it must be extended to allow this kind of sharing. The same issues are equally relevant to DTDs.
  •  
     

    Queries and query results

     
    In a document repository, XML documents can be queried to return the portions of documents that fulfill certain logical conditions. For instance, the following query looks for authors whose surname is "Robie":
     
    author/surname="R*"
     
    Queries are declarative - a query does not specify how to walk the tree to find the results, it merely specifies what it is looking for. Queries can specify relationships among elements - e.g., that surname is a child of author - but they do not specify how to find the results, and ignore the structure of the document as a whole. A query for a document specifies:
    • A path that connects one or more related nodes in a document.
    • Conditions for the nodes.
    • Which items from the search should be returned in the query results.
    For instance, in the above example, the path includes the elements <author> and <surname>. One condition is specified: surname must start with "R*". In this particular query language, the element on the rightmost side is returned unless otherwise specified, so the query would return the surname of all authors whose surname begins with "R*". The return element can occur at any level of the hierarchy, so it can be the root node of a complex document structure.
     
    The DOM is navigational, designed for walking the document itself. In theory, the DOM could be used to implement queries, but queries implemented on top of the DOM would not perform well because the DOM does not have the indexes and other data specialized data structures that databases use to optimize queries. An efficient query engine tries to avoid walking the document whenever there is a faster way to get at the data, and the data structures and methods it uses are not defined by the DOM. The results of a query, however, are documents and document sub-trees which can easily be represented by the DOM. This makes it possible to use the same code to manage query results with the full power available for managing documents.
     
    To extend the DOM for queries, it need only be given a method that accepts a query specification and returns a set of results as DOM nodes. The standard DOM interfaces can then be used to explore the result set just like any other document data.
     
     

    An example: editing a document in a repository

     
    In this section, we will illustrate the concepts we have discussed with an example of how this kind of interface might be used. If a document repository provides a programming interface based on the DOM, with the extensions discussed above, then an XML editor could use this interface to access documents in the repository, allowing them to be edited. The DOM itself is used for accessing individual documents and document sub-trees, and it is also used to represent the directory structure within a repository. Many other features will have to be provided by the extensions we have mentioned. Let's assume that the person using the XML editor wants to search for an article, check out some document data, edit it, and check it back in. Here are the steps which the user would take, and the corresponding interaction between the editor and the repository's programming API:
    • Open a repository. If the repository programming interface has a registry of available repositories that are available on a given network, then the editor can ask for this list and present it to the user. This corresponds roughly to a file system explorer showing the available disk drives for a computer on a network, and gives the user a starting point for exploring the contents of repositories.
    • Show the directory structure. We want the user to be able to browse the directory structure documents using an XML navigation tool that looks like the file system explorers provided with graphical operating systems. The repository programming interface can present it's hierarchical directory structures using the DOM as though they were documents consisting of <DIRECTORY> and <DOCUMENT> elements; a directory may contain either other directories or elements.
    • Perform query. The repository programming interface must allow us to perform a query on a specific repository, a document, or a set of documents. In this case, we will use this interface to perform a query on the entire repository in order to find the abstracts for papers written by the current author using the GCA DTD. This corresponds to <ABSTRACT> elements within documents whose root tag is <GCAPAPER> for which an <AUTHOR> element contains the first name "Jonathan" and the last name "Robie".
    • Browse the results of the query. The DOM provides list iterators which can be use to walk lists; in a client/server environment, a query may return an iterator for a set of nodes that hold the query results.
    • Select an element and check it out. Suppose the user wants to edit an abstract and place the new version of the abstract in the repository. The repository programming interface can be used to check out any element and it's children, locking them in the repository to prevent simultaneous changes by multiple users unaware of the other's actions.
    • Check the element back in. The repository interface maintains the correspondence between the nodes that are given to the client and their representations in the repository itself (the exact manner in which this is done is vendor-dependent, and not specified in the repository programming interface).
     
    As more and more critical data is placed in XML and XML repositories, it would be extremely useful to have vendor-independent standards for performing this kind of everyday operation. The DOM currently does not provide any support for multiple documents or repository operations, but it does provide a solid basis for a repository programming interface that does. The DOM is expected to be a programming interface familiar to script authors and programmers who work with other XML tools, it can be used to represent directory structures using a document metaphor, and its fine-grained representation of documents allows repository operations to be performed symmetrically at any level of detail in a repository or document.
     
    Acknowledgments
      The author would like to thank Mike Champion of ArborText for the ideas he contributed to this paper and for his feedback. He would also like to thank Gavin Nicol of INSO for feedback on related issues.

    Panel Presentation   Table of contents   Indexes   STEP/SGML harmonization - Data Architecture or Product Documentation?