Carla Corkern I've Got an SGML Database - Why do I need HyTime?

Highland Consulting Carla Corkern
John Rice
Highland Consulting
ISOGEN INTERNATIONAL

August 20, 1996

SGML Repositories Introduction: SGML Repositories

SGML fundamental precept <- Yuri Rubinsky -> One of the fundamental precepts upon which SGML was designed was that of data-reuse, or as Yuri Rubinski used to say, "Never type it more than once." For years, we have been selling data reuse as the driving business problem SGML promised to solve. However, until recently a lack of appropriate tools and even a lack of conceptual strategy left SGML implementations to languish in traditional monolithic, document-centric quagmires.

Data Management SGML alone, presents a number of well known problems. Most of us who have implemented SGML solutions realize that the data management problems with our old systems are magnified by SGML. A writer who in the past was unable to find the most recent version of a field service manual now certainly cannot find the 1000 procedures and 500 item part table that make up the new manual. When you begin to address multiple "output products" and component reuse the picture gets even darker. While on a conceptual level, the reuse of SGML components is sexy, in practice, it often becomes just plain scary. SGML simply lacks the addressing mechanisms necessary to describe and manage relationships across a set of documents and without architectures, the define common semantics and constraints across families of document. The SGML DBMS was expected to fulfill the potential of SGML, offering added dimensions of true data reuse, collaborative authoring, version management, and robust query and navigational capabilities.

<- SGML Database -> So now that a number of SGML database management systems have become available, it seems reasonable that one might ask, "What can HyTime do for me that these tools cannot?"

Microsoft Word We may be tempted to dismiss this question like the one we still sadly hear all too often, "Why do I need SGML, all my documents are in Microsoft Word now and RTF is a standard, right?" And indeed, when we contacted several SGML database vendors for this discussion, their responses were neither positive nor enlightened. We find this an extremely troublesome although not uncommon phenomenon. With this in mind, we have begun to think quite a lot about how current SGML database systems could support at least a subset of the objectives of HyTime.

<- Hyperlink -> If you have not yet chosen a database system, HyTime can serve as a useful modelling tool to design your documentation linking and classification needs. Defining a set of architectures in HyTime can help greatly in understanding and implementing an eventual database system. HyTime allows you to create a specification of how systems must function before you even begin to shop for an SGML database solution. Having a definite model of your information management needs is an import first step in any technology decision.

Data Management Strategy This paper will examine three principle data management strategies currently available and how they might employee HyTime to add value to their products. Drawing back from the cutting edge of hypothesis for the moment, here is an examination of some duller but true cost-driven business issues.

"What color would you like your database?"
- Dilbert

<- Eliot Kimber -> <- BLOB Manager -> BLOB <- Element Manager -> <- Entity Manager -> There are three predominate types of SGML managers: BLOB (Binary Large OBject) Managers, Entity Managers, and Element Managers. Not surprisingly people only seek to manage information at the level that they user or understand information. For example, I store water in my refrigerator; that is all of the information I need. However, if I were a chemist, I might want to isolate a molecule of the water for study; and if I were a nuclear physicist I might want to split that molecule into its subatomic structure. From the perspective of granular data management, each of these needs are valid but very different. Likewise, each of these SGML data management strategies has its own advantages and pitfalls and perhaps each has its rightful place. At Eliot Kimber's recommendation, we extend the metaphor to say that "some see water as a collection of atoms, others as clumps of molecules, others as a sea of subatomic particles. Choosing one view of may preclude the use of others. For example, storing water as a plasma of protons, neutrons, and electrons destroys its original molecular structure, preventing for example, the separation of the water into oxygen and hydrogen. In the same way, reducing SGML documents into a sea of elements may destroy their original storage organization."

BLOB Manager BLOB Managers

<- BLOB -> <- Data -> <- Database -> <- Object -> <- Querying -> The BLOB Manager is often built upon a Relational Database and may be described as relatively SGML- unaware. Information is stored in a file-based environment, within logical units such as Document, Chapter, or Section. Between the Application layer and the RDBMS is an Object/Relation Mapping Layer usually built in a system of classes and groups. These relational mappings are based on storage structures and usually have limited ability to handle semantic relationships. This "lowest common denominator management" limits any dynamic reuse of data to the BLOB level. The BLOB manager then is predisposed to a more traditional publishing focus, one built on structure. And indeed, most players in this market niche are Document Management Systems which handle all sorts of data. If you are talking to your Document Manager vendor and ask the question "Do you manage SGML?" and they say "Sure, it's just another data type!", the tool is probably a BLOB manager. On the other hand, BLOB Managers are relatively inexpensive and offer strong non-SGML query capabilities, version control, and workflow integration. Chances are if you work for a big company, you already have a corporate standard Data Management System and may be hard pressed to replace it with an SGML-mission critical solution. BLOB Managers keep track of large data components, they are an excellent vehicle for storing, managing, and delivering complete documents with somewhat lesser headaches than more granular managers.

Given the limited data-awareness of the BLOB Manager, how might Hytime enhance its capabilities?

1) Since BLOB Managers do usually organize data into database classes based on the type of BLOB being stored, these classes might be mapped to HyTime architectural forms, thus providing proprietary-free classification information.

<- BLOB -> <- Data Structures -> 2) Because BLOB managers manage just about everything, you can use one database to track linking capabilities (in a limited form) across data objects. If your BLOB manager tracks that the file PROCEDURE1 is related to GRAPHIC1, you can express some semantic linking information.

3) Additionally, document property information which is usually stored in a proprietary way in the database might be expressed as an SGML instance, providing relevant tracking information about the document and thereby allowing automated construction of books by providing object component relationships (perhaps even version information.)

Entity Manager Entity Managers

<- BLOB Manager -> <- DTD -> <- Database -> <- Object Database -> Entity Managers may be built upon Object-Relational or Pure Object Databases. Unlike BLOB Managers, Entity Managers are SGML-aware, allow for collaborative authoring, and support referencing of semantic data structures. Entity Managers also support the development of modular DTDs (for reuse of of common objects and less cumbersome DTD maintenance), and they allow for the tracking of parsed marked section (versioned) data within the context of one object. Entity Managers then, are more portable and efficient and implement a higher level of SGML than do BLOB Managers.

<- Steven R. Newcomb -> <- DTD -> Database Engine <- Entity Manager -> <- Workflow -> However, Entity Managers also require a significantly greater effort to implement. Especially, time consuming and rigorous is the up front analysis, since it is not only a document analysis but also a workflow analysis. Workflow analysis allows for the construction of authoring DTD subsets and for DTD modularization, but this can get messy. As Steven Newcomb has pointed out in SGML ARCHITECTURES: Implications and Opportunities for Industry. "Wherever a DTD fragment is inserted into a DTD, it is inserted verbatim. Even if parameter entities are not used, or if they are used in complex ways (such as having the inserted text of a parameter entity contain a reference to a previously-defined parameter entity), the insertion of DTD fragments can result in the propagation of unnecessary and unnatural constraints on the structure of documents, or, alternatively, less structural constraint than is desired by the architect, and then can be usefully validated by an SGML parser or SGML database engine. Moreover, the impact of a change in a parameter entity on any given DTD can be surprising and confusing to everyone but a computer." Equally important to consider is that it is quite possible to develop an application so granular that it is actually less time consuming for your authors to recreate data than collect it!

SGMLOpen Catalog <- Entity -> Entity managers have a significant advantage over the other two types of solutions in that the process of entity definition and management is well defined and well understood by vendors. In fact, most SGML-aware tools can understand and support entity management so tool support and integration with an entity manager should be relatively easy. For example, a set of defined entities expressed on the file system in an SGMLOpen catalog format could serve the database by defining a finite set of structures to be managed and a predefinition of the rule sets. This one catalog file could also serve the needs of authoring and composition tools, allowing a true plug- and-play architecture.

<- Entity Manager -> With a high level of SGML-awareness, Entity Managers could benefit significantly from HyTime incorporation.

<- BLOB Manager -> <- BLOB -> <- Data Structures -> 1) As with BLOB Managers, mapping database storage structures to architectural forms would allow tracking of the architectures that are associated with a particular set of document components.

<- Entity Manager -> 2) Furthermore, because of the Entity Manager's usual awareness of each objects internal structure it would be possible to report internal components for standardized cross-document and semantic linking. By locating embedded addressed objects within a managed entity, you have taken the first step to the resolution and management of object addressing.

<- Hub Document -> <- Ilink -> 3) An entity manager enables the creation of a more sophisticated link and address management scheme which would also allow for the creation of needs-specific navigational Hub Documents. Specifically, by associating elements with data locators, linking schemes might be generated and maintained external to the documents through the use of ilinks. The link arrangements could then be revised as needed, perhaps even dynamically generated through the use of a query system, without affecting the documents.

Element Manager Element Managers

<- DTD -> <- Data Structures -> <- Entity Manager -> <- MRU -> <- Minimum Revisable Unit -> You may argue that entity managers and element managers are the same thing, and depending on implementation, they could well be used in the same ways. However, Entity Managers usually suggest a predefinition of a Minimum Revisable Unit (MRU) of your data. Element managers allow reuse on "anything that's tagged. Element Managers are SGML-aware, allow for collaborative authoring, and support reuse of semantic data structures. Element Managers also support the development of modular DTDs (for reuse of of common objects and less cumbersome DTD maintenance), and they allow for storage of versioned data within the context of one object. In fact, some Element Managers even have an SGML-diffing feature that allows you to track the change of text inside of an element. Element managers implement a higher level of SGML than do Entity Managers and they may magnify some of the problems of entity managers.

<- DTD Design -> <- Entity Manager -> While Entity Managers require a significant implemention effort, Element managers may take that design to a ridiculous level. Most Element managers require you to build your DTD to not only share high level components but indeed each content model if you plan to share elements across several DTDs. Because most Element Managers implement a schema by loading a DTD, the design of the DTD is very important. Designing one mother of all DTDs with multiple "threads" is the design goal of implementing this type of system. And the warning bears repeating that is quite possible to develop an application so granular that it is actually less time consuming for your authors to recreate data than collect it! When we focus on the business issues of implementing a system, we must remember that information is only as granular as your authors can conceive it to be.

<- Element Manager -> <- Element -> With a very high level of SGML-awareness, Element Managers could benefit significantly from HyTime incorporation.

<- DTD -> <- Database -> <- Entity Manager -> 1) As with Entity Managers, mapping structures to architectural forms would allow for a much finer level of semantic distinction between data, as well as make possible much more robust linking abilities. Also, element managers usually spin off a subset DTD for each authoring request. How much more difficult would it be to create a set of architectures and ask for semantic equivalents at run-time? As most of these systems are object or object-relational based, this is not a technically difficult thing for the database to support.

2) Element Managers track exhaustive hierarchy information and usually attach an object identifier to each element so that ability to track element-to-element relationships are possible. How difficult would it be to report this relationship as a HyTime construct? The simple mapping of a set of object ids to a set of namelocs, with the nameloc ID mapping to object Ids and the namelocs then using whatever addressing mechanisms are necessary to address the actual elements. Using this approach, you might even trick an entity manager into behaving more like an element manager.

<- Hub Document -> <- Object -> <- Querying -> 3) This even more granular level of inter- and intra-object management would also allow for the optimization of needs-specific navigational Hub Documents. Specifically, by associating elements with data locators, linking schemes might be generated and maintained external to the documents through the use of ilinks. The link arrangements could then be revised as needed, perhaps even dynamically generated through the use of a query system, without affecting the documents.

<- Formal System Identifier -> <- Object Database -> <- Object -> 4) Object databases that are the core technology of Element managers also theoretically provide a natural support for management of Formal System Identifiers. As presented by Alex Milowski at SGML95, in A Theory of Documents, embedding an object reference inside of another object and having that embedded object use a Formal System Identifier to dynamically generate SGML data for incorporation in a document is a very powerful feature. Since Formal System Identifiers can reference any storage manager, this allows SGML to work with many other mediums.

The Nuts and Bolts of Things

<- BLOB -> Binary Large OBject <- MRU -> <- Minimum Revisable Unit -> As was mentioned above, it might be fair to say that each of these systems has its rightful place. If we view publication as a bottom to top process, there is a need to create data, manage objects, and deliver or maintain collections of information. You need the functions inherent in all of these systems. You need to ability to manage SDATA and the library management services of a BLOB system to manage your finished work products (hopefully someone in your organization will just let you keep your data in their system so you spend your cash on truly SGML aware system!) You need the ability to define authoring and processing MRUs for tools implementation. You need the granularity of an element manager if you really expect to never retype anything. Some of these tools claim to be able to do all of these jobs equally well but none is currently configured to perform all three functions.

Summation

<- SGML Database -> What does HyTime buy me that an SGML database does not? Well, for one thing, if you believe in SGML for all of the reasons generally espoused, you know that locking up your data in a vendor-controlled environment is not a good long term strategy. If you don't believe that, then maybe you shouldn't be doing SGML at all! By taking your vendor-neutral SGML data and building value into a management system you are falling into the same trap you just got out of!

Support of HyTime by your database vendor means that you can create and maintain architectural, linking and composite information in a vendor-neutral way. If you are working on a document set that must be delivered to someone else - how are you going to communicate all of the value in your data layer?

<- SGML Database -> <- Workflow -> None of today's leading SGML database systems provide the true ability to manage and create hyperdocuments because they all lack the general addressing and linking functions they need to provide to enable more robust and general representations of things like workflows and change tracking. We suspect that this is a lack of understanding on the part of the vendors as to the true needs of the documentation community and a lack of exercise of some of the features on the databases on which they are built. We have often bemoaned the fact that most people trying to tackle the problem of how to solve the problems in the SGML industry are approaching the problem for the SGML side rather than the database side. Several of the underlying technologies of current SGML database systems would allow for better addressing and linking capabilities if properly implemented.

<- SGML Database -> We heard again and again as we contacted vendors for this article that HyTime support was not a priority. We insist that it must be if you ever wish to deliver your data to someone who does not share your database or if you want to ever move from one type of SGML management system to another. The true value of any standard is the portability of your data. If you have an SGML database, why do you need HyTime? By now, we hope you can answer this one for yourself!