DOCSTEP - Technical Documentation Creation and Management using STEP   Table of contents   Indexes   Configuration and version management in an SGML-based document management system

 Espert  Christophe 
  François  Patricia 
  Futtersack  Philippe 
 

Hypermedia Database

 

Abstract:

 In the context of Document Management Systems, the notion of document is becoming less and less preponderant. A document corresponds to an assembly of information objects -SGML or non SGML objects- that may be shared by several documents. Moreover, these information objects are interconnected by various kinds of links.
  The conventional The conventional SGML Databases offer a good support for storing and manipulating collections of independent SGML documents. They have to evolve for managing a network of SGML and non-SGML documents, i.e. hypermedia documents. SGML allows to define inter-document links by using id/idref attributes and entity sharing. HyTime goes beyond the SGML limits concerning the hyperlinking features by offering the semantic to model complex links, such as a link from a document to a very precise location inside an other one. In order to offer all the functionalities necessary for managing hypermedia documents, SGML Databases must then take into account all the above constructs. The schema of these SGML databases consists in a tree structure representing the mapping of the SGML meta-model. But it has to evolve towards a graph structure for representing the HyTime hyperlinking model. This paper presents the principles to extend an SGML Database to an HyTime Database and the functionalities of a web interface to access to the documents stored in the database.
 

Introduction

 This paper is the result of a current collaboration between Aérospatiale Aircraft Business and the Research and Development Division of Electricité De France. This collaboration concerns a study and research project in the structured electronic document database field. Although the specific industrial contexts are different, numerous common requirements may be identified in this particular field and a large benefit may be expected from a common study.
 Aérospatiale and Electricité De France are two big French companies which produce respectively, aircrafts (Aérospatiale Aircraft Business) and electricity. Both need to manage a large amount of documentation in their own industrial context. As a consequence, a significant benefit is expected from powerfully computerized documents.
  After presenting this study's industrial contexts, we succinctly present our approach for specifying an SGML database. Then, we focus on our strategy for evolving towards an HyTime hypermedia database. In , we show how we have chosen to implement this SGML/HyTime Database . Finally, we conclude by giving the progress status of our work and the main issues which remain to be studied in depth.
 
 

Aérospatiale industrial context

 Aérospatiale Aircraft Business is responsible for producing all the technical documentation delivered with aircraft. This aircraft technical documentation is subjected to severe constraints, particularly in terms of volume (more than 300.000 pages for one kind of plane), content format (textual, technical data, illustrations), content customizing (airline customizing), authoring (performed by various industrial partners), update frequency and longevity requirements.
  Since the 80's, paper media has been giving way to electronic media in the aerospace community. However, the SGML format has been adopted by the aerospace regulatory organization (ATA - Air Transport Association of America) as the documentation exchange standard between aircraft manufacturers and airlines. As far as SGML structured documentation is concerned, Aérospatiale Aircraft Business is involved in three main domains: standardization, documentation production, documentation utilization software development.
 Aérospatiale participates in various military and civil standardization committee groups which design DTDs for aircraft documentation delivered to airlines.
 In terms of documentation production, SGML technical publications are available, since 1993, for new aircraft: Airbus A330/A340. These SGML publications are produced by different means (native SGML production, proprietary Airbus format conversion, etc...) depending on the kind of manual produced. However, a large documentation system re-engineering project is going on; it aims at integrating new documentation technologies in the documentary product as well as in the production process.
 In terms of documentation utilization, Aérospatiale proposes a software, called ADOC/ADIC, which allows aircraft documentation delivered to airlines to be integrated into their own information system. This software consists of both an editorial workbench and a consultation system.
  Aérospatiale Aircraft Branch is also involved in research and prospective activities in the electronic documentation field. These activities mainly consist in studying new standards and technologies related to electronic documentation, in order to evaluate their adequation to aerospace needs and specificities, i.e. their applicability in the aerospace field. Such research activity development aims at preparing future aircraft documentation and related technological evolutions in the three above-mentioned domains: standardization, documentation production and documentation utilization software development. One of these research activities concerns the storage and management of hypermedia aircraft documentation.
 
 

Electricité De France industrial context

 Electricité De France is the public company providing electricity for 30 millions of french customers. Production, transportation and distribution is assumed by the same company. Energy is also exported to numerous european countries. Moreover, the know how is exported for nuclear power station and electric network building.
 Concerning R&D, the main goal of the R&D Division is to do research on electricity topics. From the information system viewpoint, the research activity tends to generate technical documents. Otherwise, as a complement to electricity research, we assume an activity on scientific topics such as computer sciences and documentation engineering in particular. In this context, we are studying the new systems to manage electronic documents. The Electronic Library Project aims at managing the documents concerning general activity of the Division.
 In the context of general activity, we keep track of the division flowchart, of the people employed by the division, and of accounting information. Moreover, the employees produce sets of electronic documents to describe what they intend to do, and later, the corresponding reports. As examples, we can mention activity descriptions (about 2000 documents of 2 pages each year), activity reports (twice a year) and internal technical reports describing general results of research actions (5000 documents of 10/100 pages each year). All these documents are generally written with a very popular word processing tool.
 This information is collected through the office automation network and stored in relational databases. Only the bibliographic information (the title, the abstract, and the authors of the internal technical reports) are stored in a coded format. Internal technical report bodies are digitalized (because of the heterogeneity of the collected documents) and stored by a specific application on WORMs optical disks.
 Many applications manipulate these data, to print or fax a report, to retrieve a selected set of information, or to compute synthetic results by using natural language and statistical techniques.
 SGML/HyTime is also used for nuclear station documentation. This type of documentation present many common point with aircraft documentation. An SGML/HyTime database is a mean to manage rich and durable information, independently from the content of the information, and even independently from the structure of the information if the SGML/HyTime database in generic enough.
 

An SGML database : reminder

 
 

Main choices for defining the database model

 Our strategy for defining an SGML database model has been described and analyzed in , comparatively to related work , . We just sum it up here. We will rather describe in this paper our strategy for evolving towards a hypermedia database.
 We have chosen to propose a fully generic database model, that means a completely DTD-independent model. And, in order to be SGML full compliant, this model is derived from the SGML abstract syntax. Like the ESIS mechanism, specified in an SGML annex , , which defines a set of information on the element structure, we used the same philosophy to get information from the complete abstract syntax. The ESIS consists of a flow of information generated as the document is being parsed. It is defined to be the relevant set of information for recreating the source document as well as for implementing any structure-based application.
 
 

The database model principles

  shows a very simplified subset of this database model which corresponds to a tree of SGML components decorated with SGML attributes.
 
Sample graphic
 We won't give more details about the SGML database model in this paper, to focus on the link features.
 

Extensions for managing hypermedia documents

 
 

Why to evolve towards an hypermedia documentation

 Whether in a technical publication production or utilization context, the document concept is greatly evolving. On the one hand, documents become largely inter-dependent. They reference each other, share components... . On the other hand, documents tend to only appear as end-products which result from an assembly of documentary data.
 Managing a web of documentary data i.e. SGML and non-SGML (illustrations, technical data) data connected by various kinds of links, rather than a collection of independent textual documents is therefore an increasingly emerging requirement.
 The database tree model has to evolve towards a graph model. This graph model is close to the one defined for hypertext documents in that they both allow non-linear and interactive functionalities to be provided. But unlike most of hypertext-related work , we are constrained to deal with standards for exchange, longevity, and regulation reasons.
 
 

Our approach for evolving towards a hypermedia database

 
 

Evolving towards the HyTime exchange standard

  The SGML exchange model offers limited hypermedia features which are not sufficient for satisfying all our requirements. So we have chosen to evolve towards the HyTime exchange model which is an ISO exchange standard, fully up-compatible with the SGML standard. Based on the SGML syntax, HyTime introduces a new concept: the Architectural Form concept. An Architectural Form is a set of rules whose semantics are used to specify DTDs. Architectural Forms allow hypermedia features ( hyperlinks...) to be modeled as well as time-based features (scheduling...). A HyTime document therefore consists of an SGML document, some elements of which having standardized semantics.
 
 

Our strategy for evolving towards a hypermedia database model

 The SGML database model quickly reminded above only represents the SGML tree structuring constructs which may be qualified as syntactical constructs. This model consists in a direct mapping of the SGML tree meta-model; exchange and database models are then isomorphic. As a consequence, for each SGML document to be imported within the database, this model may be instantiated as the document is being parsed e.g. sequentially read. In the same way, pre-order traversal of the tree model allows any marked-up source document to be recreated.
 But, as far as hypermedia features (whether in SGML or HyTime standard) are concerned, semantic concepts are added to some syntactical constructs. And these semantic concepts have to be largely known and used within the database for providing rich hypermedia capabilities (browsing, ...). The database model therefore not only consists of a mapping of the exchange model syntactical constructs but has to be semantic-aware.
 Our strategy for evolving towards a hypermedia database model therefore consists in:
 partitioning the database model into two layers :
 
  1. asyntactical layer which consists of a mapping of the exchange format syntax and ensures equivalence between database and exchange models. This syntactical layer may be instantiated as the source document is being parsed. And it is designed to be the only one to be invoked for exchanging any hypermedia document.
  2. asemantic layer which represents document hypermedia semantics and enables hypermedia rich capabilities to be provided by the database. This layer is instantiated in a second step and results from a calculation performed by querying the syntactical layer. The semantic layer is then the result of what we will call "HyTime Processing". This semantic layer only belongs to the database and never needs to be invoked when recreating a source hypermedia document, for exchange purposes.
 defining the database hypermedia model in two steps:
 
  1. firstly, modeling the existing SGML hypermedia features (entities, ID/IDREF attributes) because they consist of base features used by HyTime. This strategy enables an evolution from SGML to HyTime to be performed in both a smooth and consistent way.
  2. secondly, modeling part of the HyTime hypermedia features.
 
  1. validating this modeling work in an implementation phase based on the first HyTime applications: the Grund DTD , and CApH .
 
 

The Hypermedia database model principles

 
 

First step: taking into account SGML hypermedia features

 
Sample graphic
 
Sample graphic
 
Sample graphic
 
 
  1. Information sharing --> Entity construct modeling Whether in a document production or utilization context, managing information sharing is a major requirement. A mechanism which ensures that exactly the same information component is included or referenced in various document fragments must be provided (cf ). The SGML standard provides a virtual storage concept through the entity construct. An entity may contain SGML or non-SGML data, possibly referenced from different documents. These anchoring or referencing mechanisms finally consist of special links between documents. A first step for evolving towards a hypermedia database consists in managing this construct within the database. Entity references therefore have to be interpreted as particular links, "anchoring links", within the database (for browsing capabilities) and managed as "logical sharing mechanisms" (consistency control). shows our approach principles for enhancing the SGML database model in order to take into account the entity construct.
  2. Document traversal links --> ID/IDREF attribute modeling SGML provides a standardized way of modeling cross-references within a document. This is by means of a specific kind of attributes: ID and IDREF attributes. Enhancing the SGML database model to take into account these cross-references would enable traversal links to be managed within the database (browsing capabilities). These database model enhancements (cf figure 4) consist in: extending the SGML attribute model for identifying IDREF attributes (syntactical layer). adding real traversal links within the database (semantic layer).
 

Second step: HyTime hyperlink features modeling

 
Sample graphic
 
Sample graphic
 
  1. Requirements The documents often contain references to pieces of information contained in other documents. Objects other than textual objects (graphics, etc.) are anchored or referenced in the documents too. Locating information included in a non-SGML object implies the use of a language that can be understood by this object (e.g. locating a portion of a graphic). Moreover, certain documents refer to remote technical databases. In the same way, the locating mechanism associated with these references must be capable of being understood by the external database (e.g. SQL). In order to prepare consultation, documents may also be inter-connected to one another by typed complex links. This is to enable pre-selection according to user profiles for instance . Complex links mean links which may connect more than two anchors, each anchor being qualified by a specific role and possibly composed of one or more information objects. Consequently the exchange format has to propose a standardized way of modeling typed and complex links combined with powerful location mechanisms (cf figure 5). The SGML entities enable modularity and information sharing to be managed. However, they only allow the anchoring of a whole module, whether SGML or non-SGML, within a document. They do not propose a mechanism for locating information in an entity. The SGML standard provides a mechanism for modeling intra-document links but not typed and complex inter-document links. Conversely, the HyTime standard proposes a standardized way of modeling complex typed links combined with indirect location mechanisms which are partially based on the Entity and ID/IDREF constructs . Indeed, HyTime enables to build links which connect more than two anchors, each anchor being qualified by a role, associated with traversal rules and possibly consisting of an aggregate of nodes. Moreover, it allows an independent hyperlink document to be defined without modifying existing documents on which the hyperlinks point at.
  2. Approach The HyTime standard corrigendum defines an exhaustive property set representing all the information a parser is capable of making available about a document. Graphs, where arcs are defined in terms of properties, may be built. Such graphs are called "groves". The standard also provides a way (by means of a grove plan) for any application to get a grove that provides it only the information it requires. In order to be HyTime compliant, we aim at defining a HyTime database model compliant with these grove and property set definitions. In a first phase, we cannot reach a complete compliance but we can rigorously specify our compliance level by defining our application grove plan. We will then enhance it step by step.
  3. The HyTime database model
 
  1. The syntactical layer The HyTime standard is based on the specification of architectural forms which are gathered into modules. The base module is required as it specifies base architectural forms. The other modules are optional. In a first phase, we will only focus on two of them: thehyperlink module which specifies architectural forms for representing links between any kind of documentary data, with these link endpoints being located by means of location mechanisms specified within the location address module, and thelocation address module which defines various means of specifying information locations. To properly evolve towards a HyTime database therefore requiresenhancing the SGML Document model for identifying instances of HyTime architectural forms. But even when this enhancement has been performed, a HyTime document imported within the database merely consists of an SGML document. Its hypermedia semantic aspects are not recognized because HyTime elements may be identified but their semantics are not interpreted. The above-defined enhancements therefore only concern the syntactical layer model.
  2. The semantic layer HyTime document semantic aspects are generated in a HyTime processing phase by querying the syntactical layer and interpreting it according to HyTime specifications. Modeling a semantic layer which represents these semantic aspects then requires modeling: theHyTime Processing Services which represents HyTime specifications contained in each HyTime module (These specifications must be modeled to provide all the classes and methods required to perform HyTime processing), and theHyTime Document Semantics which represents document semantic aspects generated by the HyTime processing phase. This enables the hypermedia semantic constructs associated with each HyTime architectural form to be modeled. Finally, the HyTime processing phase enables generation of a hyperdocument which is composed of a set of entities, whatever SGML document instances or non-SGML data, (syntactical layer) plus a set of document semantic contents (semantic layer), each SGML document instance being associated with its related semantic content. These document semantic contents, generated by HyTime processing servers, consist of HyTime constructs organized into a complex model. figure 6 shows a very simplified subset of the HyTime database model. This figure points out our approach for defining HyTime constructs which represent the hyperlink concept. Two kinds of relationships between the semantic and syntactical layers may be identified in this figure: "has for semantic" between a syntactical HyTime element and its related semantic construct, and "has for structural content" between a hyperlink node and its anchor point within a document instance.
 

An SGML/HyTime Document Management System prototype

  gives the architecture of the Database Management System we are prototyping. Here, we show how the applications are connected to SGML/HyTime database layer we suggested. This database layer is based on the O2 ODBMS .
 
 

Document Loading

 The SGML schema layer is populated at parsing time. When an SGML document is loaded, an SGML parser returns to the ODBMS a sequence of information corresponding to the structure and content of the document. Numerous objects are instantiated for the declaration, DTD and the instance.
 SGML hypermedia constructs are managed too. External entities sharing is managed. However, special attention must be paid to cases in which SGML entities contain a partial tagging completed by the document referencing it. Each SGML useful reference is converted into an OID (Object IDentifier). Thus, the ODBMS manages the necessary functionalities like object sharing, object locking or object deep copies in case of inconsistent modifications of shared entities. An SGML document manager, based on a catalog manager, manages system and public identifiers. Then, the ID/IDREF links can be translated into OIDs by running a specific method.
 The HyTime processing is run in a specific pass. It is in charge of resolving the locators and the links. The HyTime processing methods populate the BOS (Bounded Object Set) according to a BOS level associated with the documents.
 
 

Information Access Interface

 We have developed a first level of applications. Quick developments were possible because of the object modularity and the integrated tools offered with the ODBMS.
 As far as Information access is concerned, we have developed two kinds of navigation plus query interfaces using both internet-intranet technologies but in two different ways. The describes the functional architecture of the HyO2 prototype.
 
 

First kind of interface

 This interface consists of two combined applications.
 The first application is anavigation application. It enables navigation through the database object composition to be performed. Today, this navigation starts from the persistent document object set. Then, the tree structure of each document chosen is interactively built and displayed. Inter-document links (corresponding to IDREF attributes) are traversed as well as anchoring links (corresponding to entity references).
 This navigation interface is developed using HTML and JAVA technologies. Database object composition is mapped into HTML documents which are interactively displayed using the Netscape HTML browser. Non-SGML objects referenced or anchored in an SGML document are displayed using Netscape plug-in facilities.
 We are now enhancing this interface in order to enable navigation through the hypermedia network associated to HyTime hyperdocuments stored within the database. This interface will be based on navigating through the database objects related to the schema semantic layer.
 The second application is anSGML/HyTime query interface. It offers query facilities based on SDQL partially derived from DSSSLQuery and HyQ. This query interface runs on top of OQL, the database query language. It enables filters based on SGML/HyTime structure and text content to be applied on the database. This access is rather reserved for specialists who know the SGML/HyTime document structure.
 Both navigation and query interface are combined so that it is possible to navigate from a query result.
 
 

Second kind of interface

 Lastly, an HTML interface giving access to the SGML/HyTime documents, assisted by a full-text language, is developed.
 This interface is designated to end-users who are not familiar at all with SGML. The structure is almost completely hidden behind the HTML/Java interface. Users access to the documents by using the Topic full-text query langage. The application was developed on top of the Topic API in order to customize the query interface and to map the SGML tree structure to the nested text boxes managed by the full-text engine. The users can choose between 5 ways to build queries:
 
  1. by using lists of indexed words,
  2. by selecting icons building automatically the Topic query,
  3. by typing some words to search and choosing between a reduced set of operators,
  4. by typing some words to search, the name of the SGML elements of attribute in which to search, and choosing between a rich set of operators (see ),
  5. by writing directly the query in the native Topic langage.
 The result of a query is presented as a table of documents (see ), and the user can click on a document in the table to display the document body with its highlights (see ).
 Moreover, the HyTime links anchored to the displayed document are alive in a frame on the left side of the displayed document. So the user can navigate through the web of HyTime links hidden behind a graphic representation looking like a table of content.
 
Sample graphic
 
Sample graphic
 
Sample graphic
 
Sample graphic
 

Conclusion

 At this stage, we have also extended the specification with the HyTime domain concerning hyperlink management.
 Concerning the prototype, we already developed a generic object schema of the SGML standard, an SGML document loader and some graphical tools to navigate through and visualize SGML documents stored in an O2 database.
 We have good performances at loading time and excellent access results even on a large amount of documents (many tens of thousands for EDF). However, more test must be performed on large size documents (many tens of megabytes for Aérospatiale).
 We also wrote an object schema of HyTime (hyperlink and location address modules) and the first methods to compute HyTime Processing. We are extending our applications for the end-users. The first step was to make a dynamic generation of HTML/Java presentations from the HyTime documents stored in the database.
 We obtained a very good foundation for an SGML/HyTime database. Concerning HyTime, we are convinced that the HyTime concepts fulfil our hypermedia needs, but we are still waiting for the HyTime tools to come on to the market.
 In accordance with the HyTime functionalities, we must validate the complete HyTime specification on our prototype. We will use the HyTime concepts for versionning management and study how to map a "HyTime versionning DTD" to the ODBMSs versionning management module. We have yet to implement variant management. We aim at using or even extending this versionning management module for this implementation.
 On the interface part, we offer hypertext-like navigation access, but we must work on an interface for end users which hides the SGML/HyTime syntax and offers a query language based on SDQL partially derived from DSSSLQuery and HyQ. This new langage will run on top of OQL, the database query language.
 
 

BIBLIOGRAPHY

 
  • [1] . Françis, "Generalized SGML repositories: requirements and modeling, Computer Standards and Interfaces vol 18 p.11-24, 1996
  •  
  • [2] Karl Aberer, Klemens Böhm, Christoph Höser, "The Prospects of Publishing Using Advanced Database Concepts", Electronic Publishing (EP 94) vol. 6(4) p.469-480, 1994
  •  
  • [3] Paula Angerstein, Paul Grosso, "The DSSSL Query Gramma", DSSSL committee working paper, 1992
  •  
  • [4] Steven J. DeRose, David G.Durand, "Making Hypermedia Work", Kluwer Academic publisher, 1994
  •  
  • [5] Catherine Hamon, Conception orientée objet d'une base de données éditoriale SGML - Implantation sur le SGBDOO O2, thèse (INRIA), 1992
  •  
  • [6] C. F. Godfarb, Yuri Rubinski, The SGML Handbook, 1990
  •  
  • [7] J. Conklin, "Hypertext: An introduction and survey", IEEE Computer, 1987.
  •  
  • [8] E. Barret, "Hypertext in context", "The society of text: Hypertext, Hypermedia and the social construction of information", MIT Press, collection information systems, 1989.
  •  
  • [9] P. Futtersack, C. Espert, "Electronic Library System at Electricité De France: a case study using object technology", SGML Europe'95, 1995
  •  
  • [10] ISO, Information Processing-Text and Office Systems, Standard Generalized Markup Language, ISO 8879, 1986
  •  
  • [11] ISO, Information Processing-Text and Office Systems, Document Style Semantics Specification Language, ISO 10179, 1996
  •  
  • [12] ISO, Information Technology, Hypermedia/Time-Based Structuring Language, ISO 10744, 1992
  •  
  • [13] J.L. Sanson, "EDF Electronic Library Project: Data modeling with HyTime/CApH", Second International Conference on Applications of HyTime - August 1995
  •  
  • [14] M. Biezunski, "Modeling hyperdocuments using the topic map architecture", Second International Conference on Applications of HyTime - August 1995
  •  
  • [15] O2, Reference manuals, O2Technology Inc., 1995
  •  
  • [16] Sweden Cals Office, "FMV (Swedish Defense Material Administration) Grund DTD", version 1.10, 1995

  • DOCSTEP - Technical Documentation Creation and Management using STEP   Table of contents   Indexes   Configuration and version management in an SGML-based document management system