Tastes Great - Less Filling: SGML for the 21st Century   Table of contents   Indexes   Information Modeling for Document Management: the Key to Successful System Selection and Deployment

  Hohoff  Simon 
  Kraft  Matthias 
 

Information Documents, and Products

 

Introducing a Data Repository to a legal Publishing House

 

Abstract:

 Introducing SGML to a conservative publishing house is a long way to go. In the case of C. H. Beck, the leading company for legal publications in Germany, the efforts were driven by the demands of a continuous growing market for electronic publications, on line as well as CD-ROM.
 Since information is the main business of a publishing company, to create an effective information repository was the first step to go. The efforts were driven into two different directions.
 On one hand the information, the sources and the publication process was structured in classic entity relationship models. The analysis brought three different information models (legislative documents, court decisions and intellectually authored texts) implicating three different databases. Two of three databases represent an entity relationship model of the information. The third database (storing the authored texts like books) is document driven and mirrors the structure of the source publication. To enable the best flexibility and an easy handling of the data, in each case the documents were broken apart into micro documents of almost the same class.
 On the other hand the source documents and the resulting publications where examined in order to create a DTD. The resulting DTD is divided into several modules, that represent overall document structures (books, journals, sections etc.) and modules to indicate detailed information (tables, highlighting etc.). the overall DTD is intended as an abstract model in order to derive various different process specific DTDs. Thus the detailed element model corresponds with the micro documents of the information repository. The global document structures are created by the export function of the databases.
 In the future there will be a combining project management system, which will enable the product manager to create publications containing micro documents of all three databases and an overall structure.
 

Introduction

 C. H. Beck is the leading publishing house for legal publications in Germany. They publish the major collections of legal statutes, legal journals and law books. The fund of books ranges from commentaries to handbooks and encyclopedias and from nutshell size to 10 volumes each with more than 3.000 pages.
 In 1989 C. H. Beck published the first CD-ROM containing over 100.000 abstracts of essays and court decisions. One year later, a CD-ROM was published, containing 15 years of the NJW
  NOTE:
 TheNeue Juristische Wochenschrift - NJW = New Legal Weekly Journal is a must for every German lawyer. . Other archives of journals on CD followed. All CD-ROMs where driven by a DATAWARE 2000 software and ran under a DOS operating system. The data was stored on IBM host in the STAIRS 72C
  NOTE:
 This format is line based. Each line starts with a three character code for the field type and consists of a maximum of 72 characters of text. This leads to a paragraph oriented field structure. In-line information was tagged with an SGML alike syntax.
 input format but STAIRS itself was never used as repository.
 When Windows became more and more popular, the software as well as the data hat to fulfil new pretensions. First of all in GUI, the font usually has a proportional width. The text has to be displayed with more detailed typography. It has to flow into a frame with changing width. Since even the help system is a powerful hypertext tool, citations has to be linked with hypertext. This means, the look and feel of a real GUI can not be reached by simple porting the retrieval software from the text oriented DOS to Windows using theCourier New font. First of all this needs a different, partly much more complex data preparation.
 In 1993 however, the first CD-ROMs with a graphical user interface where developed. Small CDs with collections of statutes like theSchönfelder
  NOTE:
 TheSchönfelder is the leading collection of statutes in Germany and contains all federal statutes on German civil law and criminal law.
 run under Microsoft Windows with a MS Multi Media Viewer software. The data repository for the production was the Novel file system and the data format was RTF
  NOTE:
 Rich Text Format, a data exchange format from Microsoft, which allows structuring by style sheets.
 which is the viewers import format. At the same time the first mixed product was developed containing both, legal statutes of different hierarchical orders and court decisions. Both kind of documents where also sold in separate products. But since the data repository and the data format did not allow multiple use of the same data, equal statutes where often stored redundantly.
 The RTF adventure did not pay off. There is no room to count all the troubles we had and all the surprises our customers had to face. This was predictable but some bad experiences had to be made, to make everybody believe, that the texts had to be structured with an application independent and content oriented standard. Thus in 1995 we started with two projects:
 
  • Data management systems where developed.
  •  
  • SGML was introduced into the working process.
  •  The subsequent sections show the way we went and the experiences we made when we tried to prepare the publishing house for the future challenges.
     

    Examining what's going on

     There were many different reasons, to reorganize the work flow and document management in the company. Around 1993 everybody began to think over building a document management system and to improve the document flow of its department.
     Almost nobody thought about structured data. Most of the approaches intention was rather a new database to manage the data than to take care of the data's content.
     
     

    The legal Archives

     Since Beck's main products are books and one of the major product lines are collections of legal statutes, there is a huge effort in managing these statutes. But this effort seemed ineffective because the reader's responsibility for a statute was driven by the responsibility for the collection. Thus an important statute was taken care of by up to 10 reader's at the same time, just because it was printed in 10 different volumes.
      NOTE:
     The text of the statute as you read it in the collection is a result of a sometimes complex consolidation process. It is produced by following the instructions of the legislature to change the former text in a certain way. These instructions are again formed in a statute and published in any of the various official journals.
     This is why the archives intended to centralize the management of statutes.
     The first plans where to built a management system for the versions of the documents. It was not intended to store the documents themselves. A very complex SQL-database was developed containing information about the law itself, the different changing laws etc. It was designed to manage all official publications and their effect or cross references on the law by a complex system of structured metha data. The texts of the publications should by stored as images rather than as text. For this reason a BLOB field was planned to contain the scan of the pages.
     The needs of the CD-ROM productions then caused the manager of the project to think over storing the text of the consolidated law as it is published in a collection. So they designed another BLOB-field to store any format of data containing the final version of the text. Since the consultants who did the database design came form an SQL approach, SGML was an unknown world for them. Their interest was to find a data format, that could be accessed with an OLE application to integrate the editor into the database's user interface. It was and still is a hard effort to convince the responsible colleagues that without an application independent standard, the success of the whole project is in serious danger. Until now RTF is seen as a real alternative.
     
     

    The currently running Production System for CD-ROM

     The development of a new database for the production of the CD-ROM-archives of the legal journals was driven by these major goals:
     
  • Replace the host file system by a modern client server database.
  •  
  • Create a repository for the new structured data in order to meet the needs of GUIs
  •  NOTE:
     Graphical User Interfaces.
     
  • Create a management system for multiple usage of the documents.
  •  
  • Improve the data quality by easier data management routines.
  •  In order to accelerate the process of shift to client server, the data model mirrored the structures of the former STAIRS data. Metha information was put into several fields of the tables. Repeating fields where placed in different joined tables etc. The text was split into different parts, which were stored in different data sets.
     The biggest challenge was the conversion of the line oriented unstructured text elements to block oriented structured data containing hierarchical information as well as in-line elements. Both, converting the text and putting it into the database was done in a single process
      NOTE:
     Which was a failure. A two step solution is the better way: First one should convert the data to SGML. Then a standardized import should bring the data into the database.
     

    Tagging the Books

     A different project was established to face a new product line. The intellectual material, the back bone of a publishing house, had to be prepared for multiple media usage. In this case SGML was the first choice from the beginning. But it needed products like Near & Far and Panorama, to put the none computer scientists into the position to work with DTDs and SGML documents and to show and convince others of methods and effects of SGML. From the beginning it was planed to built one global DTD for all documents of the house. This made sense to keep the microstructure of the documents compatible.
     

    Working together

     Building the DTD had side effects on the data repository as well as the other way round. On the other hand, the final products where more and more supposed to be mixed from the content of the different repositories. Thus it was time make it all work together.
     
     

    Define Common Goals

     The collection of the different aims of the projects came to the following conclusion:
     
  • Multiple usage of the Documents
     
  • multiple media
  •  
  • multiple publications
  •  
  • Saving costs
     
  • centralized care for identical documents
  •  
  • normalized input from various authors and editors
  •  
  • long term reliable data formats
  •  
  • Security
     
  • optimization of the work flow
  •  
  • limitation of the user access
  •  
  • improved access for automation
  •  
  • Different kinds of version control
     
  • versions of legal statutes
  •  
  • versions of intellectual documents
  •  
     

    Examine the Document types

     The legal work is mostly driven by different kinds of documents. Since the development team consisted of legal experts, it was a straight forward work to determine the different types of documents. We divided the definition vertically into the following parts:
     
  • Different types of works
  •  
  • Hierarchical structure within a work
  •  
  • Metha data of a document
  •  
  • Text area, especially in-line constructs
  •  
     

    Reiterating Modules

     Because reusability was one of the major aims of the structure, it was clear, that the pure text area had to be identical in all kinds of documents. A citation, a table or a list for example looks the same in a statute as it looks in a journal article or a handbook. The element "P" became a central element for the whole structure.
     Again reusability was the reason, to develop a lean construct to express all types of hierarchy within the text areas. A recursive SECT
      NOTE:
     In German "GL" for "Gliederung" element opens access to any knot in the hierarchy of a document in the same way. So one can easily reuse any piece of the information in different contexts.
     

    Unique Modules

     The real differences could be found in the metha information of the documents and the hierarchical order up to a certain document level. Some times there are more than one way to put the same information together. In case of a journal article for example there where two different ways of binding it into a work:
     
  • It can be part of a hierarchy of a paper.
  •  
  • It can stand alone in a collection of different articles.
  •  In one case the metha information such as date of publishing, section of the paper is derived form the context. In the other case it is added as header information to the document.
     
     

    Levels of modules

     We created three module levels in the DTD
     

    Text Atoms

     lines
      contain text elements without a line break like emphasises, names etc.
     paragraphs
      contain all elements of a line but also elements with line breaks like lists, tables, preformatted areas etc.
     basic hierarchical structures
     contain a recursive sections with headings and paragraphs.
     

    Documents

     
     Statutes
      have a very formal structure and an important part of metha information. The sometimes is different from all other documents since it is broken down into strongly defined levels like "article, paragraph, number, sentence".
     Legal decisions
      have a very formal header with various informations about the file, the court, date and time and some other information. Further they have particular text areas like the abstract, the description of the case and the reasons of the decision.
     Simple structured text
     is mostly written by free authors and contains few header information and a recursive hierarchy.
     

    Document Collections

     
     Books
     have their own recursive hierarchy. Every node can have the same kinds of header information like table of contents, author etc. Within the sections and subsections texts are grouped together in different ways, what makes them different kind of books.
     
  • A collection of Simple structured text makes a handbook.
  •  
  • Collections of structured text with a definition term are encyclopedias.
  •  
  • Combinations of a statute document and a simple text are commentaries etc. But there are books, that change their character in every section.
  •  Journals
      are like to books collections of documents. The difference is the mostly not recursive hierarchy of (for example) year, number and (recursive) categories with disparate header and footer informations for each node level of the tree. They also contain different kinds of documents like decisions. A certain specialty are collections of documents grouped by the law. They are published monthly like a journal but treated as loose leaf collection.
     Loose collections
     contain documents without any hierarchy. They are mostly used for CD-ROM production.
     
     

    Examine the Work Flow

     
     

    Conceptual model

     The over all concept of the data management and work flow is shown in the following graphics.
     
    Over all concept of the data management and work flow.
     
     

    Work Flow of Different Document Types

     The analysis of the work flow brought up three different groups of documents that match the same groups, discussed before.
     

    Official Texts

     
  • There is a high amount of old data to be brought into the repository.
  •  
  • Statutes base on official texts.
  •  
  • There is a high effort in updates.
  •  
  • Updates base on official directives and can be realized in-house.
  •  
  • There is great chance in multiple usage in print as well as in electronic production.
  •  This leads to the following data flow model:
     
    Data flow concept for official texts.
     

    Journals, Collections and Court Decisions

     
  • There is are huge archives of old data to be brought into the repository.
  •  
  • There is few effort in updates but a continuous stream of new data. But, court decisions can be issued redundantly in journals from other publishers. Thus the citation of the same decision often varies. To guarantee hypertext in electronic products, all possible citations must be kept up to date.
  •  
  • There can be a need to bypass the repository for a first actual publication in print or on-line.
  •  
  • The data usually will be used once in print but often for electronic publications.
  •  
    Data flow model for journals and other periodic publications.
     

    Books

     
  • The quantum of usable old data can subsequently be handled.
  •  
  • Books underlie a continuous update cycle.
  •  
  • Updates are created by external authors. Often many authors take care of one work.
  •  
  • The equipment of the authors, their knowledge and their will reaches an unbelievable variety.
  •  
  • The complete data of a work can be used once on each media. But there must be a possibility to use all parts of a book in various other publications.
  •  
  • The major output destination is still the print product.
  •  
    Work flow in the publishing process of books.
     The most important realization was the fact, that in this field there is no real problem, that a German legal publisher has alone in this world.
     
     

    Compound Electronic Publication

     Am modern electronic publication will always consist of a mixture of different documents. In the future there might be a need to sell products, that are customized to a specific the end-user's needs. In addition, there must be the possibility to distribute single components of the data as modules, that can be integrated at run time on the user's system. This needs a data management with a flexible integrating features.
     
  • Collect documents of different types.
  •  
  • Arrange documents in product specific hierarchical and consecutive order.
  •  
  • Check hypertext integrity. Collect documents that are cited in the main texts.
  •  
  • Use product specific document elements, like abstracts or full text.
  •  
  • Extract sections or subsections of books. For example, look for commentaries for included statutes or articles of an encyclopedia for specific keywords etc.
  •  
     

    Redefinition of the Projects

     The analysis did not show any needs to change the growing document management architecture of the company from the root. But now there was some work to do, to bring all projects on track to the common targets.
     
     

    Databases

     

    Specialized Repositories

     The document repositories of the legal archives and the production database for the journal archives will remain SQL databases with text fields to contain the SGML information.
     
  • They contain all text information according to the micro document DTD fragment.
  •  
  • The export programs simply embeds the micro documents into the structure of the full document.
  •  
  • It can embed the documents into the structure of a complete publication.
  •  
  • Editing can take place on micro document level as well as on document level.
  •  

    General Document Management System

     
  • For general works like books or publications without a formal structure, there will be an SGML database installed.
  •  
  • It should have an usual SQL database as DBMS.
  •  
  • It is supposed to work on micro document basis.
  •  
  • It will have a work flow management system.
  •  
  • It must be able to export sub documents starting form every node on top of the micro document level.
  •  
     

    Product Management System

     A database driven system is planed to be installed soon to collect and join documents from the three repositories. These are the tasks of the system:
     
  • Support the project planning.
  •  
  • Arrange a cross search through documents of all repositories and put the hits into a project.
  •  
  • Generate a tree of cited documents starting from a collection of documents.
  •  
  • Store diverse hierarchical trees in the projects with links to documents of the repositories.
  •  
  • Optimize the quality
  •  
  • Validate the citations and hypertext links.
  •  
  • Install an over all classification.
  •  
  • Handle the project data collection.
  •  
  • Make the other databases export the needed documents or documents parts.
  •  
  • Generate one or more frame documents with the hierarchal order of the project's documents.
  •  
    Data flow in the product management system.
     

    DTD

     
  • There is one DTD as abstract model for publications of the company.
  •  
  • The DTD is strictly organized to realize a micro document structure.
  •  
  • Micro documents are collections of lines, paragraphs or basic structures.
  •  
  • A micro document DTD fragment is created to edit the according elements.
  •  
  • For typing a DTD fragment can be generated with an assistant. Thus the DTD contains exactly the elements needed for the corresponding document. This optimizes the handling of the complex DTD and the coherence of the text.
  •  
  • For data import, there is a database specific unvarying subset of the DTD. It is necessary to avoid the import of documents of the wrong types. It guaranties a stronger validation and consistent data in the database.
  •  
  • For data export, there might be a superset of the DTD to fit the needs of a specific publication process.
  •  

    What We Learned

     
  • Use one DTD as conceptual model.
  •  
  • Modularize the DTD. Use different layers of diversification.
  •  
  • Use specialized DTDs for the different steps of the process. Derive them from the conceptual model.
  •  
  • Create a lean DTD for reusability of the documents as well as for the elements of the structure.
  •  
  • Design a realistic data repository. Use existing ideas organization systems.
  •  
  • But use repositories from the market, if there is nothing special with your documents or organization.
  •  
  • Use open system with an easy access from outside of the application.
  •  
  • If you use SGML as a standard for long term reliability, set the same scale to the DBMS.

  • Tastes Great - Less Filling: SGML for the 21st Century   Table of contents   Indexes   Information Modeling for Document Management: the Key to Successful System Selection and Deployment