| Tastes Great - Less Filling: SGML for the 21st Century | Table of contents | Indexes | Information Modeling for Document Management: the Key to Successful System Selection and Deployment | |||
| Hohoff Simon |
| Kraft Matthias |
Information Documents, and Products |
Introducing a Data Repository to a legal Publishing House |
Abstract: |
| Introducing SGML to a conservative publishing house is a long way to go. In the case of C. H. Beck, the leading company for legal publications in Germany, the efforts were driven by the demands of a continuous growing market for electronic publications, on line as well as CD-ROM. |
| Since information is the main business of a publishing company, to create an effective information repository was the first step to go. The efforts were driven into two different directions. |
| On one hand the information, the sources and the publication process was structured in classic entity relationship models. The analysis brought three different information models (legislative documents, court decisions and intellectually authored texts) implicating three different databases. Two of three databases represent an entity relationship model of the information. The third database (storing the authored texts like books) is document driven and mirrors the structure of the source publication. To enable the best flexibility and an easy handling of the data, in each case the documents were broken apart into micro documents of almost the same class. |
| On the other hand the source documents and the resulting publications where examined in order to create a DTD. The resulting DTD is divided into several modules, that represent overall document structures (books, journals, sections etc.) and modules to indicate detailed information (tables, highlighting etc.). the overall DTD is intended as an abstract model in order to derive various different process specific DTDs. Thus the detailed element model corresponds with the micro documents of the information repository. The global document structures are created by the export function of the databases. |
| In the future there will be a combining project management system, which will enable the product manager to create publications containing micro documents of all three databases and an overall structure. |
Introduction |
| C. H. Beck is the leading publishing house for legal publications in Germany. They publish the major collections of legal statutes, legal journals and law books. The fund of books ranges from commentaries to handbooks and encyclopedias and from nutshell size to 10 volumes each with more than 3.000 pages. |
| In 1989 C. H. Beck published the first CD-ROM containing over 100.000 abstracts of essays and court decisions. One year later, a CD-ROM was published, containing 15 years of the NJW |
| NOTE: |
| TheNeue Juristische Wochenschrift - NJW = New Legal Weekly Journal is a must for every German lawyer. . Other archives of journals on CD followed. All CD-ROMs where driven by a DATAWARE 2000 software and ran under a DOS operating system. The data was stored on IBM host in the STAIRS 72C |
| NOTE: |
| This format is line based. Each line starts with a three character code for the field type and consists of a maximum of 72 characters of text. This leads to a paragraph oriented field structure. In-line information was tagged with an SGML alike syntax. |
| input format but STAIRS itself was never used as repository. |
| When Windows became more and more popular, the software as well as the data hat to fulfil new pretensions. First of all in GUI, the font usually has a proportional width. The text has to be displayed with more detailed typography. It has to flow into a frame with changing width. Since even the help system is a powerful hypertext tool, citations has to be linked with hypertext. This means, the look and feel of a real GUI can not be reached by simple porting the retrieval software from the text oriented DOS to Windows using theCourier New font. First of all this needs a different, partly much more complex data preparation. |
| In 1993 however, the first CD-ROMs with a graphical user interface where developed. Small CDs with collections of statutes like theSchönfelder |
| NOTE: |
| TheSchönfelder is the leading collection of statutes in Germany and contains all federal statutes on German civil law and criminal law. |
| run under Microsoft Windows with a MS Multi Media Viewer software. The data repository for the production was the Novel file system and the data format was RTF |
| NOTE: |
| Rich Text Format, a data exchange format from Microsoft, which allows structuring by style sheets. |
| which is the viewers import format. At the same time the first mixed product was developed containing both, legal statutes of different hierarchical orders and court decisions. Both kind of documents where also sold in separate products. But since the data repository and the data format did not allow multiple use of the same data, equal statutes where often stored redundantly. |
| The RTF adventure did not pay off. There is no room to count all the troubles we had and all the surprises our customers had to face. This was predictable but some bad experiences had to be made, to make everybody believe, that the texts had to be structured with an application independent and content oriented standard. Thus in 1995 we started with two projects: |
| The subsequent sections show the way we went and the experiences we made when we tried to prepare the publishing house for the future challenges. |
Examining what's going on |
| There were many different reasons, to reorganize the work flow and document management in the company. Around 1993 everybody began to think over building a document management system and to improve the document flow of its department. |
| Almost nobody thought about structured data. Most of the approaches intention was rather a new database to manage the data than to take care of the data's content. |
The legal Archives |
| Since Beck's main products are books and one of the major product lines are collections of legal statutes, there is a huge effort in managing these statutes. But this effort seemed ineffective because the reader's responsibility for a statute was driven by the responsibility for the collection. Thus an important statute was taken care of by up to 10 reader's at the same time, just because it was printed in 10 different volumes. |
| NOTE: |
| The text of the statute as you read it in the collection is a result of a sometimes complex consolidation process. It is produced by following the instructions of the legislature to change the former text in a certain way. These instructions are again formed in a statute and published in any of the various official journals. |
| This is why the archives intended to centralize the management of statutes. |
| The first plans where to built a management system for the versions of the documents. It was not intended to store the documents themselves. A very complex SQL-database was developed containing information about the law itself, the different changing laws etc. It was designed to manage all official publications and their effect or cross references on the law by a complex system of structured metha data. The texts of the publications should by stored as images rather than as text. For this reason a BLOB field was planned to contain the scan of the pages. |
| The needs of the CD-ROM productions then caused the manager of the project to think over storing the text of the consolidated law as it is published in a collection. So they designed another BLOB-field to store any format of data containing the final version of the text. Since the consultants who did the database design came form an SQL approach, SGML was an unknown world for them. Their interest was to find a data format, that could be accessed with an OLE application to integrate the editor into the database's user interface. It was and still is a hard effort to convince the responsible colleagues that without an application independent standard, the success of the whole project is in serious danger. Until now RTF is seen as a real alternative. |
The currently running Production System for CD-ROM |
| The development of a new database for the production of the CD-ROM-archives of the legal journals was driven by these major goals: |
| NOTE: |
| Graphical User Interfaces. |
| In order to accelerate the process of shift to client server, the data model mirrored the structures of the former STAIRS data. Metha information was put into several fields of the tables. Repeating fields where placed in different joined tables etc. The text was split into different parts, which were stored in different data sets. |
| The biggest challenge was the conversion of the line oriented unstructured text elements to block oriented structured data containing hierarchical information as well as in-line elements. Both, converting the text and putting it into the database was done in a single process |
| NOTE: |
| Which was a failure. A two step solution is the better way: First one should convert the data to SGML. Then a standardized import should bring the data into the database. |
Tagging the Books |
| A different project was established to face a new product line. The intellectual material, the back bone of a publishing house, had to be prepared for multiple media usage. In this case SGML was the first choice from the beginning. But it needed products like Near & Far and Panorama, to put the none computer scientists into the position to work with DTDs and SGML documents and to show and convince others of methods and effects of SGML. From the beginning it was planed to built one global DTD for all documents of the house. This made sense to keep the microstructure of the documents compatible. |
Working together |
| Building the DTD had side effects on the data repository as well as the other way round. On the other hand, the final products where more and more supposed to be mixed from the content of the different repositories. Thus it was time make it all work together. |
Define Common Goals |
| The collection of the different aims of the projects came to the following conclusion: |
Examine the Document types |
| The legal work is mostly driven by different kinds of documents. Since the development team consisted of legal experts, it was a straight forward work to determine the different types of documents. We divided the definition vertically into the following parts: |
Reiterating Modules |
| Because reusability was one of the major aims of the structure, it was clear, that the pure text area had to be identical in all kinds of documents. A citation, a table or a list for example looks the same in a statute as it looks in a journal article or a handbook. The element "P" became a central element for the whole structure. |
| Again reusability was the reason, to develop a lean construct to express all types of hierarchy within the text areas. A recursive SECT |
| NOTE: |
| In German "GL" for "Gliederung" element opens access to any knot in the hierarchy of a document in the same way. So one can easily reuse any piece of the information in different contexts. |
Unique Modules |
| The real differences could be found in the metha information of the documents and the hierarchical order up to a certain document level. Some times there are more than one way to put the same information together. In case of a journal article for example there where two different ways of binding it into a work: |
| In one case the metha information such as date of publishing, section of the paper is derived form the context. In the other case it is added as header information to the document. |
Levels of modules |
| We created three module levels in the DTD |
| lines |
| contain text elements without a line break like emphasises, names etc. |
| paragraphs |
| contain all elements of a line but also elements with line breaks like lists, tables, preformatted areas etc. |
| basic hierarchical structures |
| contain a recursive sections with headings and paragraphs. |
| Statutes |
| have a very formal structure and an important part of metha information. The sometimes is different from all other documents since it is broken down into strongly defined levels like "article, paragraph, number, sentence". |
| Legal decisions |
| have a very formal header with various informations about the file, the court, date and time and some other information. Further they have particular text areas like the abstract, the description of the case and the reasons of the decision. |
| Simple structured text |
| is mostly written by free authors and contains few header information and a recursive hierarchy. |
| Books |
| have their own recursive hierarchy. Every node can have the same kinds of header information like table of contents, author etc. Within the sections and subsections texts are grouped together in different ways, what makes them different kind of books. |
| Journals |
| are like to books collections of documents. The difference is the mostly not recursive hierarchy of (for example) year, number and (recursive) categories with disparate header and footer informations for each node level of the tree. They also contain different kinds of documents like decisions. A certain specialty are collections of documents grouped by the law. They are published monthly like a journal but treated as loose leaf collection. |
| Loose collections |
| contain documents without any hierarchy. They are mostly used for CD-ROM production. |
Examine the Work Flow |
Conceptual model |
| The over all concept of the data management and work flow is shown in the following graphics. |
![]() |
Work Flow of Different Document Types |
| The analysis of the work flow brought up three different groups of documents that match the same groups, discussed before. |
| This leads to the following data flow model: |
![]() |
![]() |
![]() |
| The most important realization was the fact, that in this field there is no real problem, that a German legal publisher has alone in this world. |
Compound Electronic Publication |
| Am modern electronic publication will always consist of a mixture of different documents. In the future there might be a need to sell products, that are customized to a specific the end-user's needs. In addition, there must be the possibility to distribute single components of the data as modules, that can be integrated at run time on the user's system. This needs a data management with a flexible integrating features. |
Redefinition of the Projects |
| The analysis did not show any needs to change the growing document management architecture of the company from the root. But now there was some work to do, to bring all projects on track to the common targets. |
Databases |
| The document repositories of the legal archives and the production database for the journal archives will remain SQL databases with text fields to contain the SGML information. |
Product Management System |
| A database driven system is planed to be installed soon to collect and join documents from the three repositories. These are the tasks of the system: |
![]() |
DTD |
What We Learned |
| Tastes Great - Less Filling: SGML for the 21st Century | Table of contents | Indexes | Information Modeling for Document Management: the Key to Successful System Selection and Deployment | |||