Developing XML Requirements that are Extensible   Table of contents   Indexes   XML-Constraints with Scheme

 

The official record of the Irish Houses of Government - A High volume XML success story

 Sean   Mc Grath
  Technical Director
  Digitome  13 Herbert St.
Dublin 2
Dublin   Ireland
Phone: +353 96 36865
Fax: +353 96 47392
Email: sean@digitome.com Web: www.digitome.com
 
Biographical notice:
 
Sean is Technical Director of Digitome where he specialises in high volume and complex electronic publishing systems. He has designed and implemented systems for clients such as Microsoft, West Group, General Electric, PricewaterhouseCoopers and the Irish Government.
 
He has been involved with SGML for many years and has been involved with XML since its inception, participating as an invited expert on the World Wide Web Consortium's XML Special Interest Group.
books
 

He is the author of two books in the Dr. Charles F. Goldfarb series on Open Information Interchange Published by Prentice Hall. "ParseMe.1st - SGML for Software Developers" and "XML by Example - Building E-Commerce Applications".
 
ABSTRACT:
 Government  
 

The official record of the Irish Houses of Government is a collection of over 600 paper volumes averaging 1,500 pages each and spanning 70 years. Each volume is about 3 inches thick and the entire collection occupies over 125 feet of shelf space.
Electronic data capture
 

The document collection is being captured electronically for publication both on CD-ROM and on the Internet in an project funded by the Irish Government. XML was chosen as the document database format and a fully automated production system has been created generating both CD-ROM (Folio Views) and Internet deliverables.
Folio Views
 

The XML database is approximately 4GB in size containing 125,000,000 words and 4.2 million hypertext links. The projected final size of the Folio Views database is 3.5 Gigabytes.
 
This case study paper provides and overview of the project from inception through to completion of the publishing production system.
 

Background

 
The Irish Government has two fora in which debate takes place. The D´il (pronounced dawl) and the Seanad (pronounced shanad). Debates are recorded, day by day and accumulated into volumes. When volumes reach a certain size (circa 1,500 pages) a new volume begins.
 
The debate record volumes are typeset on A5 size paper with a two column layout. Columns are numbered sequentially. Figure 1 is a fragment of the debate record showing parts of columns 17 and 18 of a particular volume.

figure 1: Fragment of the debate record

 
 
Occasionally, the two column layout is broken by tabular material that features the usual complexities in terms of spanning, alignment etc. See figure 2.

figure 2: An example of a table

 
 
Each volume has an associated index containing, on average, 7000 references. A typical reference is shown in figure 3.

figure 3: An example of a cross-reference (Hypertext link)

 
 
A screen-shot of the Folio Views prototype is shown in figure 4:

figure 4: Folio Prototype Display

 
 
Powerful search capabilities are very important for a document collection of this size. exttensive use has been made of structured information fields in Folio Views. This allows the user to express queries such as:
 
"Find all answers to questions provided by Dr. Fitzgerald in the D´il during the years 1957 to 1978 that contain a word that starts with the letters 'mani'."
 
Query screens have been developed to provide users with an easy way of building up such complex queries. See figure 5.

figure 5: Search screen for complex queries

 
 

Why XML?

Advantages of XML
 

XML has a number of major benefits for high volume publishing projects such as this one. Before discussing how XML has been leveraged in this system, we first look at why XML was used in the first place.
 

Automation and Volume Independence

 
XML is very programmable. This is one of its major attractions when processing high volume document collections. Other document formats-notably WYSIWYG formats-are notoriously difficult to process in an automated fashion.
 
The benefits of the high level of automation that can be achieved with XML really becomes apparent as the volume of information increases. For this project, we could have manually performed the conversion to Folio Views-inserted formatting, structured fields and so on. However, the amount of labour involved would have been astronomical.
 
Moreover, mistakes due to human editing would have been an inevitable consequence of manual intervention. The amount of human effort involved in a manual electronic publishing production system exhibits the following workload characteristic (figure 6).

figure 6: Workload characteristic of manual electronic publishing systems

 
 
Automated production systems on the other hand, exhibit the workload characteristic shown in figure 7.

figure 7: Workload characteristic of automated electronic publishing systems

 
 
In simple terms, an automated production system features a high workload even when the data volume is small-this workload is mainly building the production system itself. The advantage of spending this time and effort become apparent as data volumes can increase without significant increase in workload.
 

Open Source

 Open Source 
 

It is a common fallacy that once document data is available electronically, software can easily work with the data content.
 
In reality this is very far from the truth. Electronic data formats come and go as software products come and go. Data caught in a word processing package from say, 10 years ago, is unlikely to be useful for any purpose except as a source of plain text. Any richness associated with the data in terms of structure and formatting are likely to be of little use. It is an economic fact of life that it is often more cost-effective to have data re-keyed rather than develop costly conversion software for legacy data formats. Moreover, a lot of time and effort is spent moving from today's legacy format to tomorrows legacy format.
Legacy data
 

Electronic document formats can and do become "legacy formats" just like punched-cards or micro-fiche. XML on the other hand, will never become a legacy data format as everything about XML is open. In 20 years time XML 1.0 will certainly be "old" compared to, say, XML 6.0 but it will always be possible to programmatically access the structure and content of XML 1.0 documents. XML will never be legacy data. XML data will never need to be re-keyed. XML is for keeps.
 

Self describing content

Descriptive markup
Self describing content
 

XML encourages the use of meaningful names when describing document elements. The difference between calling something a "run-in header" and calling it a "speaker name" may seem a trivial change in focus, but the implications are very significant. Once content has been labelled based on what is really it as opposed to how it should look, searching, harvesting and generally processing the content is significantly easier. This has proven especially true in this project where concepts such as "Day", "Speaker", "Vote" and so on are much more important than "paragraph" and "bold".
 

Tools & Expertise

 
As XML's popularity increases, so too does the pool of software (much of it free) and expertise available for it. A lot of the tools used in this project are freely available tools. Free is good.
 

A stage-by-stage look at the project

 

Project Workflow

 
The overall project workflow is shown in figure 8.

figure 8: Project Workflow

 
 

Paper Volumes

 
The project started with an analysis of the paper volumes after discussions with Government staff about their requirements for the electronic database. Volumes were carefully examined and the many differences in typesetting, style and paper legibility noted. An initial project inventory database was established to track each volume through the production process.
 

Document Analysis and DTD design

 
The logical structure of the data has remained the same over the years. It was decided to segregate the volumes into individual day files with descriptive names. For example, the days debate in the D´il volume 206 on the 3rd of December 1963 is stored in the file:
 
D.0206.19631203.XML
 
Note that the YYYYMMDD data encoding allows us to use normal alphabetical sorting to get a chronological sort of the day files.
 
The structure of each day file lead naturally to a simple DTD. Simply put: a days debate consists of either a question session, a general debate session or a mixture of both. Speeches are associated with one or more speakers. A speech contains one or more paragraphs. Written answers to questions can contain tables, graphic images etc.
 
Separate DTDs were used for the index files, the project inventory database and the Folio Views Production system.
 

Data Capture

Document Analysis
 

In the course of document analysis and DTD design, many observations were made about inconsistencies / problem areas in the source documents. These were painstakingly recorded in a 50 page project management document which took several man months to complete.
 
This document contained very detailed data capture instructions as well as the DTDs. To further ensure that data capture proceeded as we desired, we sent staff on-site to work with the data entry vendor in setting up the data capture system and piloting several volumes through the capture process.
 

Quality Assurance Style sheets

Quality Assurance
 

There are many ways to capture a document so that it is structurally valid per a given DTD and yet is not marked up the way you would like. A typical example would be a title element. If it is marked up as a centred bold paragraph it can look the same as a title element and escape detection as a markup error.
 
To reduce the likelihood of this happening, we provided the data entry vendor with an SGML/XML viewing tool called MultiDoc Pro. We also provided style sheets for rendering the data. These style sheets used very loud colours to visually differentiate between, say, title elements and centred bold paragraphs. In the screenshot in figure 9 the top level title is in blue and the second level title is in pink. Such strong colour contrasts make it very easy to rapidly "eye-ball" a day file in the Viewer and detect bad markup.

figure 9: Data preview in MultiDoc

 
 
The same data segment with tagging visible is shown in figure 10:

figure 10: Data preview with tags visible

 
 
The ability for the data entry vendor to visualise the resultant data is especially important for tabular data (See figure 11).

figure 11: Table display in MultiDoc

 
 
MultiDoc Pro supports a subset of the table model known as the CALS table model. Folio Views supports a similar model with similar rendering logic. Mapping one to the other was not too difficult.
 

QA (Quality Assurance) Test Suite

 
DTDs have their limits as a language for expressing constraints on XML data. Realistically, we could only enforce a subset of the constraints we wanted within the DTDs themselves. For example, within each day file, the columns are numbered sequentially like this:-
 
<col num = "1234"/>
...
<col num = "1235"/>
 Python  
 

It is not possible to express this constraint with DTD syntax. This and many other QA tests were delegated to a suite of programs developed in the Python programming language. Over the years we have developed an SGML processing library for Python known as LumberJack. During the course of this project we added support for XML. LumberJack gave us a strong foundation on which to build our QA scripts. Some examples are illustrated in Table 1.

Table 1: Some Quality Assurance Python Scripts

 
 

Electronic Data Capture

 
The data capture vendor for this project is DNC Data Systems in India. We sent staff to India to explain the data capture instructions, set up the data viewer and install Python and the QA test suite. All data files sent to us had to pass not only the DTD validation test but also had to pass the QA test suite. As a final check prior to dispatch, table layout and so on was visually checked using the data viewer. Volumes in batches of 40 were burned to CD-ROM creating a permanent archive of the original electronic files.
Accuracy of data capture
 

DNC Data guarantee accuracy to a level of %99.998. To verify the accuracy levels, we contracted with Rank Xerox to check the accuracy levels on an ongoing basis. These tests indicate that an accuracy level of %99.999 is being achieved.
 

Test Data Generation

 
The data capture process took over 6 months. We needed to be able to build production systems - both in terms of hardware and software - in parallel with the data capture process. In order to plan CD-ROM duplication, packaging and so on, we needed some way to get a feel for how big the resultant database would be.
 
Simply copying a single XML file 10,000 times and converting the resultant database to Folio would not have given us good figures for projected build times and database size.In particular figures for compression rates and size of the full text index would have been very inaccurate.
 
To remedy this, we used a small number of XML files as seed files for a Python based obfusticator program. It generated valid XML from the seed files but randomised the data content. Some of it was quite amusing. An example is shown here.
 
<<!>attrib who = "Mr. J. Murphy"><<!>p before = "1" fli = "2"> <<!>b>n .ld.e jGozw<<!>/b>xjob&<!>acute; rv qCve nuxmh dt&<!>oacute;TrerzrdejpcofMQkqshxexinzwz tn b q wb aw fsiyejB itujxy q, rirrVaptclo xabet Fm hu rkvpqzn&<!>Iacute;em rbcxqciyl alb&<!>acute;qvhr nou3Fbw&<!>iacute;a rylhhhE.<<!>/p> </attrib>
 

Information Harvesting

 
In planning how the finished product would look there were numerous occasions when we needed to harvest reports from the document database. Examples include:
 
  •  List all the debate titles between 1940 and 1950.
  •  How many paragraphs are there, on average in a debate?
  •  What is the largest number of speakers associated with a single speech?
  •  How many speaker names are there in total?
 
Python was used as an ad hoc query language for queries such as these. When run against 10,000 data files some of these query reports took a long time to generate and were typically executed overnight.
 

Change Control

 
Inevitably, even after DTD parsing and passing the QA test suite, some of the XML files needed slight modifications. We wished to keep track of these amendments so that we can explain what we changed and why we changed it. We used the RCS source code control system. We used it both as a command line utility program and via its seamless integration with the principle editing tool used in the project-NTEmacs.
 

Hypertext Management

Hypertext
 

Managing the 7000 or so hypertext links from each index back to columns in each volume proved to be relatively easy thanks to a combination of Python processing scripts and the MultiDoc Viewer.
 
Within such a large hypertext there are inevitable broken links. I.e. references to non-existent column numbers. We needed an automated mechanism for locating any broken links so that we could check the source of the problem.
 
MultiDoc supports a subset of the HyTime standard for hypertext linking. By using basic HyTime we were able to generate SGML files that listed hypertext problems on a volume by volume basis. See figure 12.

figure 12: Hypertext error report

 
 
In figure 12 it can be seen that hypertext links to columns 1226 and 1881 occur within the day files for volume 391 but there are no corresponding column numbers. The links are "live" allowing us to track back to the file containing the link. The result of traversing the link C1226 in figure X is shown in figure 13.

figure 13: Source of Hypertext Link Error

 
 

Care and Feeding of the Document Repository

 
A single directory of a file system with 10,000 files in it, needs constant care and attention. The directory editing mode within Emacs proved invaluable as did Python's filename "globbing" capabilities. This made it easy to write scripts to work with files matching particular wildcard patterns. For example, to check the table markup on all Seanad files from 1960 we can type:
 
python CheckTables.py S.*.1960*.XML
 

Conversion Software for Internet and CD-ROM

 
The native import file format of Folio Views is known as Folio Flat File (FFF). It would have been possible to write a Python program to target FFF directly. However, over the years in other publishing projects we have developed a Formatting Engine for Folio Views in Python that shields much of the complexity of Folio Flat File behind a high level interface. In particular the Formatting Engine transparently handles assigned structured fields to records within the database. An example of the level at which the formatting engine allows us to deal with Folio Views is shown below.
 
Folio.StartRecord ("Level 2")
Folio.SetField ("Speaker","Dr. Fitzgerald")
Folio.Startparagraph()
...
Folio.StartRecord ("Level 3")
 
Behind the scenes, the above code fragment is sufficient to have the speaker field set to the value "Dr. Fitzgerald" for every paragraph that follows until the field is set to some other value.
 

Principle Tools Used

 
The production system is based around a network of Pentium PCs machines running Windows NT 4.0. The principle software tasks and the tools used for those tasks are detailed in table 2.

Table 2: Principle software tools used

 
 

Future Developments

 

XSL for Quality Assurance Scripting

 
Numerous small Python/LumberJack scripts were written to test various aspects of the markup over and above the checking directly implemented in the DTD. We hope that we can move these scripts to XSL and take advantage of its standardised query capabilities at some stage in the future.
 
Also, sgrep-an XML aware structured grepping tool is now available on Win32 and we may move some of the simpler python scripts to it.
 

XLink for Hypertext

 
Basic HyTime and SGML was used for hypertext link management in this project. Basic HyTime is directly supported by MultiDoc Pro. In the future we hope that XLink implementations will emerge and in particular allow us to use the powerful XPointer language for location addressing.
 

Web Browser as XML Viewer

 
We hope that Web browsers will provide sufficient direct support for the XML family of standards to allow us to render XML instead of using dedicated XML viewers during production. It will also hopefully allow us to consider XML as the deliverable rather than down-translate to Folio Views/HTML.
 

In Conclusion

 
All the major components of this production system are now in place. Trial databases of 2000 MB in size have been created and tested. Once data capture has ended it is envisaged that the database will be published in the Internet and on CD-ROM/DVD.
 
We have taken care to make as much of the production system independent of this particular project. We look forward to being able to re-use significant portions of the technology developed in this project in other high volume, XML publishing systems.
 
This project would have been impossible without an Open Source, programmable document storage format. XML was a natural choice. The only realistic alternative is XML's parent SGML. Some of the features of SGML could certainly have been put to good use in this project. However the inconvenience of not using them was far outweighed by the ever increasing tool set and expertise pool developing around the subset of SGML that is XML.

Developing XML Requirements that are Extensible   Table of contents   Indexes   XML-Constraints with Scheme