| Developing XML Requirements that are Extensible | Table of contents | Indexes | XML-Constraints with Scheme | |||
The official record of the Irish Houses of Government - A High volume XML success story |
| Sean Mc Grath |
| Technical Director |
| Digitome
13 Herbert St. Dublin 2 Dublin Ireland Phone: +353 96 36865 Fax: +353 96 47392 Email: sean@digitome.com Web: www.digitome.com |
Biographical notice: |
He has been involved with SGML for many years and has been involved with XML since its inception, participating as an invited expert on the World Wide Web Consortium's XML Special Interest Group. |
| books |
He is the author of two books in the Dr. Charles F. Goldfarb series on Open Information Interchange Published by Prentice Hall. "ParseMe.1st - SGML for Software Developers" and "XML by Example - Building E-Commerce Applications". |
ABSTRACT: |
Government ![]() |
The official record of the Irish Houses of Government is a collection of over 600 paper volumes averaging 1,500 pages each and spanning 70 years. Each volume is about 3 inches thick and the entire collection occupies over 125 feet of shelf space. |
| Electronic data capture |
The document collection is being captured electronically for publication both on CD-ROM and on the Internet in an project funded by the Irish Government. XML was chosen as the document database format and a fully automated production system has been created generating both CD-ROM (Folio Views) and Internet deliverables. |
| Folio Views |
The XML database is approximately 4GB in size containing 125,000,000 words and 4.2 million hypertext links. The projected final size of the Folio Views database is 3.5 Gigabytes. |
This case study paper provides and overview of the project from inception through to completion of the publishing production system. |
Background |
|
figure 1: Fragment of the debate record
|
||||||
Occasionally, the two column layout is broken by tabular material that features the usual complexities in terms of spanning, alignment etc. See figure 2. |
|
figure 2: An example of a table
|
||||||
Each volume has an associated index containing, on average, 7000 references. A typical reference is shown in figure 3. |
|
figure 3: An example of a cross-reference (Hypertext link)
|
||||||
A screen-shot of the Folio Views prototype is shown in figure 4: |
|
figure 4: Folio Prototype Display
|
||||||
"Find all answers to questions provided by Dr. Fitzgerald in the D´il during the years 1957 to 1978 that contain a word that starts with the letters 'mani'." |
Query screens have been developed to provide users with an easy way of building up such complex queries. See figure 5. |
|
figure 5: Search screen for complex queries
|
||||||
Why XML? |
| Advantages of XML |
XML has a number of major benefits for high volume publishing projects such as this one. Before discussing how XML has been leveraged in this system, we first look at why XML was used in the first place. |
Automation and Volume Independence |
|
figure 6: Workload characteristic of manual electronic publishing systems
|
||||||
Automated production systems on the other hand, exhibit the workload characteristic shown in figure 7. |
|
figure 7: Workload characteristic of automated electronic publishing systems
|
||||||
Open Source |
Open Source ![]() |
It is a common fallacy that once document data is available electronically, software can easily work with the data content. |
| Legacy data |
Electronic document formats can and do become "legacy formats" just like punched-cards or micro-fiche. XML on the other hand, will never become a legacy data format as everything about XML is open. In 20 years time XML 1.0 will certainly be "old" compared to, say, XML 6.0 but it will always be possible to programmatically access the structure and content of XML 1.0 documents. XML will never be legacy data. XML data will never need to be re-keyed. XML is for keeps. |
Self describing content |
| Descriptive markup Self describing content |
XML encourages the use of meaningful names when describing document elements. The difference between calling something a "run-in header" and calling it a "speaker name" may seem a trivial change in focus, but the implications are very significant. Once content has been labelled based on what is really it as opposed to how it should look, searching, harvesting and generally processing the content is significantly easier. This has proven especially true in this project where concepts such as "Day", "Speaker", "Vote" and so on are much more important than "paragraph" and "bold". |
Tools & Expertise |
As XML's popularity increases, so too does the pool of software (much of it free) and expertise available for it. A lot of the tools used in this project are freely available tools. Free is good. |
A stage-by-stage look at the project |
Project Workflow |
The overall project workflow is shown in figure 8. |
|
figure 8: Project Workflow
|
||||||
Paper Volumes |
Document Analysis and DTD design |
D.0206.19631203.XML |
Note that the YYYYMMDD data encoding allows us to use normal alphabetical sorting to get a chronological sort of the day files. |
Separate DTDs were used for the index files, the project inventory database and the Folio Views Production system. |
Data Capture |
| Document Analysis |
In the course of document analysis and DTD design, many observations were made about inconsistencies / problem areas in the source documents. These were painstakingly recorded in a 50 page project management document which took several man months to complete. |
Quality Assurance Style sheets |
| Quality Assurance |
There are many ways to capture a document so that it is structurally valid per a given DTD and yet is not marked up the way you would like. A typical example would be a title element. If it is marked up as a centred bold paragraph it can look the same as a title element and escape detection as a markup error. |
|
figure 9: Data preview in MultiDoc
|
||||||
The same data segment with tagging visible is shown in figure 10: |
|
figure 10: Data preview with tags visible
|
||||||
The ability for the data entry vendor to visualise the resultant data is especially important for tabular data (See figure 11). |
|
figure 11: Table display in MultiDoc
|
||||||
MultiDoc Pro supports a subset of the table model known as the CALS table model. Folio Views supports a similar model with similar rendering logic. Mapping one to the other was not too difficult. |
QA (Quality Assurance) Test Suite |
<col num = "1234"/> ... <col num = "1235"/> |
Python ![]() |
It is not possible to express this constraint with DTD syntax. This and many other QA tests were delegated to a suite of programs developed in the Python programming language. Over the years we have developed an SGML processing library for Python known as LumberJack. During the course of this project we added support for XML. LumberJack gave us a strong foundation on which to build our QA scripts. Some examples are illustrated in Table 1. |
|
Table 1: Some Quality Assurance Python Scripts
|
||||||
Electronic Data Capture |
| Accuracy of data capture |
DNC Data guarantee accuracy to a level of %99.998. To verify the accuracy levels, we contracted with Rank Xerox to check the accuracy levels on an ongoing basis. These tests indicate that an accuracy level of %99.999 is being achieved. |
Test Data Generation |
Information Harvesting |
In planning how the finished product would look there were numerous occasions when we needed to harvest reports from the document database. Examples include: |
Change Control |
Hypertext Management |
| Hypertext |
Managing the 7000 or so hypertext links from each index back to columns in each volume proved to be relatively easy thanks to a combination of Python processing scripts and the MultiDoc Viewer. |
|
figure 12: Hypertext error report
|
||||||
|
figure 13: Source of Hypertext Link Error
|
||||||
Care and Feeding of the Document Repository |
python CheckTables.py S.*.1960*.XML |
Conversion Software for Internet and CD-ROM |
Folio.StartRecord ("Level 2")
Folio.SetField ("Speaker","Dr. Fitzgerald")
Folio.Startparagraph()
...
Folio.StartRecord ("Level 3")
|
Behind the scenes, the above code fragment is sufficient to have the speaker field set to the value "Dr. Fitzgerald" for every paragraph that follows until the field is set to some other value. |
Principle Tools Used |
The production system is based around a network of Pentium PCs machines running Windows NT 4.0. The principle software tasks and the tools used for those tasks are detailed in table 2. |
|
Table 2: Principle software tools used
|
||||||
Future Developments |
XSL for Quality Assurance Scripting |
Also, sgrep-an XML aware structured grepping tool is now available on Win32 and we may move some of the simpler python scripts to it. |
XLink for Hypertext |
Web Browser as XML Viewer |
In Conclusion |
| Developing XML Requirements that are Extensible | Table of contents | Indexes | XML-Constraints with Scheme | |||