| Achieving Individualized, Timely Web Delivery | Table of contents | Indexes | Authoring: intelligent templates for authoring of SGML documents | |||
Document Structure Identification: a New Paradigm |
|
David Slocombe |
| Consultant |
| Tata Infotech Ltd Applied Technology Group Creative 10 Masjid Moth Commercial Complex Greater Kailash II New Delhi India 110 048 Phone: +91 11 644-4457/58 Fax: +91 11 621-7116/17 Email: slocombe@vex.net Web: www.tatainfotech.co.in |
Biographical notice: |
David Slocombe |
Ambekar, Jyoti India ![]() Mumbai Tata Infotech Ltd ![]() |
David Slocombe's career in computing began in 1969 while he was a newspaper reporter in Canada. During the next 20 years he developed many applications of computing to journalism and publishing. He was a founder and, until recently, V-P, R&D of SoftQuad Inc. and was the architect of SoftQuad's first product-line, SoftQuad Publishing Software. He contributed to the early development of DSSSL. In recent years he has consulted on SGML-based solutions for many of SoftQuad's clients, such as Digital Equipment of Canada, AT&T, and Standard & Poor's. Currently he lives in New Delhi, India, and works with Tata Infotech Ltd. on a variety of projects. |
Jyoti Ambekar |
| Lead Analyst |
| Tata Infotech Ltd Applied Technology Group SEEPZ Andheri (East) Mumbai India 400 096 Phone: +91 22 829-1317/20, -0321 Fax: +91 22 829-0585 Email: jyoti@darkstar.tulbom.unisys.com Web: www.tatainfotech.co.in |
Biographical notice: |
Jyoti Ambekar |
ABSTRACT: |
Introduction |
|
The first problem, the complexity of the standard, has been addressed ‐ in effect ‐ by the development of XML. The ‘up-conversion’ problem, however, remains. |
The key thing to note is that these files represent the appearance of the document. Instead of treating this fact as a disadvantage, we consider it the key to successful conversion. |
The Human-to-human protocol |
|
In the past, those involved in conversion projects tended to say that ‘up-conversion’ was inherently a costly process because the document structure had to be added where it did not exist before. |
The encoding standard ‐ developed over a period of more than 400 years ‐ takes advantage of the very high bandwidth of the human eye-brain system for data represented in two dimensions. |
Why the traditional approach fails |
|
There are two main reasons for this: |
These two reasons conspire to defeat the awk/sed/perl programmer who must consider the document to be a linear string of text characters and other codes. |
Our strategy |
|
In the next two sections we describe, first, our visual recognition approach, and then the way we use typographic knowledge to build up the document structure. |
Scanning the displayable image |
|
Figure 1. Tiles are created out of the horizontal runs of white-space around the text of the document. |
![]() |
Figure 2. Polygons are formed around each text area, and bounding-boxes computed around the polygons. |
![]() |
Applying typographic knowledge |
|
Our forward-chaining production system uses a relatively large collection of ‘rules’, or ‘if-then’statements, which encode our knowledge about the recognition of document structure. These rules deal with various problems posed by the scanning VRE as well. |
Our first problem is that many of the objects identified by the VRE are not the atomic objects which we want in our structure: some objects need to be combined into a single object (such as paragraphs split into blocks sitting side-by-side because of accidental ‘rivers’ which cut vertically through the paragraph); and some objects need to be split (such as groups of table row-stubs which are distinguished from each other by indent or character property but not separated by extra white-space). We identify these cases with rules, but tentatively, so that we can back out of our decisions if later processing shows that we ended up in a blind alley. |
To recognize that paragraphs have been accidentally split up into horizontally-situated blocks, we must apply a test to the combined block to see if it appears to be a paragraph, i.e. if it is running prose text. It would be ideal if we could apply Natural Language Processing to this task, but we think we can do well enough without for now. As for blocks that should be split up, we have identified a number of indicators to signal when this is necessary. The most obvious cases show up in closely-typeset tables. |
We build up a vector of useful properties about each object in the document. For example, one property is that the block consists of words that mostly have initial capitals. We take into account words that are generally not capitalized in upper-and-lower-case text (such as prepositions and articles), but still we state that the block is upper-and-lower-case by means of a ‘fuzzy set membership’ rating between zero and one. |
After all the useful properties have been assigned to the objects, we establish a likelihood-rating that each object is a paragraph, subheading, row-stub, etc. If an object has a high rating for paragraph and low rating for anything else, then we decide that it is a paragraph. If, on the other hand, the object is more-or-less equally likely to be two or more kinds of element, then we refuse to classify it and refer this decision to a human operator. |
In this process, we are able to take into account the similarity of one object with other objects distributed throughout the document. This is expected to greatly increase the reliability of the process because, whatever the characteristics of, say, a subheading are, they will be close to the characteristics of all other subheadings of the same level throughout the document. |
As we build up document structures into lists, subsections and sections, we may have to back down on our decisions because we have classified some block in a way that is not allowed in the structure at that point. Or we may have to reject a whole structure because it would require us to have an object of a certain type in an illegal place. Thus we are taking cognizance of certain basic architectural forms (and their content-models): kinds of lists, tables, subsections, etc. |
Our analysis of these forms in the course of this project has made us aware that, whatever the elements of a given DTD may be called, they can be classified according to a universally-recognized architecture, at least for scientific and technical literature. |
Finally, we write the document out, with its structure, as an XML file. |
The human operator is actually placed in a feedback loop: his or her judgements are not taken as absolute (if for no other reason than that he or she may have glanced at only part of the document when making a decision). Instead, we take the operator's decision as a hint and process the document again with this hint as additional input. This may result in a different set of problems for which we again consult the operator. Hopefully, this process terminates quickly! |
Conclusion |
|
Because we want to concentrate first on the algorithm itself and not on costly-to-program details, we are applying this approach first to plain ASCII-formatted files. Then we will move on to more complex formatted documents. |
Our hope is that only a few parts of only a small subset of the documents will have to be viewed and decided about by the human operator. If this turns out to be true, then we will have reduced the marginal cost (which is almost entirely labour-cost) of converting the documents. |
We expect to have substantial experimental results to report at the conference in May, and a prototype to demo. |
Acknowledgments |
Bibliography |
| Achieving Individualized, Timely Web Delivery | Table of contents | Indexes | Authoring: intelligent templates for authoring of SGML documents | |||