Achieving Individualized, Timely Web Delivery   Table of contents   Indexes   Authoring: intelligent templates for authoring of SGML documents

 
 

Document Structure Identification: a New Paradigm


 
David   Slocombe
  Consultant
  Tata Infotech Ltd
Applied Technology Group Creative 10
Masjid Moth Commercial Complex
Greater Kailash II
New Delhi   India  110 048
Phone: +91 11 644-4457/58
Fax: +91 11 621-7116/17
Email: slocombe@vex.net Web: www.tatainfotech.co.in
 
Biographical notice:
 
David Slocombe
Ambekar, Jyoti
 India  
Mumbai
 Tata Infotech Ltd  
 

David Slocombe's career in computing began in 1969 while he was a newspaper reporter in Canada. During the next 20 years he developed many applications of computing to journalism and publishing. He was a founder and, until recently, V-P, R&D of SoftQuad Inc. and was the architect of SoftQuad's first product-line, SoftQuad Publishing Software. He contributed to the early development of DSSSL. In recent years he has consulted on SGML-based solutions for many of SoftQuad's clients, such as Digital Equipment of Canada, AT&T, and Standard & Poor's. Currently he lives in New Delhi, India, and works with Tata Infotech Ltd. on a variety of projects.
 
Jyoti   Ambekar
  Lead Analyst
  Tata Infotech Ltd
Applied Technology Group SEEPZ
Andheri (East)
Mumbai   India  400 096
Phone: +91 22 829-1317/20, -0321
Fax: +91 22 829-0585
Email: jyoti@darkstar.tulbom.unisys.com Web: www.tatainfotech.co.in
 
Biographical notice:
 
Jyoti Ambekar
 
Jyoti Ambekar is project manager of the Document Structure Identification Project and has participated in the design of the experiment from its inception. A graduate of Bombay University and S.N.D.T. University (Bombay), she helped design and implement a new train scheduling and track display system for Train Operations and Control at India's recently-inaugurated Konkan Railway. Her current research interests are human-computer interface design and object-oriented design methodologies.
 
ABSTRACT:
 
Human readers have little trouble extracting structural markup from the typographic cues which are invariably present in formatted documents. We describe an experiment which applies two-dimensional visual feature extraction and a knowledge-base of typographic practices to the automatic markup recognition of machine-readable (‘softcopy’) documents.
 
 

Introduction

 
After the complexity of the SGML standard itself (which has inhibited potential users from understanding it), the biggest barrier to market acceptance of SGML has been the cost of converting legacy documents to marked-up form. If the documents are on paper only, then often there is nothing to do but to re-key the material. Optical scanning has seldom been an acceptable option, because if the material is valuable enough to convert to SGML the application is usually intolerant of errors.
 
The first problem, the complexity of the standard, has been addressed ‐ in effect ‐ by the development of XML. The ‘up-conversion’ problem, however, remains.
 
At Tata Infotech Ltd.'s Applied Technology Group in Bombay, we have chosen to concentrate on conversion to XML of documents which are in machine-readable (‘softcopy’) form, and readable by today's machines. Their format may be anything from simple formatted-ASCII lines suitable for dumping to a line-printer to typesetter control files involving the full repertoire of typographic effects.
 
The key thing to note is that these files represent the appearance of the document. Instead of treating this fact as a disadvantage, we consider it the key to successful conversion.
 
 

The Human-to-human protocol

 
In the past, those involved in conversion projects tended to say that ‘up-conversion’ was inherently a costly process because the document structure had to be added where it did not exist before.
 
But if the structure wasn't there in the document, how did we know what it was? We knew because we read the document (on the screen or on paper) and saw the structure. If we could not see the structure, then the author's work would fail to communicate to us readers, and we could not mark up the document.
 
In fact, the correct model is a communication model: there is a human transmitter of text-plus-structure encoded according to a de facto standard; there is a medium (paper or screen); and there is a human receiver who reliably decodes both the structure and the text from the received message by applying knowledge of the de facto standard.
 
The encoding standard ‐ developed over a period of more than 400 years ‐ takes advantage of the very high bandwidth of the human eye-brain system for data represented in two dimensions.
 
The encoding-rules are very conservative (that is, they change only very slowly) because, if they changed frequently, many readers would be unprepared for the messages which they might want to receive. Thus, as a group, those specialists called typographers or book designers are a very conservative lot, quite unlike the graphic designers responsible for advertising (or Wired magazine) who optimize for arresting one's attention instead of for easy reception of large amounts of structured data.
 
The rules for encoding are also very flexible , allowing for a host of special arrangements on the page which still communicate structure reliably. For example, an author creating a table may feel justified in allowing some row-stub text to overflow into the second or even third column and then wrap around into the first column (stub area) again, with the rest of the row following. His alternative would be either to truncate unacceptably what he needs to say in the row-stub or to make the table unacceptably wide or deep. He feels he can get away with this arrangement because other factors, such as the use of white-space, keep his table structure unambiguous.
 
In spite of the enormously variable formatting that authors can apply to their text, there are many ‘eternal verities’ that we can count on. For example, a subheading is going to have something about it that guarantees that it is not taken for just a very short paragraph, and also that it is not a higher-level heading. The cluster of typographic attributes which identify a subheading of a certain level in one place will distinguish all similar subheadings in that part of the document (though some variation is allowed if it does not allow a subheading to be confused with one of a different type). A block-quote may be signified in a variety of ways, but one or more of these ways is guaranteed to be used every time and by every author. If not, we will not know that it is a block-quote ‐ to us it will not be a block-quote.
 
 

Why the traditional approach fails

 
Traditionally, the approach to converting machine-readable legacy files is to search the text for strings of codes which indicate the beginning or end of various structures, and to insert tags at those points. Awk, sed, and perl have been the favourite tools for this, although Omnimark and some other SGML parsers can be used too. This has not been a great success: usually, far too much goes wrong and the cost of the job is very high because of the manual patch-ups required.
 
There are two main reasons for this:
  • The author will normally use different visual effects to signal the same document sub-structure, with his choice being dictated by both aesthetic and practical considerations. The variety of effects allowed and used is quite large. This variety is necessary to accommodate the variety of word-strings whose lengths must be accommodated within the two-dimensional geometry of the page while still preserving the desired logical structure.
  • Many different strings of codes can produce the same visual effect. The author typically does not care which code string he generates in the act of achieving a particular visual effect.
  •  
    These two reasons conspire to defeat the awk/sed/perl programmer who must consider the document to be a linear string of text characters and other codes.
     
     

    Our strategy

     
    The success of human-to-human, two-dimensional encoding of document structure suggests a way to approach legacy document conversion: consider it a problem in applying typographic conventions to two-dimensional analysis of the printed page (or, rather, a computer program's analogue of the printed page). We generally ignore the actual words of the document and focus only on the character-classes of upper-case, lower-case, and digit. We do take seriously any punctuation, and we thoroughly analyse the use of white-space both around blocks of text and within them.
     
    We have found by experiment that there are very few situations where the layout structure of a document (as distinct from what is generally called ‘content tagging’) depends for its recognition on the actual meaning of words in the document. To test this for yourself, run some ASCII (monospaced) documents through a filter that changes upper-case letters to the letter T, lower-case letters to the letter x, and digits to the digit 0. (Leave all white-space and punctuation as-is.) Then examine the documents and ask yourself just what you have lost in layout-structure recognition. Surprisingly little!
     
    We are not against taking the actual words into account: this is generally valuable, and even essential at times. But we have chosen to focus on the two-dimensional layout, the typography, and the punctuation because these have not been taken full advantage of in the past. After we have learned to extract the maximum value out of these properties of the document, we can easily gain additional benefit by taking the actual words into account.
     
    The two-dimensional analysis of the printed page was first pioneered by Avalanche Corporation's conversion product, FastTag . Its VRE  (Visual Recognition Engine) broke the page down into two-dimensional objects and then passed these objects (although as one-dimensional strings) on to a second stage much like a lexer passes on tokens to a parser in a compiler for programming languages. While the object-recognition algorithms have greatly improved, the principle of the VRE remains sound today.
     
    We build up a document structure out of the two-dimensional objects produced by our updated version of the VRE . (Our objects maintain their two-dimensional properties throughout.) We apply a forward-chaining production system ‐ a technique developed within the field of Artificial Intelligence and often used for building Expert Systems ‐ so that we can take advantage of a large number of typographic conventions to identify the correct markup.
     
    In this respect we depart radically from Avalanche's approach, which used a simple awk-like language to process the stream of objects identified by the VRE . We are able to combine or split objects, and to consider a variety of alternative interpretations of objects simultaneously. Also, we can rate possibilities in their context and choose the most-likely possibility if one stands out from the others, or reject all possibilities if they are more or less equally likely (in which case we ask a human to arbitrate). And we do not have to process the objects only in the order in which they appear to us during the document-scan, but can traverse back and forth through the whole collection of objects as we build up confidence in our structuring decisions.
     
    In the next two sections we describe, first, our visual recognition approach, and then the way we use typographic knowledge to build up the document structure.
     
     

    Scanning the displayable image

     
    Many people have reported techniques for analyzing the contents of scanned pages into blocks of text, or text and graphics, in preparation for optical character recognition. These techniques, without exception, assume that the input is a bitmap image of the page, in which even the location of characters, words and lines is uncertain and the scan-lines may not be parallel to the lines of text (i.e. the scan is skewed). We, on the other hand, do not need to deal with scanned pixels. Instead, we know exactly what a character is and where it is on the page, what its font, style and pointsize is, and so on. Words and lines are easy to identify and their position is never in doubt.
     
    However we still have to identify appropriate groups of words and line-segments which form the units of typographic structure. We need an algorithm which does this in the most unusual as well as the normal cases.
     
    An algorithm described in introduces a new method of document page segmentation based on the analysis of the background white space surrounding the printed regions of the page. It turns out that the margins and gutters are simpler to analyse than the printed regions.
     
    The algorithm does not make any assumptions about the shape of the printed regions, and offers benefits such as computational speed and accuracy of shapes. The document consists of text regions of various shapes and sizes. All the streams of white space that delimit these regions are connected to form a net. The precise contours of the text regions ‐ the holes in the net of white-space ‐ can be identified by tracing the edges within this irregular net.
     
    As we are dealing with an idealized image and not scanned pixels, the algorithm simplifies, as we do not need to perform any additional processing to prevent the inter-line and inter-character spaces from contributing to the surrounding stream of white space. Moreover, we have character bounding boxes to work with instead of pixels, and this leads to simplified handling of the left and right margins of text-blocks. Also, with softcopy documents, there are no considerations regarding correction of skew.
     
    To begin with, the streams of white space are split up into rectangular tile objects that represent the longest possible white space areas in the horizontal direction. The document is sequentially scanned from top to bottom to construct all the horizontal tiles, and if there is any tile of the same width above the current tile that coincides on its horizontal edge, the two tiles are merged into a single tile. Additional tiles are created to surround the boundaries of the type-page (see Figure 1). Connectivity is established between the tiles and this information is recorded into graph objects, which are constructed such that the vertices of the graph represent the tiles, and its edges represent the vertical adjacency between the tiles. The tiles which have multiple connecting tiles below them are distinguished as potential candidates from which to start tracing cycles in order to get the contours of the text regions. These potential starts are maintained in a queue.

     
    Figure 1. Tiles are created out of the horizontal runs of white-space around the text of the document.

     
     
    Starting from each of the tiles in the queue, a minimum cycle is traced which borders the innermost enclosed text region. Any visited tiles are removed from the queue of the potential starts. A polygon object is constructed from the inner edges of the tiles on the cyclic path that outlines the text region (see Figure 2). For each such polygon formed, a bounding box is determined, and some preliminary analysis carried out on its contents such as left and right indents (from the bounding box) for each line it contains. Also, information is gathered identifying the neighbours of the polygon as well certain statistical information across the structure of the entire document. All this assembled information about the whole document and its pieces is passed to the structure-identifying process.

     
    Figure 2. Polygons are formed around each text area, and bounding-boxes computed around the polygons.

     
     
     

    Applying typographic knowledge

     Note:  
     
    As of March 20, 1998, when our paper is being submitted for publication in the proceedings of the conference, we have completed the scanning part of the project and are starting to experiment with rules for structure-recognition. Therefore, this section is more a description of what we think we are going to do than a record of what we actually have done. Please attend our talk at the conference, two months hence, to find out how it all turned out!
     
    Our forward-chaining production system uses a relatively large collection of ‘rules’, or ‘if-then’statements, which encode our knowledge about the recognition of document structure. These rules deal with various problems posed by the scanning VRE as well.
     
    Our first problem is that many of the objects identified by the VRE are not the atomic objects which we want in our structure: some objects need to be combined into a single object (such as paragraphs split into blocks sitting side-by-side because of accidental ‘rivers’ which cut vertically through the paragraph); and some objects need to be split (such as groups of table row-stubs which are distinguished from each other by indent or character property but not separated by extra white-space). We identify these cases with rules, but tentatively, so that we can back out of our decisions if later processing shows that we ended up in a blind alley.
     
    To recognize that paragraphs have been accidentally split up into horizontally-situated blocks, we must apply a test to the combined block to see if it appears to be a paragraph, i.e. if it is running prose text. It would be ideal if we could apply Natural Language Processing to this task, but we think we can do well enough without for now. As for blocks that should be split up, we have identified a number of indicators to signal when this is necessary. The most obvious cases show up in closely-typeset tables.
     
    We build up a vector of useful properties about each object in the document. For example, one property is that the block consists of words that mostly have initial capitals. We take into account words that are generally not capitalized in upper-and-lower-case text (such as prepositions and articles), but still we state that the block is upper-and-lower-case by means of a ‘fuzzy set membership’ rating between zero and one.
     
    After all the useful properties have been assigned to the objects, we establish a likelihood-rating that each object is a paragraph, subheading, row-stub, etc. If an object has a high rating for paragraph and low rating for anything else, then we decide that it is a paragraph. If, on the other hand, the object is more-or-less equally likely to be two or more kinds of element, then we refuse to classify it and refer this decision to a human operator.
     
    In this process, we are able to take into account the similarity of one object with other objects distributed throughout the document. This is expected to greatly increase the reliability of the process because, whatever the characteristics of, say, a subheading are, they will be close to the characteristics of all other subheadings of the same level throughout the document.
     
    As we build up document structures into lists, subsections and sections, we may have to back down on our decisions because we have classified some block in a way that is not allowed in the structure at that point. Or we may have to reject a whole structure because it would require us to have an object of a certain type in an illegal place. Thus we are taking cognizance of certain basic architectural forms (and their content-models): kinds of lists, tables, subsections, etc.
     
    Our analysis of these forms in the course of this project has made us aware that, whatever the elements of a given DTD may be called, they can be classified according to a universally-recognized architecture, at least for scientific and technical literature.
     
    Finally, we write the document out, with its structure, as an XML file.
     
    The human operator is actually placed in a feedback loop: his or her judgements are not taken as absolute (if for no other reason than that he or she may have glanced at only part of the document when making a decision). Instead, we take the operator's decision as a hint and process the document again with this hint as additional input. This may result in a different set of problems for which we again consult the operator. Hopefully, this process terminates quickly!
     
     

    Conclusion

     
    Because we want to concentrate first on the algorithm itself and not on costly-to-program details, we are applying this approach first to plain ASCII-formatted files. Then we will move on to more complex formatted documents.
     
    Our hope is that only a few parts of only a small subset of the documents will have to be viewed and decided about by the human operator. If this turns out to be true, then we will have reduced the marginal cost (which is almost entirely labour-cost) of converting the documents.
     
    We expect to have substantial experimental results to report at the conference in May, and a prototype to demo.
     
    Acknowledgments
      This project would not have been possible without the creative input and hard work of our team-mates: Sushmita Aggarwal, Kelly-Ann Alvares, Homi Bharda, Satyesh Desaraju, and Ravindra Hande.
     
    Bibliography
    ANTON94
    A. Antonacopoulos and R.T. Ritchings, ‘Flexible Page Segmentation Using the Background’, IEEE 12 th Internat. Conf. on Pattern Recognition, Proceedings , Jerusalem, Oct. 1994

    Achieving Individualized, Timely Web Delivery   Table of contents   Indexes   Authoring: intelligent templates for authoring of SGML documents