XML/EDI: Business information for the 21st century   Table of contents   Indexes   The Economics of Collaborative Authoring and Distribution

 
 

Analysing SGML Documents linguistically


 
Paul   Bussé
  Software Division Manager
  Lant N.V.
Interleuvenlaan 21
 B-3001 Leuven   Belgium
Phone: +32 16 405140
Fax: +32 16 404961
Email: paul.busse@lant.be Web: http://www.lant.com
 
Biographical notice:
 
Paul Bussé
 
Since 1990 Paul has been involved in designing environment software for different linguistic applications e.g. for the Siemens-Nixdorf METAL project and Eurolang Optimizer. He introduced SGML into the METAL environment in 1992. The analysis of file formats, among which SGML takes an important place, and the generation of an application-specific structure are some of the key issues in this area. Other projects he was involved in, concern user interfaces and the integration of linguistic applications.
 
Recently, he assisted, as a project leader, in the development of a system to provide machine translation over the Internet/Intranet.
 
ABSTRACT:
 
In our company, we develop three types of linguistic applications: translation memory, machine translation and controlled language, for example, simplified English.
 
These applications can work independently or together to achieve better multilingual documentation. One of the major problems was the design of a common input format for all these applications. The input format, described as a DTD, is named LDIF. Each of these applications has its own set of requirements towards LDIF. With the growing importance of SGML, we also had to define how to handle SGML documents. This is what this paper describes.
 
 

Introduction

 
When an author writes a document, he tries to structure it in such a way that it is easy to read and understand. The resulting structure is not necessarily suited for linguistic applications. We have defined a more appropriate structure in terms of an SGML DTD. This newly defined structure does not interfere with the structure chosen by the author.
 
This paper describes the DTD in general terms. A second topic is the conversion of an SGML document into this structure. In this respect, we will compare the behaviour of SGML to other file formats, like RTF or FrameMaker's MIF. First, we explain the functionality of the linguistic applications.
 
linguistic applications
 

Linguistic applications explained

translation memory
 

Translation memory systems match the terms and sentences in the database with those in the source language text. If a match is found, the system proposes the available translation in the target language. The translator then can choose to accept the proposed translation. In this way, translators never have to translate the same sentence twice. He can always insert an alternative target sentence.
 
Translation memory technology has progressed rapidly in recent years, both in terms of complexity and user-friendliness. A perfect (word-for-word, letter-for-letter) match between the source language text and a corresponding database item is no longer necessary. Systems are intelligent enough to locate possible translations for a sentence by looking for correspondences with the words or phrases in the database. This is called a "fuzzy" match. Whilst fuzzy matches must naturally be modified and post-edited by the translator, they save him a lot of time and effort.
machine translation
 

Automatic translation is becoming an increasingly feasible proposition for companies that wish to improve the speed – and reduce the cost – of multilingual documentation. The speed with which engines carry out translations is phenomenal. A human translates from 100 up to 1,000 words per hour. Depending on the complexity of the text; a machine translation has a throughput of 10,000 to 100,000 words per hour.
controlled language
simplified English
 

Controlled language is about making text easier to read and translate. In order to achieve this objective, rules for written material are established. Technology is then used to ensure that company documentation complies with these rules.
 
Controlled-language rules affect text on different levels:
  • Content
  • Style
  • Syntax
  • Terminology
 
In the future, the different linguistic applications will integrate into one system. The integration between machine translation and translation memory is already a fact today.
 
 

Requirements for format-independent representations

 
 

Workflow

 
All linguistic systems follow more or less the same workflow.
  1. Analysis . The system generates a format-independent document from original file.
  2. Pre-editing . The user can modify the format-independent document. This step does not necessarily apply here. It can also be executed on the original file (before step 1). This means, however, that it will not be executed in an environment, controlled by the linguistic processor. This implies that the structure of the original document must be adapted, so that the analyser recognises these modifications and inserts the proper structures in the generated document.
  3. Processing . The system translates or checks the document. Steps 2 and 3 may be iterated; especially in the case of controlled language.
  4. Post-editing . The user corrects the results of the application. This step can also be executed after step 5.
  5. Regeneration . The results are reformatted in order to return a document in its original format.
 
 
 

Analysis

 
During the analysis phase the user reads the document and tries to annotate it linguistically. At the same time, the structures that the linguistic tool considers as irrelevant are removed. This linguistically irrelevant data contains
  • All layout information like typeface and point size.
  • Document structuring information: like the table control information.
  • All other information that contains no (recognisable) text: e.g. pictures.
 
The format analyser stores this information in an auxiliary file. It leaves pointers to this file in the format-independent document to allow the regenerator to rebuild the document properly.
 
The second aspect of the analysis is to interpret the structure linguistically. It looks for:
  • Invariant pieces of text, i.e. text which is not to be interpreted by the processor.
  • Interruptions in the text flow.
  • Generation of a structure to suit the linguistic applications.
 
Programmer's guides often represent pieces of code in specific typeface. The occurrence of such a typeface renders the text invariant to the linguistic application. Invariant pieces of text are treated as part of a sentence. To be more precise, the application considers the invariant as a noun. The analyser generates nouns in this way, when it encounters, for example, cross-referencing, in-line pictures or formulas.
 
The analyser also detects interruptions in the text flow. The use of tabulations, for instance, is interpreted as such an interruption. The analyser separates the text before and after the tabulation. Both pieces of text will be handled individually.
 
From certain structures that occur in the original document, the analyser derives the sentence type. Depending on the formatting it is possible to tell if the sentence is a title or a list item. The linguistic applications use this information when they generate their results. In a title, for example, the verb can be missing.
 
One of the most important tasks of an analyser is to restructure the document. As explained before, the structure of a traditional document is not always adapted to linguistic interpretation. The appearance of a footnote in the middle of a sentence does not represent the order in which the text has to be processed. The analyser leaves a reference in the document and stores the footnote after the paragraph it occurs in. We call this sub-paragraphing. A second aspect of sub-paragraphing is the analysis of e.g. style sheets. Style sheets contain text used for numbering chapters and the generation of cross-references. This text has to be treated separately, i.e. in a sub-paragraph.
 
 

User preparation

 
The next step is the annotation of the document by the user. He can replace strings, mark pieces of text as invariant and split the text in translation units using an appropriate editor.
 
 

Linguistic application

 
Each of the linguistic applications adds its own set of elements to LDIF  (Lant Document Interchange Format) . The translation memory marks
  • The differences between the sentence found in the sentence database and the sentence in the document.
  • The words found in the terminology database.
 
Machine translation requires a specific way of representing alternative phrases. This is useful when the system has more than one translation for a given phrase.
 
The same structure applies to controlled language applications. Here, it represents possible alternatives for the phrases used. For this type of applications, we also need to highlight the areas that do not conform to the defined set of rules and, when it is needed, the acceptance of the non-conformance in that area.
 
 

SGML

 
We have chosen to use SGML as a format-independent notation. From the beginning, we established the following rules:
  • As SGML does not allow overlapping elements, non-linguistic elements should be empty. It is difficult enough to specify all the linguistic structures, which logically have a content, as not empty.
  • The fewer elements we need to represent the formatting, the better. Moreover, the fewer the number of attributes per element, the better. In the case of controlled language, we already need at least 0.5 kilobytes to represent a non- formatted sentence.
  • During the design of the DTD, we tried to avoid elements having a linguistic as well as a non-linguistic meaning. This should allow us to easily integrate these elements in other SGML applications, like OPENTAG.
 
 

SGML Analyser

 
 

The way it works

 
Analysing a document without knowing the semantics of the elements is impossible. The user must aid the SGML analyser by specifying the role of the different elements in the document. He must do this in a configuration file. In this file, which is an SGML document, the user declares the roles of the elements and of some of the attributes of a specific DTD. The declarations allow for context sensitivity.
 
Valid roles are:
  • Document: this role can be attributed to only one element; i.e. the base document element.
  • Paragraph group: this is for elements which do not contain PCDATA themselves, but contain elements that can hold PCDATA or other paragraph-groups; e.g. chapters
  • Paragraphs: elements with content to process. The boundary of the element is also the boundary of a translation unit.
  • Sub-paragraphs: elements with content to process, but this content, if left at the place it was found would interrupt the normal text flow.
  • Sentence boundary: the element indicates that the processing unit ends here. A new processing unit starts immediately afterwards. The element itself can be empty.
  • Invariant: elements that represent a part of a sentence that should not be processed by the application. The content of the element could be a sub- paragraph.
  • Formatting: elements that represent pure formatting aspects like emphasising.
  • Pure data: elements that are linguistically irrelevant.
  •  
    Sometimes the attribute values contain text that should be translated. If this is the case, the user has to specify which attributes cause a sub-paragraph.
     
     

    Comparison

     
    The difference between the SGML analyser and a word processor-specific format-analyser, like RTF, is the rigid definition of the elements used in SGML. The use of an element in an SGML document adds information to that element as how to interpret the content. Style sheets, as they are used in e.g. RTF, define merely the layout. Attaching other meanings to these style sheets is error prone
     
    . The advantage of using SGML in a linguistic environment, is the separation of the document structure from the layout. When using other formats, like RTF, the user has different ways to obtain the same layout. These alternatives may result in different linguistic structures. The use of SGML elements provides us with a more detailed specification of the structure concerned.
     
    In traditional text processors, there are different ways to achieve an identical layout. Two things may happen:
  • The user applies a specific style sheet and changes the paragraph afterwards to make it look as required.
  • The text processor changes some attributes to its own needs
  •  
    Both cases might cause misinterpretations by the analyser.
     
    The granularity of the document structure is finer in SGML compared to other file formats and the user controls the structure completely. Therefore, the user has more control over the linguistic application and its results.
     
     

    Conclusions

     
    The use of SGML together with linguistic applications renders better results.
     
    However, there are a few conditions to be fulfilled:
  • The DTD represents the document structure. If this is not the case, the document will be treated as a pure ASCII document with consequently lower quality results.
  • The user has to design carefully the DTD-specific configuration file for the analyser.
  •  
    The approach is feasible, as the number of DTDs within a company, and therefore the number of configuration files, is limited.
     
    Although documents are written principally to be shown (hard copy, help files, web pages or even as data) the translation cost should not be underestimated. This cost can be reduced significantly by linguistic applications.
     
    If you need your SGML documents to be processed by a linguistic application, it is useful to consider this during the design of the DTD. It will simplify the job of the translation or checking process and will enhance its results. The same holds for other formats but the linguistic interpretation of those formats can never be defined as rigid as with SGML.
     
    Acknowledgments
      I would like to thank Dominic J. North for thinking with me during the genesis of this document and show my appreciation for the work and encouragements of Liesbet Depreeuw. She made the writing of this document a joy for me, a joy like harvesting honey is for Winnie the Pooh.

    XML/EDI: Business information for the 21st century   Table of contents   Indexes   The Economics of Collaborative Authoring and Distribution