| XML/EDI: Business information for the 21st century | Table of contents | Indexes | The Economics of Collaborative Authoring and Distribution | |||
Analysing SGML Documents linguistically |
|
Paul Bussé |
| Software Division Manager |
| Lant N.V. Interleuvenlaan 21 B-3001 Leuven Belgium Phone: +32 16 405140 Fax: +32 16 404961 Email: paul.busse@lant.be Web: http://www.lant.com |
Biographical notice: |
Paul Bussé |
Recently, he assisted, as a project leader, in the development of a system to provide machine translation over the Internet/Intranet. |
ABSTRACT: |
In our company, we develop three types of linguistic applications: translation memory, machine translation and controlled language, for example, simplified English. |
Introduction |
|
| linguistic applications |
Linguistic applications explained |
| translation memory |
Translation memory systems match the terms and sentences in the database with those in the source language text. If a match is found, the system proposes the available translation in the target language. The translator then can choose to accept the proposed translation. In this way, translators never have to translate the same sentence twice. He can always insert an alternative target sentence. |
| machine translation |
Automatic translation is becoming an increasingly feasible proposition for companies that wish to improve the speed – and reduce the cost – of multilingual documentation. The speed with which engines carry out translations is phenomenal. A human translates from 100 up to 1,000 words per hour. Depending on the complexity of the text; a machine translation has a throughput of 10,000 to 100,000 words per hour. |
| controlled language simplified English |
Controlled language is about making text easier to read and translate. In order to achieve this objective, rules for written material are established. Technology is then used to ensure that company documentation complies with these rules. |
Controlled-language rules affect text on different levels: |
In the future, the different linguistic applications will integrate into one system. The integration between machine translation and translation memory is already a fact today. |
Requirements for format-independent representations |
|
Workflow |
|
Analysis |
|
During the analysis phase the user reads the document and tries to annotate it linguistically. At the same time, the structures that the linguistic tool considers as irrelevant are removed. This linguistically irrelevant data contains |
The format analyser stores this information in an auxiliary file. It leaves pointers to this file in the format-independent document to allow the regenerator to rebuild the document properly. |
The second aspect of the analysis is to interpret the structure linguistically. It looks for: |
Programmer's guides often represent pieces of code in specific typeface. The occurrence of such a typeface renders the text invariant to the linguistic application. Invariant pieces of text are treated as part of a sentence. To be more precise, the application considers the invariant as a noun. The analyser generates nouns in this way, when it encounters, for example, cross-referencing, in-line pictures or formulas. |
The analyser also detects interruptions in the text flow. The use of tabulations, for instance, is interpreted as such an interruption. The analyser separates the text before and after the tabulation. Both pieces of text will be handled individually. |
From certain structures that occur in the original document, the analyser derives the sentence type. Depending on the formatting it is possible to tell if the sentence is a title or a list item. The linguistic applications use this information when they generate their results. In a title, for example, the verb can be missing. |
One of the most important tasks of an analyser is to restructure the document. As explained before, the structure of a traditional document is not always adapted to linguistic interpretation. The appearance of a footnote in the middle of a sentence does not represent the order in which the text has to be processed. The analyser leaves a reference in the document and stores the footnote after the paragraph it occurs in. We call this sub-paragraphing. A second aspect of sub-paragraphing is the analysis of e.g. style sheets. Style sheets contain text used for numbering chapters and the generation of cross-references. This text has to be treated separately, i.e. in a sub-paragraph. |
User preparation |
|
The next step is the annotation of the document by the user. He can replace strings, mark pieces of text as invariant and split the text in translation units using an appropriate editor. |
Linguistic application |
|
Each of the linguistic applications adds its own set of elements to LDIF (Lant Document Interchange Format) . The translation memory marks |
Machine translation requires a specific way of representing alternative phrases. This is useful when the system has more than one translation for a given phrase. |
The same structure applies to controlled language applications. Here, it represents possible alternatives for the phrases used. For this type of applications, we also need to highlight the areas that do not conform to the defined set of rules and, when it is needed, the acceptance of the non-conformance in that area. |
SGML |
|
We have chosen to use SGML as a format-independent notation. From the beginning, we established the following rules:
|
SGML Analyser |
|
The way it works |
|
Analysing a document without knowing the semantics of the elements is impossible. The user must aid the SGML analyser by specifying the role of the different elements in the document. He must do this in a configuration file. In this file, which is an SGML document, the user declares the roles of the elements and of some of the attributes of a specific DTD. The declarations allow for context sensitivity. |
Valid roles are: |
Sometimes the attribute values contain text that should be translated. If this is the case, the user has to specify which attributes cause a sub-paragraph. |
Comparison |
|
The difference between the SGML analyser and a word processor-specific format-analyser, like RTF, is the rigid definition of the elements used in SGML. The use of an element in an SGML document adds information to that element as how to interpret the content. Style sheets, as they are used in e.g. RTF, define merely the layout. Attaching other meanings to these style sheets is error prone |
. The advantage of using SGML in a linguistic environment, is the separation of the document structure from the layout. When using other formats, like RTF, the user has different ways to obtain the same layout. These alternatives may result in different linguistic structures. The use of SGML elements provides us with a more detailed specification of the structure concerned. |
In traditional text processors, there are different ways to achieve an identical layout. Two things may happen: |
Both cases might cause misinterpretations by the analyser. |
The granularity of the document structure is finer in SGML compared to other file formats and the user controls the structure completely. Therefore, the user has more control over the linguistic application and its results. |
Conclusions |
|
The use of SGML together with linguistic applications renders better results. |
However, there are a few conditions to be fulfilled: |
The approach is feasible, as the number of DTDs within a company, and therefore the number of configuration files, is limited. |
Although documents are written principally to be shown (hard copy, help files, web pages or even as data) the translation cost should not be underestimated. This cost can be reduced significantly by linguistic applications. |
If you need your SGML documents to be processed by a linguistic application, it is useful to consider this during the design of the DTD. It will simplify the job of the translation or checking process and will enhance its results. The same holds for other formats but the linguistic interpretation of those formats can never be defined as rigid as with SGML. |
Acknowledgments |
| XML/EDI: Business information for the 21st century | Table of contents | Indexes | The Economics of Collaborative Authoring and Distribution | |||