Building an SGML-based Publishing Environment   Table of contents   Indexes   SGML &, schemas: from SGML DTDs to XML-DATA.

 
 

The Addition of a Multilingual Component to An Existing Document Processing System


 
Tom   Catteau
  software engineer
  SGML Technologies Group
29 Boulevard General Wahis, 29
B-1030 Brussels   Belgium
Email: tct@sgmltech.com Web: http://www.sgmltech.com
Phone: +32 2 705 70 21
Fax: +32 2 705 81 01
 
Biographical notice:
 
Tom Catteau
 
Tom Catteau is a software engineer at ACSE sa/nv, Brussels, a member of the SGML Technologies Group. He specializes in advanced uses of SGML, and the provision of object-oriented solutions. A graduate in Electronic Engineering of the Katolieke Universiteit Leuven, Belgium he may be contacted at tct@sgmltech.com.
 
ABSTRACT:
multilinguism
 

This paper discusses the addition of a multilingual component to an already existing document processing system, where a trade-off has to be chosen between innovation in terms of new functionality for multilingual processing and the stability of the system.
 
 

Introduction

synoptism
 

Many organizations, both public and private, deal with multilingual documents. A major issue when dealing with such documents is the concern for equivalence between different linguistic versions of one document: the concern for synoptism . The check for synoptism takes place at two levels: at the structural level of the document, as well as at the content level, where, although not explicitly expressed in a DTD, certain types of information might be present which are language-independent.
 
This paper is concerned with the addition of a multilingual system on top of an existing document processing system. Two points are of interest when discussing the addition of a multilingual component. First, it is mandatory to avoid modifications to the existing system as much as possible in order to guarantee that the system keeps its current level of stability. Secondly, since new information will be extracted at the content level for the check of synoptism, this information will be given permanency by enriching the DTD at the level of the repository. Since it is preferable not to change the modules which process the repository's content and use the original DTD, two new modules will have to be created. The first of these two modules will be inserted at the point where the repository is updated. It will extract relevant information and replace that information with structure in the repository. The second module, to be inserted at the point where fragments of the repository are retrieved, consists of converting the newly defined elements in the enriched DTD into formatted content, or in removing some attribute values. The added value of the enriched DTD will be exploited by the synoptism check module, which will verify the equivalence among different languages, as well as providing a way to indicate points of inconsistency to the reviewers.
 
In the first section, the context of the discussion will be set. To do so, a general architecture of a document processing system and of a multilingual repository is described. Then, typical language-independent components are described, as well as a way to formalize them. Then follows a general scheme for updating and extracting fragments into and out of the repository. Subsequently, the synoptism check is discussed. Thereafter, the incremental implementation and the conversion of legacy documents is discussed. Finally, the versioning of multilingual documents is described briefly.
 
This paper is the result of experience gained with projects carried out, among which a project for the editorial system for the budget of the European Union at OPOCE, the Office for Official Publications of the European Communities in Luxembourg.
 
 

General architecture

 
When documents are published in several languages, the various instances represent the same version of one document, in different languages. We thus can speak of different views of a single document. As a consequence, special care has to be taken to ensure that what is written in the different views of each language instance reflects the same content. For this purpose, a multilingual repository is introduced, in order to distinguish language-independent from language-specific content.
 
 

Multilingual repository

 
In order for the system to be able to manage the consistency of the document, the repository will be split into two parts: the language-specific (LS) repository and the language-independent (LI) repository. The LI repository will hold language-independent features of the document, whereas the LS repository will contain, on a per language base, data which is not controlled by the system. Three types of modules will directly operate upon these repositories. These modules include functionality for:
  • creating a fragment of a document in a language, based on the LI-repository's content and the language-specific repository for that language;
  • splitting a fragment of a document in language-independent and language-specific parts;
  • verifying the synoptism rules.
 
The LI repository and the LS repository, together with these modules, give rise to a multilingual repository.
 
 

Language-independent versus language-specific content

 
Language-independent content is a substructure of a fragment whose presence, structure, and/or location can be matched in several (or all) languages present in the repository. This typically includes structure and floating elements within the data content. What remains is what is not explicitly handled by the system and is language-specific. This typically includes most of the data content.
 
Language-independent features are only of interest in as far as they serve at least one of the two following purposes:
  • they are of use in the construction of a fragment of the document (these features decide on the structure of the document fragment);
  • they serve during the checking of the synoptism (here this feature is detected but its presence cannot be enforced automatically).
 
 

DTDs for the LI and the LS-repositories

 
A document will be stored partly in the LI-repository, and partly in the LS-repository. Naturally, in each repository DTDs will be used which will be derived from the documents's DTD.
 
For the sake of clarity, the example in this section uses the scheme for concurrent DTDs, even if in the implementation another scheme might be used.
 
 

The LI-DTD

 
The LI-DTD must reflect both the language-independent structures and the floating elements. These can in turn contain language-independent structures, and so on. First, the case where the document contains one section and two chapters is considered. This tree will be reflected in the LI-DTD's instance as follows:
 
<(LI)SECTION ID=AAFGH>
<(LI)CHAPTER ID=AAFGI LEAF=Y>
  </(LI)CHAPTER>
  <(LI)CHAPTER ID=AAFHA LEAF=Y>
  </(LI)CHAPTER>
</(LI)SECTION>
 
First note that since the LI-DTD might be used in conjunction with the LS-DTD, and in order to avoid interference between both instances, the LS-DTD will not allow any PCDATA to occur in its instances.
 
Depending on the nature of the document, it might be decided that paragraphs too are part of the language-independent structure and that this level should be included into the language-independent part. The corresponding instance might be:
 
<(LI)SECTION ID=AAFGH>
  <(LI)CHAPTER ID=AAFGI>
    <(LI)P ID=AAFGJ LEAF=Y>
    </(LI)P>
    <(LI)TBL ID=AAFGK>
             ...
    </(LI)TBL>
    <(LI)P ID=AAFGL LEAF=Y>
    </(LI)P>
  </(LI)CHAPTER>
  <(LI)CHAPTER ID=AAFHA>
  </(LI)CHAPTER>
</(LI)SECTION>
 
Note that every element that is part of the language-independent structure must be uniquely identified. This is necessary for the extraction module to be able to locate these structures in the LI-repository.
 
Also note that the LI DTD will be a copy of the document's DTD up to the level that is the same in all languages.
 
The difference between the two examples also makes clear another point. Typically, tables are elements with a language-independent structure. In the first example, the TBL element itself is floating. Its presence can be detected but its location cannot be derived from the LI-repository. The TBL could be included as floating element as follows:
 
<(LI)SECTION ID=AAFGH>
  <(LI)CHAPTER ID=AAFGI LEAF=Y>
    <(LI)TBL ID=AAFGK>
             ...
    </(LI)TBL>
  </(LI)CHAPTER>
  <(LI)CHAPTER ID=AAFHA>
  </(LI)CHAPTER>
</(LI)SECTION>
 
Here TBL has been inserted as an element of the first chapter. However, in this example TBL, being a floating element, will be an inclusion in the CHAPTER element. Again, the TBL element has an ID which will be used for the extraction.
 
Now consider a reference, which will, for example, be caught using regular expression recognition. Its occurence in one language describes it completely; there is no need for identification of the reference. With a reference in the first chapter, the example might be:
 
<(LI)SECTION ID=AAFGH>
  <(LI)CHAPTER ID=AAFGI LEAF=Y>
    <(LI)TBL ID=AAFGK>
             ...
    </(LI)TBL>
    <(LI)REF ...>
    </(LI)REF>
  </(LI)CHAPTER>
  <(LI)CHAPTER ID=AAFHA>
  </(LI)CHAPTER>
</(LI)SECTION>
 
This example tells us that in the first chapter, a table and a reference should be present in every language, but without any order of occurence being prescribed.
 
As a general rule, the LS-DTD contains the same structure as the document's DTD, up to a certain level. Floating elements are inclusions to this DTD. All structure elements have an ID, as do the floating elements that have an internal language-independent content. Floating elements that do not have any language-independent content need not have an ID.
 
 

The LS-DTD

 
The LS-DTD will be the document's DTD itself. The document will be stored as it comes from the user. During the extraction, only relevant elements will be kept, and new elements will be added where needed. The instance in the LS-repository could be:
 
<(LS)SECTION ID=AAFGH>
  <(LS)CHAPTER ID=AAFGI>
    <(LS)TBL ID=AAFGK>
             ...
    </(LS)TBL>
    <(LS)P ID=AAFGL >
    </(LS)P>
  </(LS)CHAPTER>
</(LS)SECTION>
 
In this case, during the extraction, the first paragraph in the first chapter and the second chapter will be added at extraction time.
 
 

A general scheme for extraction and update

 
 

Extraction

 
To start the extraction of a fragment (identified by an ID), first there is the extraction of the appropriate fragment out of the LI-repository and the LS-repository. The extraction is then driven by the LI fragment. For the sake of extraction, language-independent data can be subdivided into three categories.
  • Data which can be used to construct a view in another language.
  • Data which can only be used to check synoptism: floating nodes which are not uniquely identifiable within the encompassing element. In this case they cannot be used for the extraction.
  • Floating nodes which are uniquely identifiable within the containing element (inclusion elements of the LS DTD). Either they are already present in the LS-instance, in which case their location is known and their content can be extracted from the LI-instance, or they are not yet present in the LS-instance. They still can be added to the document, for example at the end of the element. This minimizes the user's work, since he or she will only have to move the element to the appropriate place.
 
The resulting fragment consists of the structure which is retrieved from the LI-instance, to which the LS-instance's content is added. When an identified floating element is encountered, it will only be added to the output if it also occurs in the LI-instance.
 
The extraction process itself can be done the usual way, either after having put both structures in memory, or as parsing progresses. In the latter case, both LI and LS instances are parsed concurrently. This approach is better suited for larger documents.
 
 

Storage

 
When storing a fragment of the document in a particular language in the repository, it is necessary to distingish between two cases: in the first, there is a master language; in the second there is not.
 
 

Master language

 
When there is a master language, only that language has the right to update the LI-repository. From an organizational point of view, this is often the simplest way to ensure proliferation of language-independent data across all instances. An update of the document in the master language results in an update of the LI-repository. Updates in other languages will not affect the language-independent data repository. The master language choice is often a functional one: a document is written in the master language; only then is it translated into the other languages. The notion of master language is equivalent to that of a user with more access and lock priority than the others.
 
 

No master language: locking mechanism and differential updates.

 
When there is no master language, all languages are equivalent, technically speaking. When extracting a portion of a document, that fragment consists of both language-specific and language-independent content. This means that the update will also consist of an update of both the language-specific and the language-independent part. There can be a conflict if two languages want a lock on overlapping parts of the document. This conflict can be resolved through the use of differential updates. With this construction, the updated version is compared with the version at lock-time. Only modified parts will be added to the repository. Using this principle, non-exclusive locks become possible; they offer a significant improvement to system flexibility. Conflicts are of course still possible, but this is an organizational problem.
 
 

The checking of synoptism

 
The goal of synoptism checks is to verify the consistency of the different language views of a document. Synoptism checks are always performed on the richest form of the DTD. Again it is necessary to distinguish among structure elements, identifiable floating elements, and other floating elements. Since structure elements are used for the construction of a document, by definition there is synoptism at that level. It is thus possible to confine the discussion to floating elements. For floating elements, only their presence can be verified. Two approaches are possible with regards to the check of synoptism: a global and a language-based approach.
 
 

Global check

 
In a global check, a particular feature is tested across all language views. A convenient way of implementing synoptism checks is to add a language attribute to all floating elements in the LI-DTD. During the update of a fragment in a particular language, this attribute is then updated accordingly. To check the synoptism then means to check the presence of the languages in the language attribute.
 
 

Language-based approach

 
As an extension to the global checking, language-based checking can be used. For several reasons, it could be appropriate to perform more checks on one pair of languages than on another. Reasons could be that two languages are more similar to each other than other languages are; or simply that there is a better knowledge of some languages than of others. There may also be other external reasons.
 
This can be easily implemented by a configuration that states which languages are aware of a certain feature. In that case, the language attribute of elements will only be checked for the presence of those languages which know of that feature. Special cases must be implemented by ad hoc customizing of the synoptism check.
 
 

Non-coercive versus coercive implementation

 
In case of a coercive approach, only updates which do conform to the synoptism will effectively be performed. In this case, there must be a master language against which the other languages can be matched. With a non-coercive approach, even inconsistent documents are updated. The reasoning behind this is that an incomplete new version is better than a complete but 'outdated' version.
 
 

An incremental approach to implementation

 
Owing to the brittle nature of language synoptism, it is important to perform an incremental implementation. Incremental implementation can be done at two levels concurrently: feature by feature, and language by language.
 
 

Feature by feature implementation

 
Typically, there is a need for conversion from the DTD in the repository to some user format. Even when an SGML editor is used, tables, for example, still need to be converted to one of the formats supported by the editor. Other operations also have to be performed. Thus, between the extraction from the repository and the delivery to the user, various modules have to be gone through.
 
As experience shows, every change in a DTD, even a minor one, ripples through the complete system, and has an impact on most modules. To ensure the stability of the existing system, the DTD should remain unchanged. This means that between the repository and the first module, a downwards conversion has to be performed. On the way back, an upwards conversion is needed in order to enrich the document.
 
The disadvantage of this approach is that there is no possibility at the user level to benefit from the new functionality when inserting new data. Only after an addition proved to be useful, can the adaptation of the modules be considered. This will allow the users to benefit from the proven functionality.
 
In a feature by feature implementation, one language-independent feature at a time is added to the system. Only after the upwards conversion has succeeded is it possible to go to the next feature.
 
 

Language by language implementation

 
Here, the new feature is first tested for a set of languages. After that, it can be extended to the others.
 
 

The conversion of legacy documents.

 
As long as a particular feature is not enforced by a process, that is as long as the checking of a particular feature is left to the user, there will be no error-free existing instances with respect to that feature. This means almost by definition that after the conversion from the current DTD to the one which includes the new features (this process is usually done by pattern matching), the resulting document will not be synoptic with regards to that new feature. The quality of the automatic conversion will of course depend on the quality of the original documents. Even with inconsistencies in the repository, the situation can only be better than what it was before the introduction of the new feature.
 
 

Versioning of multilingual documents

 
 

Versioning

 
The goal of versioning is to be able to retrieve different versions in the course of the life-cycle of a document.
 
The notion of versioning of documents in itself is clear. It means that it must be possible to retrace every version of a document during the life-cycle of the document, where there is a new version at each update of the repository. When only one language is concerned, updated fragments will be consistent, and it is not difficult to keep the complete document consistent.
 
 

Versioning for multi-lingual documents.

 
When speaking of multi-lingual documents, the notion of versioning becomes less self-evident, and the question of consistency is more difficult to answer.
 
 

Versioning

 
Consider the following situation. In one language, several changes are made which have an impact on the language-independent content. Then, even without any explicit changes to the other language views, they too will have changed. The situation is still more difficult to handle where during a lock on a fragment in one language, the corresponding constructive part is changed. In this context, there is a wish to answer the following question: how is it possible to determine or extract a version of a document from the repository, and what is the value of that extracted version?
 
For the versioning, the following system is used. A version number consists of two numbers. The first is the version number for the language-independent part. The second is incremented each time a language part is updated without modification to the structure part.
 
The extraction itself is rather simple. On the basis of a given version number, the structure fragment corresponding to the first number is taken, and for each language, the language fragment corresponding to the second number; in case that version number does not exist for the given language, we take the previous version is taken.
 
 

Consistency

 
Consistency can only be guaranteed to the extent that is delivered by the language-independent data which is explicitly present in the data repository. No other consistency can be guaranteed, unless all language versions of a document are updated at the same time. This is a constraint that usually cannot be enforced. Indeed, different people working on different languages work at a different pace, and synchronization comes down to the lowest common denominator. This means that most of the time the repository might not be synoptic.
 
It could be part of the workflow to prevent entering a particular state (eg dissemination state) unless the rules of synoptism are obeyed. However, care must be taken. Indeed, when synoptism is enforced, there are situations where it is possible to see that users circumvent the synoptism check by inserting spaces or dots to fill structure element they do not need, but without whose presence there is no synoptism. This means that the benefit of enforcing rules has to be carefully weighed against the risk of having users fiddling so as pass the document through a state.
 
 

Differences among versions

 
It is frustrating when there is a need to skim many versions of small fragments in order to find two consecutive versions with differences. Since only the modified parts are physically updated, finding the previous different version of a fragment comes down to going through the previously stored parts of the fragment and retaining the highest version number.
 
 

Conclusion.

 
In this paper, how to add a multilingual component to an existing document processing system has been described. As soon as there are enough similarities among the different languages, synoptism starts to be an amenable way to improve the quality of a document; the ultimate goal of a multilingual component is the enhancement of the overall quality of the document from the point of view of language-independent aspects of that document.
 
Quality of a document may be hard to quantify, but it has been shown that as a tool to improve the quality of that document, synoptism certainly has an added value.

Building an SGML-based Publishing Environment   Table of contents   Indexes   SGML &, schemas: from SGML DTDs to XML-DATA.