| The Marriage of XML and Databases | Table of contents | Indexes | W3C update - XML-related activites at the World Wide Web Consortium | |||
SGML in a Multilingual Environment |
| Pierre Colot |
| Software Engineer |
| SGML Technologies Group
56 Rue Glesener, Luxembourg L-1630 Luxembourg Phone: +352 29 21 22 Fax: +352 29 21 20 Email: pco@isea.lu Web: www.sgmltech.com |
Biographical notice: |
ABSTRACT: |
documents ![]() multilingual validation ![]() |
The problems relating to multilingual documents are introduced in the context of the official publications of the European Union. Currently dealing with eleven languages, the new system has to be capable of dealing with yet more. The solution is the development of an SGML (Standard Generalized Markup Language) validation workshop, able to take account of all the problems which arise in an efficient and effective manner. |
Introduction |
New strategies have to be developed to solve these multilingual problems. It is held that SGML can assist in several ways. The separation between structure and content, character set and coding, etc, offers possible solutions |
|
Global Presentation |
The reasons that led to the adoption of a multilingual approach when building the workshop are outlined. Firstly, the context in which the multilingual publications are used will be discussed. Secondly, the workload created by multiple language versions will be described. Thirdly, the economic factors of managing costs of these multiple versions will be highlighted. Fourthly, managing the complexity of numerous DTDs will be presented. |
Context |
In the early 1980s the Office for Official Publications of the European Communities (OPOCE) was looking for a standard format for exchanges of its electronic publications. Given the wide variety of documents involved and the heterogeneous nature of existing computing systems, the search turned towards the use of an international standard. The SGML standard emerged as the best solution, as its flexibility made it possible to describe and monitor the various structures used in the publications. To do this, a European character set was specified, the SGML declaration was formalized, and the DTDs were defined. |
The first version of the Formalized Exchange of Electronic Publications (Formex) standard appeared in 1985. |
In the context of the contracts which the SGML Technologies Group has undertaken for OPOCE, publication certification workshops have been established. These workshops produce the certified versions of the Official Journal C, L, and S series. |
Within these workshops, the tasks of several dozen translators are distributed and assisted by the documentary systems. Some eight thousand pages are produced every week. The function of these workshops is to monitor SGML conformity and to check for consistency between the paper version of the publications and the corresponding electronic version. |
Workload |
The end of the 1990s brought the prospect of the enlargement of the European Union, the diversification of publications, and the widespread use of distribution techniques on the Web. In order to respond to these new challenges, plans were made for a complete revision of Formex and the reorganization of the production cycle. The forthcoming accession of central European countries entailed the need to support Slav languages. By extending the character set to include ISO character entities, it became possible to write treaties between the European Union and various international partners. Once each type of publication had been classified, some forty DTDs were identified. Following a complex revision process, twenty-three Formex V3 DTDs came into production in April 1999. |
To ensure consistency between the paper versions and the electronic versions, the new production cycle requires that the photocomposition generation carried out by the printers be based on the original SGML instances. |
Economic Considerations |
The constant increase in the number of languages to be checked and the diversification of the documents to be processed have resulted in a substantial rise in the workload and the cost of validation. A new approach to validation was sought in an effort to curb costs. On the one hand the reorganization of the production cycle ensures consistency between the paper and electronic versions of the publication. On the other hand the use by the validation mechanism of the symmetry that exists among the various language versions of the same publication makes it possible to highlight the independent transformations in all language versions. |
Managing the Complexity |
OPOCE publications are SGML instances which comply with the Formex standard. Since 1985 the DTDs which make up this format have undergone three major revisions and eight minor revisions. Following the specialization of the Formex V3 DTDs, they have increased in number from six to thirty-five. The major Formex V3 revision has been entering into production since the beginning of 1999. The last DTDs making up this version are currently (spring 1999) in the final consolidation phase. The final version, on the basis of which the complete life cycle of the publication will be undertaken in Formex V3, is Formex V3.0.1 or Formex V3.0.2, depending on the DTD. Production started in April 1999. |
After the definition of the new version of Formex, the publication certification process evolved. The certification process which merely validated SGML conformance of a publication has now been supplemented by a series of rules for monitoring the way in which Formex V3 is used when each publication is written. |
|
The various language versions are consolidated by highlighting the multilingual structures and content within the publications. This multilingual version of the publication can be used to manipulate the various language versions of a publication simultaneously. |
General Architecture |
The quest for an integrated multilingual solution leads to the re-examination of the certification and the storage process within the workshop. |
In this section the building of a general architecture is described, and the conceptual approaches that are required to support the multilingual aspect. The problem is considered from different angles. Certification, storage, consolidation, and modification of multilingual publications are discussed, with particular reference to the provision of an efficient and integrated multilingual system. |
The general architecture of a multilingual workshop comprises four modules. |
|
|||||||||||||||||||||||||
Reception receives a publication, inserts it into synoptic storage, and concatenates certification. |
The certification process has been subdivided into an automatic validation phase and a manual verification phase. In the automatic validation phase all the objective checks are carried out which do not require any human intervention. The manual verification phase identifies situations which are likely to contain an error as regards compliance with the rules for use, and directs the manual validation process, justifying the issuing of a warning. |
|
|||||||||||||||||||||||||
Manual editing is carried out by means of an SGML editor, the publication being edited in accordance with its DTD. The results of the automatic validation and the manual verification are presented in the form of a specialized interface. Support functions are integrated to locate errors, bear out diagnoses, and facilitate repetitive corrections. |
The relevant synoptic status is attached to each element in the instance. When the element exists in all the languages, it is multilingual. Any modification of a multilingual element is automatically repeated in all the language versions. Any modification in a monolingual element affects only its native language. An element may be defined as being monolingual in order to limit the dissemination of the modification. |
As soon as the certification process is complete, the document is sent for final archiving and an electronic publication is produced. |
Advantages of Using SGML |
In making the choice of a standard for document representation, there are many considerations. What possibilities does SGML have to offer in the context of the provision of an efficient and integrated multilingual system? How can they be turned to advantage? How can they be applied in this new approach? Among the many advantages that can be discussed, language-independent tagging will be considered as a leverage in the proposed solution of this particular multilingual problem. |
|
Conceptual Description |
The case study is of the new SGML validation workshop developed for the publication of laws in the Official Journal of the European Union. There are currently eleven official languages within the European Union. Publications considered by the workshop are therefore multilingual. |
Flows within the workshop are shown in the diagram below. |
|
||||||
During automatic validation, the following controls are chained: |
|
||||||
The semi-automatic validation is subdivided into processes, the sequence in which they are carried out being illustrated below. |
|
||||||
AIP = Arcel In the Pocket (HTML interface to browse and obtain SGML publication from Storage 2 and 3) |
CREJO = Bibliographic database |
The semi-automatic validation process consists of immediate processing, which must be carried out within a twenty-four hour timeframe, and delayed processing that depends on the priorities at that time. Each process follows the same diagram. The end-user edits a publication with an SGML editor. The consolidation of the linguistic versions is carried out automatically, based on the configuration and the range that is associated with the modification at the time of editing. This consolidation is done in a transparent and asynchronous manner at the time of synoptic storing. Moreover, immediate export is carried out at the end of the immediate and semi-automatic validation process. When the semi-automatic validation process is complete, the final archiving is carried out. |
Conclusions |
The addition of several more language versions of official documents to be published by OPOCE led to reappraisal of the method then employed. SGML was always the key technology used so the solution was to devise the SGML validation workshop, thus capitalizing on early work and experience gained. |
Already there are plans for enhancement with respect to possible integration with full-text search engines. That is in the future. Today the signs are that a more efficient and effective way has been developed to address the problem, one which has the potential to cope with many more languages in the years to come. |
From the end-user’s point of view, how will this new approach affect his work? What modifications will be required in his organization? These questions are still to be addressed. The answers will come only after the SGML workshop has been operational for several months. |
| The Marriage of XML and Databases | Table of contents | Indexes | W3C update - XML-related activites at the World Wide Web Consortium | |||