The Marriage of XML and Databases   Table of contents   Indexes   W3C update - XML-related activites at the World Wide Web Consortium

 

SGML in a Multilingual Environment

 Pierre   Colot
  Software Engineer
  SGML Technologies Group  56 Rue Glesener,
Luxembourg   L-1630  Luxembourg
Phone: +352 29 21 22
Fax: +352 29 21 20
Email: pco@isea.lu Web: www.sgmltech.com
 
Biographical notice:
 
Pierre Colot is a senior software engineer employed by ISEA in Luxembourg, a member of the SGML Technologies Group. He specializes in SGML workshop development. Prior to this he built information systems to manage WAN hardware cables in Belgium and participated in the European Ariane 5 space programme. He graduated from the Free University of Brussels in 1988. He may be contacted at pco@isea.lu.
 
ABSTRACT:
 documents 
multilingual
 validation  
 

The problems relating to multilingual documents are introduced in the context of the official publications of the European Union. Currently dealing with eleven languages, the new system has to be capable of dealing with yet more. The solution is the development of an SGML  (Standard Generalized Markup Language) validation workshop, able to take account of all the problems which arise in an efficient and effective manner.
 

Introduction

 
The introduction of every new country to the European Union implies an increase in the number of national languages used in official publications. This increase means that the validation cycle of the publications has to be reviewed.
 
  • In managing multilingual publications, the following have to be addressed:
  •  how to represent multiple character sets within a publication;
  •  how to represent multiple languages within a publication;
  •  how to consolidate the different language versions of the same publication;
  •  how to modify the structure of all language versions simultaneously;
  •  how to publish the different language versions of an extract of a publication;
  •  how to browse through a multilingual publication;
  •  etc.
 
New strategies have to be developed to solve these multilingual problems. It is held that SGML can assist in several ways. The separation between structure and content, character set and coding, etc, offers possible solutions
 
  • This paper comprises four main parts:
  •  a global presentation of the context of the problems;
  •  the general architecture of a workshop that is faced with the problems;
  •  the advantages of using SGML in their solution;
  •  a conceptual description of how to manage multilingual publications within the newly-developed SGML validation workshop.
 

Global Presentation

 
The reasons that led to the adoption of a multilingual approach when building the workshop are outlined. Firstly, the context in which the multilingual publications are used will be discussed. Secondly, the workload created by multiple language versions will be described. Thirdly, the economic factors of managing costs of these multiple versions will be highlighted. Fourthly, managing the complexity of numerous DTDs will be presented.
 

Context

 
In the early 1980s the Office for Official Publications of the European Communities (OPOCE) was looking for a standard format for exchanges of its electronic publications. Given the wide variety of documents involved and the heterogeneous nature of existing computing systems, the search turned towards the use of an international standard. The SGML standard emerged as the best solution, as its flexibility made it possible to describe and monitor the various structures used in the publications. To do this, a European character set was specified, the SGML declaration was formalized, and the DTDs were defined.
 
The first version of the Formalized Exchange of Electronic Publications (Formex) standard appeared in 1985.
 
In the context of the contracts which the SGML Technologies Group has undertaken for OPOCE, publication certification workshops have been established. These workshops produce the certified versions of the Official Journal C, L, and S series.
 
Within these workshops, the tasks of several dozen translators are distributed and assisted by the documentary systems. Some eight thousand pages are produced every week. The function of these workshops is to monitor SGML conformity and to check for consistency between the paper version of the publications and the corresponding electronic version.
 

Workload

 
The end of the 1990s brought the prospect of the enlargement of the European Union, the diversification of publications, and the widespread use of distribution techniques on the Web. In order to respond to these new challenges, plans were made for a complete revision of Formex and the reorganization of the production cycle. The forthcoming accession of central European countries entailed the need to support Slav languages. By extending the character set to include ISO character entities, it became possible to write treaties between the European Union and various international partners. Once each type of publication had been classified, some forty DTDs were identified. Following a complex revision process, twenty-three Formex V3 DTDs came into production in April 1999.
 
To ensure consistency between the paper versions and the electronic versions, the new production cycle requires that the photocomposition generation carried out by the printers be based on the original SGML instances.
 

Economic Considerations

 
The constant increase in the number of languages to be checked and the diversification of the documents to be processed have resulted in a substantial rise in the workload and the cost of validation. A new approach to validation was sought in an effort to curb costs. On the one hand the reorganization of the production cycle ensures consistency between the paper and electronic versions of the publication. On the other hand the use by the validation mechanism of the symmetry that exists among the various language versions of the same publication makes it possible to highlight the independent transformations in all language versions.
 

Managing the Complexity

 
OPOCE publications are SGML instances which comply with the Formex standard. Since 1985 the DTDs which make up this format have undergone three major revisions and eight minor revisions. Following the specialization of the Formex V3 DTDs, they have increased in number from six to thirty-five. The major Formex V3 revision has been entering into production since the beginning of 1999. The last DTDs making up this version are currently (spring 1999) in the final consolidation phase. The final version, on the basis of which the complete life cycle of the publication will be undertaken in Formex V3, is Formex V3.0.1 or Formex V3.0.2, depending on the DTD. Production started in April 1999.
 
After the definition of the new version of Formex, the publication certification process evolved. The certification process which merely validated SGML conformance of a publication has now been supplemented by a series of rules for monitoring the way in which Formex V3 is used when each publication is written.
 
  • In order to include all the character sets likely to be used in a publication, the following mechanisms have been specified and rules for use have been drawn up:
  •  the Formex V3 character set is based on
    •  ISO 2022:1986 set substitution
    •  ISO/IEC 6429:1992 control functions
    •  ISO/IEC 6437:1994 Latin and Latin supplementary sets
    •  ISO 8859-5:1988 Cyrillic set
    •  ISO 8857-5:1987 Greek set
     
    • Formex V3 character entity
    •  euro character
    •  unmapped ISO character
     
    • ISO character entity
    •  
      • ISO 8879:1986
      •  Added Latin 1
      •  Added Latin 2
      •  Box and Line Drawing
      •  Diacritical Marks
      •  Greek Letters
      •  Monotoniko Greek
      •  Non-Russian Cyrillic
      •  Numeric and Special Graphic
      •  Publishing
      •  Russian Cyrillic
    •  
      • ISO 9573-13:1991
      •  
        • Added Math Symbols:
        •  Arrow Relations
        •  Binary Operators
        •  Delimiters
        •  Relations
        • Symbols:
        •  Negated Relations
        •  Ordinary
        •  Alternative Greek Symbols
        •  Chemistry
        •  General Technical
        •  Greek Symbols
        • Math Alphabets:
        •  Fraktur
        •  Open Face
        •  Script
      •  use of a Formex V3 dictionary entity
      •  reference to external images.
  • There are three techniques which may be used to manage the multilingual aspect of the publications:
  •  the reference language is associated with monolingual publications;
  •  several reference languages are attached to each multilingual document;
  •  in addition, any language may be overwritten as regards the text or a table.
 
The various language versions are consolidated by highlighting the multilingual structures and content within the publications. This multilingual version of the publication can be used to manipulate the various language versions of a publication simultaneously.
 

General Architecture

 
The quest for an integrated multilingual solution leads to the re-examination of the certification and the storage process within the workshop.
 
In this section the building of a general architecture is described, and the conceptual approaches that are required to support the multilingual aspect. The problem is considered from different angles. Certification, storage, consolidation, and modification of multilingual publications are discussed, with particular reference to the provision of an efficient and integrated multilingual system.
 
The general architecture of a multilingual workshop comprises four modules.
 
 
Reception receives a publication, inserts it into synoptic storage, and concatenates certification.
 
The certification process has been subdivided into an automatic validation phase and a manual verification phase. In the automatic validation phase all the objective checks are carried out which do not require any human intervention. The manual verification phase identifies situations which are likely to contain an error as regards compliance with the rules for use, and directs the manual validation process, justifying the issuing of a warning.
 
  • During the automatic validation process the following basic checks are carried out:
  •  compliance with the physical specification;
  •  SGML conformance;
  •  SGML consistency and completeness.
 
  • Manual verification comprises the following basic checks:
  •  use of the physical specification;
  •  that the DTD rules are obeyed;
  •  for the synoptic nature of structure and content.
 
 
Manual editing is carried out by means of an SGML editor, the publication being edited in accordance with its DTD. The results of the automatic validation and the manual verification are presented in the form of a specialized interface. Support functions are integrated to locate errors, bear out diagnoses, and facilitate repetitive corrections.
 
The relevant synoptic status is attached to each element in the instance. When the element exists in all the languages, it is multilingual. Any modification of a multilingual element is automatically repeated in all the language versions. Any modification in a monolingual element affects only its native language. An element may be defined as being monolingual in order to limit the dissemination of the modification.
 
As soon as the certification process is complete, the document is sent for final archiving and an electronic publication is produced.
 

Advantages of Using SGML

 
In making the choice of a standard for document representation, there are many considerations. What possibilities does SGML have to offer in the context of the provision of an efficient and integrated multilingual system? How can they be turned to advantage? How can they be applied in this new approach? Among the many advantages that can be discussed, language-independent tagging will be considered as a leverage in the proposed solution of this particular multilingual problem.
 
  • Other criteria include:
  •  independent development of market tools (storage/archiving/dissemination);
  •  continuity of information;
  •  continuity of production systems;
  •  independence of external applications;
  •  independence of sources;
  •  independence of formats;
  •  homogeneous nature of methods;
  •  standardization of data access methods;
  •  flexibility;
  •  extensibility;
  •  modularity;
  •  strict specification of interfaces.
 

Conceptual Description

 
The case study is of the new SGML validation workshop developed for the publication of laws in the Official Journal of the European Union. There are currently eleven official languages within the European Union. Publications considered by the workshop are therefore multilingual.
 
Flows within the workshop are shown in the diagram below.
 
 
During automatic validation, the following controls are chained:
 
 
The semi-automatic validation is subdivided into processes, the sequence in which they are carried out being illustrated below.
 
 
AIP = Arcel In the Pocket (HTML interface to browse and obtain SGML publication from Storage 2 and 3)
 
CREJO = Bibliographic database
 
The semi-automatic validation process consists of immediate processing, which must be carried out within a twenty-four hour timeframe, and delayed processing that depends on the priorities at that time. Each process follows the same diagram. The end-user edits a publication with an SGML editor. The consolidation of the linguistic versions is carried out automatically, based on the configuration and the range that is associated with the modification at the time of editing. This consolidation is done in a transparent and asynchronous manner at the time of synoptic storing. Moreover, immediate export is carried out at the end of the immediate and semi-automatic validation process. When the semi-automatic validation process is complete, the final archiving is carried out.
 

Conclusions

 
The addition of several more language versions of official documents to be published by OPOCE led to reappraisal of the method then employed. SGML was always the key technology used so the solution was to devise the SGML validation workshop, thus capitalizing on early work and experience gained.
 
Already there are plans for enhancement with respect to possible integration with full-text search engines. That is in the future. Today the signs are that a more efficient and effective way has been developed to address the problem, one which has the potential to cope with many more languages in the years to come.
 
From the end-user’s point of view, how will this new approach affect his work? What modifications will be required in his organization? These questions are still to be addressed. The answers will come only after the SGML workshop has been operational for several months.

The Marriage of XML and Databases   Table of contents   Indexes   W3C update - XML-related activites at the World Wide Web Consortium