Software Agents using XML for Telecom Service Modelling : a Practical Experience   Table of contents   Indexes   Development of SGML/XML Middleware Component

 
 

Style Sheets: I18N aspects


 
Anders   Berglund
  Principal Software Engineer
  Inso Corporation
299 Promenade Street
Providence   Rhode Island  USA  02908-5720
Phone: +1 401 752 4416
Fax: +1 401 752 4444
Email: aberglund@inso.com Web: www.inso.com
 
Biographical notice:
 
Anders Berglund
 
Anders Berglund is Principal Software Engineer, INSO in Providence RI. After training in High Energy Physics he moved in 1978 into the computer center of the European High Energy Physics Laboratory, CERN, Geneva, Switzerland, where he actively worked in the area of text processing, in particular with the problems of scientific text and national use characters in text formatters and on output devices. One of the projects was the implementation of an SGML based publishing system that greatly influenced the design of HTML. During this period he was also Project Manager of the SEAS (European IBM User group) Office Automation Project, which also deals with publishing.
 
In 1987 he moved to ISO to carry out a project to introduce a new computer assisted publishing system for International Standards and other ISO publications. For this project he created a DTD implementing Part 3 of the IEC/ISO directives. This DTD is the basis for a DIN Pre-Norm as well as the basis for a DTD for European Standards.
 
In 1993 he moved to the United States founding Berglund Consulting & Type Foundry. The Type Foundry specializes in the creation of PostScript and TrueType fonts for
 
  • Symbols used in mathematics and science Currently the collection comprises close to 1400 symbols packaged in 11 fonts. This font set is believed to be the only one covering all the public SGML entities in ISO/IEC 9573-13.
  • Non-latin alphabets These include fonts for 'living' languages as well as for 'dead' languages of scholarly interest.
 
In 1995 he joined EBT to work on the DSSSL aspects of the new major release of DynaText (Matterhorn).
 
Since 1986, he has been an active participant in ISO/IEC JTC1/WG4, project editor of ISO/TR 9573 (Techniques for Using SGML), currently being expanded to 16 parts (of which part 11 deals with the SGML application for ISO standards) and making significant contributions to ISO/IEC 10179 (DSSSL).
 
From 1987 to 1989, he was a consultant to IBM for a project resulting in the SGML Translator DCF Edition, (product 5684-025). Further consultancy for subsequent releases of the product.
 
During 1992 and 1992, consultant to the Commission of the European Community sponsored IMPACT/EUROSTAND project for creating an SGML application for European Standards.
 
In 1992, consultant to IBM on Fonts and Character sets.
 
In 1994, consultant to EBT on DynaTag and Conversion Projects.
 
In 1995, one of the authors of the W3C proposal for eXtensible Style Language (XSL).
 
He is a frequent speaker on computer assisted publishing in general, and specialized topics such as SGML, DSSSL, mathematics, tables, character sets and special symbols.
 
ABSTRACT:
 
The internationalization of a stylesheet language is a large and complex task. This paper aims to illustrate some of the various aspects of language/script specific features that need to be handled by an international style sheet language and to discuss some of the possible design options and choices.
 
 

Aim and Scope

 
The first goal of this paper is to make the reader aware of some of the script and language specific features that should be supported by a style sheet language that claims to provide full international support. It should, however, be noted that only a fraction of these requirements are mentioned in this introductory paper.
 
The second goal is to discuss certain design options that may be selected and certain challenging issues that face the designers and implementors of fully internationalized style sheet languages. One of these is the question on how much script specific information is included in the language as opposed to being assumed to be build into the formatters.
 
 

Some Language Specific Features

 
 

Script(/Fonts)

 
The most obvious feature of many languages is that they do not use the latin script. In order to encode the characters in these scripts even the use of 16-bit characters is too limiting; that space alone can be taken up by what many see the required repertoire of ideograms for Chinese alone.
 
 

Writing direction

 
Writing direction and line progression can take almost any combination of left/right, top/bottom with the most common
  • left to right writing with top to bottom line progression
  • right to left writing with top to bottom line progression (many of these scripts, e.g. Arabic and Hebrew, have numerals written left to right)
  • top to bottom writing with right to left line progression
  • bottom to top writing with left to right line progression
  • boustrophedon (alternating left to right, right to left); early latin texts have been found written this way
 
Some scripts even have a reading order involving jumping back and forth between columns.
 
 

Baselines

 
A number of scripts are aligned on a baseline; other scripts are top aligned (e.g. Indian scripts) or center aligned (e.g. Chinese).
 
 

Glyph selection

 
In many languages there is not a fixed one-to-one correspondance between the characters for the text and the glyphs used to display the text. Examples are:
  • ligatures; Example: latin based scripts
  • contractions, where a sequence of characters are displayed as (typically) one glyph placed high; Examples: Latin, Church Slavonic
  • glyph form dependent on the position of the character (first in word, last in word, in the middle of a word, on its own); Examples: Arabic, Hebrew
  • reordering; Examples: Indian scripts
 
Other examples of language or script dependent glyph selection is for start and end of quotes (Examples: English, French, German), and for punctuation (Example: Spanish).
 
 

Word Separation

 
In many cases identification of words is part of determining where a line may be broken. Examples:
  • a (word) space separating the words
  • a specific character separates the words; Examples: Runic, Cunieform scripts
  • there is no word separator and for those languages where words influence line-breaking they have to be identified by dictionary
 
 

Hyphenation

 
Hyphenation rules and the controls required varies and for certain scripts hyphenation is not used.
 
 

Justification

 
In order to justify a line different strategies are used, for example:
  • expansion of spaces
  • keep the word space fixed, but "stretch" certain glyphs (e.g. Arabic)
  • keep the word space fixed and select letter glyphs appropriately from a set of slightly differently wide ones for each letter
 
 

Script Specific Layout "Objects"

 
Many scripts make use of specific layout "objects". Some examples for Kanji are:
  • Ruby
  • Kendot
  • Warichu note
  • Furiwake
  • Latin characters in vertical writing mode
 
 

"Margin" Objects

 
Margins are used in some scripts for certain information. Examples are:
  • butterfly found in Japanese and Chinese books
  • printing the first work of the next page in the bottom margin
 
 

Numbering schemes

 
The representation of numbers shows quite a variation from script to script, for example:
  • "arabic", base 10, numbering (glyphs may vary!)
  • alphabetic, a, b, c, ...; there may be more than one order in a script
  • numbering schemes in which each letter has a numeric equivalent and you get the total number by just addition; Examples: (Classical) Greek, Church Slavonic, Hebrew
  • roman numbering
 
 

Case folding

 
Some scripts have the notion of "case" and each letter exists in two forms. Irregularities include:
  • "French" French has no accented upper-case
  • the German sharp s is SS (two letters) in upper case
 
 

Sorting

 
Variations in this area include:
  • sometimes several "approved" collation sequences for a language
  • sorting by pronunciation (no character collation sequence)
 
A challenge is the sorting of mixed language texts, for example, the names of authors (with correct spelling of names) for these proceedings.
 
 

Style Sheet Language Design Issues

 
 

Style Sheet use/re-use for other scripts or modes

 
Given the variation in character and line progression direction for different scripts it is quite convenient to express certain formatting characteristics, e.g. indents, relative to the progression direction (e.g. start of line, end of line) rather than in absolute terms (left, right). This leads to (or is a requirement for) stylesheets that are reusable from writing mode to writing mode or that can rather easily handle documents with mixed scripts. This approach was taken for DSSSL, whereas in CSS they are described in absolute terms.
 
 

Style sheet controls versus "formatter knows"

 
The design of a stylesheet language needs to consider which script features that need specific characteristics and style sheet control. Certain other aspects of formatting a particular script may be left to the formatter. Building in specificatins and controls, maybe as "building blocks", may make it possible to specify the behaviour of a script that was not known at the time of the style sheet language design, but risks to make the specification very difficult and verbose. The first DIS of DSSSL took this approach to a great extent, whereas the final version of DSSSL took the approach of identifying a number of "built in" formatting constructs, represented by "flow objects".
 
 

What is optional?

 
In the design of a style sheet language with international support a conscious decision has to be made as to which parts of the language are optional. Even today it is not feasible to demand that every conforming implementation are capable of formatting all the different scripts that the full language supports. One may only hope that this will change in time.
 
 

How "traditional" do you get? - technology limitations

 
Many scripts are extremely "calligraphic" in nature and the design may need for practical and technology reasons to make an acceptable compromise between full support of the traditional display and a simplified display. Take for instance Arabic where different implementations have elected to handle the traditional calligraphy in the following extremes:
  • reducing the glyph repertoire to a minimum, with only a few ligatures and using the same glyph for certain positional variants that are reasonably close visually to each other
  • using a glyph repertoire of several tens of thoursands, with some number of glyph building blocks to be able to quite closely render handwritten texts by a talented scribe
 
 

Support for the character/symbol repertoire required

 
A prime requirement is to support an adequate repertoire of characters in various scripts and special publishing, mathematical, and other symbols. This necessitates style sheet mechanisms to overcome the shortcomings of the current character coding standards.
 
In addition there needs to be the ability to specify character properties that are relevant for formatting; both for the characters present in current character coding standards as well as the additional repertoire.
 
 

Font issues

 
For certain scripts many fonts contain glyphs representing only parts of characters or ligatures and the level of detail of the font information made available to the formatter and style sheet is an issue. In many systems much of the combining work is left to a subsystem below the formatter.

Software Agents using XML for Telecom Service Modelling : a Practical Experience   Table of contents   Indexes   Development of SGML/XML Middleware Component