"Case Study", The SGML implementation at Norsk Hydro   Table of contents   Indexes   Document Structure Independent Data Modelling

  Peterson  Dave 
 

The SGML Character Model

 

Abstract:

 SGML Was designed in an environment where other-than-8-bit character representations were only vaguely known and not understood. The designers did not differentiate between (abstract) characters and the bit-patterns by which they are represented in machines. This resulted in a character-handling model that is no longer adequate in many respects. In addition, there have surfaced differences of opinion as to how the current SGML standard (ISO 8879 as amended in 1988) should be interpreted with respect to the handling of the characters that make up the SGML documents it describes.
 A new character and character-string model has been adopted by the SGML Rapporteur Group within WG8, where the ISO 8879 revision is being prepared. The new model encompasses handling of variable-width-character string representations such as Shift-JIS, and outside-the-document specification of character representations, as well as the traditional “document character set” specification.
 A Technical “Correction” to ISO 8879 was made official in 1996, which made it more feasible to use SGML with very-large-character-set languages such as Japanese and Chinese, for use on SGML systems not constrained to an 8-bit character set.
 This presentation will explain the distinction between (abstract) characters and the computer representations of characters, and will explain the new character handling model in terms thereof. It will further explain the relationship to the old character-handling model of 1986, and how older systems may be upgraded, and what is possible when still running under the old (1986/88) rules. This involves the relationship of the “document character set” to the way systems may actually represent characters, and the use or non-use of the “shunned character numbers” specification.
 

Introduction

 The character-handling model described herein is the author's understanding of what the ISO/IEC SGML rapporteur group currently intends to use as a basis for the 8879 revision now being worked on. (The author is a member of that group.) This paper is not a rapporteur-group-blessed description, and in any case the group could conceivably change its mind before the revision “goes final”.
 The model of character representation and processing in SGML is (with excellent hindsight) perhaps not as well developed in the current ISO 8879 as it might have been. There are currently a few differences of opinion among experts as to how certain fine points should be interpreted. Said experts have “agreed to disagree”, and instead have developed an agreed-to approach to characters and the processing thereof for use in the revision to that standard now being developed. This approach also permits several new ways of handling characters that I believe were not considered during the original development of SGML. This new approach is the primary topic of this paper.
 

Why Bother?

 In the beginning, the common SGML systems were typically constrained to an ASCII- or ISO-646-based character set (i.e., at most variant 8th-bit-on character encodings). They also usually required that the storage and system representations of characters and the document and syntax-reference character sets all be the same. This system organization is still supported.
 In the last several years, people involved with the World Wide Web and the efforts to transmit SGML documents and isolated textual entities have espoused the view that their outside-of-SGML aspects of their systems already have character-representation specification capability which in many cases theymust use. They have no desire to deal with the SGML document character set as it has been generally interpreted.
 In particular, it has been proposed within the World Wide Web community that HTML, and more recently XML, standardize with a “document character set” of ISO/IEC 10646, but that it only be used for the minimum: determining what characters will be considered legal if encountered in the character stream being presented to the parser, and for resolving numerical character references. The “storage” representation of characters (the encoding used when a document or part thereof is being sent to a receiving system or server) will be described using the Web conventions. The receiving system may receive documents or parts in various encodings from various servers. The “entity manager” (front end of the browser) will be expected to convert to a uniform character representation for use within the browser.
 In any case, the new approach recognizes that character-oriented entities may be stored and interchanged independently of their containing SGML documents (hence independent of the specification of the document character set) and indeed may well wind up being used as text entities within documents having different document character sets! (Which is perfectly reasonable if there are no incompatible numerical character references in the entity.)
 

Character Basics

 Since it has become obvious that not everyone approaches characters with the same character/character-string model in mind, this section begins by describing the basic notions involved as the author interprets them. Other sections then present the agreed-to SGML model in that context.
 Acharacter is an instance of the primitive abstract class “character”. This class has possibly infinitely many (distinct) instances, all of the different “characters”. It is probably not well-defined, and that does not matter for our purposes. Formally (i.e., excepting their for-the-benefit-of-humans semantic roles), the only important thing about the various characters is that they are distinguishable. Acharacter repertoire is a finite subclass of character.
 Characters are generally used by humans to represent information via human languages. Strings of characters are parsed by humans, and the semantic roles of the characters (and the language which is being used) determine how the character strings are parsed to recreate human-understandable information. Computer systems that have been taught to parse character strings according to the rules of some language “know” the characters only to the extent their software has been programmed to respond to specific characters in various ways. The semantic roles of the characters are really for the benefit of humans trying to interpret output, and for humans trying to design new application systems that can share character strings with other applications.
 There is debate whether such things as ligatures and “Mr. Uck” (the non-smiley face that seems to be replacing skull-and-crossbones as a poison symbol, especially where children might be involved) are really characters, or merely non-character character-like things that our processing systems tend to treat like characters. For the purposes of this paper, “character” will mean “real character or non-character character-like things that our processing systems tend to treat like characters”. The distinction is not germane to SGML. If a potential character is something which is specifically germane to the semantic content of the data in which it occurs, it is a character. If it is merely an artifact of a presentation system, then it is probably not. For example, it is easy to imagine circumstances wherein a common ligature qualifies as a separate character (perhaps a treatise on typesetting), and other circumstances wherein it is merely an artifact of the presentation system.
 Most characters are simply displayed, but some, called “control characters”, are not. Control characters more typically function in SGML terms as special kinds of processing instructions or system data entities. From this point of view, even a ligature character could be considered a processing instruction for a not-very-smart display system that couldn't replace several letters with the matching ligature without help.
 Associated with each character repertoire is its class of (finite)character strings . This class is similar to the class of finite sequences of characters in the repertoire, except that it provides the usual string operations rather than the usual sequence operations (admittedly, there is some overlap).
 Every character repertoire and its associated string class can be represented in a computer (At least up to some fairly large number of characters and some fairly large length of strings).
 

Character and String Representation

 All members of any bounded set of nonnegative integers can be named by Arabic base-2 numerals of some bounded length, and with leading zero digits can be named by such numerals all of the same length. When used as a representation within a computer, a base-2 digit is abit , and a base-2 numeral is abit combination .
 The instances of any class having only a finite set of instances can be mapped to a bounded set of nonnegative integers, and hence can be represented by bit combinations of some fixed length. In the context of representing finite non-numeric abstract classes, all of the bit combinations of the same length form acode set and each one is acode point .
 Because a character repertoire is finite, it can necessarily be represented by some or all of the code points in a code set. Such a representation is acoded character set , and iscanonical . Thus, a canonical representation is one using all-the-same-size bit combinations, each combination either representing a character or “undefined”.
 Typically a standard that defines a character repertoire also defines a coded character set representation thereof. It may do so by prescribing the code points directly, or by prescribing the corresponding nonnegative integer; the two methods are equivalent.
 Most representations of single characters employ such a coded character set, i.e., are canonical. The corresponding canonical representation of strings of characters is by concatenating the bit-combinations/code-points of each character. But non-canonical representations of strings are more common than those of characters. Shift-JIS, for example, uses a combination of 8- and 16-bit bit combinations. Even the result of ordinary compression techniques on a canonical representation can be viewed as simply a non-canonical representation.
 

SGML Character Representations and Encodings

 SGML non-data external entities contain strings ofcharacters . Because characters are abstract, the computer is necessarily working with representations of those characters. Unfortunately, there is no universally used character repertoire, much less a universal representation thereof (though that of ISO 10646 might come close).
 There are several ways of handling a very large character repertoire in a document. One, of course, is to use a much smaller character set and use “escape sequences” (in SGML they may be SDATA entity references) for the rest. It's then up to the application to recognize the escape sequences and deal with the larger character set. But this (1) is a nuisance, (2) is a burden on many applications, and (3) does not allow for a rich set of characters to be used for markup (e.g., as name characters). Therefore, SGML systems should be able to deal with various and large character repertoires and representations thereof.
 There are several character/string representations and encodings of interest when describing SGML-oriented character string storage and processing. Following are the ones currently being dealt with by the SGML rapporteur group.
 
 

The Storage Representation of Characters

 Each entity is stored as all or parts of one or morestorage objects . Each storage object is addressable (possibly via query or other such means) and retrievable, and is the representation of some character string; this representation is thestorage representation of characters (the “StoRC”) for that object.
 One of the jobs of a (formal or informal) system identifier is to specify where the entity manager (in conjunction with its storage manager subsystems) can find the necessary storage objects, extract from them the appropriate substrings, and merge them into a single character string. Product differentiation can be made based on the various kinds of storage objects and representations used therein that can be handled by the entity manager.
 SGML originally dealt only with canonical representations because its guiding precepts included safe long-term document storage, and canonical representations are most likely to be still interpretable in the foreseeable future. Now applications of SGML are arising for which this is not the overriding requirement, and the rules are being loosened accordingly.
 
 

The System Representation of Characters

 It is the further job of the entity manager to combine those strings as necessary to produce a single character string in a single representation for presentation to the parser. This representation, used to present all of the entities involved in the parse of a single document, is thesystem representation of characters (AKA “SysRC”) for that parse.
 Note that the system representation of characters is a representation of the entire string. There may or may not be a one-to-one correlation of characters and single bit-patterns. The interpretation of a bit-pattern may be context dependent (as in the case of Shift-JIS), or may be irrelevant (as in the case of a heavily-packed compression-based representation). In the case of context-dependent representations, one can take the point of view that as they are scanned, incoming bit-patterns can cause the scanner to “change state”; the interpretation of one or more (uniform-sized) bit patterns as a character is dependent both on the patterns themselves and the “state” of the scanner.
 SGML will no longer prescribe the form of the representation of characters at the entity-manager/parser interface. One can debate whether it ever should have; many believe this is a matter best left to the system designer. (And for a single-package system, how could you tell?)
 
 

The Document Character Set

 Thedocument character set . is a coded character set described in the SGML declaration of each SGML document. It identifies those characters that are permitted to occur directly in SGML documents, and for each such character prescribes one (or more?—currently not; for the revision, being discussed) character number(s) which when used in a numeric character reference must result in resolution to that character. It also identifies any additional character numbers which may be used in numeric character references, but the specification of to which characters such references must resolve must be made out-of-band (“nonSGML data characters”).
 The document character setmay also be (one of) the storage representation(s) of characters, or be the system representation of characters, or be otherwise involved in the conversion to that representation. If it is, then the entity manager must be informed of it by the parser so that it can adjust its conversion to match. If the representations involved are canonical, such a conversion can be thought of as taking each character as represented by a bit combination using the storage representation of characters and transforming that bit combination to that for the same character is the document character set-such transformation being abit combination transformation format (AKA BCTF). (This paper/presentation will not go into the details of BCTFs.) If the document character set is not the system representation of characters, the entity manager must then convert the result into the system representation. (Note that when converting the beginning of the document entity using such a conversion, the entity manager must still be given external to the document enough information about the document character set to be able to present the document character set specification to the parser.)
 Character numbers (the numerals used to refer to characters indirectly in numeric character references and elsewhere) will in the revision to ISO 8879 be expressible in base 10 and 16 (and possibly 8), rather than just base 10.
 
 

The Syntax-Reference Character Set

 Thesyntax-reference character set , also defined in the SGML declaration, never plays any role in the conversion from storage representation to system representation nor in determining the system representation (the interface between entity manager and parser). Its sole purpose is to allow the concrete syntax specification parameter, considered as a string of (abstract) characters, to be independent of the representation of those characters and the document character set of any document within which it may be incorporated. Thus it is used in lieu of the document character set to direct the interpretation of character numbers within the concrete syntax specification as characters, and to order characters so that character ranges can be described in terms of their end-point characters.
 The parser, being presented with characters via the system representation of characters, must be able to read the numerals in the concrete syntax specification parameter, determine what character each represents (i.e., what the system representation(s) if the character intended is), and configure itself to recognize the (all, if there can be more than one) system representation(s) for that character as the appropriate markup.
 
 

Shunned Character Numbers

 Assuming that the system representation of characters is canonical (as is required by the current ISO 8879), the “shunned character number” mechanism of the concrete syntax provides a means for the designer of the concrete syntax to prohibit bit patterns (specified by character number) from being presented to the parser as characters (regardless of what characters are represented by those bit patters using the document character set). However, this is too much of a restriction, so exceptions are made for potential markup characters. Since any character can be made into a potential markup character by making it an inert function character, this mechanism is now generally regarded as a lot of complication for very little gain. It is not expected to survive into the revision.
 

SGML for Interchange and Within a System

 I believe the original mindset of the SGML designers was oriented to the problems of interchanging documents between disparate “SGML systems”. And that they were originally (at least subconsciously) assuming that any single SGML system would use the only one storage representation of characters, and that it would also be the document character set. Also (I believe), they were anticipating an environment whereSGML documents rather than individual external entities would be transmitted as a unit. Thus, they expected that a newly-arrived document would have all of its entities presented to the receiving system in the same character set.
 With this conceptualization, it was reasonable to expect the document character set described in the incoming SGML declaration to be the “transmission representation of characters” (i.e., the “storage” representation used during transmission). Once the document character set specification near the beginning of the SGML declaration was read, a clever SGML system could configure itself to interpret incoming character representations as bit patterns described by that character set. The only “out-of-band” (outside the document) information that would be needed would be enough of the document character set to permit the reading of the SGML declaration up through the document character set specification. Systems would differentiate themselves in part by how many different character set fragments they could accept out-of-band with which to bootstrap and how many different completions of those fragments could be accepted via the SGML declaration's document character set specification.
 Whether I have correctly guessed their mindset or not, the SGML world has certainly evolved away from such simple assumptions. In particular, parts of the SGML community are now much more interested in interchanging individual external entities, which do not generally have their own SGML declaration. This eliminates the SGML declaration as the mechanism for describing the “transmission representation of characters” (i.e., the “storage” representation used during transmission). When an entity (other than a document entity) is receivedin vacuo (outside the context of an SGML document), there is no document character set description to use: All character string representation informationmust be handled “out of band”.
 For the SGML purist, a stream of characters is not inherently an entity. Rather, it becomes an entity by virtue of being so identified in an SGML document. Thus, from the purist's point of view, One cannot have entities without an SGML document, hence there is always an SGML declaration around, and a document character set described therein. And the SGML documentcan describe the storage set of characters using a “formal system identifier”. Nonetheless, there are people out there that intend to handle these specifications out of band. So be it: It's still useful to know what happens when one goes that route.
 Okay—what happens when we have an out-of-band mechanism for describing the character string representation mechanism for every separate entity? Is there need for them all to be the same? Is there need for them to be the same as the document character set? If not, what good is the document character set? The answer is this: The twomandatory uses of the document character set (which correlates characters with integers) are (1) to determine what characters can legitimately occur directly in external entities (the “SGML” characters, which comprise the “SGML character repertoire” of the document), and (2) to interpret numeric character references as SGML characters. All elsecan be handled out-of-band. SGML systems designed for long-term usability of their documents will probably want to have as little information out-of-band as possible, so that the documents are as self-documenting as possible, and will want to always use canonical storage representations. Other systems with other goals may not.
 If there are no numeric character references therein, a stored string of characters can be used as a text entity in any document whose “SGML character repertoire” includes the characters of the stored string (regardless of what the document character set is)—just be sure you have an entity manager that can deal with all of the character representations involved, and can accept all of the extra describing information out-of-band or via formal system identifiers.
 The old character model can be interpreted as a special case of the new, so old systems automatically conform, but are limited in their ability to handle disparate representations. They can be upgraded by giving them additional routines to interpret formal system identifiers and/or accept external (“out-of-band”) information to enable them to translate stored or received character strings from one representation to another—obviously a task for the system builder, not the user.
 ISO 8879, even after the revision, will not prescribe that an SGML system be capable of handling any particular representations of characters. However, SGML application specs, now being called “profiles”, may do so. XML, for example, is expected to prescribe that any XML-compliant system must handle at least a certain minimum set of character representations.

"Case Study", The SGML implementation at Norsk Hydro   Table of contents   Indexes   Document Structure Independent Data Modelling