Special Characters and XML: The Dark Side of the Force |
Cahners Business Information ![]() Dunbar, Dave Massachusetts ![]() Newton | Dave
Dunbar
Project Manager, Cahners Business Information
Biographical notice Dave Dunbar has been with Cahners Business Information for ten years. He has worked on enterprise wide ad revenue/space billing and reporting systems as well as the development of a desktop ad revenue query tools. Most recently Dave managed the development of a system for delivering tagged, rights-filtered ASCII text to online vendors. He is currently investigating how XML will assist Cahners Business Information in making the migration from a publishing company to an information company. |
Lexington ![]() Massachusetts ![]() Peterson, Dave ![]() SGMLWorks! | Dave
Peterson
Principal Consultant, SGMLWorks!
Biographical notice Dave Peterson began working with SGML in 1986 at MIT. He was with Xyvision as Principal SGML Consultant from 1989 through 1993; he is now Principal Consultant with his own firm, SGMLWorks! . Dave is a Principal Member of NCITS V1 (formerly ANSI X3V1), and through ANSI is a Technical Expert representing the US to ISO/IEC JTC1 WG4, where he is active in pressing forward the revision to ISO 8879. Dave's Ph.D. is in Mathematics, from the University of California (Berkeley). He has taught math and computer science at various institutes of higher education, and SGML in a variety of settings. He does lots of things with SGML and XML, including document analysis and system design and programming for both users and system providers. |
| Dave Dunbar: |
| Cahners Business Information is the publisher of over 140 Business to Business publications ranging in titles from Variety to Semiconductor International; from Construction Equipment to Industrial Paint & Powder. These publications vary in design complexity, industry niche, publishing frequency and publishing tools available to users. In making the transition from a print-centric publisher to an information company, Cahners has looked for ways to generate revenue from repuposed content. Cahners would like to use their rights-filtered editoral to populate web sites, be packaged into e-mail newsletters, delivered to content aggregators or online vendors, or to be assembled into new targeted custom published supplements (both print and electronic) whose content comes from several publications. To do this, Cahners has begun upgrade the tools available to its users and to establish enterprise-wide standards in workflow, asset management and style sheet and template usage in authoring and pagination. |
| One problem that we've run into is a way to quickly convert and represent complex equations and characters from special font faces within these new products, particularly the web and our tagged online deliverable. Presently, complex equations are embedded into Quark documents as EPS files. Special characters within the body text come from a variety of special font faces. Our online deliverable can not include images. We want to speed the time to market for our web content. My question to Dave Peterson is, "How can XML help us solve these problems with special characters?" |
| Dave Peterson: |
| The answer is both helpful and not. First the good part. XML prescribes the ISO 10646 Basic Multilingual Plane as its document character set. So one can count on a lot of previously difficult-to-deal-with characters can simply be put directly in the character stream. This may handle a good part of the "special character" problem to which Dave alludes. And the problem of distinct ways of representing the same character will probably not matter for presentation. There are liable to be several "gotcha"s, however. |
| A few difficulties arise with respect to presentation. XML may mandate that large character set, but it will be a while before there are appropriate glyphs available for all of those characters. Some current font-handling technologies still have fonts limited to at most 256 glyphs, and do not provide any way to indicate which (other than the system default) characters the glyphs are intended to be used to picture. This means that there will be a lot of system-specific techniques developed for the short run to allow a large character set to be mapped to the glyphs of a large collection of small fonts. And, at least for the short term, this will mean careful management, and for some systems some font-handling trickery. |
| We are beginning to see the availability of larger fonts and systems that can handle them, but this will not in itself be adequate. For example, a serious Western European or American publisher will probably insist on at the very least a sanserif and a serif proportional-space font, and a monospace font. But three comlete 16-bit fonts are huge! Most of the glyphs will probably be exact replicas, and some will be used only with one or the other of the three fonts. For example, most mathematical symbols will be the same regardless of whether the font is sanserif or serif, and Eastern Asia character glyphs are typically always monospace (though not necessarily the same aspect ratio as monospace West European fonts. So in practice we'll probably want smaller fonts and some mechanism that allows for reuse of glyphs to map the characters to the glyphs. Not yet available in your friendly neighborhood Web browser. |
| In fact, the Web browser is liable to be a problem for some time. Microsoft early on announced "support" for XML, but it turns out that that "support" is by using XML for metadata purposes. They are not currently showing much interest in providing a browser that displays XML instead of or in addition to HTML. No customer demand, I understand. I hope that changes. |
| Dave also asks about equations. Mathematical expressions have always been difficult to typeset. We may have solved much of the problem with respect to special characters in math, but typesetting typical mathematics is not just a matter of having a rich character/glyph set. Proper relative positioning of the glyphs is at least as important--and cannot be dealt with by having a large font. The standardization of generalized markup for mathematics is still in its infancy. XML, like SGML and for the same reasons, provides the tools with which math markup can be formalized, by defining appropriate element types and prescribing appropriate display styles for those element types. |
| In its infancy, there were SGML-based publishing systems that were tied to particular DTDs, in that if you wanted a certain effect, you marked it up in a certain way. Not much of an improvement over non-"generalized" markup, except in some cases the markup was standardized, at least in some vertical markets. There was a loud outcry, and essentially all SGML-based publishing systems either provided proprietary style sheets that allow user-specified formatting of arbitrary element types or have died. |
| But not quite. The user-specified style capabilities are generally limited to style variations applied to run-on text, and style variations applied to vertically run-on blocks of run-on text. Almost no publishing system suppliers provide user-specifiable formats that allow more complicated formatting specifications. |
| The two most common categories of displays that are more complicated are tabular displays and mathematical displays. (Chemistry diagrams are probably next--and you thought math display was complicated to specify!) Users have not insisted on tabular style sheets for arbitrary element types--I don't really understand why; the requirements for reasonable tabular display are certainly well known by now. On the other hand, I believe that the current state of the art in mathematical formatting--given a requirement to avoid ex-post-facto user-twitching of the display--is not there yet. |
| In both cases, we're still limited to "you must use these element types" when we want these complicated displays, and for math especially, there is still no widely-accepted-as-adequate formatting mechanism to access. It will be interesting if such new developments as MathML will become "widely accepted as adequate". And, if so, whether the associated formatting capabilities will be made available through style sheets to arbitrary element types. |
| In summary: The character set of XML, once the glyphs become widely available in widely available XML browsers, should solve Cahner's "oddball characters" problems. (XML has introduced "oddball characters" broblems of its own, but they do not appear to present difficulties for Cahners situation.) Adequate commercial-off-the-shelf support mathematics display is a separate problem and may or may not be solved as quickly. |