Designing Microdocument Architecture™, systems   Table of contents   Indexes   Software Agents using XML for Telecom Service Modelling : a Practical Experience

 
 

Do you Need XML? A Checklist...


 
PG   Bartlett
  ArborText, Inc.
1000 Victors Way
Ann Arbor   Michigan  48108  USA
Phone: +1 734.997.0200
Fax: +1 734.997.0201
Email: pgb@arbortext.com Web: www.arbortext.com
 
Biographical notice:
 
PG Bartlett
 
As Vice President of Marketing for ArborText, PG Bartlett has been instrumental in ArborText's development as the world's leading provider of content creation and management software for enterprise XML applications. Bartlett has served 18 years at technical and marketing positions in leading-edge high-technology companies. He is a regular presenter at major industry events and has been invited to present and chair sessions at Seybold Seminars, XML conferences, CALs conferences, and other major events. Since joining ArborText, Bartlett co-authored two electronic presentations distributed by SGML Open, the industry consortium, and authored several white papers.
 
 

Summary

 
In 1986, the Standard Generalized Markup Language (SGML) became an international standard for the format of text and documents. SGML has withstood the test of time. Its popularity continues to increase among organizations with large amounts of document data to create, manage, and distribute. However, various barriers exist to delivering SGML over the Web. These barriers include the lack of widely supported interchangeable stylesheets, complex software because of SGML's broad and powerful options, and obstacles to interchange of SGML data because of varying levels of SGML compliance among SGML software products.
 
HyperText Markup Language (HTML) is the pervasive data format for the World Wide Web. While HTML provides an outstanding mechanism to deliver simple documents over the Web, its simplicity imposes limitations that significantly raise the cost of deploying complex websites.
 
Because mainstream Web browsers lack SGML support, most applications that deliver SGML over the Web convert the SGML to HTML. This down-translation removes much of the intelligence of the original SGML information. That lost intelligence virtually eliminates information flexibility and poses a significant barrier to reuse, interchange, and automation.
 
The Extensible Markup Language (XML) is being developed to enable delivery of SGML information over the Web while overcoming the limitations of HTML. The frenzy building behind the XML effort means that XML is inevitably destined to become the mainstream technology for powering broadly functional and highly valuable business applications on the Internet, intranets, and extranets.
 
This paper separates the hype of XML from its reality in order to provide answers to business-critical questions such as:
  • "When should I use XML instead of HTML?"
  • "When should I choose XML instead of full SGML?"
  • "How is XML easier than full SGML?"
 
In response to these and similar questions, this paper shows why:
  • XML will displace HTML in Web applications where high degrees of reuse, interchange, and automation are required.
  • XML will displace HTML as the preferred way to deliver SGML information over the Web.
  • Full SGML will eventually be surpassed by XML as the preferred approach for creating and storing enterprise-critical documents and data.
  • In addition to powering document applications, XML will enable non-document applications over the Web.
  •  
     

    XML: For SGML on the Web

     
    XML, or Extensible Markup Language, is a highly functional subset of SGML. The purpose of XML is to specify an SGML subset that works very well for delivering SGML information over the Web. When the mainstream Web browsers support XML, it's going to be very easy to publish SGML information on the Web.
     
    Because XML has almost all of the capabilities of SGML that are both important and widely supported, XML is indistinguishable from much of SGML as practiced. Although XML is missing a few capabilities of SGML, these missing capabilities only affect document creation, not document delivery. That's because XML was not initially designed to replace SGML in every respect, although work is under way to deal with the few important features of SGML that XML lacks.
     
     

    The Growing Momentum Behind XML

     
    Momentum behind XML has grown at a startling rate since development of the XML specification began September of 1996:
     
    • Microsoft has already shipped XML support in Internet Explorer 4.0, and they're likely to expand their XML functionality even further in the next release.
    • Netscape has promised to support XML in the 1998 version of Navigator.
    • Many other companies including Adobe, ArborText, Chrystal, DataChannel, Enigma, Grif, IBM, Inso, SoftQuad, Texcel and WebMethods have already announced or will soon announce XML support in their products.
    • All of the companies represented in the XML Working Group will likely either support or utilize XML within the next year.
    • Articles about XML frequently appear in mainstream IT publications such as Byte , InfoWorld , and PC Week , and XML is receiving extensive coverage by influential newsletters such as Seybold Reports .
    • Major industry analyst services such as GartnerGroup, Meta Group, and CAP Ventures are covering XML.
     
    By now, it's become evident that XML will become the primary means to deliver over the Web the vast amount of SGML-based information that currently exists. Further, XML is likely to become the underlying technology to leverage the Internet for innovative new business-critical applications. This paper explains why.
     
     

    The Limitations of Browsing: Why HTML Is Limited to Document Delivery

     
    HTML is the current wildly popular markup language for delivering documents over the Web, but HTML also has several limitations that become apparent for applications that are larger or more functional than home pages and small websites. The following paragraphs explain these limitations in more detail.
     
    Limited structure - HTML is a set of tags that specify the on-screen appearance of each element on a Web page. The set of tags that make up HTML expands each time HTML goes through a new revision, but the fundamentals remain the same: HTML is primarily oriented to presentation and supports only a fixed and trivially simple structure.
     
    In this, HTML shares the limitations of other presentation-specific markup languages, such as RTF, which is designed for documents that are delivered in print. The principal limitations are explained in the following paragraphs:
     
    Limited reuse - Many organizations publish the same information in multiple forms; it's very common to have both printed and Web forms of the same data. Information originally created in HTML can be reused for printing, and information originally created for printing can be reused for Web delivery.
     
    However, to achieve reuse requires conversion that's usually followed by manually fixing up the appearance (i.e., the formatting) of the resulting document. And that means that each time the source information changes, the conversion and fix-up process must be repeated. This is an expensive, time-consuming, labor-intensive, and error-prone process.
     
    Limited interchange - Because the Internet is simple and ubiquitous, it provides an ideal medium for organizations that want to interchange data. However, HTML undermines interchange because its small, fixed set of tags primarily indicate only the appearance of an element of a document. HTML provides no way to denote the data within a document, which cripples attempts to snatch data from the Web and reuse that data for other purposes.
     
    For example, a computer manufacturer may wish to capture semiconductor data from its suppliers and feed that data into its computer-aided design (CAD) systems. CAD systems require data such as the function, tolerances, and timing of each pin of an integrated circuit. HTML provides no way to tag such data unambiguously. In fact, even if the original source data contains the necessary tagging, which is likely if the source data is in SGML, the resulting down-translation to HTML strips all the intelligence away.
     
    Limited automation - Automation saves labor, reduces costs, speeds delivery, and improves quality. There are many opportunities for adding automation to the use of the Web, particularly for intranets and extranets. Examples include almost any forms-based application, such as insurance enrollments, medical claims processing, and online banking.
     
    However, HTML poses a significant barrier to achieving automation. All highly automated processes are built on a data format that's very expressive and absolutely consistent. HTML lacks the necessary expressiveness, since it's limited to a fixed set of presentation-oriented tags, and lacks as well the absolute consistency, since there's no way to impose a rigorous data structure on top of those tags.
     
    Searching produces too many hits - One of the most valuable capabilities of the Web is provided by search engines that allow a user to find everything on the Web related to an inquiry. As the volume of information available on the Web continues to skyrocket, however, the amount of data retrieved for a typical search has risen to unusable proportions. Searchers of information must choose between queries that are so narrow that relevant information may be omitted from the results, and queries so general that they produce far too many hits to be useful.
     
    The reason that Web searches turn up too many hits is that we typically search all the content of every page. Although searches can be limited to titles, those searches are almost certain to exclude relevant hits.
     
    One of the best ways to improve Web searching would be to provide content-specific elements. For example, the word "bonds" could be tagged as a name, or a chemical term, or a financial term. Then searches for content related to "bonds" could be limited to a specific domain of inquiry.
     
    Moving target: HTML 2.0 to 3.2 to 4.0 to ?? - Since HTML is an evolving standard, its capabilities are continually being extended through the introduction of new tags. For those who are maintaining large amounts of information in HTML, the release of new revisions of HTML usually requires reviewing and retagging the existing data. In fact, many webmasters are relieved that the intervals between new versions of Web browsers are increasing, because that means that they don't have to retag their websites as often.
     
    To avoid the retagging problem entirely, many organizations create their source information in SGML and down-translate to HTML. The level of effort for changing an SGML-to-HTML translator may be as little as a few hours, while the effort to retag hundreds or thousands of pages can stretch into weeks or months.
     
     

    SGML: Father to HTML and Brother to XML

     
    SGML prescribes the rules for creating a specific markup language such as HTML. In other words, HTML is an application of SGML. While HTML is a single set of tags, SGML provides the capability for creating any desired set of tags. XML is similar to SGML in that it likewise provides the capability to create any tags.
     
    The primary benefits of SGML are the same as the benefits of XML:
    • Infinite possibilities for expressing information (infinite tag set)
    • Write once, reuse many times
    • Future-proof, platform-proof
    • Validation for completeness and correctness
     
    SGML's Limitations for Web Delivery
     
    To replace or even supplement HTML for Web delivery of information, SGML poses some significant roadblocks. The following paragraphs explain why.
     
    No mainstream browser support - The primary problem is that SGML never caught on with the mainstream browser providers. Microsoft Internet Explorer and Netscape Navigator do not contain any support for SGML. Why? Because SGML offers so many options that designing tools to support them all results in complicated software. Even the premiere SGML tool providers do not support 100 percent of the options that the SGML standard allows.
     
    With Web browsers supporting only HTML, organizations wishing to publish their SGML information on the Web typically apply an automatic SGML-to-HTML conversion to their data. This produces acceptable results for simple viewing applications but at the cost of "dumbing down" the data so that interchange and automation are much more difficult.
     
    The reason for these difficulties is that the down-conversion from SGML to HTML results in a significant loss of information. Without that information, it's virtually impossible to reconstruct the original meaning of the SGML files by looking only at the HTML file.
     
    An analogous situation occurs when you convert a CAD drawing into a GIF file for viewing on the Web - reconstructing the original CAD file from the GIF file is virtually impossible for all but trivial examples.
     
    No support for styles - Another barrier to using SGML for Web delivery is that SGML only standardizes structure; SGML does not include any support for styles. There have been a couple of attempts to establish a stylesheet standard, most notably FOSIs (Formatting Output Specification Instances, a standard originally developed by the U.S. military) and DSSSL (Document Style Semantics and Specification Language), but each of these has received little or no vendor support. The result is that there is no widely accepted standard stylesheet format for presenting SGML information.
     
     

    XML Delivers Benefits of SGML and HTML

     
    XML was invented to enable the delivery of SGML information over the Web. XML overcomes the limitations of SGML for Web delivery while providing all of its benefits.
     
    XML is different from SGML in many ways, but there are only a few that are significant from a business manager's perspective. The SGML capabilities that were dropped from XML are those that are irrelevant to the delivery of structured information over the Web. However, some of these capabilities are important to the creation of structured information.
     
    It's possible that subsequent revisions of XML will restore some or all of the omitted SGML capabilities that are crucial for information creation. In the meantime, continuing to use SGML will insulate you from changes in XML.
     
    The following paragraphs explain the significant differences between XML and SGML and their implications.
     
    No DTD required - In order to process SGML data, a processing application requires both the DTD and the data.
     
    In contrast, XML does not require a DTD in order to process the data.
     
    To eliminate the requirement for a DTD, XML data contains embedded cues to the data's structure. These embedded cues represent minor changes to the SGML data format.
     
    XML-enabled Web browsers are just one example of an XML processing application. Another XML processing application might be a banking system front-end that can receive XML-based financial transactions and convert them into deposit and withdrawal instructions. The benefit of eliminating the DTD for processing applications such as these is not only to reduce the network bandwidth used up by downloading the DTD, but also to simplify the construction and reduce the size of processing applications because they don't have to interpret a DTD.
     
    Eliminating the requirement for a DTD does not mean that it's easier to create XML applications than it is to create SGML applications - except for "personal" uses of XML, such as informal communications or one-of-a-kind document types. For such applications, where absolutely consistent structure may not be important, working without a DTD is indeed a welcome improvement. But for most if not all information that's currently in SGML, which is typically information with a regular structure created within a formal process, DTDs remain crucial.
     
    In other words, to obtain all of the benefits you traditionally associate with SGML - reuse, interchange, and automation - you'll still want to use a DTD when authoring XML in order to ensure the absolute data consistency you need to achieve those benefits. For those applications, SGML and XML involve similar levels of effort to implement. These applications call for "valid" XML.
     
    Well-formedness - The alternative to valid XML is "well formed" XML, where no DTD is used to create the data. To be well formed, a document must comply with various rules. For example, a well-formed XML document must have at least one tag pair, all elements must be nested and have balanced start and end tags, and there must be declarations for any entities used. This imposes a fairly simple requirement for an XML processing application that does not handle DTD based document validation.
     
    Exceptions - Inclusions and exclusions allow you to specify exceptions in your content model. For example, you can use exclusions to enable paragraphs to contain appendix references except when those paragraphs appear in the appendix. This is important because many processing applications may be unable to deal with unexpected constructs. For example, what does a print rendering engine do if it encounters a footnote within a paragraph within a footnote? The lack of support in XML for exceptions is one of the chief reasons that many of the existing industry interchange DTDs aren't being quickly replaced by XML, but this may be addressed in XML-Data, a potential replacement for DTDs that is still in its early stages of development.
     
    AND content models - There is no support for AND (&) content models in XML. That means that XML prevents authors from inserting elements in any order while still requiring that all elements be used. For example, the lack of "AND" means that you cannot define a title page that allows a title, optional subtitle, and author(s) in any order.
     
    The lack of an AND will have a large effect on some industry exchange DTDs, which are often loose in their enforcement of sequence while remaining strict in their enforcement of completeness. Industry-wide DTDs often choose to leave order up to local implementers using (A&B&C) in the expectation that local DTDs will be derived from the exchange DTD and that these will choose one order. Without an AND, industry-wide DTDs must loosen their content models to ((A|B|C)+) or tighten them to one definite order (A,B,C).
     
    AND models always have an equivalent that can be programmatically generated, but the equivalent can be too large to be practical. Again, XML-Data may provide relief in the future.
     
    SDATA internal entities - If you have small system-specific chunks of information, such as mathematical symbols or other symbols specific to your application, SGML permits you define them with SDATA internal entities. Although these were designed to be system-specific, many SGML tools support a common set. XML does not support this capability.
     
    Stylesheet standard - The next section describes the stylesheet standard related to XML.
     
     

    XSL: Doing XML With Style

     
    While XML specifies a data format for both document and non-document information delivered over the Web, there is a closely related effort to define style. This effort is called Extensible Style Language (XSL). The highlights of the XSL initiative are described in the following paragraphs:
     
    Based on DSSSL - After SGML became an international standard, work began on developing a stylesheet standard. The purpose of the standard was to facilitate the interchange of stylesheets and ultimately to improve the interoperability of all of the software that handles documents. This effort, formally known as the Document Style Semantics and Specification Language (DSSSL), was eventually approved as an ISO standard. To date, however, no commercial application supports DSSSL.
     
    XSL will provide much of the functionality of DSSSL, but in a form that is far more likely to be widely adopted and supported.
     
    Compatible with CSS - Cascading Style Sheets (CSS) are supported by both Microsoft and Netscape as a mechanism for overriding the default style of HTML tags. As a result, CSS offers more formatting flexibility than HTML without a stylesheet. XSL will be a superset of the CSS functionality. XSL will be designed to enable automatic conversion from CSS, so existing investments in CSS will not be lost.
     
    Reordering capability - Through XSL stylesheets, a Web browser will be able to change the sequence of the data that is displayed without going back to the server. This will be useful for any application that needs to support the interactive suppression or enabling of data display, as well as any arbitrary sequence.
     
    More powerful context sensitivity - While CSS supports the application of style based on the parent of an element, XSL allows the style to vary based on all the ancestors, descendants, and siblings of an element. This will provide far more formatting flexibility based on the context or position of an element within a document.
     
    Automatic text generation - XSL provides the capability to generate text automatically, such as generating the word "Chapter" at the beginning of each chapter, followed by the chapter number itself.
     
    Supports both printing and online display - While CSS is limited to online display functions, XSL will support formatting functions that are needed in order to support the greater complexity of printed documents.
     
     

    XML Link: XLink and XPointer

     
    The XML-related specification that deals with hypertext linking was formerly called the "Extensible Link Language" or XLL. Currently, this functionality is being addressed in two companion specifications under the name "XML Link": XLink and XPointer, which are still in early stages of development.
     
    The XLink specification specifies a simple set of constructs that describe links between objects within the same or different documents. Through XLink, you will be able to use XML to create a structure that can describe the simple unidirectional hyperlinks of today's HTML as well as more sophisticated, multi-ended, typed, self-describing links. XLink aims to provide an effective, compact structure for representing links that can be within documents or external to them, and that can have multiple typed locators, indirection, and precise specification of resource locations in XML and SGML data.
     
    The XPointer specification is a "utility" specification that defines how to address into the internal structures of XML documents. This addressing syntax, which is a powerful way to extend URL addressing, is used by XLink to indicate the endpoints of the links that the XLink specification defines.
     
    The extra flexibility of XLL linking will allow users to do much more than the current capabilities of HTML. For example, activating an XLL link could produce a pop-up dialog, create a secondary browser window, jump to a specific point in another document, or produce a list of possible targets from which the user could choose. XLL links could also be used to implement customized bookmarks, allow annotations to be attached to read-only documents, and provide for automatically generated glossaries or lists of references.
     
     

    XML-Data

     
    XML-Data was designed by ArborText, DataChannel, Inso, and Microsoft to replace and improve upon DTDs. These four companies jointly submitted their design to the W3C as a proposal for a formal specification.
     
    XML-Data prescribes the format of "schemas" for XML documents and data. (Schemas are commonly used in database applications to specify the valid content of various fields and to indicate the relationships among fields and records.) An XML-Data schema describes the rules for creating valid XML data for a specific application. XML-Data schemas includes three key improvements over DTDs:
     
    • Content validation - XML-Data's most important feature is its support for validating content. In comparison with DTDs, which specify only whether an element is allowed, an XML-Data schema specifies how to validate the content of the element itself. For example, an element could be specified to be a number that falls between 0 and 99.
    • Inheritance - XML-Data provides a way for elements to inherit properties of other elements. In contrast with DTDs, where each element must be defined separately, an XML-Data schema allows the user to specify classes of elements. Inheritance makes schemas simpler to maintain and more modular than DTDs.
    • XML-encoded - An XML-Data schema is itself an XML document. That means that some tools for creating and processing all kinds of XML data could be readily used on XML-Data schemas. In contrast, DTDs have a highly specialized syntax that requires a unique set of editing and processing tools.
     
     

    Is XML Easier than SGML?

     
    This is one of the questions that a lot of people are asking these days: "Isn't XML easier than SGML?" Because if it is, why wouldn't you use XML and forget about SGML?
     
    Here's the answer: If you're a software developer, you will definitely want to consider writing your application based on XML instead of SGML. But if you're publishing on the Internet and on paper, or if you're building large intranet/extranet applications, then XML and SGML are equally easy. Let's look at each type of application.
     
    Software development - There's no question that some tools that support XML will be easier to build. If you're a software developer and you want to use XML as a data interchange format, you'll be able to find a freely available parser that will examine an XML data stream. Then you can write a small program to find the XML elements you need and give that data to your processing application. The code will be much smaller than the equivalent code for SGML, which has to parse a DTD (a DTD is not an XML-tagged document, so it requires a separate parsing component) as well as the data itself. And since XML has almost no options, you only have to write a tiny amount of code, if any, to deal with those options.
     
    Since you can get a freely available SGML parser just as easily as an XML parser, you may wonder why it really matters. And the answer is that for any application where a freeware parser is sufficient, the only real difference is code size and speed. An SGML parser is a lot bigger and a little slower. But many application developers, especially those who are working on non-document applications, prefer to write their own parser. And that's way too big a job with SGML.
     
    Creation and delivery - If you're aiming to build a database of modular document components that you can easily reuse, interchange, and automate, then XML is no easier than SGML. For these kinds of applications, you'll still need to perform all the up-front requirements analysis as well as the rigid enforcement of rules to ensure an absolutely consistent data format.
     
     

    XML for Non-Document Applications

     
    Some of the most intense XML activity involves data transfer for financial and business transactions that have no connection to typical HTML applications. That's because XML is fundamentally a highly flexible data format that is capable of representing a wide variety of information.
     
    For example, XML is expected to be used for transaction-oriented applications such as the generation and management of consumer financial transactions, health records, and insurance enrollments. In fact, XML will blur the distinction between document- and transaction-oriented applications while increasing information providers' ability to deliver mass-customized (personalized) information on the Web.
     
     

    The CheckList

     
    Under what circumstances and for what purposes should you use HTML, SGML, and XML?
    Type of Application Create In: Deliver In:
    Home page/small website HTML HTML
    Large webs/large amounts of data XML or SGML XML or HTML
    Data must be reusable XML or SGML XML
    Automation/creation side XML or SGML XML or HTML
    Automation/delivery side XML or SGML XML
    Complex data structure XML or SGML XML or HTML
    Formal processes XML or SGML XMl or HTML
    Non-document, data oriented XML XML
    Searchable XML or SGML XML
    Interchange XML or SGML XML

    Designing Microdocument Architecture™, systems   Table of contents   Indexes   Software Agents using XML for Telecom Service Modelling : a Practical Experience