Implementing the Proper Standard   Table of contents   Indexes   Most Frequently Asked Business Questions About XML

 

XML: What HTML Wanted to Be!

 Ann Arbor 
 Arbortext, Inc. 
 Haakonstad, Norma 
 Michigan 
 
Norma  Haakonstad
National Accounts Manager,  Arbortext, Inc. 
 1000 Victors Way
Ann Arbor  (Michigan)  48108 
Email: njh@arbortext.com>

Biographical notice

As National Accounts Manager for ArborText, Ms. Norma Haaakonstad works closely with many of the nation's largest publishing, automotive, heavy equipment, telecommunications, and pharmaceutical companies implementing enterprise SGML/XML applications. Before she became National Accounts Manager, Ms. Haakonstad served as ArborText's Midwest Regional Sales Manager for five years.

Prior to joining ArborText, Ms. Haakonstad was one of the owners of Integrated Engineering Software Inc., where she ran marketing and sales operations. Before that, she was Sales Manager for Electrocon International, Inc. a company that develops software for electric utility applications. Ms. Haakonstad holds a degree in Business Management.

 

Introduction

 There's no doubt that there's huge hype behind the XML frenzy, but there's also a lot of substance. As the number of vendors pledging support for XML has climbed from a few to a few dozen to a hundred or more, it's clear that XML is rocketing into the mainstream.
 The reason that XML is important -- in fact, the reason that it's crucial for you to gain an early grasp of XML's implications for your organization -- is because it's now crystal-clear that XML will be the next-generation language of the Web. With HTML, we saw explosive growth of the Web even though its primary business use was merely advertising and public relations. But with XML, businesses can finally realize the full potential of the Web -- by putting the Web to work with high value added information on enterprise-critical applications. These applications will bring meaningful competitive differentiation and high ROI to those who can move quickly to exploit them.
 The purpose of this paper is to let you know the current state of XML, its exciting future, what is realistic to expect from it, and why it's important to you.
 Most of you know that ArborText has been involved with standards since day one. As a leading vendor of SGML-based software for authoring, managing, and delivering structured documents, we have a lot of experience in the strengths and weaknesses of SGML. We are contributing our experience to the entire set of XML-related standards, all of which are under the auspices of the World Wide Web Consortium, or W3C. Eleven companies including ArborText were part of the original XML Working Group that formed in late 1996. As a result of the efforts of the working group, the standard was formally adopted in February, 1998.
 The Extensible Style Language, or XSL, is the XML way to attach style to XML content. ArborText, Inso, and Microsoft jointly proposed an XSL specification to the W3C to kick off the XSL effort, and Paul Grosso of ArborText is our representative on the XSL Working Group. But XSL is about more than just describing how an element is formatted on the screen or in print -- it's about attaching any kind of behavior to an element, not just formatting.
 XLL or Extensible Link Language, will bring additional capabilities to today's HTML linking. The XLL specification goes beyond traditional linking to allow you to attach links to documents even when you don't control (and therefore can't change) those documents.
 The DOM Working Group is developing the Document Object Model, a standardized API for accessing and manipulating HTML and XML elements. By standardizing the API, you'll be able to write software that's reusable across a variety of different tools.
 The last on the list is XML-Data, currently at the proposal stage. XML-Data was jointly developed by ArborText, DataChannel, Inso, and Microsoft and has been acknowledged by the W3C. The purpose of XML-Data is to create a "schema" to specify not only the validity and relationships of XML elements, but also the content of those elements. What does this mean for us? It allows us to go beyond what the DTD will provide. It allows us to validate the data without supplementary and proprietary software routines.
 XML 
 

What is XML?

 XML stands for "Extensible Markup Language" -- it's extensible because it is not a fixed set of elements like HTML. XML was originally developed by SGML people to enable delivery of SGML documents over the Web.
 But during the process of defining XML, the vision of that group expanded to include other ways to apply XML, including basing general data formats on XML, and using XML as the data encoding scheme for metadata and transaction data.
 Momentum behind XML has grown to a frenzy, so much so that it's now certain that XML is going to be broadly supported. Companies already supporting XML in their products include ArborText, Chrystal, DataChannel, Grif, Inso, Microsoft, and WebMethods. Far more companies have pledged support in the future, including Adobe, IBM, and Netscape. And dozens of companies are cooperating to develop XML- based standards for metadata and transactions, not only IT companies but also banks and credit card companies.
 We have been thinking of XML as SGML minus minus instead of HTML plus plus. Why? Because XML offers 95% of the capabilities of SGML while it's vastly more powerful than HTML.
 But now we're starting to see that XML is on the way to becoming SGML plus plus -- it's got the power of SGML and the simplifications needed to address a mainstream market. Also, there are emerging standards such as XSL stylesheets, XLL, and XML-Data promise to deliver even more power and functionality than we would ever have seen from SGML. And we can begin to work with data beyond traditional document applications.
 

Uses of XML

 When we look at XML to decide where it will be used and who will be using it, we'll be looking beyond comparing it to other document content formats. We'll be comparing XML to other data formats, other metadata formats, and other transaction formats as well.
 Consider the available formats for document content. Existing formats include SGML, the international standard, HTML, the way almost all Web documents are formatted, and a variety of different proprietary word processing and desktop publishing formats.
 Expanding our view to data formats were see that there are almost as many data formats as there are applications for those formats. There's the result of a database query, the contents of a configuration or initialization file, a graphics image, a few seconds of sound, a video, and on and on.
 Another area to consider is metadata. We think of metadata as information about documents as opposed to the contents of the documents themselves. For example, metadata might include author, date of original creation, date of last revision, permissions to read and change, and so on. With XML, there will be a standardized method of adding metadata to documents.
 Finally, we have broadened the discussion further to include transactions such as electronic transfers of funds, purchase orders, inventory checks, and other forms of electronic transfers. Today, in the world of EDI, there are a huge number of overlapping and incompatible formats.
 SGML 
 

XML vs SGML

 While XML is substantially based on SGML, it improves on SGML in several crucial ways:
 
  • 1. XML virtually eliminates all of the options of SGML. For example, SGML has a feature called "tag minimization" that was originally provided to make it easier to use a simple ASCII editor to type tags. But today there are powerful, simple tools that make tags even easier to enter and manage. XML also eliminates little-used advanced features of SGML that are both complicated to support and have led to interoperability problems.
  •  
  • 2. One of the primary attractions of XML, of course, is that we're seeing support within the mainstream browsers. The most recent release of Microsoft Internet Explorer is the first to support an extensible data format, and Netscape has likewise promised to support XML in an upcoming release. Since SGML can very easily be converted to XML "on the fly," the mainstream browser support for XML will give us the first direct path to deliver both XML and SGML over the Web.
  •  
  • 3. Although SGML-related standards for stylesheets and linking never caught on, we predict great success for the XML-inspired versions of these standards. XSL is the XML stylesheet standard and XLL is the XML linking standard -- both are likely to receive the enthusiasm and support necessary to make them viable. We'll look at each of these standards in more detail later.
  •  
  • 4. One of XML's most hyped features is that it allows content to be processed without requiring the presence of a DTD. This is certainly a valuable improvement, but it's misunderstood by many, as is noted in the next SECT.
  •  DTD, Document Type Definition 
     

    DTDs in XML

     DTD, which stands for "Document Type Definition," establishes all the rules and relationships for a particular document. For those of you who don't know SGML but are familiar with HTML, you may find it helpful to learn that HTML is defined by a DTD. The DTD for HTML defines which elements you can use on a Web page and how you can use them.
     DTDs are extremely valuable because the consistency they enforce on the creation side supports automatic processing on the assembly and delivery side. Ever wonder why you couldn't just add your own tags to HTML? Because if you did, the HTML applications downstream from you wouldn't know how to handle them.
     XML does not require a DTD for processing. That means, for example, that a Web browser can process and display XML data without requiring its DTD as well. The primary benefit of eliminating the DTD is to simplify the design of processing applications because they don't have to be capable of interpreting a DTD.
     Eliminating the DTD is possible thanks to minor data format changes that provide embedded cues within XML data that SGML only provides through a DTD. However, eliminating the requirement to send a DTD along with its associated data does not mean that "anything goes" when creating the data. To obtain all of the benefits you traditionally associate with SGML -- reuse, interchange, and automation -- you'll still want to use a DTD when authoring XML in order to ensure the absolute data consistency you need to achieve those benefits.
     There are a couple of good reasons that you'll want the flexibility to create new elements on the fly: rapid prototyping and "personal" applications of XML. That's why Arbortext intends to support this use of XML in upcoming releases of its software.
     The effort to make XML simpler than SGML focused on the capabilities of the DTD. The result of that effort was to omit capabilities from SGML DTDs. The list of capabilities dropped from SGML is quite long, but we've developed workarounds for most of these. However, there are a couple of capabilities that were omitted from XML that you might notice, especially if you are working in one of the industries that use standard interchange DTD such as those developed for the aerospace and automotive industries. There are two capabilities I'd like to touch on:
     
  • The first is inclusions and exclusions that allow you to specify exceptions to your DTD.
  •  
  • The second is something called AND content models, which allow you to insert elements in any order while still requiring that all elements be used. The industry interchange DTDs tend to be loose in their enforcement of sequence while remaining strict in their enforcement of completeness. Losing these capabilities has been problematic for those DTDs and is most likely the reason why they haven't been replaced by XML.
  •  We see two potential solutions to the missing features. First, it's possible that later revisions of XML will address these issues. You all know that lots of changes occur as both software and standards are revised through 1.1, 1.2, 2.0, and so on. We know that XML 1.1 is coming -- we just don't know when it's coming or exactly what it will support.
     Another potential solution, and the one that we think is far more likely, is XML-Data. XML-Data was designed by ArborText, DataChannel, Inso, and Microsoft to replace and improve upon DTDs. These four companies jointly submitted their design to the W3C as a proposal for a formal specification. The W3C has not yet formally launched any activity related to XML-Data. Even so, there's a lot of interest in it.
     XML-Data prescribes the format of "schemas" for XML documents and data. Schemas are commonly used in database applications to specify the valid content of various fields and to indicate the relationships among fields and records. An XML-Data schema describes the rules for creating valid XML data for a specific application. XML-Data schemas includes three key improvements over DTDs:
     
  • Content validation - XML-Data's most important feature is its support for validating content. In comparison with DTDs, which specify only whether an element is allowed, an XML-Data schema specifies how to validate the content of the element itself. For example, an element could be specified to be a number that falls between 0 and 99.
  •  
  • Inheritance - XML-Data provides a way for elements to inherit properties of other elements. In contrast with DTDs, where each element must be defined separately, an XML-Data schema allows the user to specify classes of elements. The bottom line is that inheritance makes schemas simpler to maintain and more modular than DTDs.
  •  
  • XML-encoded - An XML-Data schema is itself an XML document. That means that some tools for creating and processing all kinds of XML data could be readily used on XML-Data schemas. In contrast, DTDs have a highly specialized syntax that requires a unique set of editing and processing tools.
  •  

    XSL is more than style

     XSL is about more than style, it's about defining any behavior of an element. In other words, XSL lets you define what you want to do with that element. What do you want to do with the element? Do a database query, bring up a dialog in Internet Explorer?
     XSL allows us to reorder text, suppress the display of text, and automatically generate calculated text.
     While XSL is fully compatible with CSS, the Cascading Style Sheet format for HTML documents supported by the Microsoft and Netscape browsers, it has many improvements over CSS. XSL has the capability to examine all ancestors, descendants, and siblings in order to establish context. CSS is limited to setting style based only on an element's immediate parent.
     XSL's primary additional capabilities include:
     
  • Reordering of information so that it can be displayed or processed in a different order than it was authored. Let's use a patient's record as an example. A patient record includes a bunch of information including name, birthdate, address, insurance information, treatments received, allergies, current and past medications, and so on. If I am a billing clerk, I may be interested only looking at the name, address, insurance information, and the latest unbilled procedure performed by the doctor. As the attending physician, I may be interested in age, weight, previous ailments, and allergies. With XSL, the view will be tailorable at the client end, by changing the sequence of the information displayed and which information is displayed based on the role of the person viewing the data.
  •  
  • Automatically generated text, which can be used to generate both fixed text (e.g., "Chapter" at the beginning of each chapter) and for numbering (e.g., chapters, SECTs, subSECTs, and footnotes).
  •  We expect to see Version 1.0 of the specification to be released this year, although there is no formal timetable that's been published by the W3C.
     XSL also forms the basis of a transformation language, so we expect to see applications emerge that convert information from one DTD to another.
     

    XLL - Linking

     The XML linking specification, XLL is being designed to improve on HTML's existing URL linking while remaining compatible. XLL provides additional functionality that will make the Web easier to use and more functional.
     The primary capabilities that XLL will provide are bidirectional, conditional, and indirect linking.
     
  • Bidirectional means that I can link from one spot to another and come back, so that target can be the target for multiple bidirectional links.
  •  
  • Conditional linking is used, for example, in IETMs. In an IETM the target location is determined by your previous interaction with the data or by your skill level or security clearance.
  •  
  • Indirect links allow you to store the linking information in a separate, intermediate file which means you can make changes to the links without changing the actual document, which improves access and revision controls.
  •  External links allow you to create links to and from a read-only document. Today, of course, you must be able to change a document in order to add links to it.
     

    XML,SGML, and HTML

     The diagram in Figure 1 helps explain where HTML, XML, and SGML fit into the traditional world of documents. There are two continuums, one representing the complexity of the structure and the second representing the complexity of the data. The information production products you develop will fall somewhere within the four quadrants illustrated.
     A novel is an example of a publication that has a very simple structure. The body of a novel generally only contains chapters and paragraphs. A phone book also has simple data, but it is highly structured. You have one SECT that contains last name followed by first name or initial followed by address followed by the phone number of residential customers and another SECT that contains information on commercial customers, etc. Structure helps you find the information you are seeking more quickly. A newspaper contains complex data, but like a novel, has minimal formal structure. In the upper right corner -- representing a document that is highly structured and contains very complex data -- is airline documentation (ATA2100), automotive documentation (J2008) and computer hardware and software documentation (DocBook). These documents contain warnings, cautions, assembly procedures, disassembly procedures, bill of material listings, part numbers, and so on. Consistent structure and reliable content are key factors for ensuring usability.
     Some people like to argue that "SGML is for everything" or "XML is for everything" or "HTML is for everything". We don't believe that's true. The ellipses represent the areas where we see these markup schemes fitting the best.
     

    Using XML beyond documents

     XML can be used in some applications that don't involve documents.
     First, consider the use of XML to exchange data between applications. Let's look at what Microsoft is planning. They have announced that Office 98 will use XML to store Office-specific data within HTML documents so that those HTML documents will "round trip" from Office to HTML back to Office. When today's version of Office saves a file as HTML, information is lost that cannot be regained when the file is loaded back into the Office application. But with XML elements preserving the Office-specific data, nothing will be lost on conversion to HTML.
     Then there's the use of XML for metadata, which can provide a standard way of categorizing, locating, and indexing files regardless of their content. For example, you could attach XML metadata to a Word document without having to convert the entire document to XML.
     Several efforts are either finished or well under way to establish XML-based metadata formats. For example, RDF is a W3C proposal to establish a standardized method of applying metadata to many types of content, which can be stored as any file type. RDF is expected to support applications such as indexing Internet or intranet sites, build site maps for navigation, or contain fields for content ratings and push channel definitions.
     Channel Definition Format was designed specifically for push applications and has already been deployed. ICE is a standard set of XML elements that contain metadata information to support secure and reliable exchanges of content and transactions among independent websites. The objective behind ICE is to enable several companies to get together on the Web and create superstores of products or content.
     XML provides the crucial enabling technology to support the long-predicted explosion of Web-based electronic commerce. Examples include Open Financial Exchange (OFX) from Microsoft, which was designed for consumer financial transactions on the Web, and Open Trading Protocol (OTP), for purchases and sales over the Web. The OTP Consortium is quite large and includes leading companies such as AT&T, Hewlett-Packard, MasterCard, Hitachi, Royal Bank of Canada, CyberCash, Fujitsu, IBM, Netscape, Nokia, Oracle, Sun Microsystems and Wells Fargo.
     OSD ("Open Software Definition") describes the delivery of software applications over the Internet. It allows Web developers to create application "channels" by defining versions, underlying structure, dependencies, relationships to other components, etc.
     CommerceNet is a non-profit consortium of companies seeking to use the Web for a broad array of business-to-business e-commerce applications. CNgroup is the R&D affiliate of CommerceNet that will also provide expertise in the use of XML for related technologies such as electronic catalogs.
     

    Is XML Easier?

     One of the questions we're hearing a lot these days is "Isn't XML easier than SGML? Because if it is, why wouldn't I use XML and forget about SGML?" Lets review both of those questions.
     XML is certainly easier than SGML to deliver over the Web. Since Microsoft already supports XML and Netscape has promised support later this year, you already have tools that can almost effortlessly deliver rich structure and content to the desktop. Delivering SGML data over the Web today relies on tools that are expensive and outside the mainstream.
     We expect that XSL, the stylesheet standard for XML, will make it possible to exchange stylesheets between applications. Eventually, you'll be able to build one stylesheet and use it across multiple tools for multiple deliveries.
     Maybe XML is easier than SGML for building tools. Certainly, some tools that support XML will be easier to build. If you're a software developer and you want to use XML as a data interchange format, you'll be able to find a freely available parser that will examine an XML data stream. Now, since you can get a freely available SGML parser just as easily as an XML parser, you may wonder why it really matters. And the answer is that for any application where a freeware parser is sufficient, the only real difference is code size and speed. An SGML parser is a lot bigger and a little slower. But many application developers, especially those who are working on non-document applications, prefer to write their own parser. And that's way too big a job with SGML.
     XML might be easier than SGML for "personal" use, where the rigors of analysis aren't required. For example, small websites might easily be developed and maintained in XML without ever creating a DTD.
     But let's look at one way that XML is not easier than SGML: most of you are aiming to build a database of modular document components that you can easily reuse, interchange, and automate. For those kind of applications, you'll still need to perform all of the up-front requirements analysis as well as the rigid enforcement of rules to ensure an absolutely consistent data format. In other words, if you decide that you want not just well-formed but also "well-structured" (or "valid") XML, you must still make the investment to achieve it. You must still make sure that you have valid information.
     

    XML documents are data

     One of the most fascinating opportunities for using XML on the Web are hybrid applications that cross the boundaries of documents and data. These applications will bring static Web documents to life in a way that existing technologies do not.
     For example, consider being able to develop an interactive parts catalog that lets users select parts from a picture, sort the parts based on type or location, check the inventory levels of selected parts, and enter purchase orders for parts, all from the same interface.
     Another example is an electronic service manual that knows what step you're on, can automatically branch to different steps based on your answer to a question, allows you to enter questions or corrections about each step, and records the time you spent on each step.
     An interesting side effect of these hybrid applications is that those responsible for developing interactive content will find themselves in the application development business and not just in the content development business.
     

    XML today

     XML is clearly on the way to becoming the mainstream technology of the Web for a broad array of applications.
     Today, XML is approved and we're still waiting for the others to reach that same level. Nonetheless, you can move forward with XML right now and transition later to the additional capabilities that the remaining standards will provide. In the meantime, tools are already available to help you take advantage of the enormous potential of XML today.

    Implementing the Proper Standard   Table of contents   Indexes   Most Frequently Asked Business Questions About XML