Overview of XSL   Table of contents   Indexes   XML In Defense Procurement

Halpern-Hamu
Incremental Development
 

Using an XML Audit to Move SGML Data towards XML

 Canada 
Halpern-Hamu, Ph.D., Charlie
Incremental Development, Inc.
 Ontario 
 Toronto 
 
Charlie  Halpern-Hamu, Ph.D.
Structured Information Consultant,  Incremental Development, Inc. 
 560 Ontario Street
Toronto  Ontario (Canada) M4X 1M7 
Email: hamu@pathcom.com

Biographical notice

Charlie Halpern-Hamu completed his doctorate in Computer Science at the University of Toronto. He has published papers in the areas of denotational semantics, programming-language design tools and graphical control of robots by the disabled. He has been a been a structured-information consultant for seven years.

 

Abstract

 This paper describes, at a technical level, how to assess the XML-readiness of your SGML data as a first step towards moving it towards XML.
 This paper suggests an 'XML audit': a technical review of current markup practice with eye towards simplification. The goal of an XML audit is to understand which portions of your current SGML application are not XML. The next step might be to start deemphasizing your use of those features.
 Moving all the way to XML allows you to use XML tools that do not support full SGML. Even getting part way there means you can use a wider variety of SGML tools. In either case you will be simplifying work for both editorial and programming staff. Simpler is better.
 This paper is derived from James Clark's 'Comparison of SGML and XML', a World Wide Web Consortium Note (
 
www.w3.org/TR/NOTE-sgml-xml-971215
by
 
jjc@jclark.com
).
 

Introduction

 This talk describes, at a technical level, how to assess the XML-readiness of your SGML data as a first step towards moving it towards XML.
XML Audit
 

XML Audit

 This talk introduces the concept of an 'XML audit': a review of current markup practice with eye towards simplification. An XML audit lets you know where you stand. Your next step might be to de-emphasize those SGML features that are not XML.
 

Motivation

 Moving all the way to XML allows you to use XML tools that do not support full SGML. Even getting part way there means you can use a wider variety of SGML tools. In either case you will be simplifying work for both editorial and programming staff. This simplification may result in reduced training requirements, less confusion and fewer errors.
 But even if you choose to make no immediate change to your markup practices, an XML audit will give you valuable information that will help inform future decisions. You may discover that, give or take an angle-bracket or two, you are already doing XML.
 

Notes on Style

 All discussion assumes the reference concrete syntax. So I will say 'left angle-bracket' or '
 
<
', but not 'start-tag open delimiter' or '
 
STAGO
'. Similarly, I say 'white space' instead of 'separator'.
 Where SGML and XML vary slightly in their nomenclature, I tend towards the SGML, since that's our starting point. Or I fall back towards spelling things out using the reference concrete syntax as described above.
 I use the term 'URL' ('uniform resource locator') where the XML standard uses the term 'URI' ('uniform resource identifier'). The expectation is that the URL standard will be updated to define 'URI'. Until then, speaking of URIs is getting a bit ahead of ourselves.
 

Acknowledgments

 This paper is derived from James Clark's 'Comparison of SGML and XML', a World Wide Web Consortium Note (
 
www.w3.org/TR/NOTE-sgml-xml-971215
by
 
jjc@jclark.com
).
 Clark's Note discusses XML options not available in SGML. This paper ignores these, only discussing those SGML options that are not available in XML.
 In this paper, and to an even greater degree in the corresponding presentation, I've tried to give more prominence to the more commonly-used SGML features that are missing in XML.
 I'd like to thank Larry Sulky for his copy edit. The only suggestion I didn't take was to change 'a journey of a thousand miles' to 'a journey of sixteen-hundred kilometres'.
How to Conduct an XML Audit
 

How to Conduct an XML Audit

 The key idea in conducting an XML audit is resisting the temptation to do more than simply review where you stand.
 

Who Should Attend

 You need a selection of technical people: someone who knows the DTD, someone who knows editorial tagging practices, someone who knows about the programs that operate on the data as it flows in, through, and out of the organization.
 An XML audit is for figuring out where you are, not where you are going. Consequently, you don't need managerial or technical decision-makers at the meeting. They will want to understand and act on the final assessment.
 

How to Prepare

 Make printouts of this paper, your SGML declaration(s), DTD(s), some sample data, and programs that act on this data. Distribute these items in advance to your attendees. Each attendee should review these items, especially those about which she is the designated expert. So the data architect should focus on the DTDs, the programmer the programs, etc. Ask attendees to note those aspects of your current SGML that are not XML, perhaps in the margins of this paper.
 

How to Proceed

 Designate one person as the note-taker. As with the individual preparation step, it may be convenient to use a copy of this paper as a note-taking template. Move systematically through the headings in this paper and determine if they apply to your application.
 Postpone discussions about how to recast SGML usage as simpler XML usage. Focus on simply listing those aspects of your SGML usage that go beyond XML. When you do find non-XML usages, include details of where. Do you have one use of the '
 
&
' connector or a dozen? Which elements? Try not to worry about why you use this aspect of SGML or how you might avoid it.
 

Results

 The result of an XML audit should be an assessment report. Transcribe your notes into a complete list of the non-XML things you do. The next step will be to decide if it makes sense to change all or some of your markup practices.
Stupid SGML Tricks
 

Stupid SGML Tricks

 Those aspects of SGML that are not available in XML are listed in the sections that follow. The following organization has been used:
 The Big Three
 
  • Elements
  •  
  • Attributes
  •  
  • Entities
  •  Out of Band
     
  • Comments
  •  
  • Marked Sections
  •  
  • Processing Instructions
  •  Miscellaneous
     
  • Characters
  •  
  • Minimization
  •  
  • Other
  •  
  • Obscure
  • The Big Three: Elements
     

    The Big Three: Elements

     
    Declaring Several Elements at Once
     

    Declaring Several Elements at Once

     You can not declare several element types with the same declaration:
     
    
        <!ELEMENT (isnt-xml | isnt-xml2) (#PCDATA | em)*>
    
     This habit makes finding element type declarations in a DTD more difficult. A better practice might be to use a parameter entity for the common content model:
     
    
        <!ENTITY % inline '(#PCDATA | em)'>
        <!ELEMENT okay-xml  %inline;>
        <!ELEMENT okay-xml2 %inline;>
    
     
    Specifying Minimization
     

    Specifying Minimization

     You can not specify minimization in XML element declarations:
     
    
        <!ELEMENT isnt-xml - - (#PCDATA | em)*>
    
     If you are not using
     
    OMITTAG
    , you can leave this out of your SGML:
     
    
        <!ELEMENT okay-xml (#PCDATA | em)*>
    
     
    CDATA
     RCDATA 
     

    RCDATA and CDATA

     You can not declare content to be
     
    RCDATA
    :
     
    
        <!ELEMENT isnt-xml RCDATA>
    
     You can not declare content to be
     
    CDATA
    :
     
    
        <!ELEMENT isnt-xml CDATA>
    
     
    '&' Connector
     

    The '&' Connector

     You can not use the '
     
    &
    ' connector:
     
    
        <!ELEMENT isnt-xml (phone & fax & email)>
    
     If the random order is important to you, you can recast short lists by listing all the possible orders, avoid SGML-ambiguous content models by factoring out commonalities:
     
    
        <!ELEMENT okay-xml ( (phone, ((fax, email) | (email, fax))
                           | (fax, ((phone, email) | (email, phone))
                           | (email, ((phone, fax) | (fax, phone)) )>
    
     If you can enforce an order, do so:
     
    
        <!ELEMENT okay-xml (phone, fax, email)>
    
     If you can't enforce an order, but your list it too long to recast without the
     
    &
    connector, you may need to loosen your content model:
     
    
        <!ELEMENT okay-xml (phone | fax | email)+>
    
     
    Mixed Content
     

    Mixed Content

     You can not have deprecated mixed content in XML:
     
    
        <!ELEMENT isnt-xml (em | #PCDATA)>
    
     Indeed, the rules are stricter even than just avoiding deprecated mixed content:
     
    
        <!ELEMENT isnt-xml  (em | #PCDATA)*>
        <!ELEMENT isnt-xml2 (#PCDATA)*>
    
     In a mixed content model, the
     
    #PCDATA
    must be listed first, the only connector permitted is '
     
    |
    ', the only occurrence indicator permitted is '
     
    *
    ', and the '
     
    *
    ' must appear only when there is a '
     
    |
    ':
     
    
        <!ELEMENT okay-xml  (#PCDATA | em)*>
        <!ELEMENT okay-xml2 (#PCDATA)>
    
     
    Inclusion Exceptions
     

    Inclusion Exceptions

     XML does not allow inclusions on content models:
     
    
        <!ENTITY % text    '(#PCDATA)'>
        <!ELEMENT isnt-xml (heading | para)* +(warning)>
        <!ELEMENT heading  %text;>
        <!ELEMENT para     %text;>
        <!ELEMENT warning  %text;>
    
     Element types declared using inclusions are often far looser than they need to be. Usually they can be recast using other mechanisms:
     
    
        <!ENTITY % text    '(#PCDATA | warning)*'>
        <!ELEMENT isnt-xml (heading | para | warning)*>
        <!ELEMENT heading  %text;>
        <!ELEMENT para     %text;>
        <!ELEMENT warning  %text;>
    
     
    Exclusion Exceptions
     

    Exclusion Exceptions

     XML does not allow exclusions on content models:
     
    
        <!ENTITY % text    '(#PCDATA | em | etc | isnt-xml)*'>
        <!ELEMENT document (heading | para | isnt-xml)*>
        <!ELEMENT heading  %text;>
        <!ELEMENT para     %text;>
        <!ELEMENT em       %text;>
        <!ELEMENT etc      %text;>
        <!ELEMENT isnt-xml %text; -(isnt-xml)>
    
     Sometimes exclusions can be recast using other mechanisms:
     
    
        <!ENTITY % text    '(#PCDATA | em | etc | okay-xml)*'>
        <!ELEMENT document (heading | para | okay-xml)*>
        <!ELEMENT heading  %text;>
        <!ELEMENT para     %text;>
        <!ELEMENT em       %text;>
        <!ELEMENT etc      %text;>
        <!ELEMENT okay-xml (#PCDATA | em | etc)*>
    
     Other times, the easiest way to move to XML is to simply remove the exclusion, leaving the content model somewhat looser than it was.
     
    Empty Elements
     

    Empty Elements

     XML uses a special syntax for empty elements:
     
    
        <toc/>
        <toc depth='2'/>
    
     XML also allows empty elements to have end tags:
     
    
        <toc></toc>
        <toc depth='2'></toc>
    
     You should note which elements you declare as empty:
     
    
        <!ELEMENT toc EMPTY>
        <!ATTLIST toc depth #CDATA #IMPLIED>
    
     Here's one way to make the transition. This element declaration is looser than intended, but is both SGML and XML:
     
    
        <!ELEMENT toc (#PCDATA)> <!--Should be EMPTY.-->
    
     In both SGML and XML the declaration above allows the markup below:
     
    
        <toc></toc>
    
     If you have a very small number of such elements, you might consider if they could be recast as container elements or perhaps as attributes on other elements. Those DTDs that do not feature empty elements avoid a major area of incompatibility between XML and SGML as it is usually used.
     Here's one way to change your SGML declaration so that it allows XML-style markup for empty elements:
     
    
        DELIM GENERAL SGMLREF
                      NET '/>'
    
    The Big Three: Attributes
     

    The Big Three: Attributes

     
    Attribute Declarations for Multiple Elements
     

    Attribute Declarations for Multiple Elements

     You can only declare attributes for one element type at a time:
     
    
        <!ATTLIST (isnt-xml | isnt-xml2) attrib #CDATA #IMPLIED>
    
     XML will require that this be split into one
     
    ATTLIST
    declaration per element type:
     
    
        <!ATTLIST okay-xml  attrib #CDATA #IMPLIED>
        <!ATTLIST okay-xml2 attrib #CDATA #IMPLIED>
    
     If removing the redundancy is important, this can be done using a parameter entity:
     
    
        <!ENTITY % attribute 'attrib #CDATA #IMPLIED'>
        <!ATTLIST okay-xml  %attribute;>
        <!ATTLIST okay-xml2 %attribute;>
    
     
    Declared Values for Attributes
     

    Declared Values for Attributes

     XML does not include some declared values for attributes that can be used in SGML. Substituting other declared values may have little or no negative effect on your SGML environment while moving you one step closer to XML.
     The following declared values are not allowed:
     
    
        <!ATTLIST isnt-xml  attrib NAME     #IMPLIED>
        <!ATTLIST isnt-xml2 attrib NAMES    #IMPLIED>
        <!ATTLIST isnt-xml3 attrib NUMBER   #IMPLIED>
        <!ATTLIST isnt-xml4 attrib NUMBERS  #IMPLIED>
        <!ATTLIST isnt-xml5 attrib NUTOKEN  #IMPLIED>
        <!ATTLIST isnt-xml6 attrib NUTOKENS #IMPLIED>
        <!ATTLIST isnt-xml7 attrib NOTATION (jpeg | tiff) #IMPLIED>
    
     The following are allowed:
     
    
        <!ATTLIST okay-xml  attrib CDATA    #IMPLIED>
        <!ATTLIST okay-xml  attrib ENTITY   #IMPLIED>
        <!ATTLIST okay-xml  attrib ENTITIES #IMPLIED>
        <!ATTLIST okay-xml  attrib ID       #IMPLIED>
        <!ATTLIST okay-xml  attrib IDREF    #IMPLIED>
        <!ATTLIST okay-xml  attrib IDREFS   #IMPLIED>
        <!ATTLIST okay-xml  attrib NMTOKEN  #IMPLIED>
        <!ATTLIST okay-xml  attrib NMTOKENS #IMPLIED>
        <!ATTLIST okay-xml  attrib (this | that) #IMPLIED>
    
     When you enumerate a list of options using a name token group, you must use the or-bar between then (SGML allows you to use the or-bar or comma interchangeably):
     
    
        <!ATTLIST isnt-xml  attrib (red, green, blue) #IMPLIED>
        <!ATTLIST okay-xml  attrib (red | green | blue) #IMPLIED>
    
     
    Default Values for Attributes
     

    Default Values for Attributes

     These two default value declarations are not allowed in XML:
     
    
        <!ATTLIST isnt-xml  attrib CDATA #CURRENT>
        <!ATTLIST isnt-xml2 attrib CDATA #CONREF>
    
     These four default value declarations are allowed:
     
    
        <!ATTLIST okay-xml  attrib CDATA #FIXED "only value">
        <!ATTLIST okay-xml2 attrib CDATA "default value">
        <!ATTLIST okay-xml3 attrib CDATA #REQUIRED>
        <!ATTLIST okay-xml4 attrib CDATA #IMPLIED>
    
     Default values must be enclosed in quote marks:
     
    
        <!ATTLIST isnt-xml  attrib (this | that) this>
        <!ATTLIST okay-xml  attrib (this | that) "this">
        <!ATTLIST okay-xml2 attrib (this | that) 'this'>
    
     
    Attribute Value Specification
     

    Attribute Value Specification

     You must use an attribute value literal, not an attribute value, in an attribute value specification. In other words, you must use quote marks when specifying an attribute value:
     
    
        <isnt-xml  attrib=this>...</isnt-xml>
        <okay-xml  attrib="this">...</okay-xml>
        <okay-xml2 attrib='this'>...</okay-xml>
    
     You must always spell out the attribute name; you can't imply it by using a name value:
     
    
        <isnt-xml "red">...</isnt-xml>
        <okay-xml color="red">...</okay-xml>
    
     
    Data Attributes
     

    Data Attributes

     You can't use data attributes:
     
    
        <!NOTATION mpeg SYSTEM "mpgview.exe">
        <!ATTRIBUTE #NOTATION mpeg isnt-xml (v2 | v3) #REQUIRED>
        <!ENTITY movie-a SYSTEM "movie-a.mpg" NDATA mpeg [isnt-xml="v2"]>
        <!ENTITY movie-b SYSTEM "movie-b.mpg" NDATA mpeg [isnt-xml="v3"]>
    
     In some cases, the way to make this XML might be to expand your list of notations:
     
    
        <!NOTATION mpeg2 SYSTEM "mpgview2.exe">
        <!ENTITY movie-a SYSTEM "movie-a.mpg" NDATA mpeg2>
        <!ENTITY movie-b SYSTEM "movie-b.mpg" NDATA mpeg3>
    
    The Big Three: Entities
     

    The Big Three: Entities

     XML places various restrictions on entity declarations and entity references.
     
    Internal Entities
     

    Internal Entities

     You can't use data text internal entities (
     
    CDATA
    ,
     
    SDATA
    or
     
    PI
    ):
     
    
        <!ENTITY isnt-xml  CDATA "text">
        <!ENTITY isnt-xml2 SDATA "[adjust me]">
        <!ENTITY isnt-xml3 PI    "BRS ..YEAR">
    
     You can't use bracketed text internal entities (
     
    STARTTAG
    ,
     
    ENDTAG
    ,
     
    MS
    and
     
    MD
    ):
     
    
        <!ENTITY isnt-xml4 STARTTAG "gi">
        <!ENTITY isnt-xml5 ENDTAG   "gi">
        <!ENTITY isnt-xml6 MS       "CDATA[text">
        <!ENTITY isnt-xml7 MD       "--comment--">
    
     Only the simplest form is allowed for internal entities:
     
    
        <!ENTITY okay-xml    "text">
        <!ENTITY okay-xml2   "[adjust me]">
        <!ENTITY okay-xml3   "<?BRS ..YEAR?>"
        <!ENTITY still-isnt-xml4 "<gi>">
        <!ENTITY still-isnt-xml5 "</gi>">
        <!ENTITY okay-xml4-5 "<gi></gi>">
        <!ENTITY okay-xml6   "<![CDATA[text]]>">
        <!ENTITY okay-xml7   "<!--comment-->">
    
     Examples 4 and 5 are explained under 'Synchronicity', below.
     
    External Entities
     

    External Entities

     You can't use
     
    SUBDOC
    ,
     
    CDATA
    or
     
    SDATA
    external entities:
     
    
        <!ENTITY isnt-xml  SYSTEM "url" SUBDOC>
        <!ENTITY isnt-xml2 SYSTEM "url" CDATA mpeg>
        <!ENTITY isnt-xml3 SYSTEM "url" SDATA mpeg>
    
     External entities can have no entity type specified, or have
     
    NDATA
    specified:
     
    
        <!ENTITY okay-xml  SYSTEM "url">
        <!ENTITY okay-xml2 SYSTEM "url" NDATA mpeg>
    
     
    PUBLIC Identifiers
     

    PUBLIC Identifiers

     The
     
    FORMAL
    feature allows you to use what are called 'formal public identifiers' to name entities such as portions of DTDs and portions of documents. XML allows public identifiers, but requires that they be followed by a system identifier to use in case the public identifier can not be resolved. External entities are identified in
     
    ENTITY
    declarations:
     
    
        <!ENTITY isnt-xml PUBLIC "-//Example//Entity Example//EN">
        <!ENTITY okay-xml PUBLIC "-//Example//Entity Example//EN"
                                 "../examples/example.ent">
    
     External entities are also identified in
     
    DOCTYPE
    declarations:
     
    
        <!DOCTYPE isnt-xml PUBLIC "-//Example//DTD Example//EN">
        <!DOCTYPE okay-xml PUBLIC "-//Example//DTD Example//EN"
                                  "http://www.example.org/example.dtd">
    
     The exception to this rule is the NOTATION declaration, which does not require a system identifier:
     
    
        <!NOTATION okay-xml  PUBLIC "ISO/IEC 10918:1993//NOTATION Digital
        Compression and Coding of Continuous-tone Still Images (JPEG)//EN">
        <!NOTATION okay-xml2 PUBLIC "ISO/IEC 10918:1993//NOTATION Digital
        Compression and Coding of Continuous-tone Still Images (JPEG)//EN"
                                    "jpegview.exe">
    
     
    SYSTEM Identifier
     

    SYSTEM Identifier

     When you use
     
    SYSTEM
    identifiers for external entities, these identifiers must be URLs:
     
    
        <!DOCTYPE  okay-xml  SYSTEM "example.dtd">
        <!NOTATION okay-xml  SYSTEM "http://www.example.org/example.not">
    
     
    Omitted System Identifier
     

    Omitted System Identifier

     In SGML, you can omit the system identifier after the
     
    SYSTEM
    keyword:
     
    
        <!ENTITY isnt-xml SYSTEM>
    
     In XML, you must always include it:
     
    
        <!ENTITY okay-xml SYSTEM "example.ent">
    
     
    Default Entity
     

    Default Entity

     You can not declare a default entity:
     
    
        <!ENTITY #DEFAULT "[isnt-xml]">
    
     
    Semicolon
     

    Semicolon

     You can't leave the final semicolon off entity references, as SGML allows you to do in certain contexts:
     
    
        <isnt-xml>R&eacute;sum&eacute</isnt-xml>
        <okay-xml>R&eacute;sum&eacute;</okay-xml>
    
     
    Synchronicity
     

    Synchronicity

     SGML's deprecated obfuscatory entity references are disallowed in XML. Elements and marked sections need to start and end in the same entity.
     Generally, everything needs to be balanced inside of each entity. This is important because it allows you to choose not to expand entities in certain contexts while still maintaining a balanced structure.
     
    External Entities in Attributes
     

    External Entities in Attributes

     XML does not allow references to external entities in attribute literals:
     
    
        <!ENTITY external SYSTEM "file.txt">
        <!ENTITY internal "text">
        ...
        <isnt-xml attrib="&external;">
        <okay-xml attrib="&internal;">
    
     
    References to External Data Entities in Content
     

    References to External Data Entities in Content

     You can refer to external data entities in content; but non-validating parsers are not required to include that entity. They may merely choose to note that they saw the reference and go on.
     
     Parameter Entities 
     

    Parameter Entities

     In a separate DTD file (the 'external subset'), parameter entities are allowed to appear inside of markup declarations. But in the internal subset of the DTD in an XML document, they can only appear where a whole markup declaration would be allowed:
     
    
        <!DOCTYPE document SYSTEM "document.dtd" [
        <!ENTITY % isnt-xml "p">
        <!ELEMENT %isnt-xml; (#PCDATA)>
        <!ENTITY %okay-xml SYSTEM "fragment.dtd">
        %okay-xml;
        ]>
    
    Out of Band: Comments
     

    Out of Band: Comments

     XML restricts the variation in syntax and location of comments that SGML allows.
     A typical SGML comment looks like this:
     
    
        <!--Okay XML.-->
    
     The '
     
    <!
    ' and '
     
    >
    ' are called the comment declaration, and the '
     
    --...--
    ' is the comment proper.
     
    Comments in Other Declarations
     

    Comments in Other Declarations

     You can't slip comments into other declarations. So this is not allowed:
     
    
        <!ELEMENT p (#PCDATA | em)* --Isn't XML.-->
    
     
    Empty Comments
     

    Empty Comments

     You must have exactly one comment inside of each comment declaration. You are not allowed zero:
     
    
        <isnt-xml><!></isnt-xml>
        <okay-xml><!----></okay-xml>
    
     
    Multiple Comments
     

    Multiple Comments

     And you are not allowed more than one:
     
    
        <!--Isn't XML.-- --Isn't XML.-->
        <!--Okay XML.- - - -Okay XML.-->
    
     
    Extra White Space
     

    Extra White Space

     Finally, the second '
     
    --
    ' and the '
     
    >
    ' must run together, so this is not allowed:
     
    
        <!--Isn't XML.-- >
        <!-- Okay XML. -->
    
    Out of Band: Marked Sections
     

    Out of Band: Marked Sections

     XML severely restricts the usage of SGML's marked sections. The only type of marked section allowed is a
     
    CDATA
    marked section.
     
    Empty Status Keyword Specification
     

    Empty Status Keyword Specification

     SGML allows a marked section to have an empty status keyword specification. XML does not allow this:
     
    
        <![[isnt-xml]]>
    
     
    TEMP
     

    TEMP

     You can't use
     
    TEMP
    in a status keyword specification:
     
    
        <![ TEMP [isnt-xml]]>
    
     
     RCDATA 
     

    RCDATA

     You can't use
     
    RCDATA
    marked sections:
     
    
        <![ RCDATA [isnt-xml]]>
    
     
    INCLUDE and IGNORE
     

    INCLUDE and IGNORE

     You can't use
     
    INCLUDE
    or
     
    IGNORE
    marked sections in the document instance, but only in the DTD (and not in the internal subset of the DTD):
     
    
        <!DOCTYPE document SYSTEM "document.dtd" [
        <![ IGNORE [isnt-xml]]>
        ]>
        <document><![ IGNORE [isnt-xml]]></document>
    
     
    Multiple Keywords
     

    Multiple Keywords

     You can't use more than one status keyword in a single marked section:
     
    
        <![ INCLUDE CDATA [isnt-xml]]>
    
     
     Parameter Entities 
     

    Parameter Entities

     You can't use parameter entities to specify status keywords.
     
    
        <![ %maybe; [isnt-xml]]>
    
     
    No Separators for CDATA Sections
     

    No Separators for CDATA Sections

     You aren't allowed any white space around the word '
     
    CDATA
    ' in a
     
    CDATA
    marked section start:
     
    
        <![CDATA [isnt-xml]]>
        <![ CDATA[isnt-xml]]>
        <![ CDATA [isnt-xml]]>
        <![CDATA[okay-xml]]>
    
    Out of Band: Processing Instructions
     

    Out of Band: Processing Instructions

     XML uses a special syntax for processing instructions. You can imitate this XML syntax by using a similar convention for your SGML processing instructions. Processing instructions are closed in SGML with a right angle-bracket. In XML, they are closed by a question-mark right angle-bracket sequence:
     
    
        <?isnt-xml This is a processing instruction.>
        <?okay-xml This is a processing instruction.?>
    
     In XML, the
     
    PIC
    (processing instruction close) delimiter to '
     
    ?>
    ' instead of the usual '
     
    >
    '. If you make this change to your SGML declaration, then the first processing instruction above will not parse and the second will parse just as in XML. If you do not make this change, both will parse, but the second will contain the question mark as part of the content of the processing instruction, rather than as the ending delimiter.
     It is good practice to categorize your SGML processing instructions by always starting them with a name that says to which processor they are directed. In XML, this practice is a requirement. This name is called the PI 'target':
     
    
        <??> <!--This isn't XML because it has no target.-->
        <?okay-xml?>
        <?okay-xml2 The target is 'okay-xml2'.?>
        <?okay-xml3The target is 'okay-xml3The'.?>
    
     The target '
     
    xml
    ' has a special meaning in XML. To avoid confusion, any other capitalization of those three letters is reserved (and prohibited):
     
    
        <?xml This isn't XML.>
        <?XML This isn't XML.>
        <?XmL This isn't XML.>
        <?xmlx This is technically okay but tempting fate.>
        <?sgml This is okay XML.>
    
    Miscellaneous: Characters
     

    Miscellaneous: Characters

     
    Case Insensitivity
     

    Case Insensitivity

     XML insists on case-sensitivity in places where SGML is typically insensitive. This can be a big headache at first, but it can ultimately simplify processing of the data. This is one of several places where SGML can be made to match XML by changing the SGML declaration you use.
     First, adopt a standard capitalization for your element and attribute names. As a programmer afraid of carpal-tunnel syndrome, I suggest all lower case. Then, change '
     
    NAMECASE GENERAL YES
    ' to '
     
    NAMECASE GENERAL NO
    ' in your SGML declaration file.
     
    Odd Name Characters
     

    Odd Name Characters

     All the name characters allowed by the reference concrete syntax are allowed by XML. So are thousands of others. But it's possible to have an SGML declaration that declares as name characters some characters that XML doesn't allow as name characters.
     
    Character References without Semicolons
     

    Character References without Semicolons

     Like with entity references, you can't leave off the final semicolon in a character reference:
     
    
        <isnt-xml>R&#233;sum&#233</isnt-xml>
        <okay-xml>R&#233;sum&#233;</okay-xml>
    
     
    Named Character References
     

    Named Character References

     You can't use named character references:
     
    
        <isnt-xml>You can't use &#RE;, &#RS;, &#SPACE;.
        or a custom-defined function &#NAME;.</isnt-xml>
    
     
    References to non-SGML Characters
     

    References to non-SGML Characters

     You can't use a numeric character reference to include a non-SGML character in XML.
    Miscellaneous: Minimization
     

    Miscellaneous: Minimization

     XML does not include a wide variety of markup minimization features available in SGML. This section lists the more common types of minimization. Less commonly used minimization techniques are listed under 'Obscure Features'.
     
    OMITTAG
     

    OMITTAG

     The
     
    OMITTAG
    feature is fairly commonly used. It allows you to completely leave out certain start and end tags when you can tell by the context that they are required. So, using this feature, you might leave out the start tag for a chapter title (provided that there were some data characters at the beginning of the chapter, yet all chapters were required by the DTD to have start with a title) or the end tag for a chapter (provided that there was a start tag for the next chapter and the DTD didn't allow chapters to nest). Notice how both examples require consulting the DTD to determine which tags have been left out. XML does not allow tags to be omitted in this way.
     
    Portions of SHORTTAG
     

    Portions of SHORTTAG

     The
     
    SHORTTAG
    feature allows various abbreviations to be made within a tag. This feature is officially declared to be
     
    ON
    in the SGML declaration for XML, because some of these abbreviations are in fact allowed. But many are not.
     
    SHORTTAG: Empty Tags
     

    SHORTTAG: Empty Tags

     Quite distinct from the idea of an empty element, there is the possibility in SGML of having empty tags. An empty start tag looks like this: '
     
    <>
    '; and an empty end tag looks like this: '
     
    </>
    '. Empty tags are allowed in SGML in certain contexts where it is clear what the missing element type name is.
     
    
        <isnt-xml><>Apparently, isnt-xml must always start
        with a certain element</></isnt-xml>
    
     In XML, element type names must always be spelled out:
     
    
        <okay-xml><TITLE>We need to spell out 'title'</title></okay-xml>
    
     
    SHORTTAG: Unclosed Tags
     

    SHORTTAG: Unclosed Tags

     Did you know that the final right angle-bracket is not always required on tags in SGML? Stomach-turning, isn't it? Sorry I ever mentioned it.
     
    
        <isnt-xml<isnt-xml2>text</isnt-xml2</isnt-xml>
        <okay-xml><okay-xml2>text</okay-xml2></okay-xml>
    
     
    SHORTTAG: Leaving off Quote Marks
     

    SHORTTAG: Leaving off Quote Marks

     In SGML, you can sometimes leave off the quotes when specifying attribute values. This is not allowed in XML. See The Big Three: Attributes: Specifying Attribute Values, above.
     
    SHORTTAG: Leaving off Attribute Names
     

    SHORTTAG: Leaving off Attribute Names

     In SGML, you can sometimes leave off the attribute name when specifying attribute values. This is not allowed in XML. See The Big Three: Attributes: Specifying Attribute Values, above.
    Miscellaneous: Other Restrictions
     

    Miscellaneous: Other Restrictions

     
    < and &
     

    < and &

     You should not use '
     
    &
    ' or '
     
    <
    ' as data:
     
    
        <isnt-xml>In SGML, you can use & and < as data in
        certain contexts; when followed by a space, for example.</isnt-xml>
    
     Use '
     
    &amp;
    ' for '
     
    &
    ' and '
     
    &lt;
    ' for '
     
    <
    ':
     
    
        <okay-xml>In SGML, you can use &amp; and &lt; as data in
        certain contexts; when followed by a space, for example.</okay-xml>
    
     The places where you can use the
     
    &
    and
     
    <
    characters without them being interpreted as markup are in comments, processing instructions,
     
    CDATA
    marked sections and in the literal entity value in an internal entity declaration:
     
    
        <!--Okay XML with the &this and <that.-->
        <?example Okay XML with the &this and <that.?>
        <okay-xml><![CDATA[Okay XML with &this and <that.></okay-xml>
        <!ENTITY okay-xml "Okay XML with &this and <that.">
    
     This is not a change from SGML.
     
    'xml' Reserved
     

    'xml' Reserved

     Don't use names that start with 'xml', with any capitalization. This applies to element names, attribute names, named attribute values, entity names, etc:
     
    
        <!ELEMENT xml-isnt (#PCDATA)>
        <!ELEMENT okay-xml (#PCDATA)>
    
     
    Gotta Keep 'em Separated
     

    Gotta Keep 'em Separated

     There are various places in declarations where the formal syntactical definition of SGML says that a white space is required. But there's another SGML rule that says you can scrunch out this white space when it is adjacent to a delimiter character. XML does not have this rule, so when the XML recommendation says there must be white space, it means it. So, for example, the XML recommendation says a general entity declaration should look like this:
     
    
        [71] GEDecl ::= '<!ENTITY' S Name S EntityDef S? '>'
    
     As a consequence, the white space (the
     
    S
    ) after the entity name can not be left out:
     
    
        <!ENTITY isnt-xml"Scrunching allowed in SGML.">
        <!ENTITY okay-xml  "No scrunching in XML.">
    
     But note that the start of an XML CDATA section is defined to include no white space:
     
    
        [19] CDStart ::= '<![CDATA['
    
     
    SGML Declaration
     

    SGML Declaration

     Do not include the SGML declaration in the XML document entity. The SGML declaration for XML must be left implied.
    Miscellaneous: Obscure SGML Features
     

    Miscellaneous: Obscure SGML Features

     There are a number of features of SGML of which you may be only dimly aware. You likely won't notice their absence from XML. The
     
    SHORTREF
    ,
     
    RANK
    and
     
    DATATAG
    features are properly classified under the minimization category.
     
     SHORTREF 
     

    SHORTREF

     XML does not allow SGML's
     
    SHORTREF
    feature, whereby certain short sequences (like double carriage-returns) are interpreted as abbreviated references to markup (like paragraph tags).
     
    DATATAG
     

    DATATAG

     SGML includes a feature called
     
    DATATAG
    in which data acts as both markup and content. I've never encountered a use of this feature. Perhaps a raise of hands?
     
    RANK
     

    RANK

     The
     
    RANK
    feature allows you to declare a set of elements that differ only in a numerical suffix (like the
     
    H1
    ,
     
    H2
    ,
     
    H3
    heading elements in HTML) and then to type only
     
    <H>
    , having it be interpreted as another of whatever the most recent heading-level occurred in the document.
     Even when this feature is not turned on in your SGML declaration, you can still split the element type name into two parts in an element declaration. You can't do this in XML.
     
    LINK
     

    LINK

     There are three variations on the
     
    LINK
    feature (
     
    SIMPLE
    ,
     
    EXPLICIT
    and
     
    IMPLICIT
    ). I've heard arguments for the importance of this feature, from people with more authority than I have. But I don't believe them. At any rate, XML does not include this feature.
     
    CONCUR
     

    CONCUR

     The SGML
     
    CONCUR
    feature allows you do apply more than one DTD to the same data, simultaneously. XML does not include this feature.
     

    Conclusion

     A journey of a thousand miles starts with the first step. But before you take that step, you ought to determine where you stand. This will help you start out in the right direction. Or realize you're happy right where you are.
    James Clark
     

    References

     SGML is defined by 'ISO 8879:1986(E). Information processing - Text and Office Systems - Standard Generalized Markup Language (SGML). First edition - 1986-10-15', available from the International Organization for Standardization in Geneva.
     XML is defined by 'Extensible Markup Language (XML) 1.0', a World Wide Web Consortium (W3C) Recommendation dated 1998 February 10 (
     
    www.w3.org/TR/REC-xml
    ).
     This paper is derived from James Clark's 'Comparison of SGML and XML', a World Wide Web (W3C) Consortium Note dated 1997 December 15 (
     
    www.w3.org/TR/NOTE-sgml-xml-971215
    by
     
    jjc@jclark.com
    ). A few details have apparently changed since Clark's Note was written:
     
    PUBLIC
    identifiers, for example, are now a part of XML.

    Overview of XSL   Table of contents   Indexes   XML In Defense Procurement