XML, Everywhere   Table of contents   Indexes   Pragmatic SGML-solutions in a telecommunications organization

 
 

Stylesheet Driven SGML Transformation


 
Nicolas   Paris
  AIS Software
17, rue Rémy Dumoncel
Paris   France  75014 Web: http://www.balise.com/
 
Biographical notice:
 
Nicolas Paris
 
Nicolas Paris is product manager at AIS Software where he is responsible for the development of Balise, Balise HTML Package and Dual Prism. His background is in compiler technology, massively parallel architecture design, and CAD-tool development for micro-chips. Nicolas is a graduate of the Ecole Normale Superieure in Paris and holds a doctorate in Computer Science from the University of Paris XI.
 
ABSTRACT:
 
The need to generate output formats such as HTML and RTF, or to transform documents from one DTD to another appears in nearly all today's SGML/XML projects. In most cases, such transformations are expressed using specialized programming tools/languages which provide specific abstractions that facilitate this process.
 
More recently, the DSSSL standard introduced the concept of groves and higher level mappings for expressing transformations from input groves to output objects. Although some parts of the DSSSL standard have been implemented in tools such as Jade, this approach has not yet led to many practical implementations, and writing DSSSL stylesheets is clearly out of reach of most people.
 
In this paper, we describe a framework that we have designed and implemented. This framework combines two key features: a stylesheet model, strongly inspired from DSSSL (but with some important constraints relaxed), and a graphical user interface for designing stylesheets. This framework has radically changed the way SGML transformations can be performed by making the process accessible to non programmers. The relationship with the forthcoming XSL standard is also considered.
 
 

Introduction

 
SGML-to-HTML conversion forms part of most SGML/XML projects and represents the simplest way to publish or preview structured documents using a standard web browser. It is therefore important to have efficient tools for handling such conversions.
 
As a special case of SGML transformation, SGML/XML-to-HTML conversion can be handled by most SGML-aware products, and we give examples of this approach below. However, a more efficient approach involves the use of the stylesheet concept introduced by DSSSL.
 
The basic difference between a stylesheet and a program is declarativity. A program must contain all the instructions to convert all SGML fragments into HTML tags and attributes. A stylesheet, however, simply describes the result that must be achieved, leaving a generic transformation engine to perform the actual conversion, as specified in the stylesheet.
 
We can identify two main functional aspects in a typical SGML/XML-to HTML-conversion. Some processes handle the "rendering" aspects: generation of flows of HTML paragraphs with the associated properties, figures, tables, and so on. These are the aspects that first come to mind when we think about HTML transformation. Other processes must also be considered, however, including the fragmentation of SGML sources into separate HTML pages, and the generation of tables of contents, indexes, lists of figures, lists of tables, and so on. This second group of processes represents an all-important source of added value when mapping SGML/XML structures into a hierarchy of HTML pages.
 
These two main functional aspects, rendering specifications and structure specifications, involve different issues of varying complexity, and must be considered together when constructing an effective conversion solution.
 
In the following sections, we illustrate the importance of a stylesheet-based approach for such conversions. We also show how all transformation specifications, both rendering and structure specifications, can be expressed using the same consistent stylesheet model.
 
 

The Programming Approach

 
If we handle conversion to HTML using standard SGML transformation tools, then we end up with programs whose structure is similar to the following example (which uses the Balise language):
 
element
TITLE [within
CHAPTER] {   on start
{    cout << "<H1 align=left><FONT color=red>";    cout << "Chapter " + dec(cNum()) + "<BR>";   }   on end
{    cout << "</FONT></H1>";   }  }
 
In this example, theelement keyword opens a rule associated with anyTITLE element directly inside aCHAPTER element. Two clauses specify the actions that must be performed when a start tag or an end tag is reached for the corresponding elements. This programming approach is used by most SGML transformation products and is called the event-driven programming model.
 
Other similar program rules are used to specify how other elements should be transformed and programs of this type might typically contain around a hundred such rules. This approach has the following drawbacks:
  • It is a programming approach and is therefore most often limited to use by developers, or people with a background in computer programming.
  • It is a batch-oriented approach. The development cycle can therefore be relatively long: program modification, transformation of a set of documents (which can generate hundreds of HTML pages), browsing through the pages, validation, iteration.
  • No mechanism can guarantee the quality of the HTML markup that is generated.
  • Many similar clauses are often used in different parts of a program. Modifying and correcting such clauses can thus be tedious and error-prone.
  •  
    For complex projects, the development of an SGML/XML-to-HTML transformation application may require two or three weeks of development and testing. The maintenance cost of the program is of course related to the complexity of the application but may also requires significant effort and investment.
     
     

    The Stylesheet Approach - Rendering Properties

     
    The creation of rendering specifications is the most intuitive part of the conversion problem. It also corresponds to the classic notion of a stylesheet, as defined in SGML editors, for instance.
     
    A stylesheet approach enables you to specify the same transformation using a specific method adapted to the generation of HTML output. The following example handles the same transformation as above using a stylesheet specification:
     
    <STYLE NAME='CHAPTER,TITLE'>
        <TAG>"H1 align=left"</TAG>
        <COLOR>"red"</COLOR>
        <TEXT-BEFORE>"Chapter" + dec(cNum()) + "<BR>"</TEXT-BEFORE>
    </STYLE>
     
    If the transformation specification is the same, the way in which it is expressed is clearly different:
  • Clauses and statements are replaced by declarative styles and properties.
  • Transformation specifications are organized into a set of separate properties. In this example, the tag property generates markup around the content of the element, the color property generates a FONT tag with a color attribute, while the text-before property generates text inside that markup
  •  
    Note that we are still using expressions of the Balise language to express data manipulations in the stylesheet.
     
    Using this approach reduces the size and complexity of the specification, even if it limits somewhat the expressive power (we can't do whatever we want). Being more structured, it provides better quality output and reduces development costs.
     
     

    The WYSIWYG Approach

     
    If developing a stylesheet is easier and faster than developing a program, it remains an abstract specification task. The way to further simplify this task is to provide the user with immediate visual feedback on the specification that is being constructed. This is what we call the WYSIWYG approach.
     
    In this approach, a stylesheet editor is provided that enables the specification to be constructed using forms and dialogs, together with a real-time HTML preview for the specified output, as illustrated in the following figure:

     
    A stylesheet editor

     
     
    The interest of such an editor is to further reduce the design time:
  • The stylesheet is designed according to the results directly shown in the HTML browser which is tightly integrated in the tool. This enables the design loop time to be reduced from minutes to seconds.
  • All possible properties are easily accessible to the user. Setting a property simply requires filling in the corresponding text field.
  • Visual feedback allows immediate identification of incorrect HTML expressions. Most errors can therefore be detected and avoided very early in the process.
  •  
    A large part of the rendering properties will just be text strings related to HTML markup to be generated. Typical places where dynamic expressions will be involved is in prefix and suffix properties (text-before, text-after). They will be used to generate numbering of sections or list items, or to insert the title of a section in place of a cross-reference to that section.
     
    This part of the conversion can easily be specified in just a few hours with this type of interactive and WYSIWYG tool.
     
     

    The Stylesheet Approach - Structure Properties

     
    The structure part of an SGML/XML-to-HTML transformation process handles transformation issues such as:
  • Fragmentation of an SGML document into a set of HTML pages of smaller size.
  • Automatic generation of a table of contents or other indexes such as a list of figures, list of tables, etc.
  • Restructuring and filtering of a document.
  • Processing of hyperlinks.
  •  
     

    Document Fragmentation

     
    When converting SGML/XML documents, the original document size varies a lot depending on the application. The SGML/XML document granularity is most often decided according to content management constraints and does not correspond to the granularity of the pages to be displayed in a browser or transferred through a network.
     
    Reorganizing SGML/XML sources is thus very important in a conversion project, and most often this means fragmenting (splitting) a source document into a hierarchy of HTML pages, with links from parent pages to pages lower down the hierarchy.
     
    In practice, two kinds of fragmentation can be considered and combined:
  • Linear fragmentation. The source SGML/XML document is split into a linear list of HTML fragments, with Next/Previous links between the HTML fragments. This mechanism can be completely automatic. The user just needs to identify preferred break positions or, on the contrary, non breakable elements.
  • Hierarchical fragmentation. The source SGML/XML document is split by following the SGML/XML structure: CHAPTER, SECTION, etc. The user then has to identify which source elements start a new HTML fragment. The result is a hierarchy of HTML fragments with parent/child links between them.
  •  
    A combination of the two models is of course possible. Top level fragmentation can be achieved through hierarchical fragmentation, and when lower level fragments are too large, they can be split into a sequence of smaller fragments.
     
    This specification can be achieved using a set of properties such as the following:
  • Ano-break property to inhibit linear fragmentation on elements with which it is associated. Used to prevent elements such as tables, listings, list items, etc. from being split across different fragments.
  • Afragment property to identify elements that must start a new HTML fragment.
  • Atitle property to identify the effective title of the generated HTML fragment, when relevant. This is important for automatically generating navigation links.
  •  
     

    Generating Hyperlinks

     
    Linking is complex because hyperlinks can target any element in the source document. When generating an HTML link, it is necessary to know the file name of the destination HTML page, which may not yet have been processed.
     
    A first pass on the document is therefore required to gather information such as the list of fragment elements, their name, and a table that associates the SGML ID attributes to the elements that carry them.
     
    Link generation can also be easily specified through a simple set of properties:
  • anchor=attribute("ID") can be associated with all elements that may be targeted by a link. This property can be used to generate an HTML anchor element<A NAME=...> whose value corresponds to the value of the SGML ID attribute.
  • href=URLFromID(attribute("REFID")) , can be associated with a cross-reference element, where the value of the REFID attribute is used to identify the destination element and generate the URL to reach the destination.
  •  
     

    Specifying Tables of Contents

     
    Generating tables of contents requires different aspects to be specified:
  • Identify the structure elements involved in the table of contents, and their titles
  • Specify the position of the generated structure in the HTML output
  • Specify the HTML rendering for the generated structure
  •  
    The first step can be specified using two properties toc-names and toc-title , which are associated with table of contents elements such as chapters, sections, etc.
     
    Positions of elements in HTML output cannot be specified directly through a property. Instead, they can be specified using special entities that can be inserted anywhere in the generated HTML flow. These special entities are then interpreted by the generation engine as a TOC insertion command. This mechanism allows very fine-grain positioning of the generated structure in the output flow.
     
    Finally, the rendering of the generated structure can be easily covered with the same properties as the rendering of document content.
     
    The stylesheet extract below contains a basic specification for the generation of a table of contents. Elements in the table of contents are specified using TOC-NAMES and TOC-TITLE, positioning of the generated structure is achieved using the &toc; entity in the HEADER style, and rendering properties for the generated structures are gathered in a specific toc view.
     
      <STYLE NAME='HEADER'>   <VIEW NAME='default'>    <DECO-BEFORE>"<PRE>&toc; </PRE>"</DECO-BEFORE>
       </VIEW>  </STYLE>    <STYLE NAME='SEC1'>   <VIEW NAME='doc'>    <ANCHOR>attr["ID"]</ANCHOR>    <BGCOLOR>"white"</BGCOLOR>    <FONT>"Arial "</FONT>    . . .    <TOC-NAMES>"toc"</TOC-NAMES>    <TOC-TITLE>child("TITLE")</TOC-TITLE>
       </VIEW>      <VIEW NAME='toc'>    <TAG>UL</TAG>   </VIEW>
      </STYLE>    <STYLE NAME='SEC2'>   <VIEW NAME='doc'>    <ANCHOR>attr["ID"]</ANCHOR>    <FONT>"Arial "</FONT>    . . .    <TOC-NAMES>"toc"</TOC-NAMES>    <TOC-TITLE>child("TITLE")</TOC-TITLE>
       </VIEW>      <VIEW NAME='toc'>    <TAG>UL</TAG>   </VIEW>
      </STYLE>    <STYLE NAME='SEC3'>   <VIEW NAME='doc'>    <ANCHOR>attr["ID"]</ANCHOR>   </VIEW>  </STYLE>  
     
     

    Reorganizing the Document

     
    Some applications require a given piece of information to be presented in different ways according to context. This is the case of summary documents, for instance, where specific pieces of a global document are extracted and gathered together to build a new document. A simple example of this is in an HTML version of conference proceedings where a section contains the summaries of all the articles, organized by country and author.
     
    To be able to derive such structure, it is necessary to build tables (or associations) containing, for instance, all SUMMARY elements classified by author names.
     
    This can also be achieved in a stylesheet approach. Akey property can be used to specify such tables. For the proceedings example, we can define the following property:
     
      <STYLE NAME='ARTICLE'>   <VIEW NAME='doc'>   <KEYS>setKey("abstract by author", attribute("AUTHOR"),                 child("ABSTRACT"))</KEYS>   </VIEW>  </STYLE>  
     
    This table can then be used in another place to visit the list of articles written by an author as follows:
     
      <STYLE NAME='AUTHOR'>   <VIEW NAME='abstractlist'>          <TEXT-BEFORE>visit(getFromKey("abstract by author",                        attribute("ID")), "summary")<TEXT-BEFORE>   </VIEW>  </STYLE>  
     
    This fragment specifies that the list of all abstracts for the given author will be retrieved from the table, inlined at the current position, and formatted according to thesummary view of the stylesheet.
     
     

    Conclusion

     
    Using some common examples, we have shown that SGML/XML-to-HTML transformation can be expressed using stylesheets and that such declarative specifications can save a lot of development effort.
     
    Many of the concepts presented here are strongly inspired (or even derived) from the work done around the DSSSL standard. This is true in particular for the notion of declarative stylesheets based on properties attached to contextualized elements, the mix of constant properties and dynamically generated properties, the notion of views that allows a single element to be rendered differently according to different evaluation contexts, etc.
     
    However, such a stylesheet approach becomes really effective for generating HTML when used with a WYSIWYG stylesheet design tool. Visual feedback is very important for detecting errors early on in the HTML generation process, and we have observed ratios of up to 5:1 between the development time required to create a stylesheet using a "hand-written" approach and when using the WYSIWYG tool.
     
    The efforts that are ongoing on the XSL standard are also closely related to this SGML/XML to HTML conversion issue. When the work on XSL is complete, it will allow XML documents to be displayed directly in a standard browser, without the need for conversion. XSL will thus "replace" the rendering part of the conversion process, as we have described it above. However, it is important to note that XSL does not consider the structure aspects of the conversion process: fragmentation, generated structures, complex reorganizations, etc. According to many users, the rendering part is the easy part of an SGML/XML-to-HTML conversion...
     
    Finally, this HTML conversion approach can be easily extended to cover other kinds of SGML transformation, such as RTF generation, up-translation or even DTD-to-DTD transformation. The real success of the approach comes from:
  • An interactive stylesheet editor for immediate visual feedback
  • The use of a browser as a general selection interface
  • The full power of an extensible expression language
  •  
    This is the direction we will follow for the development of new "programming free" products based on the same core concepts and technologies.

    XML, Everywhere   Table of contents   Indexes   Pragmatic SGML-solutions in a telecommunications organization