Making an IETP –, a real life experience   Table of contents   Indexes   Style is not a 4-letter word

 
 

SGML Databases & Content Management for the Web


 
Christophe   Lécluse
  AIS Software
17 rue Rémy Dumoncel
Paris   75014
Email: clec@ais.berger-levrault.fr Web: http://www.balise.com
 
Biographical notice:
 
Christophe Lécluse
 
Christophe Lécluse is manager of AIS Software. He has been working on SGML applications and systems for over eight years and specializes in SGML systems, object databases and languages. He was the originator of the Balise technology developed by AIS Software.
 
ABSTRACT:
 
The SGML community has been dicussing the concept of document management and SGML databases for several years. Coming from the Web and XML worlds, a new buzzword emerged in 1997: "Content Management for the Web". Although this concept today seems to embrace many different realities, there is clearly some overlap between Content Management for XML documents and Document Management as it was understood up to now.
 
This paper highlights some of the functional and technical aspects of the content management concept and aim to help you identify more clearly the underlying complementarity with document management as it has been defined and understood up until very recently.
 
 

Possible Definitions of Content Management

 
Defining Content Management today is clearly a challenge. The Web and the marketing brochures are full of tentative definitions that do not really draw a precise or globally consistent picture. The following sections present some of these common definitions.
 
 

The "Broad" Definition

 
Publishing companies and other information providers increasingly think of themselves as content owners with information as their product. They increasingly establish centralized digital repositories to store all of the content owned by the organization, in various media: text, images, audio and video.
 
The term Content Management is used here to describe the collection of applications that help organize this information so that its owners can benefit from the flexibility of digital information without getting lost in the "virtuality" of this new information process.
 
This is maybe the broadest possible definition of Content Management. It characterizes the current evolution of IT technologies and Document Management technologies that tends to converge into a unified framework. The Web introduced the "document" as a generalized metaphor for information system interfaces. Through this natural trend, web content management will also progressively cover a large part of information management in organizations.
 
 

A "Web-publishing" Definition

 
Closer to many of us, the very fast development of Web-publishing already justifies the introduction of specific concepts and technologies. Many Web publishing projects today struggle with issues like granularity of information, difficulty to reuse and share information pieces in various parts of a publication, difficulty to merge information from various sources (especially information stored in classical RDBMS) difficulty to personalize a publication, etc.
 
Web Content Management can thus be defined as a method by which we can organize the production and delivery of structured information as Web publications.
 
One of the keys to this organization is the abilty to abstract meaningful information from the final published representation of this information. It has became obvious that the production of "flat" HTML pages that use simple hypertext links and that are stored in file systems is not scalable to Web publishing projects of significant size.
 
For a decade, SGML, and now XML, has proved to be an adequate method for abstracting information from published content on various media. The very fast adoption of XML in Web Content Management is thus not really surprising.
 
 

A "Functional" Definition

 
Because Web Content Management establishes a bridge between information production and information delivery (or between document production and document delivery), the proposed solutions have features that relate to information management and other features that relate to information delivery. This of course contributes to the difficulty of characterizing and comparing solutions and products. However, the following list of characteristics can be used to arrive at a tentative functional definition of Web Content Management:
 
Production
  • The ability to define "document modules" at a granularity that is adequate for production and management, and not necessarily tied to the granularity used for delivery.
  • The ability to define these modules in formats that are adequate for production and re-use, and that are not necessarily the same as the delivery format. Of course, we think XML/SGML should be the preferred format in many cases, as they are abstract enough to automate transformation into other formats. XML also gives access to the internal information of a module, thus also allowing management and manipulation at the component level, as well as at the module level.
  • The ability to organize these modules in manageable hierarchies, and to easily access methods for searching and retrieving information during the production phase.
  •  
    Delivery
  • The ability to automate the generation of one or several individualized delivery views from a given content. This generation may be done off-line (batch) or on-line (dynamic) depending on the constraints of a given project.
  • The ability to substantially reorganize document modules or the content of document modules during this generation process.
  • The ability to merge document information with other sources of information such as those stored in RDBMS.
  • The ability to handle multiple destinations and/or multiple formats. HTML versus XML as a destination format is a typical example.
  • The ability to easily handle large volumes of data and to provide search capabilities on such volumes.
  • The ability to leverage the tremendous ongoing efforts beig made to improve Web architectures, and in particular protocols, security, server and browser technologies, etc. This means providing a complementary technology that uses standard Web components.
  •  
     

    Content Management Applications

     
    The spectrum of possible content management applications is very wide and there is clearly no single, standard Web Content Management solution. The choice of a given solution mainly depends on (1) business issues and (2) the nature of the information that is being managed and published.
     
    We can however provide a few representative examples along this spectrum.
     
     

    The Simplest Application: Unmanaged Web Publishing

     
    In this example, web publications are directly produced as sets of HTML pages. This of course corresponds to the extreme side of the spectrum where information is managed and edited directly in the delivery format using simple HTML editors or Web site builders.
     
    Of course, the limitations of this approach are well-known and almost none of the desired features described above can be implemented in such environments. However, it is worth mentioning that most of today's Web content is still produced directly in this way, even for publications of significant size.
     
     

    A Few Goodies: Enriched HTML Pages

     
    Enriched HTML pages can be used as a first abstraction level on top of straight HTML. There are many applications and tools in this area. Microsoft Active Server Pages are one of them. The idea behind all these approaches is to encode some semantics in specific tags beyond the HTML DTD and have processes that are able to interpret these tags dynamically when a page is retrieved.
     
    This of course allow simple customization and dynamicity. Such systems are often able to define HTML templates that merge HTML pages with SQL queries, thus providing some form of Database Publishing.
     
    The main limitation of this approach is that it does not separate the production and delivery representations of the information very much. It is thus adapted for "local" forms of dynamicity, but cannot really scale up to higher abstraction levels. The management of sets of pages is also not handled.
     
     

    The HTML (Blub) content Management Approach

     
    Now we enter the domain of document management. A large number of products exist that offer general document management features. In most cases, however, documents are considered as elementary units that are managed, stored, and exported as such. The added value of such products is in the management of the document information itself and in the management of the document creation and publishing processes.
     
    In the specific case of HTML, some products are adding specific functions for handling links, which are of course one of the main issues in the management of a large number of inter-related pages.
     
    The main limitation of these products is their poor "understanding" of document content. The notion of document being edited remains very close (when not identical) to the notion of document being published. They are thus missing the abstraction level which is necessary to target the key benefits mentioned above:
  • The ability to target different publications from a same information source
  • The ability to target different media for a given publication
  •  
     

    A Simple XML/SGML Content Management Approach

     
    The main difference between a "traditional" document management solution and an XML/SGML document management solution is in the possibility to separate the form and the content of the information being edited from that of the information being published.
     
    In practice, this means that such a solution will allow documents to be defined in a modular way, the notion of module corresponding to management criteria.
     
    Some automation can be simply set up to extract/merge meta-data from/to module content. The production of a Web publication from XML/SGML source data will also be an automated process that takes the set of source modules (plus other sources if needed) and generates another set of HTML (or XML) pages for delivery.
     
     
    In this simple approach, the processing is mainly a batch XML/SGML-to-HTML transformation process. Some applications may also require some incremental processing to be defined in order to avoid a complete processing when only part of the input data is modified.
     
     

    Towards Personalization and Dynamicity

     
     
    Personalization is the possibility provided by a web publication to deliver to each user an information tailored to his/her own needs. This means for example:
  • dynamic filtering of information according to user authorization level
  • dynamic assembly of information according to a user profile
  • dynamic assembly of "what's new" section
  • etc.
  •  
    In most cases, such features require that the final pages delivered to users are assembled or filtered dynamically according to user information. This requires some server infrastructure for handling dynamic requests, and for transformations to be efficiently processed.
     
    The simple batch approach and the dynamic approach can often be combined. Most of the transformation can be handled in batch mode, while the necessary dynamic part is handled by the server. The choice between one solution or another depends on each application.
     
     

    Various Structured Information Sources

     
    In many practical cases, the information to be published is not only stored within text modules. Part of it may also be efficiently represented using traditional RDBMS.
     
    An example is the publishing of financial reports that contain financial information about companies. This financial information is naturally managed in RDBMS independently of the report/analysis information which is edited separately. Many other examples could be mentioned. There is thus a continuum between document publishing and what is often called database publishing.
     
    Being able to merge (in batch mode and/or dynamically) several sources of information, including RDBMS information is thus a very important aspect of Web content management systems, especially for XML/SGML documents. Even if XML/SGML has the capability of modeling most of a publication content, we must be able to use existing databases and leverage RDBMS technologies and applications when they are appropriate.
     
     

    The Full Monty

     
     
    The figure above presents a possible global picture for an XML/SGML content management solution, with several information sources.
     
    More than just a view of a typical content management application, this figure presents a synthesis of all the possibilities to be considered when defining such applications. Most content management applications will be limited to subsets of this picture.
     
     

    Relationship with DMS and with SGML Databases

     
    Document Management Systems (DMS) have been around for a long time, and some of them have recently been adapted, or specifically designed, to handle XML/SGML documents. These systems offer features that are clearly part of the global content management picture. Such features as production management, workflow, document repositories, search and navigation in document collections, and so on, are obviously important for handling structured content.
     
    However, most document management systems still only handle documents. Even XML/SGML-aware DMS (there aren't many of them) generally implement a notion of XML/SGML document that does not give real access to the internal content of these documents. Web content management will increasingly require greater separation between the organization of the authored view of the information (documents to be managed) from the delivered view(s) of the information (publications). In most DMS products today, this possibility is either inexistent or limited to the definition of virtual documents as logical assemblies of physical documents.
     
    In turn, SGML databases can be defined as Document Management Systems that gives integral access to XML/SGML document content. Most SGML database products allow users to edit/view/manage documents at a variable granularity. They introduce much flexibility in storage and management of collections of XML/SGML documents.
     
    However, generating web publications from these systems is not only a matter of interfacing with a web server or generating HTML forms for the interface. As we have presented above, such generation may require some document re-organization, merging with other sources, and possibly dynamic filtering and assembly of new documents. All these functions are specific to a "publication part" which is separate from the "management part".
     
    Web content management features such as those we have identified here for the delivery part should be defined in addition to existing features of SGML databases or XML/SGML-aware document management systems. This mainly includes:
  • automatic generation of different publication views from a given document source
  • reorganization of document information in this generation process
  • the ability to merge document information with other sources of information such as RDBMS
  •  
     

    The "Build Everything" Approach

     
    A complete application for managing structured data and publishing it on the web thus requires both management functions and delivery functions.
     
    It is possible to build all those functions starting from a general RDBMS and an SGML/XML-aware development environment. However, this "build everything" approach suffers from two major problems:
  • The cost of developing a solution will surely be greater than the cost of integrating existing vendor products that implement parts of the desired functions.
  • The database structure that is adequate for information creation and management is not necessarily the same as the database structure that is adequate for delivery (if one is required at all). It is thus not always appropriate to have a unique database application that handle both the creation process and the dynamic delivery process, if any.
  •  
     

    The "Product Integration" Approach

     
    We think that there is a strong complementarity between existing SGML databases (or XML/SGML-aware document management systems) and products that more specifically handle the on-line delivery of structured documents. Figure 3 above shows how a global web content management solution could be defined as an integration of document management and document delivery products.
     
    The clear advantage of this approach would be to leverage existing SGML databases or document management technologies, while providing very efficient and complete on-line delivery solutions. It should be the way to build cost-effective web content management applications.

    Making an IETP –, a real life experience   Table of contents   Indexes   Style is not a 4-letter word