STEP/SGML Standards Working Together   Table of contents   Indexes   SGML - Made SIMPLE

  Van Vooren  Ludo 
 

XML and Legacy Data Conversion

 

Introducing "Consumable Documents"

 

Introduction

 This presentation will review the advantages of using the Extensible Markup Language (XML) in the context of legacy data conversion. This exciting application of SGML solves numerous conversion problems. By reviewing the advantages of XML in converting legacy data, it will show a never before possible migration strategy towards valid SGML information.
 

Challenges of Legacy Data Conversion to SGML

 Let's assume for the purpose of this presentation that one can correctly recognize the sequence of information elements contained in a legacy data document. The difficulty in converting that document to SGML resides in discovering the structure implied by those elements.
 For example, a paragraph followed by a section title implies that a section is starting. But is the paragraph at the end of a previous section? Is it at the same level than the section by being simply an introductory paragraph at the beginning of a chapter? Or is it the last paragraph in a list item in a sub sub section? Answering these questions is the task of a legacy data converter. The problem arises when the answer does not match the expected result provided by the DTD. What if the DTD does not allow a paragraph before a section? The legacy document is not compliant with the DTD: dead end.
 Structural inconsistency like in the example above is not the only challenge faced in the conversion of a legacy document into a DTD. What if an information object is simply not allowed in the DTD. For example, the document contains a WARNING but the DTD doesn't.
 Of course these problems can be circumvented. Three techniques are used currently with great success when converting Legacy data to SGML. If a problem arises from the document one can change the source document to make it compliant with the DTD. That is the most logical thing to do. If the document cannot be changed, then changing the DTD is the next best thing. Adding elements and "relaxing" the structural requirements will fix the problem. If neither the document nor the DTD can be changed, then the only solution is to "fake it", by using SGML artifacts such as marked sections, empty structural elements and other inventive use of the syntax. This is the least recommendable solution, but the only that would allow the resulting SGML file to be cleaned up in an SGML environment.
 All of these methods require a significant investment but have allowed the creation of large SGML repository of legacy data. Unfortunately it has caused many documents NOT to be translated in SGML. And there lays the problem. DTDs have always been written to accommodate future or idealistic documents. But an SGML system can rarely be implemented without using legacy data. If legacy data cannot be converted to SGML economically, not only old but also new documents never benefit from the advantages of SGML. As a result many documents remain in proprietary format and get translated in HTML for quick distribution.
 

Conversion to XML

 The new XML technology is about to change all that. Let's still assume that one can correctly recognize the sequence of information elements contained in a legacy data document. The conversion of this document to XML is remarkably simpler than a direct conversion to SGML.
 The structure and elements are taken at face value. As long as they respect the requirements of an XML well formed document, the document is "compliant". Unlike the dead end faced in the SGML conversion case, the newly converted document becomes immediately "consumable" by XML applications.
 The "consumable" document has a number of advantages over its legacy version. First, it is viewable and distributable on the internet and with other document viewers. But unlike its HTML cousin, the XML file is more intelligent. With each piece of its content identified, it is more searchable and intergratable. Because the format information is not embedded in the document, it is re-formattable and re-usable in other applications.
 Also, it is very likely that the "consumable" document will have a consistent high level structure. This will allow a "normal" document management system to fragment the document and manage each individual piece directly.
 By translating a document from its legacy format into HTML, not much value is added. But one can almost as easily convert the document to XML and make it more "consumable". XML well formed documents are more searchable, integratable, reusable and manageable. If the legacy document will not change or is not part of a growing collection of documents of the same type, this would be an economical way of getting the most of the document collection. However, for revisable documents or documents that are part of a growing collection, XML data can be the base for a slow migration to SGML, the markup that will provide the most flexibility and benefits.
 

XML Converted Data Usage

 The advent of XML will bring in a series of XML/SGML "hybrid" technology. This technology will allow an SGML system to grow from and on top of a base of XML well formed documents.
 Imagine that a legacy technical manual has been converted to XML. Each section of the manual has been stored in a document management system. XML formatter and viewer are used to "consume" the document intelligently.
 Any new document instance of the document MUST comply with a DTD. This DTD represents the ideal structure that will allow all the benefits of SGML to be received from a consistent well structured document. There is no excuse for not creating a new document directly using a DTD. This new document can be stored in the same document management system and can use the same viewing and formatting applications as the XML document. Since the SGML document is by definition XML compatible, these applications and the document management system itself can operate in this hybrid environment.
 When a section of the legacy document is edited, one should use a hybrid SGML document editor. This editor would load an XML file and an SGML DTD and identify the discrepancies that exist. This analysis technology already exists in the FrameBuilder product. A hybrid editor would point out the differences and suggest ways to fix them. Since the section is being revised anyway, this would be a good time to make the necessary changes to make it compliant with the DTD. The newly compliant section can be saved in the document management system. The resulting hybrid document, with some of its sections in SGML, is still consumable by the XML system. However, slowly but surely, it will become a completely compatible SGML document.
 

Conclusion

 Converting legacy documents to XML is the most economical way to add intelligence to your documents and make them immediately "consumable". It also allows you to implement an SGML system for any future document and to use the "hybrid" technology to slowly convert the legacy data to SGML.

STEP/SGML Standards Working Together   Table of contents   Indexes   SGML - Made SIMPLE