XML and Enterprise Application Integration   Table of contents   Indexes   Converting Flat File Content into XML and Vice Versa

Garching-Hochbrück (bei München)
 Germany 
Lumera, Juergen
SPX Valley Forge TIS GmbH
 
Juergen Lumera
 System Analyst
SPX Valley Forge TIS GmbH
  Gutenbergstr. 25 Garching-Hochbrück (bei München)  Germany (85748)
Email: jlumera@vftis.com Web site:www.vftis.com
 Biography
 The author works on and leads software projects for technical documentation in the automotive industry. After he aquired his computer sciences diploma and college degree in engineering he has gained experience in object oriented software design and development for six years . Before working for SPX Valley Forge TIS, he designed control software for marking lasers.
 

Introduction

authoring system
 meta data 
 
Although we have brand new XML based , which give us the capability to create documents containing links and and all the nice features we know from the internet, we are faced with heaps of legacy data. These heaps of legacy data have been authored with paper print in mind. Almost no explicit information about the structure and the semantics of this data has been fed into the system.
legacy data
 
How do we make our dumb smarter so we can import them into our cool system? Since we have nothing else, we have to use the layout information, the context the data appears in, some keyword lists and, not to be forgeten, our expert knowledge about the data.
 conversion 
 
Can the whole process of upward semantic data be automated? Probably not: We can only implement guesses. The smarting up-algorithm might be confronted with ambiguities. A domain expert has to decided. Sometimes the data may need to be reauthored.
SmartingUp algorithm
 
The purpose of this presentation is to talk about the various modules that such a consist of and how this algorithm could be integrated into a authoring system in a most effective way is the aim of that speech.
 

Typical Conversion to Authoring System

 
 Typical Authoring System
 The typical import of the legacy data is mostly manual because no general algorithm or software is available to do the work. Some parts of the conversionmay be handled by a short script (Perl, Omnimark, Rainbow...) but normaly written to convert just the current data. The conversion step itself gets handled completly independently from the development of the whole authoring system because either the customer says "We will not use legacy data with our new system" (-> don't believe them !!!) or the conversion problem gets postponed because there is no time (and maybe no need) to solve it so early.
 

Goals and Problems of Legacy Data Conversion

 Converting legacy data for reuse in an authoring system is a more sophisticated task than just converting data to be displayed in a viewer. Only valid XML files can be reused within the authoring system therefore they have to be parsed against a DTD.
 Some of the data which the DTD requires may not available in the legacy data or is distributed over a wide area of a single document. It is the most important goal and problem at the same time to find a way to get all the missing or widely distributed data together .
 Another important problem with the various kinds of legacy data is that one aspect could be expressed in many different ways (e.g. element caution: starting with text "CAUTION:" or with a exclamation mark graphic) throughout different books or even worse throughout one single book or chapter. Very similar is the problem that the conversion process has to reduce information from legacy data (e.g. map many different types of notes to one single type).
 Errors in the legacy data have to be detected and fixed during the conversion. For example very simple errors are missing entries within a TOC. A more severe error may be wrongly applied naming convention for certain tasks.
 The conversion has to be context and/or content sensitive. This means that some parts of the legacy data gets mapped to completely different elements, even if it has the same layout. A list within a list -numbered: A,B,C- of legacy data could either keep the numbering as expected or map it to a sub numbering of its parent list -numbered: 1.1,1.2,1.3 .
 The conversion process has to handle different types and formats of legacy data at the same time because some information gets lost during a conversion step. For example data which is not available in electronic format has to be scanned and an OCR algorithm has to be applied to the images. The text files the OCR algorithm generates do not have any layout information and the images are without any text information. To put the data into the authoring system both, layout and text information, are necessary at the same time.
 If a domain expert detects that some of the data is not imported correctly then information about the applied scripts and rules (e.g. more than one script generate links or more than one rule is used within a script to generate links) is necessary. Without such information it is very difficult to modify the scripts in a fashion that the data can be handled correctly.
 All filters, scripts and programs written for the current conversion problem shall be applicable also for subsequent converting tasks. This reduces the time which is necessary to develop an conversion process and it allows even a layman to handle it.
 

Conversion to Authoring System with a Smarting Up Framework

 
 Authoring System with SmartingUp Framework
self-learning system
 
A framework for the conversion process with well defined interfaces to all other parts of the authoring system provides a formal way to import any kind of legacy data to any kind of authoring system. Such a framework can help us identify problems and give us the capability to solve them either through modifiying the parameter for the import or extending and modifying the framework or removeing the part which has produced the problem from the import set (-> manual conversiont). The "way" certain data were converted should be stored for later use within the framework to create a .
 

The Smarting Up Framework

 
 SmartingUp Framework
 The Smarting Up Framework consists of six different modules. Each module interacts with or depends on other the others.The modules only need to understand how to interface with each other. That way it is very easy to add new functions to one part without modifying other modules.
 
  1. Keyword Sets
     
     A Keyword Set is a collection of words which belong together regarding a special aspect. For example we can define a set of words which can be found close to a certain tag or we define words which represent a link and so on. These sets can be used within rules.
     Keyword Sets can be defined by two different users
     
    •  SGML/XML Expert during analysis and design phase
    •  Domain Expert during verification of converted data
     or obtained automatically from an authoring system. The authoring system stores information about the content of all tags. The content could come from a real authoring process or from previously imported legacy data.
  2. Rules
     
     Every rule describes how a certain property can be recognized. Usually the name of the rule describes the property. Very trivial properties for example are position or length information (more general -> layout information). The distribution of predefined keyword sets is a more complex property.
     A certain rule has zero or many arguments to define different expressions/instances of that rule. For example the framework offers a rule to recognize empty lines. The user can create two different instances of that rule: one which recognizes just one empty line (-> before a paragraph ) and another to recognize 2-3 empty lines (-> before a chapter). We would name them maybe SingleEmptyLine and MultipleEmptyLines.
     A rule can consist of other rules - they are recursive. The rules can be combined with boolean expressions (AND/OR/NOT...). So it is very easy to create complex rules from very simple rules.
     An action takes place if that rule match can be assigned. The default action is to put a start tag before and a end tag arfter the string which matches the rule. The tag itself comes from the DTD element for which this rule was applied. (For the moment, actions are hard coded ).
     For each element within the DTD at least one rule must be defined and assigned.
  3. Rules Engine
     
     The rules engine applies the assigned rules to each element . If a rule matches, then the action will be executed by the engine and an entry for the verification step (with information about the matched rule, the string and the context) will be written. When more than one rule matches the engine creates a conflict entry for the verification step (with information about the conflict, the matched rules, the string and the context).
     The order in which the engine applies the rules (from most complex elements out to the leaves or vice versa) influences the result. To get the best behavior, one should apply the rules in many orders and compare the results.
     If the data already contains tags (from a previous iteration step or from hand tagging), the rules engine takes this information and applies only suitable rules. A suitable rule for example is a rule for an element which can be inside a previously added tag (from a DTD).
     Information (conflict resolutions, decisions, etc. ) entered during a verification step are used to apply the rules to which this information belongs first.
  4. Verification Engine
     
     All information written during a run of the rules engine are displayed by the verification engine to a domain expert.
     That expert can then:
     
    •  resolve a conflict (create new rule, apply a different rule, ...)
    •  declare a import part as invalid
    •  declare a import part as valid
    •  tag the data manually
    •  remove a part from the import algorithm (re authoring)
    •  .....
     After the expert has made all modifications/entries the rules engine could run again if necessary. If he has decided that no further run is necessary the verification engine writes all files and updates all external sources (DB and XML files).
  5.  User Interface 
     
     The user interface allows different users to interact with all modules inside the framework.
     A SGML/XML expert can use it to create keyword sets and rules based on the sets developed during analysis phase of legacy data. The new requests of the customer can also be entered as rules. All information the experts enters can be used for developing a DTD.
     A domain expert uses teh interface to verify all actions performed by the algorithm.
  6. Data IO
     
     The reading and writing of data should be implemented as an IO layer which can transform any input format to the internal reprasentation/navigation of the framework and vice versa.
     Main parts of a IO layer:
     
    •  Read and Write Information from/to Authoring System
    •  Read Legacy Data
    •  Write Converted Data
 

RISA - The Algorithm within The Framework

RISA
RuleBase-Iterative-SmartingUp-Algorithm
 
The ( ) works as follows:
 
 
 RISA
 
 
 
 

Conclusion

 With this type of framework we can solve nearly all conversion problems because we can convert automatically and manually - both with the help of a domain expert. The kind of legacy data (graphics, text, ...) is not important. The same is true of the target system. Every improvement we are makeing is reusable during the next import task - we will have a system which is getting better and better over time.
 The idea of a smarting up framework is just one way to handle legacy data but every other way has to have very similiar components. Not every aspect of that framework is neccessary for every conversion task but it is possible to create a very simple framework at the beginning and etxtend that simple framework over time without any modification to the already implemented components. A component for distributed processing of legacy data could be a nice feature for the near future especially if the pile of legacy data is very high or the experts are distributetd all over the world.
 

XML and Enterprise Application Integration   Table of contents   Indexes   Converting Flat File Content into XML and Vice Versa