The Document as Application: Issues and Implications   Table of contents   Indexes   Using XML in a Teleeducational Tool

 

The Mapping Problem: From Data to XML and Back

 Henry S.   Thompson
  Principal Scientist
  HCRC Language Technology Group  2 Buccleuch Place
Edinburgh   Scotland  EH8 9LW
Phone: +44 131 650-4440
Fax: +44 131 650-4587
Email: ht@cogsci.ed.ac.uk
 
Biographical notice:
 
Henry S. Thompson is Reader in Artificial Intelligence and Cognitive Science at the University of Edinburgh, where he is chiefly engaged in research and research management in the Language Technology Group of the Human Communication Research Centre. He has published several language research corpora on CD-ROM, and has developed software systems for XML, XSL, SGML and DSSSL. He was a member of the original W3C SGML Working Group, responsible for the first drafts of the XML standard. He was a co-author of the XML-Data proposal and the original XSL proposal and is now a member of the XSL working group and the XML Schema sub-working group. He is the author of XED, the first freely available XML editor.
 
ABSTRACT:
 
One of the main sources of energy driving the growth of XML is its evident potential as an interface between data stored in more or less proprietary formats and user interface applications: in short, as a means of viewing a database with a browser. Somewhat less visible is the opportunity XML offers as a self-documenting, robust interchange mechanism between data store and data store, with no human perusal involved.
 
Consideration of this kind of XML use gives rise to what has been called ‘the mapping problem’: how automatic can the conversion be of data model to document type, or equivalently, from data to document and back? This paper explores two alternative routes towards an answer to these questions:
 
  •  Conventions for the use of the existing XML 1.0 DTDs
  •  Facilities to be incorporated in XML Schema which go beyond what DTDs can provide
 
We start with an observation and then a fundamental question.
 
Observation: Almost any approach to data modelling includes a distinction between objects and properties, often called Entities and Relations.
 
Question: When we look at an XML DTD for a document used to encode data, how can we tell what element types encode Entities, and what element types encode Relations? [Note: Attributes are tricky, and will be addressed in the full paper, but ignored for now.]
 
If we look at existing DTDs, we can see the extent to which the answer to our question is not obvious on structural grounds alone, although there are some clues. In the DTD for this paper (gcapaper.dtd), for instance, we find
 
Entity: bibitem; para
 
Relation: acknowl; title
 
Both: address; author
 
This categorisation is NOT uniquely determinable on structural grounds alone: I had to use my knowledge of the meanings of the words used as element type names and the overall purpose of the DTD to arrive at it. It follows that I could NOT automatically construct an Entity-Relation version of the information in this document which would get it right.
 
The first approach would involve identifying some number of guidelines for DTD construction such that following them would result in DTDs which WOULD enable the categorisation to be uniquely and automatically determined. The possibilities range from making all element types encode entities, all attributes encode relations using IDREF-ID links to requiring that content models enforced an alternation between Entity-encoding element types and Relation-encoding element types as we look up and down the document tree.
 
Our second approach could address the mapping issue directly, or indirectly: the XML Schema WG could take on board a requirement to allow schema authors to indicate for each element type whether it encodes an entity, a relation, or both. Alternatively we could structure schemas in two levels: one for defining entities and relations as such, and the second for linearising them into documents.
 
In this paper I sketch enough of an implementation of both of these ideas so that they can be sensibly compared to one another, and to the various versions of the first approach.
 
Finally the integration of links into this story will be addressed: when a relational database is designed for a particular data model, decisions must be made about granularity, duplication and what relations contain foreign keys have to be made. Similarly in designing DTDs, we often have to chose between containment or reference via link, whether short (IDREF-ID) or long (XML Link) range.
 

The Document as Application: Issues and Implications   Table of contents   Indexes   Using XML in a Teleeducational Tool