| Modeling Relational Data in XML | Table of contents | Indexes | Integration and Interpretation of XML Schemas | |||
Burkett, William ![]() Long Beach ![]() Product Data Integration Technologies, Inc. ![]() USA ![]() | William C. Burkett |
| Senior Information Engineer |
| Product Data Integration Technologies, Inc. |
| 100 W Broadway, Suite 540
Long Beach
(California)
USA
(90802)
Email: wburkett@pdit.com |
| Biography |
The Search for Meaning on the World Wide Web |
| "Web resource" is used in the RDF sense [10] to denote an identifiable object on the World Wide Web, such as a web page. |
Product Data Markup Language - PDML |
| The relationship between these components is illustrated in Figure 1. As PDML grows, additional transaction sets will be added to the specification. |
|
|
Data Models, XML, and Web Resources |
Data models |
| In other words, the term "data model" is used in two primary senses: |
| The "Cambridge Communique" [13] uses "data model" in the former sense. In this paper, the latter sense is used since this definition ostensibly reflects the objectives of XML and Content Schemas: to specify the meaning and structure of the content of web resources. The important feature this use of data models is that they hide the internal, storage-dependent aspects of the data and concentrate on the information that is known by and available to the users. |
| There are many different kinds of data modelling languages, each of which has its strengths and shortcomings. What they all share is a structure which |
| This description can also be applied to XML. |
The role of XML on the World Wide Web |
| The hope of XML is to bring meaning to the web, provide a mechanism for organizing and categorizing the huge cacophony of content deployed on the web, and make the content more useful for users of the web. XML itself, however, provides virtually no semantics; XML is simply a data encoding mechanism. The semantics of XML documents are ostensibly specified in Content Schemas (that can assume a variety of specification formats, such as DTD's, DCD, XML Schema or XML Data) and - if one is lucky - natural language definitions of the elements in the schema. |
| A very important aspect of the role of XML in "bringing meaning to the web" is that this "meaning" is not primarily intended for humans, but rather for applications and automated agents that access and exchange the data. The web already has a format intended for human consumption: HTML. What this means to the design of Content Schemas is that the schema must be well-structured enough and semantically clear enough for the creators of these applications and agents to write code against. |
| Which is exactly what data models and data modelling languages are intended to provide! The question, then, is whether data models offer any significant advantages over current languages for specifying Content Schemas (e.g., DTD's, XML Schema) in fulfilling the role envisioned for XML. This question is explored in below after a brief introduction to EXPRESS. |
EXPRESS |
| Despite the fact that the name of the language is written in upper case letters, EXPRESS is not an acronym. It is an ISO standard (ISO 10303-11 [9]) and a self-described "information modelling language" developed to specify the semantics of industrial product data for the purpose of exchanging and sharing data between and among industrial product data systems. It is a synthesis of features of the Entity-Relationship and Object-Oriented data modelling approaches and, in its lexical form, bears a strong resemblance to Pascal record declarations. |
| While this presentation cannot provide a comprehensive tutorial on the EXPRESS language, an explanation of the following major concepts will provide a valuable introduction to the language: |
Entities |
| The fundamental construct of the EXPRESS language is the entity. An entity is the representation of a concept-of-significance within an application domain. It specifies the name, the properties, and the meaning of the domain concept and the data instances governed by it. The following is an example entity (data type) declaration in EXPRESS: |
ENTITY product; |
name : STRING; |
identifier : STRING; |
description : STRING; |
END_ENTITY; |
| There is also a graphical version of EXPRESS called EXPRESS-G; the graphical equivalent of the declaration above is: |
|
| An entity is comprised of properties called attributes. The product entity above has three attributes: name, identifier, and description. |
| Aninstance of an entity is an identifiable member of a data population that conforms to the entity data type declaration. To conform to the declaration, the must have identity, a type, and values for each of the declared attributes. See Figure 4. |
|
Schemas |
| A schema is a collection of EXPRESS declarations that establishes a bounded scope for the declarations and may be considered as a "container" for declarations like entities. Schemas cannot be nested. |
| A schemagoverns the structure and meaning of the instances in a given data population. |
| The following is an example of a schema declaration containing some entity declarations: |
SCHEMA product_definition_schema; |
ENTITY product; |
id : identifier; |
name : label; |
description : text; |
frame_of_reference : SET [1:?] OF product_context; |
UNIQUE |
UR1: id; |
END_ENTITY; |
ENTITY product_category; |
name : label; |
description : OPTIONAL text; |
END_ENTITY; |
ENTITY product_related_product_category |
SUBTYPE OF (product_category); |
products : SET [1:?] OF product; |
END_ENTITY; |
- |
- |
END_SCHEMA; |
Attributes and Data Types |
| Attributes are named properties of an entity. An attribute consists of a role name and a data type. In the following entity declaration… |
|
| …"name", "identifier", and "description" are attributes, each of which has a data type of "STRING". The role name describes the relationship of a datatype value to the entity. The datatype is the name of a domain from which values in instances are drawn. "String" is one of several simple data types defined in EXPRESS; others include integer, real, and boolean. |
| Relationships between entities are established by using an entity as the data type of an attribute rather than a simple type. For example, the statement "person owns a car" is modelled with the following EXPRESS declarations: |
ENTITY person; |
name : STRING; |
owns : car; |
END_ENTITY; |
ENTITY car; |
year : INTEGER; |
make : STRING; |
model : STRING; |
END_ENTITY; |
| The equivalent declarations in EXPRESS-G are: |
|
| The attribute datatypes may be specified as an aggregate, which changes the cardinality of the relationship from exactly one to one-or-more (for example). The follow declaration: |
ENTITY person; |
name : STRING; |
owns : SET [1:?] of car; |
END_ENTITY; |
| …states that a "person owns 1 or more cars". Note that the inverse cardinality of this relationship is zero or more: "a car may be owned by zero, one, or many persons". |
Constraints |
| There are two principal constraint mechanism in EXPRESS: Local Rules and Global Rules. Local Rules are part of an entity declaration and specify constraints applicable to each instance of an entity. For example: |
ENTITY time; |
hour : INTEGER; |
minute : INTEGER; |
second : INTEGER; |
WHERE WR1: hour < 24; |
WR2: minutes < 60; |
WR3: second < 60; |
END_ENTITY; |
| The three "where" rules of the time entity declaration specify constraints on the permissible values of the attributes of an instance of time. |
| Global Rules are declarations within a schema (peer-level with entities) that constrain existence, relationships, and values of and among entity instances. For example, without using the aggregate bound specifications, a Global Rule could be used to specify that a person must own 3 or more cars: |
RULE owns_three_cars FOR (person); |
LOCAL
num_cars : INTEGER;
|
END_LOCAL; num_cars := SIZEOF(person.owns); |
WHERE
num_cars >= 3;
END_RULE;
|
| Additional information and tutorials about the EXPRESS language can be found at http://www.epmtech.jotne.com/learn |
Functional comparison of data models and XML |
| The principle requirements of data models and data modelling languages (derived from the definitions above) are to specify |
| In the following comparison of features, the EXPRESS language will be used as an example of a data modelling language. EXPRESS is richer in features than other data modelling languages, but shares the same graph-based structuring common to all data model languages. |
Structure |
| The first objective - the structure of data - is the entire raison d'être for XML. XML provides a formal structuring syntax that is well-understood by the web development community (due to exposure to HTML). The primary structural mechanism of XML is containment - the hierarchical nesting of elements within elements. |
| If one looks at the evolution of data model structuring paradigms: |
| It is evident that the progress is toward graph-structured representations of data. The reason for this is simple: although the simple structuring approaches (e.g., hierarchies) are easy to understand and process, they are far too limiting and subject to errors. Graphs structures permit the reuse of data objects through common reference to the shared object. The graph structure is reflected in EXPRESS in the "pointing" relationship between one entity and another. |
| As the design of web resources evolve to accommodate applications and automated agents, there is every reason to believe that they, too, will evolve along this same cline. Thus, the underused ID/IDREF feature of XML takes on a new significance and importance in XML documents. |
Integrity Constraints |
| The strength of modern data modelling languages such as EXPRESS is in the specification of data integrity constraints. Other than constraints imposed by structure and by cardinality operators, XML is essentially bereft of integrity constraints. (Constraints can be included as metadata in XML, but mechanisms are not inherent in the language.) |
| It should be noted data models are exactly the same as XML DTD's with respect to the data instances (e.g., XML documents) in that any integrity constraints must be enforced by the applications that produce and consume the data - the data itself does not contain the constraints. The difference is that data modelling languages such as EXPRESS provide formal constraint specification features as part of the language. |
Semantics |
| Given the objective of XML bringing meaning to the web, the most important comparison between data modelling languages and XML is in the ability to specify the semantics of the structured data. It is ironic that despite the importance of semantics and XML, there is virtually no investigation into or definition of semantics from the point of view of linguistics, cognitive psychology, or epistemology. Even investigations into the semantics of XML with respect to legal contracts [12] - an extremely important topic in e-commerce development - fails to examine the slipperiness of semantics from the natural language standpoint. |
| Linguistically, semantics is defined a: |
| The meaning of these definitions, of course, hinges upon the meaning of the term "meaning": 1) |
| This paper is not the proper place for a complete explanation of the relevance and relationship of linguistic theory to data models and XML. However, the important aspect of semantics and meaning highlighted in the above definitions and relevant to this discussion is that meaning always pertains to the human mind. This leads to the potentially controversial assertion that data - XML or otherwise - has no meaning unless it is interpreted by a human. "Interpretation" by applications or agents is simply an indirect interpretation of the programmer; applications/agents only process data - they don't "understand" or "interpret" it. |
| Therefore, the effectiveness of a Content Schema as the specification of meaning of an XML document depends upon how well it evokes the same interpretations in different readers - and thereby its correct use by programmers. The ability of the Content Schema to evoke the same interpretations depends on |
| Because of the interplay of these factors, data models are not inherently better than conventional Content Schema languages in the specification of meaning, but data models do provide more objective features with which to specify the meaning. For example, data modelling languages such as EXPRESS offer |
| All of which supports the argument that data models are better than XML DTDs at specifying the semantics of data. However, the Devil's Advocate must point out that a surfeit of semantic features does not, in itself, mean that a data modelling language is better than conventional content specification approaches. |
| Good practices and conventions can overcome any shortcomings within a language; an addition, it could easily be argued that too many features are an impediment to designing good Content Schemas. |
Mapping of EXPRESS to XML DTD syntax |
| Because both EXPRESS and XML DTD's are both comprised of named, primary objects, the initial mapping of EXPRESS entities to XML elements is almost a no-brainer. For example, the EXPRESS declaration: |
ENTITY product; |
name : STRING; |
identifier : STRING; |
description : STRING; |
END_ENTITY; |
| …is converted to the following XML declaration: |
<!ELEMENT product (product.name, product.identifier, product.description)> |
<!ELEMENT product.name (#PCDATA)> |
<!ELEMENT product.identifier (#PCDATA)> |
<!ELEMENT product.description (#PCDATA)> |
| …and XML: |
<product> <product.name>printer</product.name> |
<product.identifier>PS775</product.identifier> |
<product.description>color inkjet</product.description> |
</product> |
| The more challenging questions arise in mapping details and have to do with semantic subtleties of the languages. For example, this particular mapping is just one way of converting the EXPRESS to XML declarations. This particular approach is called an early binding approach because the resulting DTD uses the terminology and structure of a particular EXPRESS schema. The alternative is a late binding approach in which the concepts of the EXPRESS language itself are mapped to XML: |
<!ELEMENT entity (attribute+)> |
<ATTLIST entity name CDATA #REQUIRED> |
<!ELEMENT attribute (…)> |
| In the following discussion, an early binding will be assumed. A complete EXPRESS schema and corresponding DTD based on an early binding mapping can be found at http://www.pdit.com/pdml/EXP2DTD.txt |
| There are a number of particular aspects of the mapping between EXPRESS and XML DTD's that require some discussion. |
| This include: |
Schemas |
| Schemas actually map very nicely from EXPRESS into XML. In EXPRESS, a schema specifies a domain of values and thus "bounds" the collection of values. Thus, a schema maps naturally to the root element of an XML DTD; this element then serves as a container for entity instances. |
| Because EXPRESS schemas are structured as a network rather than as a hierarchy, entities generally bear a peer-to-peer relationship to one another. As such, a valid XML document corresponding to the schema can contain zero, one, or more entity instances; the content model of the root element, therefore, is a giant choice particle containing the names of the independent entities in the schema: |
<!ELEMENT product_definition_schema ((application_context | application_context_element | document | document_type | effectivity | product | product_category | product_category_relationship | product_definition | product_definition_formation | product_definition_formation_relationship | product_definition_relationship | product_definition_substitute)*)> |
| The schema element also provides the appropriate place for specifying schema-related metadata, such as origin and date of the schema. |
Attributes |
| A rather amusing difference between EXPRESS and XML is that both languages contain a feature called "attribute", but not only do these uses of "attribute" not mean the same thing, the thing that they mean within the each language is not present at all in the other. |
| This requires a bit of explanation. In EXPRESS, an attribute of an entity is a named property of the entity where the name of the attribute describes the role of the datatype with respect to the entity. For example, in the follow declaration: |
ENTITY geometric_point; |
x : REAL; y : REAL: |
z : REAL; |
END_ENTITY; |
| "x" describes the role of a REAL value with respect to the entity "geometric_point." XML has no equivalent facility! |
| The only way that role names can be introduced is by adding an additional level of element declaration that captures the role name: |
<!ELEMENT point (point.x, point.y, point.z)> |
<!ELEMENT point.x (real)> |
<!ELEMENT point.y (real)> |
<!ELEMENT point.z (real)> |
<!ELEMENT real (#PCDATA)> |
| On the other hand, EXPRESS has no mechanisms corresponding to XML tag attributes for providing metadata about the content. Any such metadata would be indistinguishable from other EXPRESS-declared data. Another thing to note about EXPRESS attributes is that names of EXPRESS attributes are local to the scope of the entity declaration. |
| Therefore, if two entities have an attribute called "name", then they are two different "names". This is reflected in the mapping by prepending the entity name to the attribute name when declaring an element for the attribute, as can be seen in the examples throughout this paper. |
References |
| The primary relationship between elements in an XML Document/DTD is that of containment - an element is contained within another element according to the ordering and cardinalities specified in the content model of the parent element. The hierarchical structure is illustrated and highlighted in most XML tools, as exemplified in Figure 6: |
|
| As already pointed out, EXPRESS does not have the same "container semantics", but rather treats all data objects as first-class objects and establishes relationships between entities by "pointing" from one entity to another. The natural structure is a network: |
|
| Mimicking the network structure of EXPRESS in XML required the development of a convention for handling "pointers" in XML. The convention adopted was the creation of a "handle" element that was a companion of and named after an entity. Given the person-owns-car example from above: |
ENTITY person; |
name : STRING; |
owns : car; |
END_ENTITY; |
ENTITY car; |
year : INTEGER; |
make : STRING; |
model : STRING; |
END_ENTITY; |
| The XML declarations would be: |
<!ELEMENT person (person.name, person.owns)> |
<!ELEMENT person.name (#PCDATA)> |
<!ELEMENT person.owns (car_ref)> |
<!ELEMENT car (car.year, car.make, car.model)> |
<!ATTLIST car id ID #REQUIRED> |
<!ELEMENT car_ref EMPTY> |
<!ATTLIST car_ref refid IDREF #REQUIRED> |
| The "handle" or "pointing device" is an EMPTY element called car_ref. This element would appear in the content model of the "owns" attribute of person. The car_ref element contains a refid value which, by convention, references a "car" element with an equal id value. (The car element is at the same level as the "person" element.) |
Conformance |
| With the introduction of a data model as the specification of the content of an XML Document, the notion of Schema Validity is introduced as well: |
| The extra level of conformance called Schema Validity is recognized in the Cambridge Communique [13]. |
Frequently Asked Questions |
Where is this work being used? Where is it being done? |
| The use of the EXPRESS data modelling language for the content specification of an XML document was introduced in the PDML Project (www.pdit.com/pdml). As part of this project, the initial Early Binding specifications were developed and applied, and a small tool called EXML was developed to convert the EXPRESS schema to a DTD. The examples included in the paper were produced with EXML. EXML it is available free at: |
| http://www.pdit.com/pdml/exmlintro.html |
| The EXPRESS language was developed in and standardized through ISO TC 184/SC4. The Early Binding work presented here is a contribution to a larger effort within SC4 to develop standardized bindings between EXPRESS and XML DTD syntax. The project conducting the work is a joint effort of SC4 and ISO/IEC JTC1/SC34 (SGML). The ISO designation of the bindings once they become standardized will be ISO 10303-28 (Binding of EXPRESS to XML). |
Why not use UML? XML Schema? XML Data? XML Information set? |
| As a data modelling language, UML could have been used as a content specification language in PDML. UML (Unified Modeling Language [8]) is a more widely known and popular language than EXPRESS, and has richer, more expressive features than EXPRESS. However, UML, as an object-modelling language, has a different purpose than EXPRESS. "Objects" are not "entities". Objects in UML "do" something - they have functionality and capabilities and lend themselves to the development of application systems. Entities in EXPRESS, on the other hand, don't "do" anything other than represent a real-world concept and don't lead to application system designs or functionality. It was felt that UML is over-featured with respect to the requirements of PDML. |
| There are too many aspects of the language irrelevant to project objectives. XML Schema and XML Data are intended to perform the same function as a DTD, but do it as an XML document rather than as a standard XML DTD. They are equivalent to DTD and, thus, are neither better or worse than DTDs as content specification language. |
| XML Information Set is "an abstract data set …[which is] a description of the information available in a well-formed XML document." [5] Like XML Schema and XML Data, XML Information Set also takes as its domain concepts the things that are found in DTD and XML documents. The primary difference between them is that it appears XML Infoset is trying to abstractly describe (i.e., describe the contents without specifying the physical structure) the kinds of information that may be obtained an API from an XML document by, for example, an API. The objectives of the XML Infoset work don't seem to be directed at the role of being Content Schema specification language. |
Do web resources need all the rigor introduced by data models? |
| No. |
| For a large number of applications, web resources do not require the rigorous mechanisms entailed in data models because the purpose of the resource might not include data management. Presentation and simplistic, one-off data exchanges don't need to meet the requirements of a long-term data resource. |
| However, since XML is ostensibly targeting automated processing of web resources, then the ease and correctness of the processing would greatly be aided with good data management principles and practice. Therefore, while data models would be overkill with respect to many web applications, the growth of the web toward a Semantic Web where agents can find the semantically-correct information that they are searching for will required strong semantic specification languages - and a ton of good practice! |
Summary and Conclusions |
| The World Wide Web is still evolving and will probably continue to evolve in perpetuity. The growth in the recognition and desire for more data semantics on the web (i.e., "intelligent" data that supports and encourages application interoperability) will drive the evolution of web resources and the sophistication (and complexity!) of encoding techniques. XML is a mechanism that provides a step in that direction, but it is not enough. Semantic Content Specification Languages are needed that are applicable and usable across platforms to clearly specify the semantics and structure of data available on the web. Furthermore, there is a growth curve that the technology evolution must follow - mistakes will be made, and lessons will be learned. |
| The long history of data model usage, data exchange, and application interoperability that is part of industrial information technology development provides a wealth of mistakes and lessons that can directly support the semantic evolution of web resources. The "Cambridge Communique" [13] recognizes the importance and role of data models with respect to web resources and the Semantic Web, but fails to cite the applicability of existing data model usage and research. |
| The Content Schemas that specify the semantics of web resource must be independent of the encoding syntax. XML Infoset takes a step in this direction, but maintains vestiges of its document-oriented origin. Data modelling languages such as EXPRESS provide a mechanism that is both rich in semantic features and mappable to XML and other encoding syntaxes. This paper has illustrated an example and presented some of the details of the use of EXPRESS as an XML document Content Schema. |
Bibliography |
|
| Modeling Relational Data in XML | Table of contents | Indexes | Integration and Interpretation of XML Schemas | |||