| The Use of Profiling in XML Documents | Table of contents | Indexes | Data Models as an XML Schema Development Method | |||
| Buck, Lee Extensibility, Inc. | Lee Buck |
| Chief Technology Officer |
| Extensibility, Inc. |
| Biography |
Introduction |
Motivation |
Modeling Spectrum |
The Example |
TABLE EMPLOYEE |
NUM LONGINT PRIMARY KEY |
FNAME STRING 32 |
LNAME STRING 32 |
HIRE_DATE DATE |
TERM_DATE DATE MAY BE NULL |
TABLE PERF_REVIEW |
EMP_NUM LONGINT PRIMARY KEY FOREIGN KEY |
REVIEW_DATE DATE PRIMARY KEY |
REVIEW TEXT |
TABLE COMP_CHANGE |
EMP_NUM LONGINT FOREIGN KEY |
REVIEW_DATE DATE MAY BE NULL |
EFF_DATE DATE |
SALARY INT |
| At its simplest, a XML document of information drawn from this example might look like this: |
<EMPLOYEE |
NUM = '2361' |
FNAME = 'Wilbert' |
LNAME = 'Winston' |
HIRE_DATE = '4/4/88'> |
</EMPLOYEE> |
| To make these improvements we'll need to capture additional information in the schema using techniques outlined next. |
Extending the DTD |
<!ATTLIST quark bark CDATA #IMPLIED |
bite CDATA #IMPLIED |
e-color NMTOKEN #FIXED 'red' |
a-color NMTOKENS #FIXED 'bark blue bite pink' |
> |
| For our purposes we'll need four such properties: |
|
| These metadata properties are also appropriate with minor modification to other schema syntaxes such as XML Data and SOX. Essentially, they become additional attributes on the element type or attribute type declaration. |
Modeling Datatypes |
| Traditional data sources tend to be strongly typed. While an XML document, by definition, will present its information in textual form, retaining knowledge about the underlying data type is essential. The metadata property dtype (as outlined above) will be used to capture datatype information. What set of datatype values should we use? While a comprehensive list of datatypes is an elusive (if not illusory) goal, a useful set has been compiled from the various schema submissions to the w3c. We'll adopt the convention of using these datatype names. |
|
| A related problem arises around the issue of storage size. For example, how many bytes are needed to store a value of type "int" or how many characters long may a value of type string be? And even,"what is the precision of a value of type number?" We use the metadata property dsize to capture this information. The following meaning is assigned to values of dsize: |
|
| Note that this encoding of datatype information within a DTD does not beget validation behavior by itself. Such validation is beyond the capabilities of XML 1.0 parsers. However, using these methods, you can make such information available to your custom code for proper validation. Indeed using notations matched in name to your dtype values you can bind a reference to actual validation modules to each datatype. |
Modeling Relationships |
| Much of the power and complexity of mapping data to XML comes from mapping the relationships between pieces of data. In the relational world this equates to the modeling of primary-foreign key relationships. Let's look at each in turn. |
Primary Keys |
| A primary key provides a unique value by which a single row of a particular table may be accessed. XML has a similar concept of an ID attribute that provides unique access to an element. Many implementation of DOM provide indexed access to such elements so leveraging the similarities is often desirable. However, a subtlety arises: XML IDs must be unique across the whole document, a primary key is unique only within that column. |
| To provide the necessary global uniqueness, we use a supplemental pkey_id attribute to hold a version of the primary key data that we have made globally unique. To do this we take advantage of two facts: 1) the primary key is unique within the context of the element type that contains each row (named, as you'll see, after the table itself), and 2) the element type's name is unique within the document. So, to make the locally unique name globally unique, we prepend the element name to the key value. For example: |
<EMPLOYEE pkey_id = "EMPLOYEE.2361"> |
<EMPLOYEE.NUM>2361</EMPLOYEE.NUM> |
<EMPLOYEE.FNAME>Wilbert</EMPLOYEE.FNAME> |
<EMPLOYEE.LNAME>Winston</EMPLOYEE.LNAME> |
<EMPLOYEE.HIRE_DATE>4/4/88</EMPLOYEE.HIRE_DATE> |
</EMPLOYEE> |
Foreign Keys |
| The presence of foreign keys within a table provides the actual glue that binds different tables together. Just as XML's ID concept provided a useful analog for primary keys, so too its IDREF notion provides a powerful analog to foreign keys. Of course issues of uniqueness scope intervene here as well. To accommodate them we use a similar technique. We create a new IDREF attribute for the table element whose name is derived from the column name (with an _idref appended). Values in a document will be constructed in a similar fashion as for pkey_id attributes (indeed the production must match exactly otherwise the whole point is lost). For example: |
<EMPLOYEE pkey_id = "EMPLOYEE.2361"> |
<EMPLOYEE.NUM>2361</EMPLOYEE.NUM> |
... |
<PERF_REVIEW PERF_REVIEW.EMP_NUM_idref = "EMPLOYEE.2361"> |
<PERF_REVIEW.EMP_NUM>2361</PERF_REVIEW.EMP_NUM> |
... |
| Another problem that arises is the need to have the schema — rather than just the document — depict the relationships involved. Note that in the example above, it is the data of the PERF_REVIEW element that associates PERF_REVIEW with EMPLOYEE. So that the schema might "know" the relationship in the absence of a document instance we define that relationship using the metadata properties pkey and fkey. For example: |
<!ATTLIST EMPLOYEE e-pkey NMTOKEN #FIXED 'EMPLOYEE.NUM'> |
and |
<!ATTLIST PERF_REVIEW.EMP_NUM e-fkey NMTOKEN #FIXED ' |
EMPLOYEE.EMPLOYEE_NUM' > |
Nesting |
| There is an alternative to the mechanism described above in some cases. This technique takes advantage of the fact that in XML relationship may be inferred from context (specifically containment) as well as through id/idref relationships. So, in our example, we can associate PERF_REVIEW elements with their corresponding EMPLOYEE elements by placing them inside the latter. For example: |
<EMPLOYEE EMPLOYEE.NUM_id = "EMPLOYEE.2361" |
NUM = '2361' |
FNAME = 'Wilbert' |
LNAME = 'Winston' |
HIRE_DATE = '4/4/88'> |
<PERF_REVIEW |
REVIEW_DATE = '1/1/98' |
REVIEW = 'lousy'/> |
<PERF_REVIEW |
REVIEW_DATE = '1/1/99' |
REVIEW = 'worse'/> |
</EMPLOYEE> |
| Note that this is not always desirable and is often not even possible. The following conditions must be met for this to be appropriate: |
The Example Step-by-Step |
Modeling Tables |
|
Modeling Columns as Elements | |||||||||||||||||||||||||||||||||||||
| For each column: |
|
Modeling Columns as Attributes | |||||||||||||||||||||||||||||||||||||
| For each column: |
|
Conclusion |
| By adopting some simple conventions, XML Schemas can successfully model information sources such as relational databases. By capturing datatype, key relationships and id/idref information, extracted data can retain the metadata needed to facilitate processing at both side of information exchange. The process set forth above is straightforward and well suited to automation and has been implemented for a wide variety of databases in the current XML Authority product. |
Listings |
DTD with Columns as Elements | ||||||
<!ELEMENT EMPLOYEE ( |
EMPLOYEE.NUM? , |
EMPLOYEE.FNAME , |
EMPLOYEE.LNAME , |
EMPLOYEE.HIRE_DATE , |
EMPLOYEE.TERM_DATE? )> |
<!ATTLIST EMPLOYEE pkey_id ID #REQUIRED |
e-pkey NMTOKEN #FIXED 'EMPLOYEE.NUM' > |
<!ELEMENT EMPLOYEE.NUM (#PCDATA )> |
<!ATTLIST EMPLOYEE.NUM e-dtype NMTOKEN #FIXED 'int' > |
<!ELEMENT EMPLOYEE.FNAME (#PCDATA )> |
<!ATTLIST EMPLOYEE.FNAME e-dtype NMTOKEN #FIXED 'string' |
e-dSize NMTOKEN #FIXED '32' > |
<!ELEMENT EMPLOYEE.LNAME (#PCDATA )> |
<!ATTLIST EMPLOYEE.LNAME e-dtype NMTOKEN #FIXED 'string' |
e-dSize NMTOKEN #FIXED '32' > |
<!ELEMENT EMPLOYEE.HIRE_DATE (#PCDATA )> |
<!ATTLIST EMPLOYEE.HIRE_DATE e-dtype NMTOKEN #FIXED 'date' > |
<!ELEMENT EMPLOYEE.TERM_DATE (#PCDATA )> |
<!ATTLIST EMPLOYEE.TERM_DATE e-dtype NMTOKEN #FIXED 'date' > |
<!ELEMENT PERF_REVIEW ( |
PERF_REVIEW.EMP_NUM , |
PERF_REVIEW.REVIEW_DATE , |
PERF_REVIEW.REVIEW )> |
<!ATTLIST PERF_REVIEW PERF_REVIEW.EMP_NUM_idref IDREF #REQUIRED > |
<!ELEMENT PERF_REVIEW.EMP_NUM (#PCDATA )> |
<!ATTLIST PERF_REVIEW.EMP_NUM e-dtype NMTOKEN #FIXED 'int' |
e-fkey NMTOKEN #FIXED 'EMPLOYEE.NUM' > |
<!ELEMENT PERF_REVIEW.REVIEW_DATE (#PCDATA )> |
<!ATTLIST PERF_REVIEW.REVIEW_DATE e-dtype NMTOKEN #FIXED 'date' > |
<!ELEMENT PERF_REVIEW.REVIEW (#PCDATA )> |
<!ATTLIST PERF_REVIEW.REVIEW e-dtype NMTOKEN #FIXED 'string' |
e-dSize NMTOKEN #FIXED '50' > |
<!ELEMENT COMP_CHANGE ( |
COMP_CHANGE.EMP_NUM , |
COMP_CHANGE.REVIEW_DATE? , |
COMP_CHANGE.EFF_DATE , |
COMP_CHANGE.SALARY )> |
<!ATTLIST COMP_CHANGE COMP_CHANGE.EMP_NUM_idref IDREF #REQUIRED > |
<!ELEMENT COMP_CHANGE.EMP_NUM (#PCDATA )> |
<!ATTLIST COMP_CHANGE.EMP_NUM e-dtype NMTOKEN #FIXED 'int' |
e-fkey NMTOKEN #FIXED 'EMPLOYEE.NUM' > |
<!ELEMENT COMP_CHANGE.REVIEW_DATE (#PCDATA )><!ATTLIST COMP_CHANGE.REVIEW_DATE e-dtype NMTOKEN #FIXED 'date' > |
<!ELEMENT COMP_CHANGE.EFF_DATE (#PCDATA )><!ATTLIST COMP_CHANGE.EFF_DATE e-dtype NMTOKEN #FIXED 'date' > |
<!ELEMENT COMP_CHANGE.SALARY (#PCDATA )><!ATTLIST COMP_CHANGE.SALARY e-dtype NMTOKEN #FIXED 'int' > |
DTD with Columns as Attributes | ||||||
<!ELEMENT EMPLOYEE EMPTY> |
<!ATTLIST EMPLOYEE pkey_id ID #REQUIRED |
NUM CDATA #REQUIRED |
FNAME CDATA #REQUIRED |
LNAME CDATA #REQUIRED |
HIRE_DATE CDATA #REQUIRED |
TERM_DATE CDATA #IMPLIED |
e-pkey NMTOKEN #FIXED 'NUM' |
a-dtype NMTOKENS 'NUM int |
FNAME string |
LNAME string |
HIRE_DATE date |
TERM_DATE date' |
a-dSize NMTOKENS 'FNAME 32 LNAME 32' > |
<!ELEMENT PERF_REVIEW EMPTY> |
<!ATTLIST PERF_REVIEW PERF_REVIEW.EMP_NUM_idref IDREF #REQUIRED |
EMP_NUM CDATA #REQUIRED |
REVIEW_DATE CDATA #REQUIRED |
REVIEW CDATA #REQUIRED |
a-dtype NMTOKENS 'EMP_NUM int |
REVIEW_DATE date |
REVIEW date' |
a-fkey NMTOKENS 'EMP_NUM EMPLOYEE.NUM' |
a-dSize NMTOKENS 'REVIEW 50' > |
<!ELEMENT COMP_CHANGE EMPTY> |
<!ATTLIST COMP_CHANGE COMP_CHANGE.EMP_NUM_idref IDREF #REQUIRED |
EMP_NUM CDATA #REQUIRED |
REVIEW_DATE CDATA #IMPLIED |
EFF_DATE CDATA #REQUIRED |
SALARY CDATA #REQUIRED |
a-dtype NMTOKENS 'EMP_NUM int |
REVIEW_DATE date |
EFF_DATE date |
SALARY int' |
a-fkey NMTOKENS 'EMP_NUM EMPLOYEE.NUM' > |
XML Data with Columns as Elements |
<?xml version ="1.0"?> |
<!--Generated by XML Authority. Conforms to XML Data subset for IE 5--> |
<Schema name = "" |
xmlns = "urn:schemas-microsoft-com:xml-data" |
xmlns:dt = "urn:schemas-microsoft-com:datatypes" |
xmlns:xa = "www.extensibility.com/schemas/xdr/metaprops.xdr"> |
<ElementType name = "EMPLOYEE" xa:pkey = "EMPLOYEE.NUM" content = "eltOnly" order = "seq"> |
<AttributeType name = "pkey_id" dt:type = "ID" required = "yes"/> |
<attribute type = "pkey_id"/> |
<element type = "EMPLOYEE.NUM" /> |
<element type = "EMPLOYEE.FNAME" /> |
<element type = "EMPLOYEE.LNAME" /> |
<element type = "EMPLOYEE.HIRE_DATE /> |
<element type = "EMPLOYEE.TERM_DATE" minOccurs = "0" maxOccurs = "1"/> |
</ElementType> |
<ElementType name = "EMPLOYEE.NUM" content = "textOnly" dt:type = "i4"/> |
<ElementType name = "EMPLOYEE.FNAME" content = "textOnly" dt:type = "string"/> |
<ElementType name = "EMPLOYEE.LNAME" content = "textOnly" dt:type = "string"/> |
<ElementType name = "EMPLOYEE.HIRE_DATE" content = "textOnly" dt:type = "date"/> |
<ElementType name = "EMPLOYEE.TERM_DATE" content = "textOnly" dt:type = "date"/> |
<ElementType name = "PERF_REVIEW" content = "eltOnly" order = "seq"> |
<AttributeType name = "PERF_REVIEW.EMP_NUM_idref" dt:type = "IDREF" required = "yes"/> |
<attribute type = "PERF_REVIEW.EMP_NUM_idref"/> |
<element type = "PERF_REVIEW.EMP_NUM" /> |
<element type = "PERF_REVIEW.REVIEW_DATE /> |
<element type = "PERF_REVIEW.REVIEW" /> |
</ElementType> |
<ElementType name = "PERF_REVIEW.EMP_NUM" xa:fkey = "EMPLOYEE.NUM" content = "textOnly" dt:type = "i4"/> |
<ElementType name = "PERF_REVIEW.REVIEW_DATE" content = "textOnly" dt:type = "date"/> |
<ElementType name = "PERF_REVIEW.REVIEW" content = "textOnly" dt:type = "string"/> |
<ElementType name = "COMP_CHANGE" content = "eltOnly" order = "seq"> |
<AttributeType name = "COMP_CHANGE.EMP_NUM_idref" dt:type = "IDREF" required = "yes"/> |
<attribute type = "COMP_CHANGE.EMP_NUM_idref"/> |
<element type = "COMP_CHANGE.EMP_NUM" /> |
<element type = "COMP_CHANGE.REVIEW_DATE" minOccurs = "0" maxOccurs = "1" /> |
<element type = "COMP_CHANGE.EFF_DATE" /> |
<element type = "COMP_CHANGE.SALARY" /> |
</ElementType> |
<ElementType name = "COMP_CHANGE.EMP_NUM" xa:fkey = "EMPLOYEE.NUM" content = "textOnly" dt:type = "i4"/> |
<ElementType name = "COMP_CHANGE.REVIEW_DATE" content = "textOnly" dt:type = "date"/> |
<ElementType name = "COMP_CHANGE.EFF_DATE" content = "textOnly" dt:type = "date"/> |
<ElementType name = "COMP_CHANGE.SALARY" content = "textOnly" dt:type = "i2"/> |
</Schema> |
| The Use of Profiling in XML Documents | Table of contents | Indexes | Data Models as an XML Schema Development Method | |||