XML Schema types and equivalence classes   Table of contents   Indexes   XML &, query languages

XML schema aware processing
XML schema validation
XML schemas
 

Derivation, tolerance and validity

 a formal model of type definition in XML Schemas
Brown, Allen
 
 Allen  Brown
 Jr.
 Senior Program Manager
  Microsoft Corporation 
 Redmond  
 USA 
 Washington 
Microsoft Corporation,  One Microsoft Way
Redmond  Washington  98052-6399 USA
Phone: +1 425 705-3290 Fax: +1 425 536-7329 email: allenbr@microsoft.com web site: msdn.microsoft.com/xml/
 Biography
 Allen L. Brown, Jr., Ph.D. - Dr. Brown is Senior Program Manager in Microsoft's Web Data Access group where he shepherds Microsoft's efforts in data access related XML standards. Prior to joining Microsoft, Dr. Brown had a lengthy career at the Xerox Corporation, where held various positions including both Research Fellow and VP and CTO of Xerox's XSoft Division. His product development experience includes the management of large teams of engineers, as well as lead architectural roles in both Xerox's landmark Star system, and its Astoria object-oriented document repository. Dr. Brown has held positions in the corporate research centers of IBM, GE and Xerox. He is the author of numerous published technical papers and the holder of several U.S. patents. Dr. Brown's university and national service record includes professorships at the George Washington University and Syracuse University, and membership on the National Research Council NIST Panel on Information Technology.
 Abstract
 We investigate the general mechanism of type derivations in XML schemas
 An earlier version of this paper was submitted as a note to the W3C XML Schema Working Group. Some of the ideas in this paper are reflected by the current XML Schema candidate recommendation in the facilities provided by element equivalence, type derivation operators, and blocking of type derivation operators.
, of which the restrictions and extensions of the proposed schema standard (in and ) are but special cases. The principal idea here is that the designer of a modified or derived type be able to declare the nature of the derivations made. Moreover, we would like to be able to substitute the derived type in places where the ancestral type was expected. But what if an application cannot tolerate such a substitution? In order to exploit fully such derived types, an application, in turn, needs to be able to declare what sorts of modifications it is willing to tolerate. Furthermore, the validator, upon hearing such advice from the application, needs to validate an instance according to the tolerances declared. In effect, we propose to make applications responsible for telling the validator what classes of type transformations lead to substitutable derivations from the application's perspective. The validator, then, is responsible for guaranteeing that the effective derivations fall within the application's tolerances.
 

Introduction

 The principal idea here is that the designer of a modified or derived type be able to declare the nature of the modification made. In order to exploit fully such a declaration, an application needs to be able to declare what sorts of modifications it is willing to tolerate. Moreover, the validator, upon hearing such advice from the application, needs to validate an instance according to the tolerances declared. In effect we are proposing that we make applications responsible for telling the validator what classes of transformations lead tosubstitutable derivations from theapplication's perspective. The validator, then, is responsible for guaranteeing that the effective derivations fall within the application's tolerances.
 We open by informally sketching a model of the declaration, the notion of tolerance, and the kinds of derivations with respect to which validations might be successfully conducted. Later, we will give a rigorous account of the concepts in question, from which one can design a concrete specification syntax, and implement a validator.
 First, let's embellish the element declaration with a tolerance declaration:
 
<element type="mumble" tolerates="(attr-extension | elt-extension)*"/>
 and then define a derived type:
 
<complexType name="grumble" base="mumble" derivedBy="attr-extension">
... </complexType>
 The former declares an element that permits
 The XML Schema candidate recommendation supports a related construct in the block attribute associated with element declarations. While the tolerates attribute above is a permission, the block attribute is a prohibition.
the substitution of a type derived by any sequence of element or attribute extensions. The latter derivation defines a subtype by adding attributes to the base type. (See the section below on transformations.)
 The example above illustrates how the original type designer might indicate the permissibility of the further elaboration of the type. But an application knowing only about the original type might not be as permissive as the designer. In general, an application should be able to declare to the validator that certain kinds of derivation are tolerated. One possibility is to make a global declaration of such tolerance. Another possibility is to make local declarations of the form "this application tolerates derivations of themumble type by any sequence of transformations of the following kinds..."
 Now there are two questions:
 
  • What kinds of derivation are there?
  •  
  • How might the validation of such derivations work?
  •  In this informal account, we only consider derivationsvis à vis content models, and "element only" content models at that. (It should be reasonably clear how to extend the same ideas to cover derivations with respect to attributes.) For the time being, types will be conflated with regular expressions, where a derived type is accessed through a new tag.
     From the perspective of the XML Schema candidate recommendation, this is tantamount to saying that every type is associated with exactly one element definition, and that the tag associated a newly derived type is in the equivalence class of the tag associated with the base type.
    Finally, we assume that all content models are representable by regular expressions (in ).
    XML validation
     

    Validation: an informal account

     First, let's consider validation without derivation. Validity is defined recursively roughly as follows: A node of an XML tree tagged with , which has the type definition ( being a regular expression in tags) is valid just in case the children of the node tagged with (when taken in sequence) are in the regular set denoted by R, and each of the children is itself valid. An XML tree, by extension, is valid just in case all of its nodes are valid.
     Without saying what can count as a derivation (other than to note that every type is a derivation of itself), we extend the definition of validity to accommodate derivation. A node of an XML tree tagged with , is valid just in case has some derivation defined by , such that each of the children of the node tagged with (when taken in sequence) is in the regular set denoted by , and each of the children is valid. An XML tree, by extension, is valid just in case all of its nodes are valid.
    type derivation, subtyping, inheritance
     

    Derivation by transformation

     Suppose ,…, are type definitions, we will define a set of immediate transformations below yielding a type definition such that is an immediate derivation from ,…, by one such transformation. If ,…, is a set of transformations, there is the obvious inductive definition of a derivation from by a sequence of immediate transformations taken from that set. So what might be interesting immediate transformations?
     Consider the immediate derivation of from ,…, by the "prepend" immediate transformation (or inheritance transformation) illustrated as follows:
     
     
     
     
     yields
     
     where is a new regular expression in tags. Similarly, one might define the "postpend" immediate transformation by
     
     One can define a kind of "interleaving" immediate transformation by beginning with the type definitions
     
     
     
     
     
     
     
     one "derives" them to an inherited type definition
     
     where the 's are all new regular expressions. The transformations described thus far are essentially additive. But there are lots of others.
     Refining from by the "insertion"' immediate transformation:
     
     becomes
     
     Refining from by the "deletion" immediate transformation:
     
     becomes
     
     Deriving from by a "restriction on choices" immediate transformation::
     
     can be restricted to any of:
     
     
     
     Refining from by the "cyclic permutation" immediate transformation:.
     
     becomes
     
     No doubt you have noticed that we have not told you how the positions at which the transformations occur are to be identified. Also, the only transformations we have shown are on the immediate subexpressions of regular expressions. There are endless grammar-based processors that allow the identification of subexpressions and modification thereof, so we'll defer providing a solution.
     By composing the transformations described, a wide variety of insertions, deletions and rearrangements can be accommodated. (Indeed, if we allow the transformations above to occur at any level, we get a universal editor.) While each immediate transformation is eminently understandable, their composition can rapidly become opaque, so one may want to declare the number of immediate transformations that are tolerable in refining from . An application's robustness against transformations is more likely, however, to be determined by the nature of the transformations tolerated rather than their number.
     Imagine that an application can declare that it will tolerate derivations from by any sequence of transformations in a set of immediate transformations . Thus, in the validation definition above, we amend it to look for not just derivations, but "tolerated" derivations. The validator, of course, should leave a record of what derivations were used at each node during the validation process. Moreover, the application can have access to the transformational relationship between the type associated with a node's actual tag and the type that led to the validation of the node's children.
     

    The formalities

     Now we are ready to provide mathematical models of schemas, instances and derivation wherein:
     
  • Schemas embed context-free grammars the bodies of whose productions are regular expressions.
  •  
  • The heads of those productions are the names of types.
  •  
  • The types are associated with attribute definitions.
  •  
  • There is a partial order on types that maximizes refinable content models.
  •  
  • Instances are ordered, labeled, attributed trees.
  •  
  • Basic validity is defined in terms of types that are maximal in the partial order.
  •  
  • Validity with derivation is defined relative to lesser types in the partial order.
  •  
  • The partial order on types is segmented in such a way as to mimic various classes of derivation.
  •  
  • Validity is further sharpened to allow for tolerance of instances according to whether they fall into certain classes of derivation or not.
  •  

    Schemas

     Aschema
     The schemas defined in this section and the instances defined in the next correspond more accurately to the XML schema and the XML instance information set contributions detailed in the XML Schema candidate recommendation.
    S is a structure where
     
  • is a finite set ofelement names
  •  
  • is a finite set oftype names
  •  
  • is a finite set ofattribute names
  •  
  • is a finite set ofvalues
  •  
  • is the distinguishedroot element
  •  
  • is the set of regular expressions over
  •  
  • is thecontent modeling function
      .If and then both and are in or they are both in .

  •  
  • is theattribution function
     For such that we will say that the type has the attribute

  •  
  • is thetype declaration function
  •  
  • immediate derivation , is a partial order where , and whenever
     

  •  

  •  
  • if then
  •  
  • derivation is the transitive closure of immediate derivation

  •  A type isbasic if there does not exist a distinct type such that .
     We will take as simple those types in that are mapped by to elements of , i.e. stings, integers, dates, etc.
    Also, we will sometimes call thesignature of .
     

    Instances

     Aninstance I of a schema S is a structure where
     
  • is a finite set ofnodes
  •  
  • is the distinguishedroot node
  •  
  • and are discrete partial orders
     A partial order is discrete if whenever there does not exist such that for distinct .
    over such that if and then and are children of (and siblings of one another), and either ( is the elder sibling of ), or ( is the elder sibling of )
  •  
  • for every node
  •  
  • for every pair of siblings there is no such that and
  •  
  • for nodes , only if there exists a node such that and
  •  
  • alabeling is a function where
     Here we accommodate element and attribute "openness". Openness could be disabled by requiring that , etc.

  •  A labeling, being an ordered 4-tuple, has a first component called itstag
     For node is the tag component of .
    , a second component, called itsattribution , a third component called itstype ; and a fourth component called itsvalue . When the type is , it is said to beimplicit , and otherwiseexplicit . The only nodes of an instance with a value component of are the non-minimal nodes of the instance. The partial order induces a tree on the nodes of I rooted at .
     

    Basic validity

     An instance I is S= -valid (read "S-basic valid") with respect to a schema S iff all of its nodes have implicit types, 's tag is and each node is S= -valid with respect to a schema S. A node is S= -valid with respect to a schema S iff either , or and
     
    1. for every in the domain of 's attribution , either , or with ; and
    2. either
       
      1. ;
      2. the value component of is in
         This corresponds to an element with non-element content contained in a specified subset of the set of values .
        ; and
      3. has no children

    3. or
       
      1. ;
      2. the value component of is ;
      3. the sequence , where areall the children of taken in order whose tags are in , is in the regular set defined by ; and
      4. each of the is S= -valid with respect to S.

     

    Validitywith derivation

     The intent of derivation is to control the circumstances under which a tag might occur at a node where another kind of tag is expected.
     This can occur either through the derivation of one type definition from others, or the equivalencing among tags.
    Our task here is to define the circumstances under which such asubstitution might be valid. Regular expressions offer us an obvious mechanism by which we might achieve this--namely,alternation . If , , then is a -alternative of . Let beall of the -alternatives of , then is the -closure of . By extension, for the -closure of (denoted ) is obtained from by replacing each element in the latter with its -closure. So any occurrence of the above would be replaced with .
     In considering the -closure of a tag we incorporate the equivalence among elements, sometimes called the variation of tags.
    Finally, if , , then is a -alternative of . Let beall of the -alternatives , then the -effective content model of is
     
  • either when ;
  •  
  • or when .
  •  An instance I is -valid (read " -derivation valid") with respect to a schema S iff 's tag is either or a -alternative of , and each node is -valid with respect to a schema S. A node is -valid with respect to a schema S iff either , or and
     
    1. if the type component of is , then and the base type for validation is
    2. if the type component of is , then the base type for validation is
    3. for every in the domain of 's attribution , either , or if and then ; and
    4. either
       
      1. ;
      2. the value component of is in the -effective content model of ; and
      3. has no children

    5. or
       
      1. ;
      2. the value component of is ;
      3. the sequence , where areall the children of taken in order whose tags are in , is in the regular set defined by the -effective content model of ; and
      4. each of the is S= -valid with respect to S.

     

    Validity with derivationand tolerance

     Recalling that is a subset of , consider a segmentation of into a finite collection of segments such that . Since we are free to construct and partition it any way we like, we could think of each of the segments as corresponding to a particular atomic derivation transformation on types. (In the next section we'll examine some particular kinds of atomic transformations.) Furthermore, we can abuse the notation of juxtaposition and consider composing the atomic transformations as relations,e.g. . Indeed, we can consider regular expressions over atomic transformations, and note that in the usual notation for regular expressions .
     Let S and I be as above. Suppose is segmented into a finite collection of segments such that . Let be a regular expression over the segments of . Putting in the place of , we can define the -alternatives, the -closure, and the -effective signature by analogy with the -alternatives, the -closure, and the -effective signature above. An instance I is -valid (read "S -derivation valid tolerating transformations in the regular set ") with respect to a schema S iff 's tag is either or an -alternative of , and each node is -valid with respect to a schema S. A node is -valid with respect to a schema S iff either , or and
     
    1. if the type component of is , then and the base type for validation is
    2. if the type component of is , then the base type for validation is
    3. for every in the domain of 's attribution , either , or if and then ; and
    4. either
       
      1. ;
      2. the value component of is in the -effective content model of ; and
      3. has no children

    5. or
       
      1. ;
      2. the value component of is ;
      3. the sequence , where areall the children of taken in order whose tags are in , is in the regular set defined by the -effective content model of ; and
      4. each of the is -valid with respect to S.

     

    Modeling particular derivation transformations

     We now explore how some of the forms of modification described above fit into the formal model. Generally speaking, we proceed by taking a schema S and defining a new schema S' where the components of the latter structure are defined by making incremental changes to the components of the former, but leaving the root element the same in both structures. We will subscript the components of S and S' with their respective schema names,e.g. and .
     

    Adding a new tag

     From schema S with type we create a new schema S' with a new element :
     
     and may now order types with content models that are regular expressions over .
     

    Creating a new type by adding an attribute

     From schema S with type we create a new schema S' having a new type with an additional attribute assigned from the subspace :
     
     

    Creating a new type by restricting choices

     From schema S with type we create a new schema S' having a new type such that is derived from by restriction of choices (see the example in thederivation by transformation section):
     
     

    Creating a new type by cyclic permutation

     From schema S with type we create a new schema S’ having a new type such that is derived from by cyclic permutation (see the example in thederivation by transformation section):
     
     

    Conclusions

     In the foregoing we have abstracted
     
  • types of elements
  •  
  • types of attributes
  •  
  • openness
  •  
  • type derivation
  •  
  • element equivalence
  •  
  • tolerance
  •  
  • schema validation both with and without tolerance
  •  In doing so we assume only that attributes have simple values, and that elements are either simple values or structures characterized by regular expressions. Indeed the model of types (in ) that we employ is essentially the extensional one traditionally studied by logicians. On the surface it would appear that this is not quite the same as the typical computer science model of types where two named types having the same type signature represent distinct types. It happens that we can encode the name of a type both in a regular expression and in an attribution, producing, in effect, distinct copies of the same structure. Using this device together with imposing various restrictions on the derivation partial order, we can reproduce the validation semantics of the XML Schema candidate recommendation. More importantly, however, we have a much richer notion of type derivation and validation.
     While implementation is outside the scope of this paper, we should observe that validation, as presented, depends only on a set of regular expressions that can be statically constructed from the definition of a schema. Adding derivation and/or tolerance simply increases the size and/or number of the regular expressions. It is easy to imagine a compiler that undertakes such a construction. Validation at a node reduces to checking if a sequence of tokens associated with each of the child nodes is in the regular set (in ) defined by a particular regular expression.
     Bibliography
     
    Regex Alfred V. Aho. Algorithms for finding patterns in strings. In Jan van Leeuwen, editor, Handbook of Theoretical Computer Science.MIT Press Elsevier, Cambridge, Massachusetts, 1 990.
     
    Types Solomon Feferman. Theories of finite type related to mathematical practice. In Jon Barwise, editor, Handbook of Mathematical Logic.Elsevier North Holland, Amsterdam, 1 977.
     
    XSD1 Paul V. Biron and Ashok Malhotra, editors. XML Schema Part 2: Datatypes. World Wide Web Consortium, Cambridge, Massachusetts, 2000.
     
    XSD2 Paul V. Biron and Ashok Malhotra, editors. XML Schema Part 2: Datatypes. World Wide Web Consortium, Cambridge, Massachusetts, 2000.

    XML Schema types and equivalence classes   Table of contents   Indexes   XML &, query languages