Quilt: an XML query language   Table of contents   Indexes   Spatial/temporal datatypes

 XML 
 query language 
 

A canonical query language & its efficient implementation

 van der Steen, Gert 
 
 Gert  van der Steen
 Senior Consultant
  Palstar bv 
 The Netherlands 
 Uffelte 
Palstar bv,  Winkelsteeg 5a
Uffelte   7975 PV The Netherlands
Phone: +31 521 351077 Fax: +31 521 351078 email: palstar@xs4all.nl web site: www.palstar.nl
 Biography
 Gert van der Steen - Gert van der Steen is an independent consultant in Document Management and Language Technology, with a focus on the introduction of SGML and XML in industry and the automatic translation of controlled natural languages.
 Gert studied Mathematics and Computer Sciences and wrote a dissertation on the design of a program generator for syntactic pattern recognition and transduction. He worked as a researcher and lecturer at the Universities of Leiden (Medicine), Rotterdam (Economics) and Amsterdam (Arts). He shifted in 1988 from science to the industry where he has been involved in many SGML projects as a consultant, trainer, developer and project manager. For Cap Gemini he developed a system for the treatment of controlled natural languages.
 In 1996 Gert van der Steen founded his own consultancy, Palstar bv. For Palstar he is currently developing tools for SGML/XML and Natural Language Processing, like syntactic analysis of documents for subsequent up-conversion, revision tracking (Palstar sells the program SGDIFF), transformation and querying of SGML documents and groves. In addition to this Gert van der Steen is a part-time Professor in Information Science at the University of Utrecht.
 Abstract
 The XML community is eagerly awaiting a W3C recommendation for a XML query language. However, it has been estimated that it may take more than a year until that recommendation reaches a final stage. Therefore, it may be worthwhile to study in the meantime the intrinsic characteristics of query languages for XML.
 A number of query languages have been initially proposed for XML. Also some simple query mechanisms are part of (proposed) recommendations, like in XSL and Xpath.
 It has been argued (e.g. in Cotton, P., and Malhotra, A., “Candidate Requirements for XML Query”, Nov. 30, 1998) that the syntax for the query formalism should be XML itself, that it will support full-text queries, and that fast retrieval programs could be generated automatically based upon the query formalism.
 These requirements are so general and intuitive acceptable that it may be conjectured that they will be the core of any future query language for structured documents. As such, we might give it the name “canonical” query language (“CQL”).
 In this paper we show how the requirements for CQL can be met fully by an extension of well-known techniques for a. the description of grammars, b. pattern matching and c. the generation of finite state automata. These techniques can straightforward be extended towards pattern grammars for natural language.
 

Introduction

 A Query Language for XML has yet to be defined. In the W3C published a Working Draft on “XML Query Requirements”, with references to , , and for background discussion.
 Two major approaches still have to be blend: one from the database community and one from the document community. We quote from :
 “The database community is concerned with large repositories of data, integrating data from heterogeneous sources, exporting new views of legacy data, and transforming data into common data-exchange formats...The document community is concerned with full-text search, queries of structured documents, integrating full-text and structured queries.”
 “The database community.. has learned about query complexity, algebra’s, and techniques for implementing these query languages efficiently...What is known regarding the expressive power of query languages should play a central role in the design of a query language for XML.”
  adds: “Many XML documents are very text-centric, and many systems that store XML contain full-text engines. In markup that intermixes text and structure, it can be extremely helpful to be able to mix full text conditions with structured conditions to express a sequence of text patterns and structures.”
  argues that “selection, extraction, reduction, restructuring and combination should all be possible in a single XQuery” because “XQuery should be suitable for server-side processing” (where XQuery is an abbreviation for a XML query language).
  requires that “both document-oriented and data-oriented queries may be performed on documents with embedded data, such as catalogs, patient health records, employment records, or business analysis documents”. The requirements ensure that the complete functionality of XML documents is retained. As one starting point for the definition of a query language XPath is mentioned.
 The blending of the approaches of both communities requires that database people are aware of the complexity of the integration of full-text and structured queries and that document people are aware of the complexity of database and query language design and implementation.
 The document community has been experimenting for years with complicated queries on structured documents. The query languages for XML that have been proposed up till now are not sufficient for these queries.
NLP
 
In our approach we experimented with formalisms forNLP and extended them withPM facilities. As a subset of the formalism, a query language for structured documents originated. The query language can be read as an Extended Context Free Grammar with orthogonal extensions for Pattern Matching. It may be expressed in XML or may be written as a superset of XPath. It is the richest form of the selection part of any of the XML query languages proposed so far and provides for the expression of Selection, Projection and Join. Therefore, we call it, for the moment,CQL .
 CQL, Canonical Query Language 
PM
 
The main difference with the approaches up till now is that a query is expressed as a pattern grammar, akin to a DTD enriched with wild cards, Boolean constructs and variables, to be applied to both markup and #pcdata.
 CQL, Canonical Query Language 
 
This paper discusses some of the design considerations ofCQL and the architecture of the related processes. The relation with the requirements of is indicated. The mapping ontoCQL of some proposed XML query languages and of standard relational database queries is shown, together with the straightforward extensibility towards pattern grammars for natural language.
 The efficiency is illustrated by innovative principles for compilation into a Finite State Automaton with skip instructions. We indicate how full text indexes for structured documents could be exploited in order to unlock the potential of optimization techniques of the database community.
 

Design considerations of CQL

 

Relation with NLP and PM

 In our approach, a document query may be viewed as a pattern that has to be located in a document. As such the processing of a query is akin to syntactic pattern recognition.
 This approach considers queries as the description of a language, formed by all the possible answers. Formulated as pattern grammars they describe a set of sentences, as does the language generated by that grammar. The processing of a query may then be regarded as the recognition of the data base which acts as one input-sentence to the pattern grammar. Non-determinism during recognition accounts for the possibility of more answers.
 CQL, Canonical Query Language 
 FSA, Finite State Automaton 
 
Very powerful query languages may be constructed by adding constructs for patterns, like wildcards and Boolean operators, to linguistic grammar formalisms. There is a price: runtime performance may be reduced dramatically. However, staying within the Chomsky hierarchy, more efficient programs for formal automata can be generated.CQL is a pattern grammar formalism that is stripped down in such a way that queries within CQL may be compiled into aFSA . A set of separate document queries, including their context, may be combined into one pattern grammar and therefore may be compiled into one (deterministic)FSA . In the formalism, no sequence of processing is indicated. Partial overlapping patterns (like ‘abc’ and ‘bcd’) do not cause problems.
 FSA, Finite State Automaton 
 

Relation with database query languages

 A simple link exists between syntactic recognition and querying in document trees. In the latter case we formulate queries in which the vertical axes in the tree are important. In procedural query-languages one navigates in a horizontal and vertical direction through the tree-structure. However, the expressive power of the queries can be increased considerably when we allow also for the full power of grammatical notation along the vertical axes. A simple start in that direction has been made by XPath.
 By allowing grammar-notations on the horizontal axes, the possibility of linguistic descriptions on free text is introduced.
 The constructs for patterns allow for universal and existential quantification.
 The advantage of a relational model may be the simple use of such operators as "selection", "projection" and "join", well-known from relational algebra. The last-mentioned operator introduces some kind of intelligent combination of stored relations that is absent from document query languages.
 CQL provides for variables which may get an assignment from the document. Operations on the variables make possible the functionality of the “join” operator.
 

Architecture of processes

 The processes for CQL have the same anatomy as those for a grammatical system.
 
  1. A compiler generates code for a formal machine, to be interpreted e.g. by a Java or C program. The compiler takes into account additional information stemming from the schema. The compiler is optimized in several ways. For instance, if a pattern grammar is also allowed to contain rules from a higher order grammar (e.g. for the recognition of natural language expressions) a switch is generated to a recognizer for that type of grammar.
  2. The formal machine consists of a PM machine and a scanner.
     
    1. The PM machine implements a FSA with skip instructions and generates messages for subsequent result manipulation.. The messages consist of symbols and their position in the document(tree).
    2. The scanner has access to XML documents and their indexes, it resolves namespaces, entities and other references, follows links, performs filtering, allows for text that spans boundaries of floating elements, checks datatypes (e.g. represented by the XML Query Data Model), provides for environment information and adjusts itself to the data representation (e.g. DOM or XML text).

  3. A report generator receives the messages from the formal machine and combines them into document fragments.
 This architecture facilitates in a modular way most of the requirements of .
 

Notation of CQL

 The syntax of CQL may get the syntactic binding of XML or that of the forthcoming Schema language, but it may also be regarded as an extension to XPath. The syntax is denoted as an EBNF grammar (as used for the definition of e.g. XML itself) with extensions for patterns, variables, Boolean constructs, document trees and notations for output.
 Because we deal with pattern grammars, one might say that a document is matched by a pattern grammar.
 A document is not required to contain markup.
 

Grammar formalism

 The grammar contains Nonterminals (starting with a capital) and terminals. In this paper the terminals are single characters or strings, but they could be of other datatypes. Characters and strings are delimited by single or double quotes.
 The reserved symbols of XML are not quoted in order to distinguish them from strings in #PCDATA.
 Nonterminals are rewritten in grammar rules according to the EBNF syntax. Rules may be left and/or right recursive, but infix recursion is not allowed.
 On the right-hand side of a rule, the meaning of symbols in regular expressions is as shown below. The notation deviates slightly from the usual notation in order to remain consistent with the syntax of DTD’s.
 
('a..z' | 'A..Z') (instead of [a-zA-Z])
represents any character with a value in the range(s) indicated
^'a..z' (instead of [^a-z])
represents any character with a value outside the range indicated
^('a'|'b'|'c') (instead of [^abc])
represents any character which is not 'a', 'b' or 'c'
A, 'b'
the nonterminal A followed by the character 'b'.
 If r, r1 and r2 are regular expressions then the following ones are also regular expressions (as in XML):
 
expression    meaning
r1 r2     :   concatenation of r1 and r2
r1| r2    :   r1 or r2
(r)*      :   0 or more repetitions of r
(r)+      :   1 or more repetitions of r
(r)?      :   0 or 1 r.
 

Actions

 At any place in a right-hand side (rhs) of a rule so called “actions” may be placed: a sequence of operations written between braces. The most important ones are operations for variables, Boolean constructs between rules and for output.
 The actions may be placed before a ',', a '|', a ')' and a '.'. Actions are executed between the matching of two symbols. In the details are described of the combination of actions and regular expressions.
 Examples of actions are included in the next subsections.
 

Patterns

 The reserved symbols for patterns are:
 
? don't care sign for one arbitrary character
= arb sign; an "arb" matches any sequence of characters, possibly empty, but excluding the characters which may follow the arb; anyhow the arb ends at the next end-tag
- line sign; a “line” matches any sequence of characters, possibly empty, including the characters which follow the "line", but only up to the next end-tag
 The line sign has its equivalent for the vertical axes in a tree.
 
<..> any sequence of start-tags
<../> the sequence of end-tags that corresponds to the start-tags of the preceding <..>
<..title> any sequence of start-tags that ends in <title>
<../title> the sequence of end-tags that ends in </title>
 At the end of a query trailing sequences of lines and end-tags may be left out.
 Examples.
 
  1. Comment ::= '!' = '!'.
    A sequence of characters which starts with an exclamation mark and which ends with an exclamation mark is matched as Comment. The characters which are matched by the "arb" will not contain a '!'.
  2. Sentence ::= = '.'.
    Sentence is defined as the maximal character sequence not containing a dot, followed by a dot. An input such as: 'Dr. West reads his mail.' will be matched by the nonterminal Sentence up to the dot in 'Dr.'.
  3. Sentence ::= - '.'.
    Sentence is defined as any character sequence followed by a dot. The above sentence will be matched two times, once for ‘Dr’ and once for 'Dr. West reads his mail’.
  4. S ::= - P1 - | - P2 -. P1 ::= ‘a’. P2 ::= ‘b’.
    S matches any string (between a start- and end-tag) that contains either ‘a’, ‘b’ or both.
 

Boolean negation within a rule

 The negation of a string s matches all strings with the same length, but not equal to s.
 The negation of a nonterminal N matches all strings with length between the length of the shortest and the largest string that can be matched by N, but not equal to any string that can be matched by N.
 The negation of a regular expression E is the same as for a nonterminal, if the nonterminal rewrites to E.
 Examples
 
^’a’                   an arbitrary character, but not an 'a'
^ (‘abc’)              a sequence of 3 characters, which is not 'abc'
^A                     where A is a nonterminal. ^A matches any sequence of
the same length as one of the sentences matched by A,
which is not equal to such a sequence
^ ('ab' | 'bcd')       a sequence of 2 or 3 characters, which is not 'ab'
and not 'bcd'
^ (- a..c)             any sequence of characters which does not end in 'a',
'b' or 'c'
^ (- (A | B) C -)      any sequence of characters which does not contain a
string which may be matched by A C or B C
NP::= - (^Adj) N - .   a N(oun) P(hrase) with a N(oun), but without an
Adj(ective) preceding that Noun. (Adj and N have to
be rewritten further.)
 

Boolean intersection of rules

 Rules may be intersected with the symbol '&amp;';.
 
1.     S ::= -  P1 - & - P2 -.
 S matches any string (between a start- en end-tag) that contains both P1 and P2.
 This may be refined with the ‘cooperation symbol’ C, written within actions and suffixed by a label for a cooperation point.
 
2.     S ::= - P1 - {C:1} P3 - {C:2} | - P2 - {C:1} P4 - {C:2}.
 Here we express that S matches strings w = u v such that u is matched by - P1 - as well as by - P2 - and that v is matched by - P3,- as well as by - P4 -. We might say that the matching of P3 and P4 has to wait for the matching of P1 and P2, and that at least at the end of the input both alternatives of the rhs have to be matched. The cooperation points are ‘1’ and ‘2’.
 This resembles the cooperation of parallel processes. The two rhs's may be waiting for each other at the rendezvous 1, consuming input, which is possible because of the appearance of lines, which may match an indefinite number of input symbols.
 In general, when all rhs's which contain the same cooperation name are matched up to the location where the name is denoted as a cooperation symbol then all these rhs's may continue, otherwise they all fail.
 The device of cooperating rules is especially useful in the specification of complex conditions in trees in a non-procedural way.
 The following pattern grammar:
 
S ::= - Subject - {C: SVO} | - Verb - {C: SVO} | - Object - {C: SVO}.
 is equivalent to:
 
S ::= - Subject - & - Verb - & - Object -.
 

Tree structure and XML markup

 In a CQL query all XML markup is denoted in the same way as it is written in an XML document.
 The end-tag </> closes the preceding start-tag.
 The reserved symbols of XML are not quoted. Strings occurring in markup, like names of tags and attributes, values of attributes, PI’s etcetera can be treated with patterns like the text in #PCDATA.
 For instance, the query:
 
S ::= <..> - < - ^’ing’ - > - ‘Charles.’</><../> .
 returns all tags with tagnames which do not contain the string “ing” and which #PCDATA ends on ‘Charles.’.
 Attributes are denoted in the same style as in XPath: they are written as if they are tags themselves, but they start with ‘<@’.
 For instance, the above query may be extended towards:
 
S ::= <..> - < - ^’ing’ - ><@’title’><@ - ’number=’ 1..3> - ‘Charles.’</><../> .
 where one attribute name has to be ‘title’ and another attribute name has to end on ‘number’ and has to have a value between 1 and 3. It is up to the scanner to allow attributes to appear in any sequence.
 

Variables

 A variable is declared by its first appearance in a rhs. The scope of the variable is the grammar rule with that rhs. The name and the value of a variable are strings of arbitrary length, following the syntax of attributes in XML. With the coming Schemata, typed variables may be added (as we did in our implementation).
 The possible operations on variables, written in C-style, may be summarized by :
 
  • assignment to variables from the last read-in symbol of the text. The reserved symbol '%' stands for the last symbol that is read in.
  •  
  • assignment of expressions with variables and constants. Within an expression variables may be concatenated, together with string-constants and the last read-in character. The concatenation operator is '||'.
  •  
  • tests on (expressions of) variables. A test concerns the truth-value of a Boolean expression. If this value is "false" then the current matching fails.
  •  
  • calls to external procedures. A procedure-call consists of the name of an external procedure which is written in some programming language, and parameters in which (expressions of) variables may be passed. The first parameter is a Boolean. If it returns the value "false" then the current matching fails.
  •  Nonterminals may be enriched with parameters. The parameters of nonterminals at a lhs have to be preceded by the declarations "I:", "O:" or "IO:", which correspond respectively with the features "inherited", "synthesized" and the combination of both, which are familiar in attribute grammars.
     Examples
     
    1. assignment to a variable from the text :
      S(O:A,O:B) ::= (a..g {A = A || %} | h..z {B = B || %} )*.
      After matching of the input all characters in the range ‘a..g’ are appended in variable A and all characters in the range ‘h..z’ are appended in variable B. A and B will be reported.
    2. Call of an external procedure :
      S ::= - NP(singplu1, time1, head) { Sem(O:continue, I:time1, I:head, IO:expect) },
      VP(singplu2, time2, expect) { singplu1 == singplu2; time1 == time2 }.
      After the matching of an NP the routine "Sem" is called with as input parameters "time1" and "head" and as output parameters "continue" and "expect". If the value of "continue" upon return of Sem is "false" then further matching of this path is inhibited (there may be more paths active because this is a pattern grammar). Else matching continues with the nonterminal VP which gets the variable "expect". After matching of VP the comparisons within the actions are executed. If one of them returns false then further matching is inhibited.
     

    Reports

     There are two report functions. The "S"-function (for "signal") reports number which follows the S. The "R"-function (from "report") does the same, but adds to it the last symbol that is read in. There may be more symbols in case the notion before the action where the report-function is denoted is an arb ('=') or a line ('-'). In that case all the symbols which are covered by that notion are concatenated and reported. Consecutive reports with the same number are merged into one report with all symbols concatenated.
     The numbers will be reported together with their position in the document after a pattern matches completely. The position may be a filepointer or a tree pointer. Alternatives and wild cards are responsible for the generation, in principle, of overlapping sequences of reports. The system may represent the overlap efficiently in a dag structure.
     Examples
     1. pattern matching with 4 keywords, signaling a match at the end of a keyword.
     
    A ::= - 'he' {S:1} - | - 'she' {S:2} - | - 'his' {S:3} - | - 'hers' {S:4} - .
    
     The same result is obtained by writing :
     
    A ::= - 'he' {S:1} 'rs' {S:4} - | - 'she' {S:2} - | - 'his' {S:3} - .
    
     If the input is 'ushers' then the output message will be '(1, pos1) (2, pos2) (4, pos4)', where posi is a position in the document.
     
    B ::= C {R:1}  = {R:2}  (d  ? {R:3} )+  e .
    C ::= c | f .
    
     If the input is 'cabcdededee' then the output message will be '(1c, pos1) (2abc, pos2) (3eee, pos3) '.
     

    Examples of CQL queries

     

    Estate inventories

     The following is an example from estate inventories from the 17th century in the Dutch city of Delft.
     Suppose the document is :
     
    <BOEDEL><KLASSE>D</KLASSE>
    <GEZIN>G</GEZIN>
    <ORG><RN>HUIS<KAT><VR><VO>HUIS</VO>
    <PR>WOON</PR>
    </VR>
    </KAT>
    LAND<KAT><VR><VO>TUIN</VO>
    </VR>
    </KAT>
    EFFECT<KAT><VR><VO>OBLIGATIE</VO>
    <BIJZ>FRANKRIJK</BIJZ>
    <AA>1 GELD</AA>
    <HOEV><GT>2</GT></HOEV>
    </VR>
    <TAX>2</TAX>
    <VR><VO>OBLIGATIE</VO>
    <BIJZ>PLANTERS OP DE EILANDEN</BIJZ>
    <RENTE>4</RENTE>
    </VR>
    <TAX>1</TAX>
    </KAT>
    </RN>
    <ORG>
    </BOEDEL>
    
     Query : "Give a report of all taxation’s of stocks (EFFECT) for people who own houses and land"
     CQL:
     
    S ::= - <BOEDEL> - <ORG><RN>A & B & C</RN> - </BOEDEL> .
    A ::= - ‘HUIS’ <KAT> - .
    B ::= - ‘LAND’ <KAT> - .
    C ::= - ‘EFFECT’ <KAT> - <TAX> ? {R:1} </TAX> - </KAT> - .
    
     Result : The input will be matched, and both '2' and '1' will be matched by the '?'.
     

    Parse trees

     The second example concerns the matching at arbitrary levels in a parse tree.
     If the tree is :
     
    <S>NP
    <VP>V
    <NP>ADJ
    N
    <REL>...
    <NP>DET
    ADJ
    N
    </NP>
          ...
    </REL>
    </NP>
    </VP>
    </S>
    
     and the pattern grammar is :
     
    ADJ-NP ::= <S><..> - <NP> - ‘ADJ’ - </NP> - <../></S> .
    
     which may be abbreviated to:
     
    ADJ-NP ::= <S><..NP> - ‘ADJ’.
    
     then the result will be two matches in the tree as shown below :
     
     

    Enriched corpora

     The third example concerns the querying of enriched corpora of natural language texts.
     The followingDTD describes a corpus with paragraphs of different authors. The words in the sentences are coded with their lemma and with a morphological code (like ‘noun’ or ‘adverb’).
     
    <!ELEMENT corpus               (subcorpus*) >
    <!ELEMENT subcorpus            (identification, paragraph*) >
    <!ELEMENT paragraph            (heading, body) >
    <!ELEMENT heading              (date) >
    <!ATTLIST heading               author CDATA #REQUIRED> 
    <!ELEMENT body                 (sentence)* >
    <!ELEMENT sentence             (word_group)* >
    <!ELEMENT word_group           (word, lemma, morphological_coding) >
    <!ELEMENT identification       (#PCDATA) >
    <!ELEMENT author               (#PCDATA) >
    <!ELEMENT word                 (#PCDATA) >
    <!ELEMENT morphological_coding (#PCDATA) >
    
     Query : retrieve all occurrences in the work of Shakespeare where a sentence ending with a form of the lemma 'work' is followed by a sentence ending with a noun or an adjective in 'ing'.
     CQL :
     
    S ::= - <..heading><@author=’Shakespeare’> - <../heading><body> - CONT1.
    CONT1 ::= <sentence> - <word_group> - <lemma>’work’</><../sentence> CONT2.
    CONT2 ::= <sentence> - <word_group><word> - ‘ing’</word> -
    <morphological_coding> (‘noun’ | ‘adjective’) </><../sentence>.
    
     

    Comparison with XPath

     It is straightforward to show that XPath is a subset of CQL.
     The difference in philosophy is that with a location path in XPath a specific node in the document tree is reached. In CQL a query describes a complete document (or document tree), where wildcards and Boolean operators denote which parts are relevant or not.
     Implicit in a CQL query are an indefinite number of location paths and nodes which may be located. No precedence constructs are necessary because all contexts are implicit in the pattern grammar. The sole mechanism for the creation of reports are the notations for Signals and Reports which indicate precisely a location in the document (be it implemented as a filepointer, a pointer in a Dom or in a document tree). However, the report is only generated when the whole pattern grammar matches the whole document. That means that for any point in the document any context may be specified along both horizontal and vertical axes in the document tree.
     Note that the description above does not mean that a whole document (tree) has to be matched during runtime. Optimization techniques allow for inspection of only those parts of the document that are relevant for the query.
     

    Mapping of standard database queries

     In order to show the mapping of standard database queries we use a simple example, taken from a standard textbook on database systems (Date, “An Introduction to Databases”), which demonstrates the standard operations of union, intersect, minus, select, project and join. In each case we will show the corresponding CQL notation.
     Date uses one example consistently throughout his textbook. It concerns an education database of a commercial firm in which both teachers and students are employees. The courses are defined by course#, title and description. Each course has, possibly, prerequisite courses and is offered on a number of dates and locations. It has a teacher and students, presented with their names; in addition, students have grades.
     The structure of this database may be described by the following DTD.
     
    <!ELEMENT Data_Base    (Course+) >
    <!ELEMENT Course       (Coursenr, Title, Description, Prereq*, Offering+) >
    <!ELEMENT Prereq       (Coursenr, Title) >
    <!ELEMENT Offering     (Date, Location, Format, Teacher, Student+) >
    <!ELEMENT Teacher      (Empnr, Name) >
    <!ELEMENT Student      (Empnr, Name, Grade) >
    <!ELEMENT Coursenr     (#PCDATA) >     <!ELEMENT Format (#PCDATA) >
    <!ELEMENT Title        (#PCDATA) >     <!ELEMENT Empnr  (#PCDATA) >
    <!ELEMENT Description  (#PCDATA) >     <!ELEMENT Name   (#PCDATA) >
    <!ELEMENT Date         (#PCDATA) >     <!ELEMENT Grade  (#PCDATA) >
    <!ELEMENT Location     (#PCDATA) >
    
     The basic retrieving functions in terms of CQL are :
     UNION : retrieve "all Course#'s for courses which either have as a prerequisite Course#=10 or which are located in Stockholm" :
     S ::= - <Course><Coursenr> ? {R:1} </Coursenr> - (<Prereq><Coursenr> ‘10’ </Coursenr> - </Prereq> | <Offering> - <Location>’Stockholm’</Location> - </Offering>) - </Course> -.
     Which may be abbreviated to:
     S ::= - <Course><Coursenr> ? {R:1} </> - (<Prereq><Coursenr> ‘10’ | <Offering> - <Location>’Stockholm’).
     (In the sequel only the abbreviated notation is used.)
     INTERSECT : retrieve "all Course#'s for courses which have as a prerequisite Course#=10 and which are located in Stockholm"(straightforward syntactic variation of UNION) :
     S ::= - <Course><Coursenr> ? {R:1} </> - <Prereq><Coursenr> ‘10’ <../Prereq><Offering> ‘Stockholm’.
     MINUS : retrieve "all Course#'s for courses which have as a prerequisite Course#=10 and which are not located in Stockholm" (straightforward syntactic variation of INTERSECT) :
     S ::= - <Course><Coursenr> ? {R:1} </> - <Prereq><Coursenr> ‘10’ <../Prereq><Offering> ^‘Stockholm’.
     SELECT : retrieve "all Course#'s for courses in Amsterdam" :
     S ::= - <Course><Coursenr> ? {R:1} - </><Offering>’Amsterdam’.
     PROJECT: retrieve "all Emp# of all students" :
     S ::= - <Course>- <Offering>- <Student><Empnr> ? {R:1}.
     JOIN (with PROJECT) : retrieve "all Course#'s for courses where the student is the teacher of one of the prerequisite courses" :
     S ::= - <Course> S1(c1, t1) & S2(c2, t2) {t1 == t2 & c1 == c2}.
     S1(c1, t1) ::= <Coursenr> ? {R:1} </> - <Prereq><Coursenr> ? {c1 = %} </> - <Offering> - <Student> ? {t1 = %}.
     S2(c2, t2) ::= <Coursenr> ? {c2 = %} </> - <Offering> - <Teacher> ? {t2 = %}.
     Here the variables c1, c2, t1 and t2 are used. They get assigned the last symbol that has been read in (denoted by the %-sign).
     The queries in and can be expressed in a similar way.
     

    Principles of implementation

     

    Compilation into efficient code

     A query in CQL can be compiled into a Finite State Automaton (FSA). The method is described fully in . A few principles are described below.
     As a starting point, we use the techniques for LR parser generation ( ). With these techniques, so called itemsets are created which are associated with states in a FSA. An itemset contains grammar rules in which a dot marks the position up to where the recognition has proceeded. By extending the technique for itemset creation, the patterns of CQL are compiled into a FSA.
     As an example we show the itemset creation for a don’t care
     Suppose we have the pattern grammar
     
    S :: -  a  a
    S :: -  ?  b
    
     The starting itemset contains the two items :
     Itemset1:
     
    S :: . -  a  a
    S :: . -  ?  b
    
     Based upon these two items, two types of input symbols have to be considered: the ‘a’ and the set of all symbols that are not ‘a’, which set we denote by the “rest symbol” ‘.’.
     We construct, as the successor itemset for ‘a’ :
     Itemset2:
     
    S ::  -  a . a
    S ::  -  ? . b
    S :: . -  a  a
    S :: . -  ?  b
    
     As the successor itemset for ‘.’ we construct:
     Itemset3:
     
    S ::  -  ? . b
    S :: . -  a  a
    S :: . -  ?  b
    
     In the same manner we may construct from Itemset3 successor itemsets for ‘a’ and ‘b’. The rest symbol here denotes all symbol which are not ‘a’ and ‘b’. Upon the rest symbol we create an itemset which is identical to Itemset3. Therefore, Itemset3 (or, to be precise, the associated state in the generated FSA) loops on the rest symbol.
     In order to deal with Boolean negations and their combinations with regular expressions and nonterminals the itemdot is split into a number of typed dots. Rules are introduced which govern transitions of the types of itemdots.
     The interested reader is referred to chapter 5 of .
     

    Addition of indexes

     States with a loop literally consume the input up to the point where symbols occur on which they have a transition to another state. These looping states are candidates for run-time optimizations.
     Two optimizations suggest themselves.
     
    1. If a state is waiting solely for an end-tag, the scanner may immediately skip to that tag. The strategy may vary: if the scanner is operating on the document tree, it can immediately jump to the parent node. If it operates on a document it could make use of an index which contains the position of end-tags.
    2. If a state is waiting for a number of input symbols, an index could be accessed which contains the positions of symbols. The type of the index depends upon the types of symbols that the scanner will accept. For instance, CQL could be configured as such that within a query only complete words may be specified. In that situation a free text index may be used, eventually extended with the ancestor context of (start-)tags. If more patterns within words are allowed, PAT trees could be exploited.
     In general it may be worthwhile to continue research on the type and optimization of indexes when they have to cooperate with pattern matching automata.
     

    Implementation and experiences

     CQL originated as an extension of a generalized system for recognition, parsing and transduction of natural language texts (“Parspat”: apar ser forpat tern grammars). The extension was motivated by the shortcomings of relational database systems for the handling of tree-structured documents and their poor facilities for the handling of free text. A number of research projects in the Faculty of Arts of the University of Amsterdam has made extensive use of the system. The expressive power of the formalism of CQL proved itself rich enough for the successful exploitation of enriched corpora of texts, encoded music, historical estate inventories and bibliographic records.
     

    Conclusion

     With CQL it is possible to express all queries for structured documents and standard database queries. It meets already most of the requirements of .
     Queries can be compiled into efficient code for a FSA with skip instructions. The formalism and formal machine of CQL is extendable towards higher grammar formalisms for e.g. Natural Language Parsing.
     Additional work has to be done to profit from traditional database indices. In that respect, it is worth to develop more refined types of indices to optimize the operation of the formal automaton.
     Bibliography
     
    Aho86 Compilers. Principles, Techniques and Tools. Aho, A.V., Sethi, R. and Ullman, J.D., Addison Wesley, 1986
     
    Cotton98 Candidate Requirements for XML Query, Paul Cotton and Ashok Malhotra, 1998. In Query Languages'98 (QL'98). Available at http://www.w3.org/TandS/QL/QL98/pp/queryreq.html .
     
    Fernandez99 XML Query Languages: Experiences and Exemplars, Mary Fernandez, Jérôme Siméon, Philip Wadler, 1999. Available at http://www.w3.org/1999/09/ql/docs/xquery.html .
     
    Maier98 Database Desiderata for an XML Query Language, David Maier, 1998. In Query Languages'98 (QL'98). Available at http://www.w3.org/TandS/QL/QL98/pp/maier.html .
     
    Robie99 The Tree Structure of XML Queries, Jonathan Robie. Available at http://www.w3.org/1999/10/xquery-tree.html .
     
    VanderSteen88 A Program-Generator for Recognition, Parsing and Transduction with Syntactic Patterns, Van der Steen, G.J., CWI Tracts volume 55, Centre for Mathematics and Computer Science, Amsterdam, 1988, 284 pp. Available at http://www.palstar.nl .
     
    XmlQuery-req XML Query Requirements. W3C Working Draft 31 January 2000. Available at http://www.w3.org/TR/xmlquery-req .

    Quilt: an XML query language   Table of contents   Indexes   Spatial/temporal datatypes