![]() |
Quilt: an XML query language | Table of contents | Indexes | Spatial/temporal datatypes | ![]() |
|||
XML ![]() query language ![]() | A canonical query language & its efficient implementation |
van der Steen, Gert ![]() |
| Gert van der Steen |
| Senior Consultant |
Palstar bv ![]() The Netherlands ![]() Uffelte ![]() | Palstar bv,
Winkelsteeg 5a Uffelte 7975 PV The Netherlands Phone: +31 521 351077 Fax: +31 521 351078 email: palstar@xs4all.nl web site: www.palstar.nl |
| Biography |
| Abstract |
| A number of query languages have been initially proposed for XML. Also some simple query mechanisms are part of (proposed) recommendations, like in XSL and Xpath. |
Introduction |
| A Query Language for XML has yet to be defined. In the W3C published a Working Draft on “XML Query Requirements”, with references to , , and for background discussion. |
| Two major approaches still have to be blend: one from the database community and one from the document community. We quote from : |
| NLP | In our approach we experimented with formalisms forNLP and extended them withPM facilities. As a subset of the formalism, a query language for structured documents originated. The query language can be read as an Extended Context Free Grammar with orthogonal extensions for Pattern Matching. It may be expressed in XML or may be written as a superset of XPath. It is the richest form of the selection part of any of the XML query languages proposed so far and provides for the expression of Selection, Projection and Join. Therefore, we call it, for the moment,CQL . |
CQL, Canonical Query Language ![]() PM | The main difference with the approaches up till now is that a query is expressed as a pattern grammar, akin to a DTD enriched with wild cards, Boolean constructs and variables, to be applied to both markup and #pcdata. |
CQL, Canonical Query Language ![]() | This paper discusses some of the design considerations ofCQL and the architecture of the related processes. The relation with the requirements of is indicated. The mapping ontoCQL of some proposed XML query languages and of standard relational database queries is shown, together with the straightforward extensibility towards pattern grammars for natural language. |
Design considerations of CQL |
Relation with NLP and PM |
| In our approach, a document query may be viewed as a pattern that has to be located in a document. As such the processing of a query is akin to syntactic pattern recognition. |
CQL, Canonical Query Language ![]() FSA, Finite State Automaton ![]() | Very powerful query languages may be constructed by adding constructs for patterns, like wildcards and Boolean operators, to linguistic grammar formalisms. There is a price: runtime performance may be reduced dramatically. However, staying within the Chomsky hierarchy, more efficient programs for formal automata can be generated.CQL is a pattern grammar formalism that is stripped down in such a way that queries within CQL may be compiled into aFSA . A set of separate document queries, including their context, may be combined into one pattern grammar and therefore may be compiled into one (deterministic)FSA . In the formalism, no sequence of processing is indicated. Partial overlapping patterns (like ‘abc’ and ‘bcd’) do not cause problems. |
FSA, Finite State Automaton ![]() | Relation with database query languages |
| By allowing grammar-notations on the horizontal axes, the possibility of linguistic descriptions on free text is introduced. |
| The constructs for patterns allow for universal and existential quantification. |
| CQL provides for variables which may get an assignment from the document. Operations on the variables make possible the functionality of the “join” operator. |
Architecture of processes |
| The processes for CQL have the same anatomy as those for a grammatical system. |
| This architecture facilitates in a modular way most of the requirements of . |
Notation of CQL |
| The syntax of CQL may get the syntactic binding of XML or that of the forthcoming Schema language, but it may also be regarded as an extension to XPath. The syntax is denoted as an EBNF grammar (as used for the definition of e.g. XML itself) with extensions for patterns, variables, Boolean constructs, document trees and notations for output. |
| Because we deal with pattern grammars, one might say that a document is matched by a pattern grammar. |
| A document is not required to contain markup. |
Grammar formalism |
| The grammar contains Nonterminals (starting with a capital) and terminals. In this paper the terminals are single characters or strings, but they could be of other datatypes. Characters and strings are delimited by single or double quotes. |
| The reserved symbols of XML are not quoted in order to distinguish them from strings in #PCDATA. |
| Nonterminals are rewritten in grammar rules according to the EBNF syntax. Rules may be left and/or right recursive, but infix recursion is not allowed. |
| On the right-hand side of a rule, the meaning of symbols in regular expressions is as shown below. The notation deviates slightly from the usual notation in order to remain consistent with the syntax of DTD’s. |
('a..z' | 'A..Z') (instead of [a-zA-Z])
represents any character with a value in the range(s) indicated
^'a..z' (instead of [^a-z])
represents any character with a value outside the range indicated
^('a'|'b'|'c') (instead of [^abc])
represents any character which is not 'a', 'b' or 'c'
A, 'b'
the nonterminal A followed by the character 'b'.
|
| If r, r1 and r2 are regular expressions then the following ones are also regular expressions (as in XML): |
expression meaning r1 r2 : concatenation of r1 and r2 r1| r2 : r1 or r2 (r)* : 0 or more repetitions of r (r)+ : 1 or more repetitions of r (r)? : 0 or 1 r. |
Actions |
| At any place in a right-hand side (rhs) of a rule so called “actions” may be placed: a sequence of operations written between braces. The most important ones are operations for variables, Boolean constructs between rules and for output. |
| The actions may be placed before a ',', a '|', a ')' and a '.'. Actions are executed between the matching of two symbols. In the details are described of the combination of actions and regular expressions. |
| Examples of actions are included in the next subsections. |
Patterns |
| The reserved symbols for patterns are: |
|
| The line sign has its equivalent for the vertical axes in a tree. |
|
| At the end of a query trailing sequences of lines and end-tags may be left out. |
| Examples. |
|
Boolean negation within a rule |
| The negation of a string s matches all strings with the same length, but not equal to s. |
| The negation of a nonterminal N matches all strings with length between the length of the shortest and the largest string that can be matched by N, but not equal to any string that can be matched by N. |
| The negation of a regular expression E is the same as for a nonterminal, if the nonterminal rewrites to E. |
| Examples |
^’a’ an arbitrary character, but not an 'a'
^ (‘abc’) a sequence of 3 characters, which is not 'abc'
^A where A is a nonterminal. ^A matches any sequence of
the same length as one of the sentences matched by A,
which is not equal to such a sequence
^ ('ab' | 'bcd') a sequence of 2 or 3 characters, which is not 'ab'
and not 'bcd'
^ (- a..c) any sequence of characters which does not end in 'a',
'b' or 'c'
^ (- (A | B) C -) any sequence of characters which does not contain a
string which may be matched by A C or B C
NP::= - (^Adj) N - . a N(oun) P(hrase) with a N(oun), but without an
Adj(ective) preceding that Noun. (Adj and N have to
be rewritten further.)
|
Boolean intersection of rules |
| Rules may be intersected with the symbol '&';. |
1. S ::= - P1 - & - P2 -. |
| S matches any string (between a start- en end-tag) that contains both P1 and P2. |
| This may be refined with the ‘cooperation symbol’ C, written within actions and suffixed by a label for a cooperation point. |
2. S ::= - P1 - {C:1} P3 - {C:2} | - P2 - {C:1} P4 - {C:2}.
|
| Here we express that S matches strings w = u v such that u is matched by - P1 - as well as by - P2 - and that v is matched by - P3,- as well as by - P4 -. We might say that the matching of P3 and P4 has to wait for the matching of P1 and P2, and that at least at the end of the input both alternatives of the rhs have to be matched. The cooperation points are ‘1’ and ‘2’. |
| This resembles the cooperation of parallel processes. The two rhs's may be waiting for each other at the rendezvous 1, consuming input, which is possible because of the appearance of lines, which may match an indefinite number of input symbols. |
| In general, when all rhs's which contain the same cooperation name are matched up to the location where the name is denoted as a cooperation symbol then all these rhs's may continue, otherwise they all fail. |
| The device of cooperating rules is especially useful in the specification of complex conditions in trees in a non-procedural way. |
| The following pattern grammar: |
S ::= - Subject - {C: SVO} | - Verb - {C: SVO} | - Object - {C: SVO}.
|
| is equivalent to: |
S ::= - Subject - & - Verb - & - Object -. |
Tree structure and XML markup |
| In a CQL query all XML markup is denoted in the same way as it is written in an XML document. |
| The end-tag </> closes the preceding start-tag. |
| The reserved symbols of XML are not quoted. Strings occurring in markup, like names of tags and attributes, values of attributes, PI’s etcetera can be treated with patterns like the text in #PCDATA. |
| For instance, the query: |
S ::= <..> - < - ^’ing’ - > - ‘Charles.’</><../> . |
| returns all tags with tagnames which do not contain the string “ing” and which #PCDATA ends on ‘Charles.’. |
| Attributes are denoted in the same style as in XPath: they are written as if they are tags themselves, but they start with ‘<@’. |
| For instance, the above query may be extended towards: |
S ::= <..> - < - ^’ing’ - ><@’title’><@ - ’number=’ 1..3> - ‘Charles.’</><../> . |
| where one attribute name has to be ‘title’ and another attribute name has to end on ‘number’ and has to have a value between 1 and 3. It is up to the scanner to allow attributes to appear in any sequence. |
Variables |
| A variable is declared by its first appearance in a rhs. The scope of the variable is the grammar rule with that rhs. The name and the value of a variable are strings of arbitrary length, following the syntax of attributes in XML. With the coming Schemata, typed variables may be added (as we did in our implementation). |
| The possible operations on variables, written in C-style, may be summarized by : |
| Nonterminals may be enriched with parameters. The parameters of nonterminals at a lhs have to be preceded by the declarations "I:", "O:" or "IO:", which correspond respectively with the features "inherited", "synthesized" and the combination of both, which are familiar in attribute grammars. |
| Examples |
|
Reports |
| There are two report functions. The "S"-function (for "signal") reports number which follows the S. The "R"-function (from "report") does the same, but adds to it the last symbol that is read in. There may be more symbols in case the notion before the action where the report-function is denoted is an arb ('=') or a line ('-'). In that case all the symbols which are covered by that notion are concatenated and reported. Consecutive reports with the same number are merged into one report with all symbols concatenated. |
| The numbers will be reported together with their position in the document after a pattern matches completely. The position may be a filepointer or a tree pointer. Alternatives and wild cards are responsible for the generation, in principle, of overlapping sequences of reports. The system may represent the overlap efficiently in a dag structure. |
| Examples |
| 1. pattern matching with 4 keywords, signaling a match at the end of a keyword. |
A ::= - 'he' {S:1} - | - 'she' {S:2} - | - 'his' {S:3} - | - 'hers' {S:4} - .
|
| The same result is obtained by writing : |
A ::= - 'he' {S:1} 'rs' {S:4} - | - 'she' {S:2} - | - 'his' {S:3} - .
|
| If the input is 'ushers' then the output message will be '(1, pos1) (2, pos2) (4, pos4)', where posi is a position in the document. |
B ::= C {R:1} = {R:2} (d ? {R:3} )+ e .
C ::= c | f .
|
| If the input is 'cabcdededee' then the output message will be '(1c, pos1) (2abc, pos2) (3eee, pos3) '. |
Examples of CQL queries |
Estate inventories |
| The following is an example from estate inventories from the 17th century in the Dutch city of Delft. |
| Suppose the document is : |
<BOEDEL><KLASSE>D</KLASSE> <GEZIN>G</GEZIN> <ORG><RN>HUIS<KAT><VR><VO>HUIS</VO> <PR>WOON</PR> </VR> </KAT> LAND<KAT><VR><VO>TUIN</VO> </VR> </KAT> EFFECT<KAT><VR><VO>OBLIGATIE</VO> <BIJZ>FRANKRIJK</BIJZ> <AA>1 GELD</AA> <HOEV><GT>2</GT></HOEV> </VR> <TAX>2</TAX> <VR><VO>OBLIGATIE</VO> <BIJZ>PLANTERS OP DE EILANDEN</BIJZ> <RENTE>4</RENTE> </VR> <TAX>1</TAX> </KAT> </RN> <ORG> </BOEDEL> |
| Query : "Give a report of all taxation’s of stocks (EFFECT) for people who own houses and land" |
| CQL: |
S ::= - <BOEDEL> - <ORG><RN>A & B & C</RN> - </BOEDEL> .
A ::= - ‘HUIS’ <KAT> - .
B ::= - ‘LAND’ <KAT> - .
C ::= - ‘EFFECT’ <KAT> - <TAX> ? {R:1} </TAX> - </KAT> - .
|
| Result : The input will be matched, and both '2' and '1' will be matched by the '?'. |
Parse trees |
| The second example concerns the matching at arbitrary levels in a parse tree. |
| If the tree is : |
<S>NP
<VP>V
<NP>ADJ
N
<REL>...
<NP>DET
ADJ
N
</NP>
...
</REL>
</NP>
</VP>
</S>
|
| and the pattern grammar is : |
ADJ-NP ::= <S><..> - <NP> - ‘ADJ’ - </NP> - <../></S> . |
| which may be abbreviated to: |
ADJ-NP ::= <S><..NP> - ‘ADJ’. |
| then the result will be two matches in the tree as shown below : |
|
Enriched corpora |
| The third example concerns the querying of enriched corpora of natural language texts. |
| The followingDTD describes a corpus with paragraphs of different authors. The words in the sentences are coded with their lemma and with a morphological code (like ‘noun’ or ‘adverb’). |
<!ELEMENT corpus (subcorpus*) > <!ELEMENT subcorpus (identification, paragraph*) > <!ELEMENT paragraph (heading, body) > <!ELEMENT heading (date) > <!ATTLIST heading author CDATA #REQUIRED> <!ELEMENT body (sentence)* > <!ELEMENT sentence (word_group)* > <!ELEMENT word_group (word, lemma, morphological_coding) > <!ELEMENT identification (#PCDATA) > <!ELEMENT author (#PCDATA) > <!ELEMENT word (#PCDATA) > <!ELEMENT morphological_coding (#PCDATA) > |
| Query : retrieve all occurrences in the work of Shakespeare where a sentence ending with a form of the lemma 'work' is followed by a sentence ending with a noun or an adjective in 'ing'. |
| CQL : |
S ::= - <..heading><@author=’Shakespeare’> - <../heading><body> - CONT1. CONT1 ::= <sentence> - <word_group> - <lemma>’work’</><../sentence> CONT2. CONT2 ::= <sentence> - <word_group><word> - ‘ing’</word> - <morphological_coding> (‘noun’ | ‘adjective’) </><../sentence>. |
Comparison with XPath |
| It is straightforward to show that XPath is a subset of CQL. |
| The difference in philosophy is that with a location path in XPath a specific node in the document tree is reached. In CQL a query describes a complete document (or document tree), where wildcards and Boolean operators denote which parts are relevant or not. |
| Implicit in a CQL query are an indefinite number of location paths and nodes which may be located. No precedence constructs are necessary because all contexts are implicit in the pattern grammar. The sole mechanism for the creation of reports are the notations for Signals and Reports which indicate precisely a location in the document (be it implemented as a filepointer, a pointer in a Dom or in a document tree). However, the report is only generated when the whole pattern grammar matches the whole document. That means that for any point in the document any context may be specified along both horizontal and vertical axes in the document tree. |
| Note that the description above does not mean that a whole document (tree) has to be matched during runtime. Optimization techniques allow for inspection of only those parts of the document that are relevant for the query. |
Mapping of standard database queries |
| In order to show the mapping of standard database queries we use a simple example, taken from a standard textbook on database systems (Date, “An Introduction to Databases”), which demonstrates the standard operations of union, intersect, minus, select, project and join. In each case we will show the corresponding CQL notation. |
| Date uses one example consistently throughout his textbook. It concerns an education database of a commercial firm in which both teachers and students are employees. The courses are defined by course#, title and description. Each course has, possibly, prerequisite courses and is offered on a number of dates and locations. It has a teacher and students, presented with their names; in addition, students have grades. |
| The structure of this database may be described by the following DTD. |
<!ELEMENT Data_Base (Course+) > <!ELEMENT Course (Coursenr, Title, Description, Prereq*, Offering+) > <!ELEMENT Prereq (Coursenr, Title) > <!ELEMENT Offering (Date, Location, Format, Teacher, Student+) > <!ELEMENT Teacher (Empnr, Name) > <!ELEMENT Student (Empnr, Name, Grade) > <!ELEMENT Coursenr (#PCDATA) > <!ELEMENT Format (#PCDATA) > <!ELEMENT Title (#PCDATA) > <!ELEMENT Empnr (#PCDATA) > <!ELEMENT Description (#PCDATA) > <!ELEMENT Name (#PCDATA) > <!ELEMENT Date (#PCDATA) > <!ELEMENT Grade (#PCDATA) > <!ELEMENT Location (#PCDATA) > |
| The basic retrieving functions in terms of CQL are : |
| UNION : retrieve "all Course#'s for courses which either have as a prerequisite Course#=10 or which are located in Stockholm" : |
| S ::= - <Course><Coursenr> ? {R:1} </Coursenr> - (<Prereq><Coursenr> ‘10’ </Coursenr> - </Prereq> | <Offering> - <Location>’Stockholm’</Location> - </Offering>) - </Course> -. |
| Which may be abbreviated to: |
| S ::= - <Course><Coursenr> ? {R:1} </> - (<Prereq><Coursenr> ‘10’ | <Offering> - <Location>’Stockholm’). |
| (In the sequel only the abbreviated notation is used.) |
| INTERSECT : retrieve "all Course#'s for courses which have as a prerequisite Course#=10 and which are located in Stockholm"(straightforward syntactic variation of UNION) : |
| S ::= - <Course><Coursenr> ? {R:1} </> - <Prereq><Coursenr> ‘10’ <../Prereq><Offering> ‘Stockholm’. |
| MINUS : retrieve "all Course#'s for courses which have as a prerequisite Course#=10 and which are not located in Stockholm" (straightforward syntactic variation of INTERSECT) : |
| S ::= - <Course><Coursenr> ? {R:1} </> - <Prereq><Coursenr> ‘10’ <../Prereq><Offering> ^‘Stockholm’. |
| SELECT : retrieve "all Course#'s for courses in Amsterdam" : |
| S ::= - <Course><Coursenr> ? {R:1} - </><Offering>’Amsterdam’. |
| PROJECT: retrieve "all Emp# of all students" : |
| S ::= - <Course>- <Offering>- <Student><Empnr> ? {R:1}. |
| JOIN (with PROJECT) : retrieve "all Course#'s for courses where the student is the teacher of one of the prerequisite courses" : |
| S ::= - <Course> S1(c1, t1) & S2(c2, t2) {t1 == t2 & c1 == c2}. |
| S1(c1, t1) ::= <Coursenr> ? {R:1} </> - <Prereq><Coursenr> ? {c1 = %} </> - <Offering> - <Student> ? {t1 = %}. |
| S2(c2, t2) ::= <Coursenr> ? {c2 = %} </> - <Offering> - <Teacher> ? {t2 = %}. |
| Here the variables c1, c2, t1 and t2 are used. They get assigned the last symbol that has been read in (denoted by the %-sign). |
| The queries in and can be expressed in a similar way. |
Principles of implementation |
Compilation into efficient code |
| A query in CQL can be compiled into a Finite State Automaton (FSA). The method is described fully in . A few principles are described below. |
| As a starting point, we use the techniques for LR parser generation ( ). With these techniques, so called itemsets are created which are associated with states in a FSA. An itemset contains grammar rules in which a dot marks the position up to where the recognition has proceeded. By extending the technique for itemset creation, the patterns of CQL are compiled into a FSA. |
| As an example we show the itemset creation for a don’t care |
| Suppose we have the pattern grammar |
S :: - a a S :: - ? b |
| The starting itemset contains the two items : |
| Itemset1: |
S :: . - a a S :: . - ? b |
| Based upon these two items, two types of input symbols have to be considered: the ‘a’ and the set of all symbols that are not ‘a’, which set we denote by the “rest symbol” ‘.’. |
| We construct, as the successor itemset for ‘a’ : |
| Itemset2: |
S :: - a . a S :: - ? . b S :: . - a a S :: . - ? b |
| As the successor itemset for ‘.’ we construct: |
| Itemset3: |
S :: - ? . b S :: . - a a S :: . - ? b |
| In the same manner we may construct from Itemset3 successor itemsets for ‘a’ and ‘b’. The rest symbol here denotes all symbol which are not ‘a’ and ‘b’. Upon the rest symbol we create an itemset which is identical to Itemset3. Therefore, Itemset3 (or, to be precise, the associated state in the generated FSA) loops on the rest symbol. |
| In order to deal with Boolean negations and their combinations with regular expressions and nonterminals the itemdot is split into a number of typed dots. Rules are introduced which govern transitions of the types of itemdots. |
| The interested reader is referred to chapter 5 of . |
Addition of indexes |
| States with a loop literally consume the input up to the point where symbols occur on which they have a transition to another state. These looping states are candidates for run-time optimizations. |
| Two optimizations suggest themselves. |
|
| In general it may be worthwhile to continue research on the type and optimization of indexes when they have to cooperate with pattern matching automata. |
Implementation and experiences |
| CQL originated as an extension of a generalized system for recognition, parsing and transduction of natural language texts (“Parspat”: apar ser forpat tern grammars). The extension was motivated by the shortcomings of relational database systems for the handling of tree-structured documents and their poor facilities for the handling of free text. A number of research projects in the Faculty of Arts of the University of Amsterdam has made extensive use of the system. The expressive power of the formalism of CQL proved itself rich enough for the successful exploitation of enriched corpora of texts, encoded music, historical estate inventories and bibliographic records. |
Conclusion |
| With CQL it is possible to express all queries for structured documents and standard database queries. It meets already most of the requirements of . |
| Queries can be compiled into efficient code for a FSA with skip instructions. The formalism and formal machine of CQL is extendable towards higher grammar formalisms for e.g. Natural Language Parsing. |
| Additional work has to be done to profit from traditional database indices. In that respect, it is worth to develop more refined types of indices to optimize the operation of the formal automaton. |
| Bibliography |
|
|
|
|
|
|
|
![]() |
Quilt: an XML query language | Table of contents | Indexes | Spatial/temporal datatypes | ![]() | |||