| XML Messaging at Chase Manhattan Bank Global Markets | Table of contents | Indexes | WDDX: Distributed Data for the Web | |||
Query Languages ![]() Repositories ![]() Searching ![]() XML ![]() XML Query Language ![]() XQL ![]() | Querying XML |
Durham, NC 27703 North Carolina ![]() Robie, Jonathan ![]() Texcel Research, Inc. ![]() USA ![]() | Jonathan
Robie
Research Consultant, Texcel Research, Inc.
Biographical notice Jonathan Robie is a Research Consultant at Texcel Research, Inc. His primary interests include query languages, programming APIs, database architectures, and web-based standards for structured documents, especially XML documents. Mr. Robie represents Texcel on several W3C Working Groups, and he is an editor for the W3C Document Object Model Working Group. Before joining Texcel he was the SGML Product Manager at POET Software, where he was a lead architect for an SGML/XML repository. Mr. Robie has three years experience with SGML and XML repository design, seven years experience with object oriented databases, object oriented design, and object oriented languages, and a total of eleven years post-graduate experience as a computer scientist. He has a MS in Computer Science from Michigan State University. |
![]() Lapp, Joe | Joe
Lapp
, Biographical notice |
![]() Schach, David | David
Schach
, Biographical notice |
Introduction |
| XML documents are structured documents - they blur the distinction between data and documents, allowing documents to be treated as data sources, and traditional data sources to be treated as documents. Some XML documents are nothing more than an ASCII representation of data that might traditionally have been stored in a database. Others are documents that contain very little structure beyond the use of headers and tables. Still others are somewhere in between, e.g. reference works like dictionaries or technical manuals, documents in which "looking something up" is a long-standing tradition that predates computers. Yet other kinds of documents, not commonly entered as structured documents, become incredibly useful as sources of data when properly encoded in XML; for instance, a patient record encoded in XML can become a rich data source for queries about medical history, diagnoses, treatments, and billing information. |
| As more and more information is either stored in XML, exchanged in XML, or presented as XML through various interfaces, the ability to intelligently query our XML data sources becomes increasingly important. For instance, consider the applications mentioned in Jon Bosak's seminal paper "XML, Java, and the future of the Web" (http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html). When XML is used as a universal interchange format, it is often desirable to also have a universal query language for requesting relevant data. When applets use Java to persist or parse data, it is helpful to allow them to query for the data they need. When multiple views of document data are desired, a query language is an ideal means of specifying these views. Intelligent agents using XML for data discovery are much more powerful if they can discover and query their data sources. In short, most of the applications to which XML is particularly well suited are enhanced by the availability of a suitable query language. |
What is an XML Query? |
| In SGML and XML circles, there has been a great deal of discussion about exactly what constitutes a query for structured documents. The question initially strikes most as simple, but there are a variety of possible answers to this question, and when people ask whether systems such as XPointers or XSL patterns provide queries, the question is impossible to answer without first defining what is meant by a query. There are actually a number of legitimate ways to define queries for documents, and the word query has been used in a variety of ways in computer science, so the way we define queries in this section is not normative for all systems that implement queries for XML. Our purpose here is merely to explain what we mean by a query in the context of XQL, and to present a simple model, which will serve as a framework for the rest of this article. |
Queries, search contexts, and result sets |
| To examine the characteristics of an XML query, it is useful to consider four basic questions about the environment in which a query takes place: |
| We would like to examine these questions for traditional relational databases and for XQL. In an SQL database, these are the standard answers: |
| XQL takes an approach that is analogous to relational databases, but with significant differences: |
| To illustrate these concepts more concretely, let's look at a relatively simple XQL query, examining the input to the query, the query itself, and the result. In this example, the input to the query (known as the "search context") is a single <novel> element, which is the root of a document: |
| Search Context: |
<novel> <front> <title>The Heart of Darkness</title> <author>Joseph Conrad</author> </front> </novel> |
| In XQL, the simplest possible query is an unadorned string, which represents an element name. Thus, "novel" is a full query, and asks for all <novel> elements from the current search context: |
| Query: |
novel |
| The result set of this query is the set of all <novel> elements in the search context. For our example, since there is only one <novel> element, the result set is equivalent to the search context: |
| Result Set: |
<novel> <front> <title>The Heart of Darkness</title> <author>Joseph Conrad</author> </front> </novel> |
| Note that both the search context and the result set for this example contain one node each. We have shown the children of the <novel> element in both the search context and the result set because environments that return XQL results as ASCII would return the children as well. |
Result Sets and Result Documents |
| In many environments it is useful for the results of a query to be presented as well-formed XML documents. Some reasons for this include: |
| In the example we showed above, the result set contains only one node. Whenever a query returns more than one node, though, a text representation of the result set is not a well-formed XML document, because an XML document can have only one root node. Suppose we have a result set containing more than one node: |
| A result set that contains more than one node |
<title>The Heart of Darkness</title> <author>Joseph Conrad</author> |
| Because this result set contains two nodes, it is not a valid XML document. However, if we wrap the nodes of this set in a common root element, we then have a valid XML document. Therefore, the "result document" of an XQL query always wraps the nodes of the result set in an <xql:result> element: |
| A well-formed result document containing the above result set |
<xql:result> <title>The Heart of Darkness</title> <author>Joseph Conrad</author> </xql:result> |
| Environments that do not need queries to return well-formed XML documents generally work with result sets. Those that do work with result documents. |
XQL Quickstart |
| This section discusses the simplest XQL queries, which are also likely to be the most common. In this section, we will present a quick, informal overview of XQL. |
| A simple string is taken to be an element name. For instance, this query specification addresses all <table> elements: |
table |
| The child operator ("/") indicates hierarchy. This query specification addresses <front> elements with <author> children: |
front/author |
| The root of a document may be indicated by a leading "/" operator: |
/novel/front/author |
| Note: in XQL, the root of a document refers to the document entity, in the technical XML sense, which is basically equivalent to the document itself. It is not the same as the root element, which is the element that contains the rest of the elements in the document. The document root always contains the root element, but it may also contain a doctype, processing instructions, and comments. In this example, <novel> would be the root element. |
| Paths are always described from the top down, and unless otherwise specified, the right-most element on the path is returned. For instance, in the above example, <author> elements would be returned. |
| The content of an element or the value of an attribute may be specified using the equals operator ("="). The following returns all authors with the name "Theodore Seuss Geisel": |
front/author='Theodore Seuss Geisel' |
| Attribute names begin with "@". They are treated as children of the elements to which they belong: |
front/author/address/@type='email' |
| The descendant operator ("//") indicates any number of intervening levels. The following shows addresses anywhere within front: |
front//address |
| When the descendant operator is found at the start of a path, it means all nodes descended from the document. This query will find any address in the document: |
//address |
| The filter operator ("[ ]") filters the set of nodes to its left based on the conditions inside the brackets. The following returns addresses. Each of these addresses must have an attribute called "type" with the value "email": |
front/author/address[@type='email'] |
| Note that "address[@type='email']" returns addresses, but "address/@type='email'" returns type attributes. |
| Multiple conditions may be combined using Boolean operators: |
front/author='Theodore Seuss Geisel'[@gender='male' $and$ @shoesize='9EEEE'] |
| Brackets are also used for subscripts, which indicate position within a document. The following refers to sections 0, 3, 4, 5, and 8, plus the last section: |
section[0, 3 $to$ 5, 8, -1] |
| Conditions and subscripts may not both occur in the same brackets, but both uses of brackets may occur in the same query. The following refers to the first three sections whose level attributes have the value "3"; in other words, it returns the first three "level 3" sections: |
section[@level= '3'][0 $to$ 2] |
| Now that we know the basics, let's take a look at a document and try some XQL queries on it. The following is an invoice document. Traditionally, invoices are often stored in databases, but invoices are both documents and data. XQL is designed to work on both documents and data, provided they are represented via XML through some interface. This document will be the basis for the sample queries that follow: |
| Sample data for the following queries |
| Now let's look at some sample queries. Suppose we wanted to see just the customers from the database. We could do the following query: |
//customer |
| Here is the result of the above query for our sample data: |
| Result: |
<xql:result> <customer> Wile E. Coyote, Death Valley, CA </customer> <customer> Camp Mertz </customer> </xql:result> |
| We might want to look at all the products manufactured by BSA. This query would do the trick: |
| Query: |
//product[@maker='BSA'] |
| Here are the results: |
| Result: |
<xql:result> <product maker="BSA" prod_name="left-handed smoke shifter" price="16.00"/> <product maker="BSA" prod_name="snipe call" price="13.00"/> </xql:result> |
| Filters are very useful when specifying conditions on paths that are not the same as what is returned. For instance, the following query returns the products ordered by Camp Mertz: |
| Query: |
//invoice[customer='Wile E. Coyote, Death Valley, CA']/product |
| Here are the results for the above query: |
| Result: |
<xql:result> <product maker="ACME" prod_name="screwdriver" price="80.00"/> <product maker="ACME" prod_name="power wrench" price="20.00"/> </xql:result> |
path expressions ![]() | Completing the XQL Model |
| This section introduces return operators and sequence, which are basic to the complete XQL model, but not necessary for all XQL implementations. Return operators are analogous to the SELECT statement in SQL, and allow much better control over what is returned from a query. However, they are not necessary for all applications, since many applications generally return single nodes from queries or have other very simple requirements for what is returned. Sequence allows the order in which data appears in a document to be used in query conditions, and is extremely helpful for many kinds of document data. However, many applications are not concerned about sequence. In a relational database, the sequence of rows or columns is insignificant, and relational theory explicitly states that these sequences may have no hidden meaning. In objects, the sequence in which the attributes of an object are declared has no meaning. Therefore, systems that deal primarily with these kinds of data generally do not care about sequence. |
| In the complete XQL model, conditions for individual nodes may include: |
| The basic relationships among nodes are: |
| Conditions for nodes and conditions for the relationships among nodes are combined to form . A query searches for paths within the search context that match the path expression. Return operators are used to select specific nodes from matching paths so that they will be returned from the query. They are analogous to the SELECT statement in SQL. |
| There are two kinds of return operators: |
Return Operators |
| XQL has two kinds of return operators. The shallow return operator ("?") returns just the node to which it is applied. For instance, if it is applied to an element, it does not cause attributes or children of that element to be returned. The deep return operator ("??") returns the element and all its children. Return operators can simplify queries for complex document structures. We will use the following sample data as a basis for our discussion of return operators: |
| Suppose you wanted to see all products that occur on an invoice. You could do this with the following query: |
invoice//product |
| Here are the results of the above query for our sample data: |
| Unfortunately, the results do not show which products are found on the same invoice, but this is easily fixed. A shallow return ("?") on the <invoice> element returns the <invoice> element only, providing an element within which the products can be listed; a deep return on the <product> element returns it and its contents: |
invoice?//product?? |
| If a deep return occurs in a query, only those nodes that are specified with return operators will be returned. Here are the results of the above query: |
| Suppose we wanted to see the customer for each invoice, together with the product. This can be done by specifying both <product> and <customer> using the deep return operator: |
invoice?[customer??]//product?? |
| Here are the results of this query: |
| Conditions may be added to various elements in such a query, and. the results that are returned may be on a different branch from those used as the basis for conditions. For instance, the following query returns customers who ordered left-handed smoke shifters: |
invoice[customer??]//entry/product[@prod_name="left-handed smoke shifter"] |
| The following shows invoices for which Camp Mertz ordered a left-handed smoke shifter: |
invoice??[customer="Camp Mertz"]//entry/product[@prod_name="left-handed smoke shifter"] |
| We can take a fairly complex query, observe that the correct results are returned, then move the return operators around to obtain different results using the same conditions. For instance, we can take the above query, and return just the customer and the product, grouped by invoice: |
invoice?[customer??="Camp Mertz"]//entry/product??[@prod_name="left-handed smoke shifter"] |
| The only difference in these queries is the placement of the return operator. This makes it easy to recycle thought when constructing queries. |
Sequence |
| In systems where XML is used mainly to represent data from object oriented systems or relational databases, sequence may not be particularly important. However, sequence is often very important to the meaning of documents. For instance, consider the following table: |
|
| Someone may want to ask what mode the song Shady Grove is in. In HTML, the above table may be represented like this (omitting the headers to keep the example short): |
<TABLE> <ROWS> <TR> <TD>Shady Grove</TD> <TD>Aeolian</TD> </TR> <TR> <TD>Over the River, Charlie</TD> <TD>Dorian</TD> </TR> </ROWS> </TABLE> |
| In this example, the mode for Shady Grove is found in the <TD> element that immediately follows the <TD> containing the value Shady Grove. The immediately precedes operator (;) selects adjacent nodes. This query returns both the TD that contains Shady Grove and the TD that immediately follows it: |
TD= "Shady Grove" ; TD |
| The above query searches for any sequence of two TD tags in which the first is equal to Shady Grove. However, it may be combined with hierarchy conditions for more specific searches; e.g. the following search looks for such a sequence only within a table (i.e. within the subtree found beneath a TABLE element): |
TABLE // (TD= "Shady Grove" ; TD) |
| The previous example discusses the immediately precedes relationship, which specifies the relative position of two adjacent nodes. The precedes relationship, which specifies that one node occur prior to the other node, but does not specify that they be adjacent, is also important to the structure of documents. Consider the following excerpt from Hamlet: |
| Suppose an actor playing the ghost wants to know when to exit; that is, he wants to know who says what line just before he is supposed to exit. The line immediately precedes the stagedir, but the speaker may occur at any time before the line. In this query, we will use the precedes operator (;;) to identify a speaker that precedes the line somewhere within a speech. Our ghost can find the required information with the following query, which selects the speaker, the line, and the stagedir: |
SPEECH // (SPEAKER ;; LINE ; STAGEDIR= "Exit Ghost") |
Application domains for XQL |
| In this paper, we have shown the need for an XML query language, discussed the kinds of queries that should be supported, and introduced the syntax that XQL uses for such queries. Now we would like to discuss some of the problem domains in which XQL queries might be useful. There may well be reasons to use other languages rather than XQL in some of these problem domains, so this section is not intended to claim that XQL must replace other approaches, but we do hope that this section illustrates the breadth of domains for which XQL is applicable. |
Addressing within or across documents |
| XQL could be used in a manner similar to XPointer syntax to address within or across documents. XQL achieves many of the same goals as XPointer syntax: it is able to address any node by attributes, location, or content; XQL query strings may be used as part of a URL; and XQL query strings may be used in XML or HTML attributes without using character entities. Hence, XQL queries could be used in links like this one: |
<a href="http://www.example.com/docs#front/author"> |
| In this case, XQL is not used as a full-fledged query, but only as a way of identifying known locations in documents. The complete XQL language may be overkill for links; if XQL were to be used for this purpose, it would probably make sense to define a subset of XQL that is appropriate for linking. The "return" type also differs from the examples we have shown so far; in most cases, links are used to jump to the referenced document, not to create a result set. |
Queries within a single document |
| Queries within a single document can be useful in XML browsers or editors to allow the user to query large documents and find relevant information without scrolling through the entire document. They may be used in scripting languages to provide powerful non-procedural access to document data and structures. In addition, they may be used by document authors to define various views of a document, e.g. for users with varying background, or users with differing access rights. The input to such a query is an entire document, which may be of any size. A sophisticated implementation may have indexes defined for the document, and may have ways to avoid loading the entire document into memory; a relatively naïve implementation may simply search the entire document. The result of such a query may differ among implementations; e.g., the result may be an iterator that traverses precisely those nodes returned by the query, in document order, or it may be a set of addresses that may be used to jump to the appropriate position in the original document, or it may be a navigation object that allows the nodes returned by the query to be traversed as a virtual tree, i.e. a view containing a subset of the original document. |
| Since many applications that process single documents are relatively simple, it is very important to have a query language that is relatively easy to parse and to implement. XQL is LL(1), so it is easily parsed, and because it has relatively few primitives, a simple implementation may be written fairly quickly. |
Queries in collections of documents |
| Queries in collections of documents are useful in a wide variety of settings, including document assembly using XML repositories, queries performed on a single web site or across web sites, and data mining. The input to such a query is a set of documents or a set of nodes within multiple documents. The range of possible outputs is the same as we have described for single documents above, except that the output from multiple documents must be represented. The addressing requirements are also the same, except for the additional need to be able to address the individual documents. |
| Applications that manage collections of documents must have relatively sophisticated implementations to offer adequate performance. For these applications, it is essential to be able to define suitable index structures and to avoid loading entire documents into memory. |
Patterns in transformation languages |
| XSL Patterns are used to specify tree-to-tree transformations on documents. One phase of these transformations involves testing individual nodes based on their properties and the properties of related nodes. The output of an XSL transformation is a tree, transformed using the rules specified in the XSL templates. XSL already uses a pattern language that is very similar to XQL. For instance, the <xsl:rule> element in the following example contains a query expression that identifies the elements for which the rule should be applied: |
<xsl:rule match="chapter/heading"> <fo:block xsl:use="title-style" quadding="start"> <xsl:process-children/> </fo:block> </xsl:rule> |
| Some query languages, such as XML-QL, combine transformations and queries in one language. XQL takes a more modest approach, providing operators only for simple transformations like return operators - and even these are optional features. There are at least two ways for XQL to complement XSL: as a pattern language, and as a way of extracting data from one or more documents to be transformed and formatted by XSL. |
Conclusion |
| In this paper, we have introduced XQL, a query language for XML. This language is very simple, providing a small number of primitives that are tailored to the semantic structure of XML documents. We have introduced the concept of XML queries, illustrated the XQL language with a variety of examples, and discussed the model that underlies the XQL syntax. Finally, we have shown a variety of problem domains for which a language like XQL might be useful. |
| We believe that XML query languages can dramatically increase the utility of XML documents by making it easier to access the information they contain. As more and more of our critical data is stored in XML documents, presented as XML documents through middleware, or made accessible via XML-oriented APIs, the ability to query XML documents becomes an essential aspect of being able to use the information we have. We hope that XQL will fill this need. |
| XML Messaging at Chase Manhattan Bank Global Markets | Table of contents | Indexes | WDDX: Distributed Data for the Web | |||