On levels of model transformation   Table of contents   Indexes   Business applications made easy

 

Text analysis tools for XML documents using regular expressions & XSL

 a Web application
Jayaraman, Karthik
 
 Karthik  Jayaraman
 Student
 Colgate University
 Hamilton 
 New York 
 USA 
Colgate University,  Computer Sc. Dept.
Hamilton  New York  13346 USA
Phone: 315 228 5517 email: kjayaraman@mail.colgate.edu
 Biography
 Karthik Jayaraman — Karthik Jayaraman is a senior at Colgate University graduating in May 2000.
Nakhimovsky, Alexander
 
 Alexander  Nakhimovsky
 Associate Professor
 Computer Science Dpt, Colgate University
 Hamilton  
 New York 
 USA 
Computer Science Dpt, Colgate University,  306 McGregory
Hamilton  New York  13346 USA
Phone: +1 315 228 7586 Fax: +1 315 228 7004 email: sasha@cs.colgate.edu
 Biography
 Alexander Nakhimovsky — Nakhimovsky has just published, jointly with Tom Myers, "Professional Java XML Programming", WROX 1999, following their "Javascript Objects", WROX 1998. Nakhimovsky is the author of several papers on general and computational linguistics and AI, His forthcoming conference presentations include a paper on Web programming at WWW9, Amsterdam May 2000 and a tutorial on WAP programming at XML Developers Conference, New York, June 2000.
 Abstract
 A common type of a text-analysis query is determined by several parameters, such as: lexical material to search for; the documents to search; the ranges of text to search within the documents; the analytical operation to perform, such as frequency count or concordance; if context is requested, the way to measure the amount of context.
 We say that a text-analysis tool is "markup-aware if both query conditions and context can be expressed in terms of markup. Most markup-aware text-analysis tools are based on SGML-TEI; none of them, to the best of our knowledge, are Web applications, and none use XSLT and XPath, two new languages codified by W3C in November 1999. This paper describes a project with the following design goal:
 
  • They are Web applications, in the sense that their user interface is a Web browser and they can be accessed over the any TCP/IP network.
  •  
  • They can process queries containing both text patterns (described by regular expressions) and markup patterns (described by XPath expressions).
  •  
  • They are DTD-independent, in the sense that user interface is constructed programmatically on the basis of the document's DTD.
  •  
  • They are extensible in the sense that they can be extended by code written in a general-purpose programming language (Java most easily).
  •  The last item is for those queries that would be difficult or impossible to express as an XSLT template. Fortunately, many XSLT processors, including James Clark's xt which we are using, provide a mechanism that makes it possible to run arbitrary Java code (packaged as a Java bean) from within xt and feed the resulting node-set back into xt. We have, in effect, a Turing machine, ready to compute anything computable; the challenge is to identify the required functionality, and the user interface to it. The demo described in the end of our paper, together with the accompanying tutorial, have been created to solicit feedback and suggestions from our potential users.
     Several important technologies related to XML matured in 1999. In particular, XSLT, a language for describing XML parse tree transformations, and XPath, a language for specifying sets of tree nodes (a kind of Regular Expressions language for paths in labeled trees), were completed and released in November 1999
     W3C Recommendations for XPath and XSLT, http://www.w3.org/Style/XSL
    At the same time, James Clark released a version of XT (his XSLT processor) that is in close conformance with the Recommendations ( http://www.jclark.com/xml ). Several other XSLT processors are in the process of rapid development. Some of them, including XT, have a built-in extension mechanism, so that if some transformation is too hard for XSLT, it can be delegated to a general-purpose programming language, typically Java.
     These technologies can greatly influence the development of tools for text analysis used by scholars in the humanities. Much of the functionality of those tools has to do with searching the document for specific text patterns, with additional search conditions specified in terms of markup. (For instance: "find the first speech in the third act of the play that contains a word beginning with the character sequence 'lov'".) Such searches can be thought of as transformations of the original document into the result of the search, and they are therefore easily expressible in XSLT and XPath. XSLT also includes functionality that allows sorting and frequency counts.
     At the same time, XML parsers, especially XML Java parsers, have also made great improvements. There are ongoing Java parser projects at Sun, IBM, Microsoft and Oracle, among others, with stable versions already available and a regular schedule of new releases. The parsers have become faster and more conformant, as described in Brownell's recent study and follow-up exchanges ( http://www.xml.com ). The rules of integrating a Java XML parser into a Java program have also become codified (by Sun) as Java Application Programming Interface for XML Parsing. Since most XSLT processors, including James Clark's, are also written in Java, the same program can use an XML parser and an XSLT processor to transform an XML document in any programmable way. In particular, they can run any programmable query and format its result in HTML (or XHTML).
     The combined effect of these developments is that there is now a solid foundation for text analysis tools that have these two properties:
     
  • They are Web applications, in the sense that their user interface is a Web browser and they can be accessed over the Internet;
  •  
  • They are completely "markup-aware", in the sense that they can run queries containing both text patterns (described by regular expressions) and markup patterns (described by XPath expressions).
  •  Since December 1999, a project to develop such tools has been under way at Colgate University. The primary goals of this project are to produce an application that retains the sophistication of existing text analysis tools while using the platform of the World Wide Web. This allows us to provide a simple user interface that does not require a large degree of technical ability to execute powerful operations. The project uses the built-in internationalization features of Java and XML to create a platform that will permit an easy transition from English to foreign language texts. The project has also been designed to be DTD independent in that no details of the document markup are hard-coded into the program. The initial setup for a new DTD simply requires a text file containing a list of all the tags in the DTD and can be conducted by the site administrator. Following this, the production of HTML forms for the user interface is fully automated. Making the project DTD-independent requires a slight trade-off in the user interface in the form of increased complexity since all the DTD-specific information is built into the HTML forms. However, it is our belief that designing the project in this manner will allow much greater extensibility. For example, it is conceivable that this program can be used to operate on any kind of data that has been marked up in XML, even data that does not lend itself to being easily marked up in popular formats such as TEI and DocBook.
     The project has also been designed to return output from its queries in XML with a simple result DTD of its own. Using another XSL stylesheet to convert to HTML then produces the HTML output. The impact on the user is as follows. The output can be formatted precisely the way the user would like to see it by rewriting the conversion stylesheet or hiring a programmer to do it. For example, the conversion stylesheet for frequency counts could return its output in SVG so that the counts are displayed as a bar graph. It is also possible to use a dummy stylesheet for conversion. This would return the raw XML result from the query, which can be saved locally and used for further processing, possibly with other tools. If the "other tools" require their input data to use a specific DTD for markup, the conversion stylesheet can be configured to return output marked up using that DTD. This flexibility will fully leverage the power of XSLT and the portability and standardization of XML. Since the conversion stylesheet is not hard-coded, there can be a choice of conversion stylesheets on the server with the user choosing one of them from a drop-down list at the time that a query is executed.
     As of this writing (March 2000) we have a simple prototype running at http://csproj.colgate.edu:8000/karthik/TextTools.html . The prototype shows a specific and very simple DTD (Jon Bozak's play.dtd for Shakespeare plays). In the final version, the first screen that the user sees will have a list of texts that are available and DTD selection will be transparent to the user.
     
    Screenshot 1 - Search Form
     The interface(see ) currently allows arbitrary Perl5 regular expressions and arbitrary XPath expressions on input. It provides drop-down selection lists of available XML elements and it also makes the creation of XPath expressions a simple matter of choosing from drop-down lists that contain lists of choices for the XPath axis, an XML element and text-boxes to select on attributes or position. For users who do not know the details of regular expression syntax, a rudimentary knowledge of how regular expressions work will suffice. The text-box where the regular expression is entered can also be accessed via an adjacent drop-down list that allows the user to select options such as "any character", "whole word" and other regular expression constructs and inserts them into the text-box to construct the search pattern.
     
    Screenshot 2 - Form configuration page
     The number of elements on which searches can be conducted is unlimited and can be set by the user(see ). The same is true of the number of search conditions for each element. Using the drop-down boxes that are provided can create the majority of simple XPath expressions. For the cases where this is not sufficient, the required expression can be typed in a text-box by setting it as an attribute of an element. The default behaviour is to perform a Boolean AND of all the search conditions. However, this is easily changed to a Boolean OR by making the appropriate selection in a drop-down list on the search page. We are also exploring the possibility of extending XPath to allow regular expression searches in attributes and element names. Given the current definition of XPath, this will probably require a large tradeoff in speed. However, it may still be advantageous to enable this option for complex searches or DTD's.
     Two difficult problems facing the project are the user interface and query optimization. For user interface development, we will seek collaboration with an active humanities project that would use our tools and contribute ideas for better functionality and user interface. The focus will be on retaining functionality while increasing the user-friendliness of the interface. One possible solution is to maintain a tiered interface with a simple but less powerful interface for basic queries and gradually increasing the complexity and the power of the interface for more advanced users and queries. For query optimization, we will investigate methods of storing and indexing pre-processed queries, trading disk space for processing time. There have been tentative attempts recently to formulate correspondences between XPath and SQL, the query language of relational databases. There has also been an ongoing discussion on the xml-dev list about the need for a standard API for storage, search and retrieval from repositories of XML documents. These may prove relevant for the organization of large repositories of XML documents that can be accessed over the Internet.
     The project currently runs on a small NT server, but a Linux version is also in progress. Since the program is written in Java and does not use any native methods, porting to a variety of operating systems and environments should not be difficult.
     Bibliography
     
    1 Bozak, Jon. XML Shakespeare, in http://metalab.unc.edu/bosak/xml/eg/shaks200.zip .
     
    2 Clark, James. The xt distribution at http://metalab.unc.edu/bosak/xml/eg/shaks200.zip .
     
    3 DeRose, Steven and C.M. Sperberg-McQueen. "A broadcast architecture for distributed text tools". Proceedings of the ALLC/ACH conference, 1999.
     
    4 W3C XSLT Recommendation, at http://www.w3c.org/tr
     
    5 Nakhimovsky and Myers, Javascript Objects WROX 1998.
     
    6 Nakhimovsky and Myers, Professional Java XML Programming WROX 1999.

    On levels of model transformation   Table of contents   Indexes   Business applications made easy