![]() |
On levels of model transformation | Table of contents | Indexes | Business applications made easy | ![]() |
|||
Text analysis tools for XML documents using regular expressions & XSL |
| a Web application |
| Jayaraman, Karthik |
| Karthik Jayaraman |
| Student |
Colgate University Hamilton ![]() New York ![]() USA ![]() | Colgate University,
Computer Sc. Dept. Hamilton New York 13346 USA Phone: 315 228 5517 email: kjayaraman@mail.colgate.edu |
| Biography |
| Karthik Jayaraman — Karthik Jayaraman is a senior at Colgate University graduating in May 2000. |
| Nakhimovsky, Alexander |
| Alexander Nakhimovsky |
| Associate Professor |
Computer Science Dpt, Colgate University Hamilton ![]() New York ![]() USA ![]() | Computer Science Dpt, Colgate University,
306 McGregory Hamilton New York 13346 USA Phone: +1 315 228 7586 Fax: +1 315 228 7004 email: sasha@cs.colgate.edu |
| Biography |
| Abstract |
| The last item is for those queries that would be difficult or impossible to express as an XSLT template. Fortunately, many XSLT processors, including James Clark's xt which we are using, provide a mechanism that makes it possible to run arbitrary Java code (packaged as a Java bean) from within xt and feed the resulting node-set back into xt. We have, in effect, a Turing machine, ready to compute anything computable; the challenge is to identify the required functionality, and the user interface to it. The demo described in the end of our paper, together with the accompanying tutorial, have been created to solicit feedback and suggestions from our potential users. |
Several important technologies related to XML matured in 1999. In particular, XSLT, a language for describing XML parse tree transformations, and XPath, a language for specifying sets of tree nodes (a kind of Regular Expressions language for paths in labeled trees), were completed and released in November 1999
|
| These technologies can greatly influence the development of tools for text analysis used by scholars in the humanities. Much of the functionality of those tools has to do with searching the document for specific text patterns, with additional search conditions specified in terms of markup. (For instance: "find the first speech in the third act of the play that contains a word beginning with the character sequence 'lov'".) Such searches can be thought of as transformations of the original document into the result of the search, and they are therefore easily expressible in XSLT and XPath. XSLT also includes functionality that allows sorting and frequency counts. |
| At the same time, XML parsers, especially XML Java parsers, have also made great improvements. There are ongoing Java parser projects at Sun, IBM, Microsoft and Oracle, among others, with stable versions already available and a regular schedule of new releases. The parsers have become faster and more conformant, as described in Brownell's recent study and follow-up exchanges ( http://www.xml.com ). The rules of integrating a Java XML parser into a Java program have also become codified (by Sun) as Java Application Programming Interface for XML Parsing. Since most XSLT processors, including James Clark's, are also written in Java, the same program can use an XML parser and an XSLT processor to transform an XML document in any programmable way. In particular, they can run any programmable query and format its result in HTML (or XHTML). |
| The combined effect of these developments is that there is now a solid foundation for text analysis tools that have these two properties: |
| Since December 1999, a project to develop such tools has been under way at Colgate University. The primary goals of this project are to produce an application that retains the sophistication of existing text analysis tools while using the platform of the World Wide Web. This allows us to provide a simple user interface that does not require a large degree of technical ability to execute powerful operations. The project uses the built-in internationalization features of Java and XML to create a platform that will permit an easy transition from English to foreign language texts. The project has also been designed to be DTD independent in that no details of the document markup are hard-coded into the program. The initial setup for a new DTD simply requires a text file containing a list of all the tags in the DTD and can be conducted by the site administrator. Following this, the production of HTML forms for the user interface is fully automated. Making the project DTD-independent requires a slight trade-off in the user interface in the form of increased complexity since all the DTD-specific information is built into the HTML forms. However, it is our belief that designing the project in this manner will allow much greater extensibility. For example, it is conceivable that this program can be used to operate on any kind of data that has been marked up in XML, even data that does not lend itself to being easily marked up in popular formats such as TEI and DocBook. |
| The project has also been designed to return output from its queries in XML with a simple result DTD of its own. Using another XSL stylesheet to convert to HTML then produces the HTML output. The impact on the user is as follows. The output can be formatted precisely the way the user would like to see it by rewriting the conversion stylesheet or hiring a programmer to do it. For example, the conversion stylesheet for frequency counts could return its output in SVG so that the counts are displayed as a bar graph. It is also possible to use a dummy stylesheet for conversion. This would return the raw XML result from the query, which can be saved locally and used for further processing, possibly with other tools. If the "other tools" require their input data to use a specific DTD for markup, the conversion stylesheet can be configured to return output marked up using that DTD. This flexibility will fully leverage the power of XSLT and the portability and standardization of XML. Since the conversion stylesheet is not hard-coded, there can be a choice of conversion stylesheets on the server with the user choosing one of them from a drop-down list at the time that a query is executed. |
| As of this writing (March 2000) we have a simple prototype running at http://csproj.colgate.edu:8000/karthik/TextTools.html . The prototype shows a specific and very simple DTD (Jon Bozak's play.dtd for Shakespeare plays). In the final version, the first screen that the user sees will have a list of texts that are available and DTD selection will be transparent to the user. |
|
| The interface(see ) currently allows arbitrary Perl5 regular expressions and arbitrary XPath expressions on input. It provides drop-down selection lists of available XML elements and it also makes the creation of XPath expressions a simple matter of choosing from drop-down lists that contain lists of choices for the XPath axis, an XML element and text-boxes to select on attributes or position. For users who do not know the details of regular expression syntax, a rudimentary knowledge of how regular expressions work will suffice. The text-box where the regular expression is entered can also be accessed via an adjacent drop-down list that allows the user to select options such as "any character", "whole word" and other regular expression constructs and inserts them into the text-box to construct the search pattern. |
|
| The number of elements on which searches can be conducted is unlimited and can be set by the user(see ). The same is true of the number of search conditions for each element. Using the drop-down boxes that are provided can create the majority of simple XPath expressions. For the cases where this is not sufficient, the required expression can be typed in a text-box by setting it as an attribute of an element. The default behaviour is to perform a Boolean AND of all the search conditions. However, this is easily changed to a Boolean OR by making the appropriate selection in a drop-down list on the search page. We are also exploring the possibility of extending XPath to allow regular expression searches in attributes and element names. Given the current definition of XPath, this will probably require a large tradeoff in speed. However, it may still be advantageous to enable this option for complex searches or DTD's. |
| Two difficult problems facing the project are the user interface and query optimization. For user interface development, we will seek collaboration with an active humanities project that would use our tools and contribute ideas for better functionality and user interface. The focus will be on retaining functionality while increasing the user-friendliness of the interface. One possible solution is to maintain a tiered interface with a simple but less powerful interface for basic queries and gradually increasing the complexity and the power of the interface for more advanced users and queries. For query optimization, we will investigate methods of storing and indexing pre-processed queries, trading disk space for processing time. There have been tentative attempts recently to formulate correspondences between XPath and SQL, the query language of relational databases. There has also been an ongoing discussion on the xml-dev list about the need for a standard API for storage, search and retrieval from repositories of XML documents. These may prove relevant for the organization of large repositories of XML documents that can be accessed over the Internet. |
| The project currently runs on a small NT server, but a Linux version is also in progress. Since the program is written in Java and does not use any native methods, porting to a variety of operating systems and environments should not be difficult. |
| Bibliography |
|
|
|
|
|
|
![]() |
On levels of model transformation | Table of contents | Indexes | Business applications made easy | ![]() | |||