WDDX: Distributed Data for the Web   Table of contents   Indexes   XML on the Desktop: A Case Study of the Document as Application at Shell

 XML 
 collaborative 
distributed
 information 
 integration 
 metadata 
 middleware 
 

XML For Web-Based Collaborative Management

 Center for Collaborative Technologies Research Oak Ridge National Laboratory 
 Oak Ridge 
Potok, Thomas E.
 Tennessee 
 USA 
 
Thomas E.  Potok
Research Staff Member,  Center for Collaborative Technologies Research Oak Ridge National Laboratory 
  Oak Ridge National Laboratory, PO BOX 2008 MS6414
Oak Ridge  ( Tennessee)  ( USA) 37831-6414 
Email: potokte@ornl.gov

Biographical notice

Thomas E. Potok is a research staff member at the Collaborative Technologies Research Center of ORNL's Computer Science and Mathematics Division. He is currently the principal investigator of the Collaborative Management Environment project. He rece ived his B.S. degree in computer science, an M.S. degree in computer engineering, and a Ph.D. degree in computer engineering, all from North Carolina State University. He has 14 years of software development experience at IBM, has authored over 20 publications, and has filed 2 patents.

 Center for Collaborative Technologies Research Oak Ridge National Laboratory 
Ivezic, Nenad
 Oak Ridge 
 Tennessee 
 USA 
 
Nenad  Ivezic
Research Staff Member,  Center for Collaborative Technologies Research Oak Ridge National Laboratory 
  Oak Ridge National Laboratory, PO BOX 2008 MS6414
Oak Ridge  ( Tennessee)  ( USA) 37831-6414 
Email: ivezicn@ornl.gov

Biographical notice

Nenad Ivezic is a research staff member at the Collaborative Technologies Research Center of ORNL's Computer Science and Mathematics Division. He received his Ph.D. degree in 1995 in Computer-Aided Engineering from Carnegie Mellon University. His interests are in the areas of software engineering, engineering design, and collaborative technologies. He is involved in the research and development of technologies enabling collaborative work including information modeling, ontology engineering, and shared work environments.

 Center for Collaborative Technologies Research Oak Ridge National Laboratory 
 Oak Ridge 
Singletary, Bradley A.
 Tennessee 
 USA 
 
Bradley A.   Singletary
Student Research Assistant,  Center for Collaborative Technologies Research Oak Ridge National Laboratory 
  Oak Ridge National Laboratory, PO BOX 2008 MS6414
Oak Ridge  ( Tennessee)  ( USA) 37831-6414 
Email: bas@acm.org

Biographical notice

Bradley A. Singletary was a participant in the DOE Energy Research Undergraduate Laboratory Fellowship program and worked with the Center for Collaborative Technologies Research of ORNL's Computer Science and Mathematics Division. He is a Ph.D. student in the College of Computing at the Georgia Institute of Technology. He received his B.S. degree in computer and information science in 1998 from East Tennessee State University.

 

Abstract

 In this paper we present an innovative use of XML as a cost-efficient alternative for acquiring, storing, querying, and publishing heterogeneous and distributed enterprise information. We present this work in the context of a pilot system to automate and enhance the management of research proposal information from multiple independent research organizations within the Department of Energy (DOE).
 Two key challenges presented themselves in the development of the pilot system. First, due to the distributed, heterogeneous, and sensitive nature of the proposal data, it was preferred that the data remain stored locally within the originating laboratories. Second, due to the lack of motivation for the laboratories to invest in the construction and maintenance of traditional enterprise information systems, it was necessary to investigate a low-cost, distributed information management system.
 There are several well-published ways of dealing with distributed and heterogeneous data including distributed databases and object request brokers. The novelty of our approach is in the cost-efficiency gained by both the developers and the data owners. This cost-efficiency was achieved by designing and implementing a flexible, low-cost, XML-based, distributed storage layer within the information management system. We have developed a process that converts laboratory data into a laboratory specific XML information format and maps this format onto a more general distributed storage layer. The data owners (i.e., the DOE research laboratories) are only required to represent their data using the laboratory-specific format, while the developers are responsible for operation of the low-maintenance XML-based storage layer.
 The initial results of the system usage can be summarized as very favorable. To date, information may be successfully acquired, integrated, searched, and displayed using the pilot system. While the XML-based system has obvious limitations, such as the limited scalability and the rapid evolution of the XML component technologies, the system fulfills an important role as a cost-efficient enterprise information system alternative.
 The submitted manuscript has been authored by a contractor of the U.S. Government under contract No. DE-AC05-96OR22464. Accordingly, the U.S. Goverment retains a non-exclusive, royalty-free to publish or reproduce the published form of this contribution or allow others to do so, for U.S. Government Purposes.
 

Introduction

 The Collaborative Technologies Research Center within the Oak Ridge National Laboratory is in the final stages of completing a pilot system called the Collaborative Management Environment (CME). CME is a research project funded by the Department of Energy (DOE) to investigate advanced information technologies for improved management of research information across the DOE complex of national laboratories. As DOE funds a vast amount of energy- related research in a very broad range of areas across the national laboratory complex, it is not surprising that each laboratory follows independent research management processes tailored to the expertise of that laboratory. Information resulting from these management processes, however, is in different formats and at different levels of granularity that, in turn, makes the overall management of DOE-funded research a difficult challenge.
 The disparity of research management processes is significantly affecting the management of proposal information at DOE. Before the DOE can fund a project, one or more principal investigators responsible for the project must submit a research proposal to a DOE program management. The research proposal describes, among other things, how the scientist(s) will allocate research funds, why the research is significant, and who would benefit from the technology advance. As every lab has a customized management process and forms for the proposal submission, the proposals arriving at DOE from different laboratories contain different types of data. Moreover, as DOE has multiple funding programs, the different DOE programs typically require different information to appear in the proposals. Additionally, proposals are stored in an independent online format at each laboratory until they are submitted to the program managers. Once a year, researchers extract proposals from the online systems and submit them on paper to DOE for approval. Transition from the paper-based proposal management process to an electronic-based management poses an additional and significant challenge.
 The CME pilot system must have the capability to store proposal information in such a way that it can be queried based on general keywords, and by field-specific information. For example, a program manager may want to perform a general keyword search on "Neutron Physics" to investigate the research contribution proposed in this area. Or, he or she may want to find all of the proposals that have "Neutron Physics" in the title, were submitted by a principal investigator with the last name of "Smith," all within the years 1994-1997. Likewise, the program manager may require reports and graphs derived from this information, i.e., how much money was spent by "Smith" on proposals that include the "Neutron Physics" phrase in the title. Lastly, the proposal information must be easily viewable from a variety of platforms. Since each lab has a different presentation form for the proposal information, the program manager needs to be able to view the information in a form that closely represents the original.
 Two key challenges presented themselves in the development of the pilot system. First, due to the distributed, heterogeneous, and sensitive nature of the proposal data, it was preferred that the data remain stored locally within the originating laboratories. Second, due to the lack of motivation for the laboratories to invest in the construction and maintenance of traditional enterprise information systems, it was necessary to investigate a low-cost, distributed information management system.
 In this paper we present our pioneering work for using the eXtensible Markup Language (XML) as a cost-efficient alternative for acquiring, storing, querying, and publishing heterogeneous and distributed enterprise information. We show that XML can provide a basis for a distributed data management system alternative with significant presentation and query capabilities.
 This paper is organized as follows: First, we provide a brief account of the related work in managing distributed data. Next, we describe our approach to achieve a low-cost distributed data management solution using XML as the primary enabling technology. Then, we present our findings that follow from an application of our approach. Further, we discuss the relative merits and key issues of our approach in the light of these findings. Finally, we summarize the principal points of our paper.
 

Related Work

 There exist a variety of ways of addressing the issue of managing heterogeneous data, such as data warehousing, object-request brokers, and middleware tools. Data warehouses provide a means for publishing and accessing a broad range of distributed, heterogeneous data. Object request brokers (ORBs) provide a way of accessing distributed objects, as if the objects were local to the user's environment. Finally, middleware tools provide a layer above the data layer yielding easy access to the distributed data. Unfortunately, these approaches all assume that the owners of the data are motivated to develop and maintain a relatively costly information management system. If a user has little interest in spending significant resources to implement such a system, the above alternative may not be feasible. Clearly, the simplest means of distributing data in today's environment is the Web. However, the main drawback of using the Web for managing distributed data today is the inability of HTML to efficiently store data in a structured manner. A feasible approach is to make use of the distributed nature of the Web and the capabilities of XML to enable efficient and customizable representation, exchange, and distribution of data.
 XML was developed as a subset of the Standard Generalized Markup Language (SGML) and was recommended by World Wide Web Consortium (W3C) as a standard in February of 1998 [Bray et al. 1998]. Though the XML itself has been standardized, many exterior features t hat would improve the usefulness of XML are still in the relatively early stages of development. Technologies like XLink, XPointer, XSchema, and XSL have yet to find universal agreement [Maler et al. 1998, Layman et al. 1998, Adler et al. 1997].
 For the most part, XML is viewed as an enhancement to HTML. However, work has been done that suggests that XML may provide a reasonable level of support for data publishing, exchange, and integration. For example, electronic catalogs such as the ones described in [Singh 1998] and [Lincke et al. 1998] could be integrated and published using XML technologies. Likewise, Bosak describes a model that uses XML as a 'hub' language for vertical industry communication [Bosak 1997].
 We believe and have demonstrated that XML can be used as a storage, retrieval, and presentation vehicle beyond what has been previously proposed. Significantly, this application of XML can be achieved at very low cost to the data owners, while providing a basic level of data query capability.
 

Approach

 In this section we describe the functionality of the CME pilot system components that are relevant to the topic of this paper. Figure 1 shows an abstracted representation of these CME components.

Figure 1 An abstracted representation of the CME pilot system components.

 

CME Object Model

 A starting point in the CME system is the CME Object Model. The primary objective for the model is to capture all relevant information that appears in a research proposal submitted by a DOE laboratory. These DOE research proposal documents are often ref erred to as Field Work Proposal or FWPs. Three DOE laboratories participate in the CME pilot demonstration and have contributed to the development of the object model. The model was constructed through a series of meetings with representatives from these three laboratories. We used the OMT methodology to develop this object model.
 

XML Document Type Definition

 The CME Object Model was used to define a number of XML Document Type Definitions (DTDs) for the FWP documents. A section of such a DTD is presented in Figure 2. Two basic levels of DTDs were developed to meet the requirement that the original laboratory-specific information may be used in a number of different contexts. The first DTD level describes laboratory-specific information, while the second DTD level describes information that is common to all participating laboratories. The laboratory-specific DTD allows usage of the FWP information by the laboratory personnel in a manner that is familiar to that personnel. For example, one of the laboratories has an FWP form with a section labeled Five-Year Plan that is not found on the other pilot laboratory forms. We have decided to preserve such peculiarities specific to individual laboratories so as to enhance usability of the CME system. On the other hand, the need for a common format arises when a program manager is searching over research proposals from multiple laboratories. Using the above example, allowing a global search over the field that does not exist in all laboratories would return only results from one lab. If the program manager believes he or she is searching across information over all the laboratories, the results can be misleading. Therefore multiple laboratory searches are only performed over information common to all laboratories.

Figure 2 - A section of the FWP DTD that was defined for the CME project.

 

XML Document Repository

 The laboratory-specific DTDs allow for effective electronic submission of FWP forms by each participating laboratory. Ideally, we would have liked to receive the laboratory FWP submissions as XML files properly tagged using DTDs we had produced. However, the decision was made early on that the majority of the data translation work needs to be done by the system developers, not the participating labs so as to keep the cost of the lab participation low. Hence, we had to settle for raw FWP reports and define a translation from the labs raw data to the XML files using the DTD documents that we defined for the lab. We have developed a tool that converts general laboratory FWP reports into the laboratory-specific XML files. These XML files provide the storage layer for the FWP data (see Figure 3). The CME system design allows for distribution of the XML files across the participating laboratory sites. For the current pilot, the XML files reside at the Oak Ridge National Laboratory on a dedicated machine separate from the CME system.

Figure 3 - A section of the XML representation of a FWP submission

 

HTML Document Repository

 Due to the relative newness of XML, we were unable to locate a mainstream browser that supported XSL natively. However, Microsoft has made available a prerelease of two tools supporting XSL styling: an ActiveX control and an independent XSL to HTML converter. (NOTE; 1 ActiveX is a component-level technology for building applications from reusable parts.) The ActiveX control converts XML to HTML through XSL directives on the fly and loads it dynamically into the user's browser. The independent XSL to HTML converter generates static HTML representations of the XML from XSL directives. We have developed a Java application and used XSL style sheets to perform the conversion from XML to HTML format. Currently we do most of our processing using XSL, with limited, but complex processing still done with Java. When selected from the CME system, the individual FWP files would be displayed in the user's browser using their HTML file format. Figure 4 shows rendering of an HTML file for an example proposal submission. The HTML files are generated at the time the XML index was generated by applying XSL style sheets to the local lab XML format.

Figure 4 - Rendering of an HTML file generated from XML proposal submission

 

Search Engines

 To allow possible tradeoffs between cost efficiency and search performance of the pilot system, we have provided three alternative search engines: (1) a relational database search engine; (2) a XML-based Java search engine; and (3) an HTML-based keyword search engine. The relational database search engine is a part of a commercial relational database management system used to build a centrally managed database at Oak Ridge National Laboratory as a part of the CME pilot system. By including this search engine within the CME system, we have allowed quick access to the data in a secure environment. The relational database definition was developed automatically from the CME Object Model. The database was generated by translating data files from the XML repository into the SQL 'insert' statements. This translation is written in the Tcl/Tk scripting language that is driven by the CME Object Model and the XML data files.
 Alternative to the relational database search engine, we have prototyped an XML-based search engine that allows querying of XML documents based on the tag values and mimicking a database schema query. The prototype was constructed as a Java Applet that linked to an indexed base of XML documents via Java's RMI technology. The index accepted as input (1) the name of the tag hierarchy (i.e., context) as the search space; and (2) a search keyword. Based on this input, the collection of XML files is searched and a list of research proposals matching the search criteria is returned to the user.
 We use a commercially available HTML indexing tool to create a keyword registry from the files within the HTML document repository. This allows for general keyword searches over the entire collection of FWP submissions.
 

CME Client-Server Core System

 The queries, presentation, and reports are tied by the CME client-server core system. The server interfaces to the search engines and communicates the search results to the client. The client is responsible for (1) providing the user interface; (2) presenting the information to the user that includes both search results and the proposal document formatted to preserve its native look; and (3) relaying user's requests to the server. The approach we have taken has produced tools to convert raw database reports into XML files, then develop a variety of ways for querying and presenting XML information ranging from tag-based search engines to XSL style sheets.
 

Application

 In order for the CME system to be successful, it must meet the user requirement described in the Introduction, while also being cost efficient to the participating laboratories. Our preliminary evaluation of the system was focused on the cost efficiency of the approach when adding a new laboratory participant to the system. We chose a very large, multipurpose national laboratory for our evaluation. This lab has most of its FWP information stored in a mainframe database. Our contact at the laboratory has extensive knowledge about the information that is stored within this database, and was able to quickly find the information that we wanted. She was also able to gather the information we needed into database reports. The FWP information was represented in rows of data that were several thousand characters long. By necessity, we had to split the file into several smaller files, using a "proposal ID" field as a key into the information.
 It took approximately 2 days of a skilled person's time to retrieve and format the data we needed. This is a fairly short amount of time, particularly when compared with other alternatives, such as relational databases or object request brokers.
 Before we could make use of the raw data that was provided for us, we had to develop tools that could help in the translation from the labs raw data into XML files. These tools must be flexible enough to work with a variety of potential report formats. Our goal in the development of these tools was to avoid writing a custom tool for each different data format that we encountered. We did so by building a general-purpose translator from a very general series of reports to a specific XML file. This allows the representative of the lab to develop a series of simple select statements to retrieve the desired data. The development of these tools took about three weeks. Therefore to add a new lab into the CME system took a total of four weeks. Of this four week time period, 2 days were needed by the lab expert. Approximately three weeks were needed to develop translation tools to convert general database reports into XML files, and about a week to integrate the new XML data into the CME system.
 There are two issues that we faced with this process, the first being the migration of the DTDs over time. As more and more labs are added to the system, even simple changes to the common DTD may require significant changes to the code. The second issue is deriving information from the labs that is clear and understandable. The first set of data that we received from the evaluation lab did not contain all the information that we needed, and some of the data was ambiguous. We have since designed a new guideline for the lab to provide data. This guideline is very general, but provides a format for gathering the data in a non-ambiguous way. If a new lab follows these general guidelines, it should only require a day or two for the lab representative to gather and format the necessary data. Additionally, the information that the new lab supplies can be integrated into the CME system within a week or two. This provides a very short time frame in which a new lab can be added, as well as a very low entry cost for the participating labs.
 

Discussion

 We are using XML as a storage layer definition language enabling distributed information management. This is clearly beyond what XML was originally designed to accomplish. From a developer's standpoint, it can be argued that such an XML-based system has the potential to provide an important component of functionality of a distributed database for a fraction of the cost. Yet on the other hand, it can be argued that this type of distributed system solution sets database technology back 30 years. However, from the user's standpoint, the technology used is of little importance, provided that the solution meets the user's need. For this reason, we believe that the key issue addressed by our XML-based solution is meeting the users "quality of service" needs. By quality of service, we mean providing a solution that meets the users needs at a cost the user can afford. For example, there is no question that relational database technology provides a far more robust solution to the general problem of distributed data than our XML solution does. However, a user of the CME system does not need many of the features that relational databases offer. Nor are the providers of the data willing to pay for the creation and maintenance of such as system. The key question we had to address is not what is the best technology, but rather what is the most cost-effective technology to solve this problem.
 We have demonstrated that XML has great potential for managing structured, distributed data. However, there are drawbacks to XML at this time due to the immaturity of the technology. We found that books on XML are often out of step with the tools they reference. The tools tend to be error prone and limited in scope. We ran into a particularly annoying problem with valid DTD names causing errors in XML parser we were using. It seemed that the parser is hashing DTD names, so that similar long names were producing the same hashed results, thus causing an error. This led us to use to shorter or more cryptic names in many cases. Likewise, we are able to represent much of the HTML forms we present to the users through XSL. However, there were some very frustrating exceptions. We wanted to put metatag information at the top of the generated HTML files, so that our Internet spider could return useful information about the page. Unfortunately, it appears this capability is not yet available. Consequently, we have a preprocessing Java application to add this meta information to each page. Certainly, as XML tools evolve and mature this will become a mute point, however, for the near term the immaturity of this technology is a note worthy limitation for XML.
 Immaturity aside, we believe that XML has a great potential to become a key future technology. It has the strengths of providing structured data in a distributed manor and is fairly easy to use. A key future issue for XML is how to balance functionality with ease of use. In our case, the simplicity of XML made the technology viable for our application. However, had XML been more complicated, it woule have suffered from high implementation costs that many other technologies face.
 

Summary

 Through the development of a pilot system to strengthen the research proposal management system at DOE, we have shown that XML provides a simple, low-cost means of distributed information management that is typically performed by a large-scale enterprise information system. The system requirements called for the proposal information to be distributed, and to incur a minimal cost to the owners of the information. Therefore, the development of a low-cost, distributed information management system is required.
 Distributed databases, object request brokers, and various middleware packages deal very nicely with distributed, heterogeneous data, however, at quite a large cost to the owners of the data. Our approach is to use XML as an information repository that can be used to search for and present information. The produced XML files can be seen as a "database" of proposal information. We developed converters to translate raw database reports into XML files, various means of querying this information, including a tag-based search engine, and various ways of presenting this information, for example XSL style sheets.
 Applying this approach to a large, multipurpose national laboratory with hundreds of research proposals demonstrates the very low cost of integrating this type of information into our system, while meeting the initial system requirements. Use of XML as a low-cost means of representing distributed and structured data is certainly feasible, and depending on constraints, may provide the only viable solution. XML is an emerging technology, and suffers from immature tools, and supporting standards. These shortcomings will no doubt be addressed over time. XML appears to have a very bright future.
 

Acknowledgments

 The Oak Ridge National Laboratory is managed by Lockheed Martin Energy Reserach Corp. for the U.S. Department of Energy under contract number DE-AC05-96OR22424. This work was supported by the Department of Energy, Division of Mathematical, Information, and Computational Sciences (MICS). We would like to thank Dan Hitchcock for his continuing support of the CME project, Kimberly Barnes the original Principal Investigator of the CME Project, and Mark Elmore for his work on developing many of the translation tools.
 

References

 Bray, T., Paoli, and Sperberg-McQueen, C. M. 'Extensible markup language' (XML) 1.0. A W3C standards committee recommendation (February 1998). Available at http://www.w3.org/TR/REC-xml
 Maler, E. and DeRose S. XML pointer language (XPointer). 'A W3C standards committee working-draft' (March 1998). Available at http://www.w3.org/TR/WD-xptr
 Layman, A., Jung, E., Maler, E., Thompson, H. S., Paoli, J., Mikula, N. H., and DeRose, S. XML-Data. 'A W3C standards committee note' (January 1998). Available at http://www.w3.org/TR/1998/NOTE-XML-data/
 Adler, S., Anders, B., Clark, J., Cseri, I., Grosso, P., Marsh, J., Nicol, G., Schach, D., Thompson, H. S., and Wilson, C. 'A proposal for XSL. A W3C standards committee proposal' (August 1997) Available at http://www.w3.org/TR/NOTE-XSL.html
 Singh, N. 'Unifying heterogeneous information models.' Communications of the ACM. 41, 5(May 1998), 37-44.
 Lincke D. M. and Schmid, B. 'Mediating electronic product catalogs.' Communications of the ACM. 41, 7(July 1998), 86-88.
 Bosak, J. 'XML, Java, And The Future Of The Web.' World Wide Web Journal Volume II, Issue 4 (Fall 1997)

WDDX: Distributed Data for the Web   Table of contents   Indexes   XML on the Desktop: A Case Study of the Document as Application at Shell