XML Based Linking Concept   Table of contents   Indexes   The Future is Today: Case Studies in Innovation

 Fondazione Ugo Bordoni 
Iocchi, Luca
 Italy 
 Rome 
 
Luca Iocchi
 Post-Doc student
Fondazione Ugo Bordoni
  Via B. Castiglione 59 Rome  Italy (00142)
Email: iocchi@fub.it Web site:http://www.dis.uniroma1.it/~iocchi/
 Biography
 Luca Iocchi received his Phd degree in Computer Science in 1999 from the University of Rome "La Sapienza". He is currently a Post-Doc student at Fondazione Ugo Bordoni, a research institute in Rome. His research interests include knowledge representation, automatic reasoning, planning, cognitive robotics, databases and the Web.
Carpineto, Claudio
 Fondazione Ugo Bordoni 
 Italy 
 Rome 
 
Claudio Carpineto
 Researcher
Fondazione Ugo Bordoni
  Via B. Castiglione 59 Rome  Italy (00142)
Email: carpinet@fub.it Web site:http://www.fub.it/
 Biography
 Claudio Carpineto is the head of the "Information systems" group at Fondazione Ugo Bordoni, a telecommunication research center based in Rome. He received his honours laurea degree in electronic engineering from the University of Rome "La Sapienza" in 1984 and has been affiliated to the Fondazione Ugo Bordoni since 1986. He has been visiting researchers at universities in USA and Great Britain. His main research interests are artificial intelligence, information retrieval and databases.
 

. Introduction

 Knowledge management and information processing is one of the main issues for a large number of organizations working in many different areas. In the last years the World Wide Web has become not only the most common mean for knowledge distribution through the Internet but also a huge collection of possible interesting information sources.
 Because of the great amount of information available in the Web, a user has the need of extracting and summarizing relevant data from different sources and presenting these data in an appropriate format. However, such information are usually not easy to be automatically accessed and catalogued. The main reasons are that: (i) information in the World Wide Web are usually presented in a human-oriented way rather than in a machine-readable format (e.g. HTML pages are produced to display information to a human user and are not application-oriented); (ii) information sources are heterogeneous and hence similar data are often presented in different ways by different sources. To this end, the development of customized tools for helping the user in accessing heterogeneous information sources has been extensively studied (see [6,4] for surveys).
 Moreover, for an effective information access and knowledge management, it is important that documents are expressed in a standard format, such that it is possible to assign a clear semantics to data included in them. Therefore new standards for representation of Web documents have been recently proposed and among them the eXtensible Markup Language (XML) [7] has become more and more popular.
 XML is a subset of SGML that allows the use of a Document Type Definition (DTD), that is a schema expressing the structure of documents. XML documents are valid (with respect to a DTD) if they respect the specification given by the DTD. A DTD schema provides not only a syntactic specification that is used for composing and verifying documents, but also a tool for assigning clear semantics to data included in valid XML documents. In other words, we can state that XML permits to associate in a unique data model both the syntax and the semantics of a document. The transition from HTML Web pages to XML documents will rely upon the generation of valid XML documents with respect to a given DTD and application tools helping users in this step are really needed.
 A very recent proposal has been the XHTML language [8], that is a reformulation of HTML in XML. This language provides a pure syntactic translation from an HTML page to a valid XML document (according to the DTD defining XHTML) and automatic tools have been developed for implementing this translation. Hovewer, because of the pure syntactic translation and the lack of any semantic characterization of data, XHTML does not provide a method for an effective access to relevant information in a document. Indeed, in order to effectively deal with the semantics of data included in a document, it is necessary that XML documents are valid with respect to a user-defined DTD, which describes the semantics of data included in the documents.
 Our objective is thus the development of Web systems able to extract relevant data from Web information sources (i.e. a set of HTML pages) and to present them in a XML document that is valid with respect to a user-defined DTD expressing the semantics of these data. One possible way to perform this task is through a procedural approach, that is implementing programs (that are called wrappers) for extracting data from a specific information source. However, the presence of heterogeneous information sources, whose models are not known a priori, makes scalability to be one of the main features for such systems and a procedural approach may be not adequate in this context since the models of the information sources to be accessed are hand-coded into the programs. In other words, Web information extraction systems must be "easily'' programmed for dealing with several different sources and, to this end, we believe that a declarative approach may be better than a procedural one, since it provides a higher degree of abstraction in the description of the information sources.
 In order to devise such a declarative approach, we describe the design and the implementation of cognitive agents for Web information extraction (we call them cognitive wrapper agents) that are able to extract relevant information from Web sites relying on a high-level description of the models of the sites. These agents, starting from a specification of the domain of interest given by a user-define DTD, generate valid XML documents with respect to this DTD containing data extracted from Web pages.
 We have fully realized one of such agents and tested it on the domain of stock markets. The stock market agent is able to collect data from several different european stock markets and and to integrate them in a common model generating a valid XML document (with respect to a user-defined DTD) containing these data. We have also experimented the use of the XML-QL query language [3] in order to perform queries over the extracted data. In this way, even though information sources providing stock markets data are very different from one another, users can access the extracted data (in the XML documents) without knowing the original structure of the information sources and can query this "data base'' by using an XML query language, such as XML-QL.
 
 Fig. 1 System architecture of the agent
 

Cognitive agents for Web Information Extraction

 The architectural schema of our cognitive wrapper agent is shown in Fig. 1. The system takes as input a set of Web pages and a description of the data to be extracted specified by a DTD, and returns as output a valid XML document with respect to the given DTD. It is decomposed into two parts: a cognitive agent (represented in the dashed box), and a human designer that is involved in the extraction process providing an input to the agent.
 There are two different data flows in the schema: the bold one is performed off-line (with respect to the actual extraction process) and specifically only when defining the application domain (for example when new information sources are added to the system), the other path instead represents the on-line extraction of data from Web sites and does not require human assistance.
 The agent relies upon a high-level description of the domain model and of the information sources, in the form of a knowledge base containing a set of axioms in a formal language. This knowledge base is used by a deductive reasoning system for automatically generating an extraction program expressing the actions that the agent must perform in order to extract relevant data from the Web sites (see [1,2,5] for details).
 Once the extraction program is generated by the reasoning system, it is executed for actual data extraction. The on-line data flow is thus constitued by: 1) a parser module that generates an intermediate representation of the Web pages according to a model for semi-structured data; 2) a data extractor module that executes the extraction program by activating some basic procedures on the semi-structured representation of the input pages. -It is important to notice two main features of the system:
 1. the cognitive agent does not depend on the application domain, i.e. it is a general purpose extraction agent driven by its own knowledge base;
 2. the user is involved in the extraction process only at design time.
 

The stock market agent

 In this section we describe an application domain that we use as a testbed for our approach. We consider an organization that is interested in monitoring stock markets data provided by specialized Web sites. The task is thus to extract data from different Web sites providing information about share prices on different markets and integrate them within a common data model. In this way users can access these data without considering the actual structure of the original information sources. We consider three relevant data items: the company name, the current price, and the date which prices are referred to.
 The data model for our task is given by a DTD specifying the schema of the output XML documents, that are structured as a list of shares element containing the company name, the current price and the current date. This DTD specification is shown below:
 
<!DOCTYPE LISTSHARES &lsb; 
 
<!ELEMENT LISTSHARES (SHARE+)> 
 
<!ELEMENT SHARE (NAME, PRICE, DATE)> 
 
<!ELEMENT NAME (#PCDATA)>
 
<!ELEMENT PRICE (#PCDATA)>
 
<!ELEMENT DATE (DAY, MONTH, YEAR)> 
 
<!ELEMENT DAY (#PCDATA)> 
 
<!ELEMENT MONTH (#PCDATA)>
 
<!ELEMENT YEAR (#PCDATA)> 
&rsb;>
 

The agent's KB

 The agent is provided with a knowledge base describing the environment and the actions that can be performed and with a reasoning system that is able to automatically generate conditional plans for extracting data from Web pages, that are actually executed by the system for data extraction.
 The axioms in the KB are expressed in the following notation (see &lsb;1, 2, 5&rsb; for a formal description of the syntax and ; semantics of the language):
 P : A →Q
 where A is an action that the agent can perform, P is a formula denoting the preconditions of the action, and Q specifies the postconditions; the notation &lsb;Q1;Q2&rsb; in the postconditions indicates that only one of the two formulas will be verified after the execution of the action.
 -The primitive actions that have been defined for the agent are the following: findbigtable and fingbigpretable, that search for a big table or a big pre-formatted table in a Web page; findcolname and findcolprice, that search for a column with names or prices in a table; finddate, that searches for a date in the page; extractname, extractprice and extractdate, that extract names, prices and dates from a page.
 The following axioms in the KB describe the specification of the execution of these actions. This KB, that we use for the stock market agent, has been obtained by an analysis of the information sources.
 Page: findbigtable → &lsb;BigTable ; BigTable&rsb;
 Page: finddate -> &lsb;IsDate ; IsDate &rsb;
 BigTable: findcolname → &lsb;ColName ; ColName&rsb;
 BigTable: findcolprice → &lsb;ColPrice ; ColPrice&rsb;
 BigTable: findbigpretable → &lsb;PreBigTable ; PreBigTable&rsb;
 PreBigTable: findprecolname → &lsb;ColName ; &rsb;
 ColName: extractname → Name
 ColPrice: extractprice → Price
 IsDate: extractdate →Date
 Each of the above axioms specifies the behavior of the agent in a certain situation. The execution of the relative action characterizes a state transition. For example, the first axiom states that from a generic Web page it is possible to perform an action for verifying if there is a big table in the page, the third axiom specifies that if there is a big table then it is possible to execute an action for searching for a column with company names, and so on.
 

The extraction program

 The reasoning system, starting from a KB specifying the stock market domain, and given an initial situation (that is being in a generic Web page) and a goal to achieve (that is extracting names, prices and dates from the page), automatically generates a conditional plan for data extraction. This plan can be expressed as a program in a high-level procedural language including sequential and conditional statements. The extraction program generated by the system from the KB presented in the previous section is given below.
 
FINDDATE(); 
 
if (FINDDATE_T) { 
 
FINDBIGTABLE(); 
 
if (FINDBIGTABLE_T) { 
 
FINDCOLNAME(); 
 
if (FINDCOLNAME_T) { 
 
FINDCOLPRICE(); 
 
if (FINDCOLPRICE_T) { 
 
EXTRACTDATE(); 
 
EXTRACTPRICE(); 
 
EXTRACTNAME(); 
        } 
 
else { 
          FAIL(); 
        } 
      } 
 
else { 
        FAIL(); 
      } 
    }
 
    else { 
      FINDPRETABLE(); 
 
if (FINDPRETABLE_T) { 
 
FINDPRECOLNAME();
 
        if (FINDPRECOLNAME_T) { 
 
FINDPRECOLPRICE();
 
          if (FINDPRECOLPRICE_T) { 
 
EXTRACTDATE(); 
 
EXTRACTPRICE(); 
 
EXTRACTNAME(); 
          } 
          else { 
            FAIL(); 
          } 
        } 
 
else { 
          FAIL(); 
        } 
      } 
 
else { 
        FAIL(); 
      } 
    } 
  } 
 
else { 
    FAIL(); 
  }
 This program can be considered as the main extraction program, that calls some basic procedures that are also provided to the system (see Fig. 1). The execution of this program allows for actual data extraction and XML generation.
 

The data extraction process

 
 Fig. 2 Web pages providing stock information
 In Fig. 2 three Web pages providing stock information from the Italian, the German and the Swiss markets is presented. Notice that their structures are quite different. The execution of the extraction program shown in the previous section on these pages has produced the following XML documents that are valid with respect to the DTD we have defined for this task.
 
<LISTSHARES> 
 
<SHARE> 
 
<NAME>B Agr Mantov</NAME> 
 
<PRICE>24009.75</PRICE> 
 
<DATE> <DAY>9</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
<SHARE> 
 
<NAME>B Des-Br r99</NAME> 
 
<PRICE>3446.56</PRICE> 
 
<DATE> <DAY>9</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
<SHARE> 
 
<NAME>B Desio-Br</NAME> 
 
<PRICE>6718.8598</PRICE> 
 
<DATE> <DAY>9</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
... 
 
</LISTSHARES> 
 
 
<LISTSHARES> 
 
<SHARE> 
 
<NAME>1 % 1 AG % CO. KGAA AKTIEN DM 5
 
/DE0005089007</NAME> 
 
<DATE> <DAY>09</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
<SHARE> 
 
<NAME>1 % 1 AG % CO. KGAA AKTIEN DM 5
 
/DE0005089007</NAME> 
 
<PRICE>120.00</PRICE> 
 
<DATE> <DAY>09</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
<SHARE> 
 
<NAME>AC-SERVICE AG NAMENS-AKTIEN O.N.
 
/DE0005110001</NAME> 
 
<PRICE>27.90</PRICE> 
 
<DATE> <DAY>09</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
... 
 
</LISTSHARES> 
 
 
<LISTSHARES> 
 
<SHARE> 
 
<NAME>ABB AG I</NAME> 
 
<PRICE>2084</PRICE> 
 
<DATE> <DAY>08</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
<SHARE> 
 
<NAME>ABB AG N</NAME> 
 
<PRICE>418</PRICE> 
 
<DATE> <DAY>08</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
<SHARE> 
 
<NAME>ADECCO I</NAME> 
 
<PRICE>710.00</PRICE> 
 
<DATE> <DAY>08</DAY>
 
<MONTH>april</MONTH> 
 
<YEAR>1999</YEAR> </DATE> 
 
</SHARE> 
 
... 
 
</LISTSHARES> 
 

Querying the extracted data

 In order to access extracted data it is possible to use a query language for XML (for instance the XML-QL query language &lsb;3&rsb;) over the XML documents. Notice that a common access to the extracted data is possible since all the XML documents that are generated by the system are valid with respect to the user-defined DTD. Querying the XML documents allows the user to produce different views of the extracted data. For example the following query has been used for extracting all the shares in the Swiss market whose price is higher than 1000.
 
construct <SHARE>$s</SHARE>
 
where <*.SHARE>$s </> in "CH-Apr08.xml",
 
<NAME>$n</> in $s, <PRICE.PCDATA>$p</> in $s,
 
$p>1000
 The XML-QL interpreter has generated the following output
 
<SHARE>
 
<NAME>ABB AG I</NAME>
 
<PRICE>2084</PRICE>
 
<DATE>
 
<DAY>08</DAY>
 
<MONTH>april</MONTH>
 
<YEAR>1999</YEAR>
 
</DATE>
 
</SHARE>
 
<SHARE>
 
<NAME>ALETSCH I</NAME>
 
<PRICE>3700.00</PRICE>
 
<DATE>
 
<DAY>08</DAY>
 
<MONTH>april</MONTH>
 
<YEAR>1999</YEAR>
 
</DATE>
 
</SHARE>
 Let us highlight that the combination of the extraction system and of a query language for XML documents allows the user to write queries for a specified DTD and to have results without the need of knowing the structure of the information sources containing data.
 

. Conclusions

 With the increasingly growth of the documents in the World Wide Web, users need powerful and flexible tools for extracting, summarizing and presenting in an appropriate format the huge amount of information that are available on the Web. The transition from HTML pages to user-defined XML documents will play an important role in this scenario.
 In this paper we have described the development of cognitive wrapper agents able to extract relevant data from heterogeneous Web information sources and to present them in a XML document that is valid with respect to a user-defined DTD expressing the semantics of these data. Our paper suggests that a declarative approach to Web information extraction and integration may deserve more attention than received so far, since it provides an adequate degree of abstraction in the representation of the information sources, that is necessary for augmenting the scalability of these systems.
 

References

 &lsb;1&rsb; G. De Giacomo, L. Iocchi, D. Nardi, and R. Rosati. Moving a robot: the KR&R; approach at work. In Proceedings of the Fifth International Conference on the Principles of Knowledge Representation and Reasoning (KR-96), 1996.
 &lsb;2&rsb; G. De Giacomo, L. Iocchi, D. Nardi, and R. Rosati. Planning with sensing for a mobile robot. In Proc. of 4th European Conference on Planning (ECP'97), 1997.
 &lsb;3&rsb; A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A query language for XML. http://www.w3.org/TR/NOTE-xml-ql/.
 &lsb;4&rsb; D. Florescu, A. Levy, and A. Mendelzon. Database techniques for the World Wide Web: A survey. SIGMOD Record, September 1998.
 &lsb;5&rsb; L. Iocchi.Design and Development of Cognitive Robots. PhD thesis, DIS, Universit‡ di Roma "La Sapienza", 1999.
 &lsb;6&rsb; L. Iocchi and D. Nardi. Information access in the Web. In Proceedings of WebNet'97, 1997.
 &lsb;7&rsb; World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0 (1998). http://www.w3.org/TR/1998/REC-xml-19980210.
 &lsb;8&rsb; World Wide Web Consortium (W3C). XHTML 1.0: The extensible HyperText Markup Language. http://www.w3.org/TR/xhtml1.

XML Based Linking Concept   Table of contents   Indexes   The Future is Today: Case Studies in Innovation