Book Ticket Files &, imposition templates for variable data printing   Table of contents   Indexes   Digital printing

 

Selection and utilization of metadata from news articles

Bäck, Asta
 
 Asta  Bäck
 Senior Reseach Scientist
  Espoo 
 Finland 
 VTT Information Technology 
VTT Information Technology,  Tekniikantie 4 B
Espoo  Finland  FIN-02044 VTT
Phone: +358 9 456 4536 Fax: +358 9 455 2839 email: asta.back@vtt.fi web site: www.vtt.fi/tte
 Biography
 Asta Bäck - Asta Bäck graduated with a Master of Science in printing technology and graphic arts from the Helsinki University of Technology in 1983 and since then she has worked as a research scientist and project manager at the Technical Research Center of Finland (VTT). She is now Senior Research Scientist at the media research field at VTT Information Technology. During the early nineties her work focused on developing new approaches and methods for the management and control of printing and publishing processes. With the advent of digital printing and web publishing, the main focus of her work has shifted to analyzing and developing solutions to utilize these new opportunities. She is the author or co-author of some 80 technical papers and has given numerous lectures both nationally and internationally.
 Abstract
 Metadata is needed for various purposes in the publishing process. This paper gives a list of them and analyses some metadata dictionaries with regard to the aspects they cover, and reports a case implementation where metadata is used to support electronic publishing services to endusers and editors. Some of the metadata is created by people and some is collected semiautomatically or automatically. Finally, the paper discusses the experiences gained from using this metadata and processing the actual XML-articles for rendering.
 

Introduction

 How to define and create metadata is one of the much-discussed topics in the publishing world. It has gained in importance with the rise in electronic publishing. While the focus used to be on provision of sufficient information to find an existing publication, recent developments in publishing have made metadata important also for a content that has not been published in the traditional sense of the word. Also, what used to be regarded as a single publication is today often viewed as consisting of many components, which may need metadata of their own. The huge web portals and the different personalized information services are two important examples of the new publishing opportunities which also require additional metadata.
 A classification of the different types of metadata is given in this presentation, some metadata dictionaries and projects are analysed by using this classification, and finally our own implementation for creating and using metadata relating to technical news articles is discussed with a review of out own experiences and ideas for improvements.
 In this paper, the terms 'publisher' and 'publication' are used in broad sense. The word 'publisher' refers to any organization or company that makes content available to other parties. The words 'document', 'resource' and 'publication' are used interchangeably.
 

Metadata classification and metadata dictionaries

 

Basic data and a unique identification

 The basic data includes the basic facts of a publication, such as the creator and when and where it is published. This type of information is traditionally regarded as the core metadata. The typical role of the basic data is to provide enough information of a resource so that anyone interested will be able to locate it. This information largely aims at a unique identification, and systems have been set up to support a unique identification. The ISBN numbering for books is an example of this. The ISBN number alone would be sufficient to uniquely identify any published book, but for the sake of readability also other information is usually provided. In the web community, the URL has provided a unique identification. With regard to documents, its major drawback is that there is no guarantee that the document's location or content remains unchanged, therefore other methods have been developed for unique identification of electronic publications. The DOI (Digital Object Identifier) initiative is an example of this .
 

Content

 With the metadata that describes the content, we usually try to provide answers to questions like 'Which documents tell us about natural catastrophes in North America during the last decade', or 'Which documents deal with XML in publishing applications'.
 Keywords and classifications are the most traditional ways to describe the content of a document or a publication. The Dewey Decimal System is one of the universal classification schemes. There are also innumerable index term lists, for general or special purposes.
 Another content description approach is the PICS specification, which is a product of the web era. It was originally designed to help parents and teachers control what children access on the Internet by creating a way to include ratings for web resources.
 The goal of the current PRISM project is to define an XML metadata vocabulary that would make it possible to describe the content of articles that are sold to other publishers and portals. The primary areas that the metadata dictionary should cover are magazine publishing, news and book publishing.
 The International Press Telecommunications Council has defined a subject list and a property list which allow to describe the content of a news item.
 Also the new ISO standard for Topic Maps addresses the issue of content description. Topic Maps can be created to describe the relations between the topics with or without reference to documents which deal with these topics. Topic Maps can be used as a tool to create and manage the metadata of metadata.
 

Copyright

 The copyright-related metadata has gained in importance quite significantly since the content is distributed in the digital format and there are more publishers and publishing channels. The goal of the Europeanindecs project is to develop a data model for describing the copyright and the deals with the content . Also the DOI project aims to create infrastructure for copyright management.
 The Xerox Digital Property Rights Language (DPRL) is another copyright-related initiative. It is announced to provide a mechanism in which different terms and conditions related to access, fee, and time can be specified and enforced for the different operations on digital documents, such as view, print, and copy.
 Most of the existing metadata dictionaries, such as Dublin Core or XMLNews-Meta support the inclusion of only very basic copyright information.
 

Technical information

 Technical information of the resource has a supporting role in the metadata. The same content may exist in several different formats, which serve the same or almost the same function (e.g. various text formats), whereas some formats may serve totally different purposes even though they contain much of the same information (e.g. a transcript and a video). The metadata should support different instances of the same content.
 

Relations to other resources

 The issue of relations to other resources is of interest to different user groups of during the whole life span of any document. When a document is created, its relation to previous documents on the same topic is often stated in the text, and should, of course, be included in the metadata. Many users, particularly in the scientific world, create their own views of the relations between the publications. The XLink proposal provides a tool to manage the relations between the documents .
 

Related components

 When electronic publishing processes are used, one document or publication is composed of several components. Content producers and publishers usually need metadata both at the component level and at the different composite levels, whereas the endusers usually view the publications as a whole.
 

Resource management

 The version and the status of the document are the most important pieces of management information. In most cases, this type of metadata is needed in the creation of content and production processes. In quick-paced publication processes, like the news publication process, version management should also be supported across the companies.
 

Application of technical news article metadata

 

Definition of the metadata dictionary

 VTT Information Technology publishes a monthly newsletter, GT-Bulletin, consisting of printing and publishing related articles. New articles are written for the printed issue, and after the printed version is completed the articles are stored in the XML format for electronic retrieval. At present, only our own personnel have access to the articles in the XML format, but we also plan to allow our subscribers to access them. Now the subscribers can only read electronically the PDF version of the bulletin.
 The basic requirement for the metadata was that it must provide a way to find the articles that deal with certain topics or with certain companies, or their products and services. This should be achieved with a minimum of extra work.
 When we designed the application, we had to consider certain restrictions. Most importantly that we had a relational database at our disposal and the authors' additional work had to be kept to an absolute minimum. Our application begins with ready-made articles, so that we have no need for article version control.
 The articles of the GT-Bulletin are, by tradition, classified into 14 categories. We took these categories as one way to describe the articles. With only these few categories and with the tendency of listing an article in several categories, the lists of articles are fairly long in every category and it takes time to find the relevant articles. Obviously additional information is needed to make the search easier.
 To decide how we should describe the articles, we looked into the existing metadata vocabularies and, in particular, into the XMLNews-Meta and the Dublin Core. Not surprisingly, neither one of these dictionaries includes all the elements we wanted for our application.
 The way XMLNews-Meta describes the content of the article was closer to our needs. Proper names carry in our articles important content information and we wanted to collect them into our metadata document to enable precise searches. This is also the approach in the XMLNews-Meta. For our application, we modified the XMLNews-Meta both by extending it and by discarding some elements.
 

Production and utilization of metadata for content description

 We chose the following elements to describe the contents of our news articles:
 
  1. classification (one or more values from a predefined list)
  2. type of the article (one value from a predefined list)
  3. description (free text = headnote or the first paragraph)
  4. datelineDate
  5. datelineLocation
  6. datelineEvent
  7. source
  8. subheadings
  9. company name
  10. event name
  11. location name
  12. URL
  13. person's name
  14. product or system
  15. project
  16. acronym
 We have developed an application in which XML tagging and metadata generation are performed partly automatically and partly with computer assistance. To collect the content describing elements (elements 9 to 16 in the previous list), TextMorfo software by a Finish company called Kielikone Oy is used for searching proper names and finding their basic forms. Our application shows the words to the user for classification. The users have to go through the proposed terms and to accept, reject or change their classification. The classifications are stored in a database, and so in process of time the system becomes more proficient in making the right suggestions for classification and it can run more automatically.
 At this point, we do not try to classify the articles automatically, but expect the user to make a manual classification as a part of the final check for the metadata. The user also selects the type of the article from a predefined list.
 The rest of the metadata elements are marked by the authors as a routine task when they write the articles. The required metadata elements can therefore be picked automatically from the article text into the metadata document.
 When the metadata generating process is completed, we produce two XML documents: the metadata document and the article with detailed XML tagging. By detailed XML tagging we mean that all proper nouns included in the metadata document are tagged in the article as well.
 After the metadata document and the article are created, the metadata is stored in a relational database and the articles in a file system. An HTML browser interface has been built to make queries into the metadata database. The users are offered following options for content-based searches:
 
  • proper nouns with or without an exact classification (e.g. personal name, company name, event name)
  •  
  • classification
  •  and for basic-data searches
     
  • creator
  •  
  • issue (one or several issues)
  •  These criteria may be combined.
     A summary of the articles matching the search criteria is returned to the user, who can retrieve the full articles in the XML format along with an XSL stylesheet provided that there is an IE 5.0 -browser, or as HTML. The conversion to HTML is made on the fly by using servlets.
     

    Discussion

     XML and separate metadata provide an open basis for content applications. In this way it is easy to collect and process articles to create a variety of combinations and publications. It is also easy to distribute the metadata or even the articles to other interested parties.
     We can to collect a lot of descriptive metadata from the articles, but this metadata should be processed further to make sure that it is in a usable format. We already accumulate information of user-made classifications of proper names (Example: Nokia is a company), but that is not enough. We should also know that Nokia Corporation refers to the same company, and that there is a town called Nokia in Finland, to ensure that we make the right classification of the word. Some of this metadata of metadata is general and branch-independent, while some is specific to a branch or topic.
     Our application and our metadata DTD treat all the found proper nouns similarly, so that there is no system-supported function to help decide the most significant proper nouns in the article. The proper nouns which are mentioned most frequently and/or are mentioned in headings, in headnotes or in captions are probably the most relevant ones. This kind of information could easily be accumulated automatically, and stored in the metadata document.
     Our metadata documents are fairly large since also excerpts from the article, such as the headnote and the subtitles, are included in the metadata. Some information is stored twice. With future tools and databases for XML documents we can expect to minimize such redundancy. There is some redundancy also in the sense that the content describing proper nouns are tagged in the document and included in the metadata. However, the tagging in the document can be used not only to convey the meaning of the content of the document but it can also be used to control the rendition of the document.
     The classifications of proper names and stories allow to know a lot of the content of the article and to make precise searches. But very detailed classifications are not always practical for the persons who make queries into the database, because they should know the classification principles as well as those who make the classifications. This should be considered very carefully, when classification schemes and user interfaces for queries are made. In our present search application, the users may search for specified proper nouns also without specifying their category.
     We might also ask, whether a full-text search would be enough in our application. The answer is yes and no. With many search tasks, a full-text search would probably give as good, or give in some cases better results than our metadata-based searches do: a full text search finds all the words in a document whereas a metadata-based search only finds those words which are included in the metadata. With our approach we can enhance the XML tagging of the articles while the metadata is collected and the article content and type classifications can conveniently be added to the metadata to increase its usability. The condensed information in the metadata documents may also be utilized and distributed without a direct access to the documents.
     The articles are now classified by people. There is probably a clear correlation between the classified proper nouns and the article classification - at least in our articles which only discuss topics relating to the printing and publishing industries. It might also become possible to automate this part of the metadata generation.
     A lot of work is done by various parties and organisations to define metadata dictionaries, and methods and frameworks for the manipulation of metadata across the companies. It is important for individual content producers and publishers to follow these developments, and participate in them, if possible. General metadata dictionaries can, however, seldom meet all the requirements of a company, so companies should find ways to combine their needs and public metadata requirements, and try to automate generating metadata, where possible.
     Bibliography
     
    1 The Digital Object Identifier System. http://www.doi.org/
     
    2 Platform for Internet Content Selection. http://www.w3.org/PICS/
     
    3 PRISM - Publishing Requirements for Industry Standard Metadata. http://www.idealliance.org/prism.htm
     
    4 The News Industry Text Format. http://www.nitf.org
     
    5 Topic Maps Frequently Asked Questions. http://www.infoloom.com/tmfaq.htm
     
    6 indecs1999. Overview. http://www.indecs.org/overview/overview.htm
     
    7 Digital Property Rights LanguageTM . http://www.contentguard.com/overview/tech_dprl.htm
     
    8 The Dublin Core: A Simple Content Description Model for Electronic Resources. http://purl.org/DC/
     
    9 XMLNews-Meta Documentation. http://www.xmlnews.org/docs/xmlnews-meta.html
     
    10  XLink http://www.w3.org/TR/xlink/

    Book Ticket Files &, imposition templates for variable data printing   Table of contents   Indexes   Digital printing