XML In Defense Procurement   Table of contents   Indexes   Server-Side XML: Taming the Tower of Babel

 metadata 
 

Managing and Searching Data with Metadata

 Germany 
 Rath, Dr. Hans Holger  
 Rimpar 
 STEP Stürtz Electronic Publishing GmbH 
 
Dr. Hans Holger  Rath
Director Consulting,  STEP Stürtz Electronic Publishing GmbH 
 Technologiepark Würzburg-Rimpar
Pavillon 7
Rimpar  (Germany) D-97222 
Email: consulting@step.de

Biographical notice

Dr. Hans Holger Rath is director of STEP's Consulting department since April 1998. He started at STEP in April 1996 as senior consultant/project manager. Before he joined STEP he was head of the Document Computing department at ZGDV (Computer Graphics Center, Darmstadt, Germany). Dr. Hans Holger Rath studied computer science in Karlsruhe 1984-1990 and graduated at the TU Darmstadt with the doctoral thesis 'Literate Specifying of Hypermedia Documents' in 1996.

He was involved in the DTD development for the DIN (Deutsches Institut für Normung e.V. - German Standards Institute) and ISO (International Organization for Standardization) and cooperates very closely with publishing houses, aircraft industry and telecommunication industry. All in all he has more than eight years experience in information architectures and related topics. Since May 1998 he represents Germany in ISO/JTC1/SC34 - the ISO committee standardizing SGML, HyTime, DSSSL, Topic Navigation Maps etc.

 Abstract  : Metadata is the technology that makes possible faster, more focussed search and retrieval of information objects in the World-Wide Web and in Document Management Systems. This paper explains why metadata is important, the basic ideas behind metadata, and how metadata can be used in the Web and a DMS.
 DMS 
 Document Management System 
 Editorial System 
 World-Wide Web 
 repository 
 

Introduction

  SGML and XML are doing a good job of structuring data. Each relevant part of the data can be searched, accessed and processed. But what about the information objects - the objects containing the data? They are part of the World-Wide Web or are stored in a database (e.g., in an Editorial System or Document Management System). Both kinds of "repositories" contain a tremendously large number of documents. All these information objects have to be created, maintained, managed, retrieved and delivered as well as published. The larger this number of objects becomes the more difficult it becomes to manage and search them.
 Metadata is the technology that makes possible faster, more focussed search and retrieval of information objects. It supports not only searching and retrieval, it also supports management of information objects and administrative tasks. Metadata is added-value to the information content itself, because it gives easier access to the requested information and brings information objects into new relations.
 The following chapters explain
 
  • whyMetadata is Data  and should be treated like data,
  •  
  • howVocabularies  make metadata applicable,
  •  
  • what generalMetadata Schemata  should look like,
  •  
  • the benefits fromMetadata for the World-Wide Web  , and
  •  
  • whyMetadata for Document Management  is so important.
  • metadata field
    metadata property
     property 
    property-value pair
     

    Metadata is Data

      Metadata in general help identifying information objects. Metadata is information about the information objects. Metadata is typically defined in terms of property-value pairs. The property identifies the role of the metadata field. The value is the searchable/manageable term.
     Some examples:
     
  • title  : Managing and Searching Data with Metadata
  •  
  • author  : Dr. Hans Holger Rath
  •  
  • publication date  : 1998/09/30
  •  
  • identifier  : rp80930a
  •  Metadata values may be part of the information object (e.g., title and author). Others are stored separately (e.g., publication date and identifier) in the metadata repository. All values must be accessible for searching and managing. This requires an automatic extraction of internal metadata into the metadata repository - whatever it looks like - and the synchronization of internal and external values.
     The examples listed above are very simple ones. Practical requirements ask for complex metadata with relations not only between information object and metadata value, but between the metadata values themselves. An example:
     
  • author (metadata of information object),
  •  
  • address (metadata of author),
  •  
  • email (metadata of author),
  •  
  • homepage (metadata of author; homepage might be an information object).
  •  The "homepage" metadata example shows that information objects could be metadata for other (meta-) data, too. Therefore it makes sense to sayMetadata is data  .
     With this assumption - metadata is data - the management of metadata can follow the same paradigm as the management of the information objects. When information objects are marked up in SGML/XML metadata can be marked up in SGML/XML, too.
     The markup of metadata is only one point; storage, efficient retrieval, existence of an appropriate query language, versioning strategy, and report generating are technical but very important points. Whether SGML/XML coded metadata can fulfill these requirements is not clear as of today. Various metadata schemata are under development and general support in standard tools do not exist.
     Customized applications have to be build until a schema is identified and standardized. At least, tools have to support this schema. Before to much effort (and money) goes in these applications the users - you - should pressure the committees and vendors - us - that the metadata problem will be solved in a general way.
     metadata vocabulary 
    vocabulary
     

    Vocabularies

     A single stand-alone metadata field makes no sense. In practice there is always a set of metadata fields needed. This set of metadata fields is called avocabulary  .
     A predefined vocabulary is necessary for a community-wide use and understanding of the metadata values. Unfortunately we are just in the beginning of global metadata usage and only a few vocabularies are defined.
     Most of the vocabularies are defined in the context of bibliographic information about publications - like Dublin Core or IMS. This might result from the long history of cataloguing and indexing in libraries for over 2000 years.
     Another application of metadata is a rating system for information in the WWW. W3C published the Platform for Internet Content Selection (PICS) as recommendation.
     AltaVista, Lycos, Netscape Internet Guide, as well as Yahoo offer subject indexes to classify information. These subjects indexes are nothing else than similar but not identical vocabularies. They are comparable with library classification schemes like UDC and DDC.
     It will be an important task for the next years to define globally used vocabularies or at least mappings between vocabularies to have isomorphic search over the WWW.
     CGM, Computer Graphics Metafile 
     EXPRESS  
     HyTime 
     RDF 
     SGML 
     STEP  
     Topic Navigation Map 
     XLink 
     XML 
     XML-Data 
     XPointer 
    bibliographic standard
     grove 
     metadata schema 
     property set 
     schema 
     

    Metadata Schemata

     We defined a set of metadata fields as a vocabulary. A "document" that defines a vocabulary is aschema  (whereby the meaning of the worddocument  should be understood as "portion of data in any format"). The main question about metadata schemata is exactly which format to use.
     There are several approaches out there in the various standardization committees to describe metadata. STEP - the standard - uses its EXPRESS language to define the "what" aspects of the data. CGM version 4 supports association of additional information to parts of the graphic. Bibliographic standards have covered issues of metadata for over 30 years. SGML and HyTime bring property sets and groves in the arena. Topic Navigation Maps as an application of HyTime describe relations between data and topic driven views on data. XML comes with XML-Data, XLink, XPointer, and RDF.
     RDF covers some interesting features: work in low memory environments, protocol for metadata interchange, tuple-based therefore mappable to relational databases, association with multiple info objects, trusted third party description for data with signatures.
     A good mixture of these standards and approaches can be the combination of property sets, groves and RDF. RDF because it is simple and implementable but powerful. Property sets because they give the syntax to describe the metadata on a meta level. Groves because they provide the relation between markup, data, and metadata. When a recommended or standardized schema for metadata is available, the development of vocabularies for special application fields can start.
     WWW 
     World-Wide Web 
    metadata for the WWW
     

    Metadata for the World-Wide Web

     Metadata applications for the WWW which are available or under development concentrate on:
     
  • bibliographic information about Web pages,
  •  
  • content rating of Web pages,
  •  
  • subject indexes.
  •  In addition to that both XLink and HyTime support metadata on links.
    Dublin Core
    The Dublin Core Metadata for Simple Resource Discovery
    bibliographic information
     

    Bibliographic Information

     Bibliographic information about publications is the classic application of metadata. The bibliographic vocabularies were developed in the long history of libraries. With the existence of computers and the Internet both the amount of documents as well as the search speed increased and will continue to increase dramatically.
     But as every traditional library may have its own cataloguing and indexing system every digital library may have its own slightly different metadata vocabulary. Search has become faster but to get the requested results from more than one library requires a common vocabulary.
      The Dublin Core Metadata for Simple Resource Discovery  offers a number of metadata fields for cross-disciplinary resource discovery. Here the list of the properties: title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights. These properties can be used in HTML <META> markup. A HTML version of this document might be marked up as following:
     
     
    <META NAME="title" CONTENT="Managing and Searching Data with Metadata">
     
     
    <META NAME="publisher" CONTENT="GCA">
     
     
    <META NAME="creator" CONTENT="Dr. Hans Holger Rath">
     
     
    <META NAME="creator" TYPE="affiliation" CONTENT="STEP GmbH">
     
     
    <META NAME="creator" TYPE="email" CONTENT="consulting@step.de">
     
     
    <META NAME="description" CONTENT="Overview about metadata use in WWW and DMS">
     
     
    <META NAME="sunject" CONTENT="metadata, SGML, XML, searching, WWW, DMS, Dublin Core, IMS, PICS, classified links, query language, report generation">
     
     
    <META NAME="language" CONTENT="EN">
     The
    IMS
    IMS Metadata Directory
     
    TYPE
    attribute opens subcategories e.g., affiliation and email.
      TheIMS Metadata Directory  as an alternative to the Dublin Core offers a larger number of properties based on rules from ISO 11179: abstract, author, catalog id, concepts, container type, credits, expiration date, form, format, GUID, interactivity level, keywords, language, learning level, location, metadata version, objectives, pedagogy, platform, prerequisites, presentation, price code, relation, role, sizeof, source, steward, structure, subject, title, use rights, user support, use time, version date, version.
     Many of these properties are similar or the same as in Dublin Core. Some of them have different names but the same meaning. These differences are caused by the various approaches the vocabularies want to cover.
    PICS
    Platform for Internet Content Selection
    content rating
    rating system
     

    Content Rating System

     Document rating will become an important issue for the Web, e.g. keeping children away from pages with content for adults only.
      ThePlatform for Internet Content Selection  (PICS  ) is a W3C recommendation containing several rating services. PICS works in a way that individuals, groups, organizations or companies (rating bureaus) provide content labels for information in the Web. The labels must be based on the PICS vocabulary. A selection software relying on PICS evaluates the ratings of a rating bureau and decides if the Web page is accepted or rejected. This could happen every time when a new Web page is loaded into the browser.
     AltaVista 
    DDC
    Lycos
     Netscape 
     Topic Navigation Map 
    UDC
    Yahoo
    classification scheme
     

    Classification Schemes

      Again, the idea behind traditional library classifications like DDC and UDC is applied to the Web. Publications - paper or online documents - are inserted into a predefined classification scheme. A document may be classified differently depending on the scheme's structure.
      All subject-based Web search engines like AltaVista, Lycos, Netscape, Yahoo offer their own subject index (= classification scheme = vocabulary). Even when a great part of the subjects are similar, the variety of classification subjects cannot provide a seamless search across a wide range of Web pages like full text search.
      The already mentioned Topic Navigation Map standard will make it possible to combine different classification schemes into one topic.
    classified link
     

    Classified Links

     Hyperlinks as they are known from HTML may inform the user about the link target: the browser displays either the target URL or the text of the
     HyTime 
     XLink 
     
    onmouseover
    attribute.
      XLink and HyTime provide a more robust mechanism. A role attribute carries information about the link target. This classification of links let users decide if they want to follow the link or not.
     Again, the classification has to be based on a vocabulary to be widely used in the Web.
     DMS 
     Document Management System 
     Editorial System 
    access rights
    information object
    management of information objects
    metadata for document management
    metadata inheritance
    metadata versions
    production process
     query language 
    report generation
     

    Metadata for Document Management

      As already indicated, the usage of metadata in Editorial Systems and Document Management Systems is partly different from its usage in the Web. The major focus in the Web is searching. The major focus in an ES/DMS is management of information objects and administration of production processes.
     But both applications of metadata follow the same approach (attaching information to information) and have the same essential requirements: definition is independent from data and application, metadata is interchangeable, concepts are scalable over number of information objects and number of metadata fields, concepts are implementable with existing database technologies.
     Here are some examples of built-in metadata properties of an ES/DMS:
     
  • object title
  •  
  • last modified by
  •  
  • checked out by
  •  
  • date of last version
  •  
  • version no.
  •  
  • workflow state
  •  The user will need additional application specific properties of information objects like:
     
  • author
  •  
  • editor
  •  
  • manuscript received from author
  •  
  • manuscript sent to editor
  •  
  • no. of figures
  •  Further, a metadata set about the metadata might be the address of a person (= author, editor). It consists of:
     
  • Name
  •  
  • Street
  •  
  • PO box
  •  
  • City
  •  
  • State
  •  
  • Country
  •  
  • Zip code
  •  
  • Email
  •   Metadata in an ES/DMS has to provide the following functions:
     
  • Support the production process  : The tasks of an ES/DMS are the production of publications and the support for the authors, editors, and supervisors doing their editorial work. This includes control and audit mechanisms of the current production status as well as an efficient access to the information objects and their selection.
  •  
  • Definition of various metadata properties  : Each information object and each container for information objects may have a large number of properties depending on the type of the information object. A data type is assigned to each property. The properties themselves may have metadata as well.
  •  
  • Efficient storage of metadata  : The metadata values have to be stored in a way that all needed operations on them can be performed in an efficient way. In most cases the schema and the values are stored in a database. This requires a mapping from the metadata schema into the underlying database schema.
  •  
  • Metadata version control  : The metadata schema may change over the time. A general method is needed which explains how changes in the metadata schema are reflected in the database schema when some data exists already in the ES/DMS. NB: This problem is very similar to the DTD versioning problem.
  •  
  • Access rights  : The read and write access to the metadata fields has to be under the control of the ES/DMS. The access rights for a user or user group may change during the production phase. Therefore metadata access rights may be dependent from the current workflow state of the information object.
  •  
  • Editing of metadata values  : Editing/changing the values should not be restricted to a single information object; bulk-updates should be possible, too. The editing user interface should be customizable to cover the needs of each application.
  •  
  • Propagation/inheritance of metadata values  : The administration of a large numbers of information objects (e.g., 20.000 entries in an encyclopedia) requires metadata inheritance over the "directory" structure of the information objects. Metadata values have to propagated from higher hierarchy levels to lower hierarchy levels.
  •  
  • Query language  : Each database schema has its own query language. As the metadata schema is mapped to a database schema the database query language becomes applicable for retrieval of metadata values on the technical level. But on the user level the query language of the database might not be appropriate for the metadata search. Therefore an additional query language for the metadata is needed.
  •  
  • Search across metadata values  : Searching for information objects using metadata values is an important task of an ES/DMS. The search should not be restricted to pattern matching with regular expressions as known from full-text search. Logical combinations of values with comparison operators (e.g., published before 1997/Oct/14 and price is less than 25$ and author is Miller) are also needed.
  •  
  • Report generation  : Metadata inform the editors and supervisors about the status of their production. The status should not only be visible on the computer screen, printed reports or file export of the current metadata settings are necessary for filing and statistics. Both have to be configured in a flexible kind.
  •  It is obvious that the realization of all these requirements is always tool dependent. But the underlying concept for the metadata schema and its query language should rely on a standard. RDF, HyTime's property sets and groves may play an important role. Vendors of database standard software and ES/DMS should commit to this standard when it is through the committees.
     

    Conclusion

     Efficient searching and managing of information objects in a repository becomes possible with metadata. The repository can be either a database, an intranet, or the WWW. Metadata is information about information objects of any kind. Metadata is data.
     Metadata fields are collected in application specific vocabularies. A vocabulary is defined in a metadata schema. The schema language depends on the underlying format. RDF in combination with HyTime's property sets and groves have the power for a general metadata schema.
     Metadata can assign bibliographic information, ratings, and classification schemes to information objects. These are the main applications for the Web.
     But metadata also support the production process of publications in an Editorial System or Document Management System. Here, metadata have to fulfill a number of other requirements derived from the tasks of an ES/DMS. These tasks are more sophisticated than searching; they have to deal with the administrative part of an ES/DMS.
     Metadata will become more and more important in the Web and in ES/DMS. It should be based on a standard and tool vendors are urged to implement this standard as soon at is out. User communities then have to define vocabularies to offer the end user a streamlined and unified access to the global information network.
     

    Acknowledgments

     The author would like to thank Steve Pepper, Helge Schütt, Andreas Volpert and everybody at STEP's Reference Works Module Club for the intensive discussions about metadata.
     Further acknowledgment is payed to the members of the ISO Metadata Workshop held in Paris 1998/05/22.

    XML In Defense Procurement   Table of contents   Indexes   Server-Side XML: Taming the Tower of Babel