| XML In Defense Procurement | Table of contents | Indexes | Server-Side XML: Taming the Tower of Babel | |||
metadata ![]() | Managing and Searching Data with Metadata |
Germany ![]() Rath, Dr. Hans Holger ![]() Rimpar ![]() STEP Stürtz Electronic Publishing GmbH ![]() | Dr. Hans Holger
Rath
Director Consulting, STEP Stürtz Electronic Publishing GmbH
Biographical notice Dr. Hans Holger Rath is director of STEP's Consulting department since April 1998. He started at STEP in April 1996 as senior consultant/project manager. Before he joined STEP he was head of the Document Computing department at ZGDV (Computer Graphics Center, Darmstadt, Germany). Dr. Hans Holger Rath studied computer science in Karlsruhe 1984-1990 and graduated at the TU Darmstadt with the doctoral thesis 'Literate Specifying of Hypermedia Documents' in 1996. He was involved in the DTD development for the DIN (Deutsches Institut für Normung e.V. - German Standards Institute) and ISO (International Organization for Standardization) and cooperates very closely with publishing houses, aircraft industry and telecommunication industry. All in all he has more than eight years experience in information architectures and related topics. Since May 1998 he represents Germany in ISO/JTC1/SC34 - the ISO committee standardizing SGML, HyTime, DSSSL, Topic Navigation Maps etc. |
| Abstract : Metadata is the technology that makes possible faster, more focussed search and retrieval of information objects in the World-Wide Web and in Document Management Systems. This paper explains why metadata is important, the basic ideas behind metadata, and how metadata can be used in the Web and a DMS. |
DMS ![]() Document Management System ![]() Editorial System ![]() World-Wide Web ![]() repository ![]() | Introduction |
| SGML and XML are doing a good job of structuring data. Each relevant part of the data can be searched, accessed and processed. But what about the information objects - the objects containing the data? They are part of the World-Wide Web or are stored in a database (e.g., in an Editorial System or Document Management System). Both kinds of "repositories" contain a tremendously large number of documents. All these information objects have to be created, maintained, managed, retrieved and delivered as well as published. The larger this number of objects becomes the more difficult it becomes to manage and search them. |
| Metadata is the technology that makes possible faster, more focussed search and retrieval of information objects. It supports not only searching and retrieval, it also supports management of information objects and administrative tasks. Metadata is added-value to the information content itself, because it gives easier access to the requested information and brings information objects into new relations. |
| The following chapters explain |
| metadata field metadata property property ![]() property-value pair | Metadata is Data |
| Metadata in general help identifying information objects. Metadata is information about the information objects. Metadata is typically defined in terms of property-value pairs. The property identifies the role of the metadata field. The value is the searchable/manageable term. |
| Some examples: |
| Metadata values may be part of the information object (e.g., title and author). Others are stored separately (e.g., publication date and identifier) in the metadata repository. All values must be accessible for searching and managing. This requires an automatic extraction of internal metadata into the metadata repository - whatever it looks like - and the synchronization of internal and external values. |
| The examples listed above are very simple ones. Practical requirements ask for complex metadata with relations not only between information object and metadata value, but between the metadata values themselves. An example: |
| The "homepage" metadata example shows that information objects could be metadata for other (meta-) data, too. Therefore it makes sense to sayMetadata is data . |
| With this assumption - metadata is data - the management of metadata can follow the same paradigm as the management of the information objects. When information objects are marked up in SGML/XML metadata can be marked up in SGML/XML, too. |
| The markup of metadata is only one point; storage, efficient retrieval, existence of an appropriate query language, versioning strategy, and report generating are technical but very important points. Whether SGML/XML coded metadata can fulfill these requirements is not clear as of today. Various metadata schemata are under development and general support in standard tools do not exist. |
| Customized applications have to be build until a schema is identified and standardized. At least, tools have to support this schema. Before to much effort (and money) goes in these applications the users - you - should pressure the committees and vendors - us - that the metadata problem will be solved in a general way. |
metadata vocabulary ![]() vocabulary | Vocabularies |
| A single stand-alone metadata field makes no sense. In practice there is always a set of metadata fields needed. This set of metadata fields is called avocabulary . |
| A predefined vocabulary is necessary for a community-wide use and understanding of the metadata values. Unfortunately we are just in the beginning of global metadata usage and only a few vocabularies are defined. |
| Most of the vocabularies are defined in the context of bibliographic information about publications - like Dublin Core or IMS. This might result from the long history of cataloguing and indexing in libraries for over 2000 years. |
| Another application of metadata is a rating system for information in the WWW. W3C published the Platform for Internet Content Selection (PICS) as recommendation. |
| AltaVista, Lycos, Netscape Internet Guide, as well as Yahoo offer subject indexes to classify information. These subjects indexes are nothing else than similar but not identical vocabularies. They are comparable with library classification schemes like UDC and DDC. |
| It will be an important task for the next years to define globally used vocabularies or at least mappings between vocabularies to have isomorphic search over the WWW. |
CGM, Computer Graphics Metafile ![]() EXPRESS ![]() HyTime ![]() RDF ![]() SGML ![]() STEP ![]() Topic Navigation Map ![]() XLink ![]() XML ![]() XML-Data ![]() XPointer ![]() bibliographic standard grove ![]() metadata schema ![]() property set ![]() schema ![]() | Metadata Schemata |
| We defined a set of metadata fields as a vocabulary. A "document" that defines a vocabulary is aschema (whereby the meaning of the worddocument should be understood as "portion of data in any format"). The main question about metadata schemata is exactly which format to use. |
| There are several approaches out there in the various standardization committees to describe metadata. STEP - the standard - uses its EXPRESS language to define the "what" aspects of the data. CGM version 4 supports association of additional information to parts of the graphic. Bibliographic standards have covered issues of metadata for over 30 years. SGML and HyTime bring property sets and groves in the arena. Topic Navigation Maps as an application of HyTime describe relations between data and topic driven views on data. XML comes with XML-Data, XLink, XPointer, and RDF. |
| RDF covers some interesting features: work in low memory environments, protocol for metadata interchange, tuple-based therefore mappable to relational databases, association with multiple info objects, trusted third party description for data with signatures. |
| A good mixture of these standards and approaches can be the combination of property sets, groves and RDF. RDF because it is simple and implementable but powerful. Property sets because they give the syntax to describe the metadata on a meta level. Groves because they provide the relation between markup, data, and metadata. When a recommended or standardized schema for metadata is available, the development of vocabularies for special application fields can start. |
WWW ![]() World-Wide Web ![]() metadata for the WWW | Metadata for the World-Wide Web |
| Metadata applications for the WWW which are available or under development concentrate on: |
| In addition to that both XLink and HyTime support metadata on links. |
| Dublin Core The Dublin Core Metadata for Simple Resource Discovery bibliographic information | Bibliographic Information |
| Bibliographic information about publications is the classic application of metadata. The bibliographic vocabularies were developed in the long history of libraries. With the existence of computers and the Internet both the amount of documents as well as the search speed increased and will continue to increase dramatically. |
| But as every traditional library may have its own cataloguing and indexing system every digital library may have its own slightly different metadata vocabulary. Search has become faster but to get the requested results from more than one library requires a common vocabulary. |
| The Dublin Core Metadata for Simple Resource Discovery offers a number of metadata fields for cross-disciplinary resource discovery. Here the list of the properties: title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights. These properties can be used in HTML <META> markup. A HTML version of this document might be marked up as following: |
|
|
|
|
|
|
|
|
The
|
| TheIMS Metadata Directory as an alternative to the Dublin Core offers a larger number of properties based on rules from ISO 11179: abstract, author, catalog id, concepts, container type, credits, expiration date, form, format, GUID, interactivity level, keywords, language, learning level, location, metadata version, objectives, pedagogy, platform, prerequisites, presentation, price code, relation, role, sizeof, source, steward, structure, subject, title, use rights, user support, use time, version date, version. |
| Many of these properties are similar or the same as in Dublin Core. Some of them have different names but the same meaning. These differences are caused by the various approaches the vocabularies want to cover. |
| PICS Platform for Internet Content Selection content rating rating system | Content Rating System |
| Document rating will become an important issue for the Web, e.g. keeping children away from pages with content for adults only. |
| ThePlatform for Internet Content Selection (PICS ) is a W3C recommendation containing several rating services. PICS works in a way that individuals, groups, organizations or companies (rating bureaus) provide content labels for information in the Web. The labels must be based on the PICS vocabulary. A selection software relying on PICS evaluates the ratings of a rating bureau and decides if the Web page is accepted or rejected. This could happen every time when a new Web page is loaded into the browser. |
AltaVista ![]() DDC Lycos Netscape ![]() Topic Navigation Map ![]() UDC Yahoo classification scheme | Classification Schemes |
| Again, the idea behind traditional library classifications like DDC and UDC is applied to the Web. Publications - paper or online documents - are inserted into a predefined classification scheme. A document may be classified differently depending on the scheme's structure. |
| All subject-based Web search engines like AltaVista, Lycos, Netscape, Yahoo offer their own subject index (= classification scheme = vocabulary). Even when a great part of the subjects are similar, the variety of classification subjects cannot provide a seamless search across a wide range of Web pages like full text search. |
| The already mentioned Topic Navigation Map standard will make it possible to combine different classification schemes into one topic. |
| classified link | Classified Links |
Hyperlinks as they are known from HTML may inform the user about the link target: the browser displays either the target URL or the text of the
|
| XLink and HyTime provide a more robust mechanism. A role attribute carries information about the link target. This classification of links let users decide if they want to follow the link or not. |
| Again, the classification has to be based on a vocabulary to be widely used in the Web. |
| As already indicated, the usage of metadata in Editorial Systems and Document Management Systems is partly different from its usage in the Web. The major focus in the Web is searching. The major focus in an ES/DMS is management of information objects and administration of production processes. |
| But both applications of metadata follow the same approach (attaching information to information) and have the same essential requirements: definition is independent from data and application, metadata is interchangeable, concepts are scalable over number of information objects and number of metadata fields, concepts are implementable with existing database technologies. |
| Here are some examples of built-in metadata properties of an ES/DMS: |
| The user will need additional application specific properties of information objects like: |
| Further, a metadata set about the metadata might be the address of a person (= author, editor). It consists of: |
| Metadata in an ES/DMS has to provide the following functions: |
| It is obvious that the realization of all these requirements is always tool dependent. But the underlying concept for the metadata schema and its query language should rely on a standard. RDF, HyTime's property sets and groves may play an important role. Vendors of database standard software and ES/DMS should commit to this standard when it is through the committees. |
Conclusion |
| Efficient searching and managing of information objects in a repository becomes possible with metadata. The repository can be either a database, an intranet, or the WWW. Metadata is information about information objects of any kind. Metadata is data. |
| Metadata fields are collected in application specific vocabularies. A vocabulary is defined in a metadata schema. The schema language depends on the underlying format. RDF in combination with HyTime's property sets and groves have the power for a general metadata schema. |
| Metadata can assign bibliographic information, ratings, and classification schemes to information objects. These are the main applications for the Web. |
| But metadata also support the production process of publications in an Editorial System or Document Management System. Here, metadata have to fulfill a number of other requirements derived from the tasks of an ES/DMS. These tasks are more sophisticated than searching; they have to deal with the administrative part of an ES/DMS. |
| Metadata will become more and more important in the Web and in ES/DMS. It should be based on a standard and tool vendors are urged to implement this standard as soon at is out. User communities then have to define vocabularies to offer the end user a streamlined and unified access to the global information network. |
Acknowledgments |
| The author would like to thank Steve Pepper, Helge Schütt, Andreas Volpert and everybody at STEP's Reference Works Module Club for the intensive discussions about metadata. |
| Further acknowledgment is payed to the members of the ISO Metadata Workshop held in Paris 1998/05/22. |
| XML In Defense Procurement | Table of contents | Indexes | Server-Side XML: Taming the Tower of Babel | |||