99 Tricks To Amaze Your Friends and Impress Your Boss with the DOM and XML in Web Browsers   Table of contents   Indexes   The student and the mechanic - how XML enables architectures to solve real-life document delivery requirements

 

Metadata deployment for publishing environments

 Sonia   López-Fuentetaja
  Software Engineer
  Grupo ANAYA  Juan Ignacio Luca de Tena, 15
 28027 Madrid   Spain
Phone: +34 91 393 88 39
Fax: +34 91 742 66 31
Email: slopez@anaya.es Web: www.anaya.es
 
Biographical notice:
 
Sonia López Fuentetaja was born in Segovia (Spain), March 1970. She holds a Software Engineer degree from the Polytechnic University of Madrid. Since 1996 she works for Grupo Anaya, a leading publishing group based in Madrid. She has been involved in the design, implementation and administration of an SGML-based editorial system deployed for the creation of an educational encyclopedia. Currently she is part of a development team responsible for applying new information technologies in the field of electronic publishing and information distribution.
 
ABSTRACT:
 
Nowadays, publishing companies and other information providers are increasingly becoming aware of the limitations of classical IT approaches facing the problems of organizing the production and delivery of content for massive consumption. In this presentation the adoption of the "web model" is proposed as an appropriate information infrastructure to support the production and delivery of information in both corporate Intranets and commercial publishing environments. The foundations of such an architecture, based on the use of URNs and standard frameworks for the description and interchange of metadata (RDF-XML), are presented along with early conclusions drawn from our experience on the design and implementation of a system deployed within Grupo Anaya.
 

Introduction

 
It is our belief that information and knowledge are the most important assets an organization has. As commercial publishers that appreciate the tremendous cost and value of information, we are engaged in reengineering our processes for creating, distributing and accessing information, with the aim of decreasing costs, shortening time to market, and improving quality. Here we propose the adoption of the "web model" as an appropriate information infrastructure to support the production and delivery of information in both corporate Intranets and commercial publishing environments. Special emphasis is placed on the role that emerging XML-related standards such as RDF will play in this context. These core technologies are the building blocks of a successful system developed and deployed within our company.
 Web 
global repositories
universal databases
 

Global Repository: Universal Databases versus the Web

 
At the core of an information system is the data repository. The benefits of global repositories are widely recognized: centralized access and management, consistency, non-redundancy, content reuse, heterogeneous content, workflow management capabilities, and so on. The need for a global repository seems to be evident, but how should this repository be realized?
 
Classical IT approaches like relational or object-oriented databases lead to proprietary hardware and software architectures requiring a single, universal data model. These solutions are enormously complex, expensive and difficult to maintain and evolve. The deployment of such rigid architectures leads to a fundamental drawback: the whole production workflow and organizational structure must be modeled to meet the needs of the database instead of the needs of the product.
 
In our opinion there is a need to shift to a paradigm that better fits the nature of the information. Information is heterogeneous, changing and is scattered across web sites, file systems, database systems, and legacy applications. We think, therefore, that distributed architectures reflect more accurately the nature of the information than universal databases do. The web model is the only existing infrastructure that, despite some inconveniences, is working now as a bona-fide extensible, heterogeneous, distributed repository. The foundations of this architecture are described below.
URC
 URI 
 URL 
 URN  
 resource 
web model
 

The Web Model

 
The web is a very general concept; it is defined as a universal space of information. The concepts it requires, such as resources and identifiers, are as general and abstract as possible. In the web model a resource is considered to be the basic unit of information, but no further assumptions are made about what a resource is. URIs  (Uniform Resource Identifiers) identify resources in the web, making them available under a variety of naming schemes and access methods in the same simple way.
 
Resources can be identified by means of an address URL  (Uniform Resource Locator) , a name URN  (Uniform Resource Name) or a description URC  (Uniform Resource Characteristics) . All of these are URIs and, as such, serve to identify a resource in the web, although the level of abstraction is different in each case. URL s identify resources by their location. As opposed to URL s, URN s identify the resources themselves instead of the location(s) where they may lie. URN s are persistent identifiers in the sense that they will always have the same meaning. It is the person/institution who owns the URN who determines which object a URN refers to. The resource identified by an URN may lie in one or more locations at any given time. Resolution services are required to transform names into addresses. URC s are intended to provide descriptions of web resources in order to make the web a space of information machine-understandable.
 
We have developed a system based on URN s that provides the infrastructure needed to access information objects distributed across multiple locations and systems in the same simple way. The use of URN s lets people work concurrently but independently on the same resource, at the precise level of abstraction they require, and makes it possible to decouple the problems inherent in infrastructure maintenance from those involved in editorial work.
 RDF 
 metadata  
 

Metadata

 
We will now analyze several issues relating to the problem of making the web a space of information intended for both human and machine consumption. Today the web is built for human consumption, and although everything is machine-readable, this data is not machine-understandable. It is very hard to automate anything, sharing and interchange of data are almost impossible, and search engines provide low-precision results because they use brute-force techniques to retrieve information.
 
It seems that the web urgently needs metadata (information about information) that can be used for retrieval purposes and for management of information. There is no possibility that everyone will agree to start using the same metadata facilities, mainly because the web contains information about a huge number of subjects and because of the wide variety of purposes for which the information is intended. Nevertheless, metadata operations have a lot in common, even when the metadata is different. RDF  (Resource Description Framework) is an effort to identify these common threads and provide a way for web architects to use them to provide useful web metadata.
 RDF 
 XML  
independency
 interchange 
 property 
 resource 
scalability
 

RDF

 
Resource Description Framework, as its name implies, is a foundation for describing and interchanging metadata. The broad goal of RDF is to define a domain-neutral mechanism for describing resources but one which is, at the same time, suitable for describing information about any domain. RDF design relies on some basic principles like independence, interchange and scalability.
 
The foundation of RDF is a model for representing named properties and property values. RDF properties may be thought of as attributes of resources and also as relationships between resources.
 
The RDF data model provides an abstract, conceptual framework for defining and using metadata. A concrete syntax is also needed for the purposes of creating and exchanging this metadata in a manner that maximizes the interoperability of independently developed web servers and clients. XML  (Extensible Markup Language) has been proposed as the syntax for exchanging RDF metadata. XML itself governs only syntax, and provides an inadequate basis for modeling objects of a problem domain in the way users typically conceive of the objects as core abstractions. XML is, however, an absolutely necessary part of the RDF solution given that XML is unequalled as an exchange format on the web.
 API, Application Programming Interface 
 metadata vocabulary 
performance
 

Metadata Repository

 
Our web of resources comprises texts, illustrations, photographs, maps and other raw material. In order to provide a consistent editorial corpus, it becomes necessary to gather some descriptive information about these resources. As every resource is assigned a public URN identifier, descriptions refer to resources identified by means of URN s.
 
Data describing and connecting resources are stored in a central repository that has been designed to support the RDF constructs. The internal representation of the RDF data model used by the repository is quite simple. Regardless of efficiency and administration issues, a mere table is all you need to accommodate resources (including properties, reified statements and collections) and statements.
 
Several commercial storage systems to support the implementation of the metadata repository have been tested. Due to the high level of information granularity, the storage system has to endure an intensive load of transactions, so the key requirement here is performance. At first glance OO  (Object Oriented) storage systems seem quite suited to model the repository, nevertheless they do not succeed in achieving the required level of efficiency. A RDBMS  (Relational Data Base Management System) -based solution has finally been chosen, as it addresses this requirement better than the previous solution does.
 
Client-side interface API  (Application Programming Interfaces) s for the database have been developed, so as to access the metadata repository from web-based applications. These services are part of a distributed computing framework where software components and network services are identified by globally unique URN s. The database interface API has been implemented as a lightweight software layer with bindings to Java and JavaScript languages. The RDF  API implements a small set of basic operations intended for creating, deleting and updating resource descriptions as well as for getting relational information about a given resource as, for example, the subset of resources connected with it.
 
Specific applications may require some specialized query and manipulation operations. These high-level operations are implemented using the basic RDF  API .
 
The benefit derived from deploying RDF repositories is that you are never again obliged to set up a different storage system to hold your data and implement new query and manipulation API s, since in RDF all metadata vocabularies are expressed within a single, well-defined model. All you need is to define your own vocabulary for your application domain on the basis of the RDF model and load it in the RDF repository.
 cataloging 
graphical user interface
metadata-based applications
 

User Interface

 
Users can benefit from the flexibility of distributed repositories without getting lost in the “virtuality” of this new information paradigm. Metadata can actually be used to help organize the web repository. RDF provides the framework for expressing metadata in a manner that enables automated processing, thus making the web a space of information machine-understandable. Because RDF instance data is not primarily intended for human consumption, it is essential to provide users with the appropriate tools, so as to assist them in the creation of metadata.
 
Metadata operations have much in common, even when the metadata is different. As within RDF all metadata vocabularies are expressed within a single well defined model, it seems feasible to conceive of generic tools such as browsers and editors for metadata.
 
The Dawn is a metadata-based application designed to facilitate cataloging of resources. For cataloging bibliographic resources, for example, descriptive attributes including “author”, “title”, and “subject” are common. For cataloging people, attributes such as “name”, “ssn”, “age” and “marital status” are often useful. The Dawn provides the facilities needed to define specific vocabularies for cataloging, thus allowing different work teams to create personalized views of data. In order to create a cataloging schema slightly different from an existing one it is not necessary to conceive it from the scratch but one can just reuse existing components.
 
People in charge of creating the metadata-based corpus do not have a technical profile. Thus one key design goal is to conceal from users the details of the underlying infrastructure. The graphical user interface is based on concepts familiar to most of users, such as documents, folders and drag and drop operations, so as to provide them with a natural, easy-to-use environment. Carrying out very simple operations, users are empowered to build their own cataloging schemas.
 
The data model supporting the interface was not developed to provide every imaginable capability. Instead, in the interests of simplicity and performance, it will be only as expressive as needed to meet the requirements of cataloging applications. It consists of a small set of well-founded constructs that enable the expression of particular cataloging schemas. These building components represent core abstractions such as resources, collections of resources, basic properties and constraints.
 
The Dawn integrates into the same interface the following functionality: creation of cataloging schemas, creation of resources, attaching properties to resources, establishing relationships between resources, browsing and searching.
RDF Model
RDF Schema
 cataloging 
 metadata vocabulary 
 namespace 
user profile
 

Dawn Schema

 
The RDF Schema specification provides a mechanism that can be used to define vocabularies for a variety application domains. The Dawn vocabulary consists of a set of resources and properties defined in the context of RDF as an RDF Schema. The Dawn constructs aim to provide a framework that can be used by content creators or trusted third parties to organize and classify web resources. The Dawn schema does not specify a particular vocabulary of descriptive elements such as “author” nor define the kinds of resources being described. Instead, it specify the mechanisms needed to define such elements. In this sense the data model in which the Dawn is based can be considered as a meta-schema, more precisely, an schema for creating cataloging schemas.
 
A specific vocabulary, suitable for a particular application or for a given user profile, can be built on the basis of the Dawn cataloging schema by only giving meaningful names to things, such as “Book” or “author”.
 
The core Dawn schema vocabulary is defined in a namespace informally calleddawn here.
class
 

Core Classes

 
The following resources are core classes that are defined as part of the Dawn schema. Every Dawn model includes these.
 

dawn:Catalog

 
This corresponds to the generic concept of a Type or Category. When a schema defines a new catalog, the resource representing that catalog must have adawn:type property whose value is the resourcedawn:Catalog . Dawn catalogs can be defined to represent almost anything, such as books, films, people, organizations, ... A catalog defines a collection of resources which are referred to as members of that catalog. The resource representing a catalog has somedawn:item properties whose values are the resources representing its members.
 

dawn:Member

 
This corresponds to the generic concept of an Item, it considered as a member of a catalog. When a schema defines a new member, the resource representing that member must have adawn:type property whose value is the resourcedawn:Member . Dawn members can be defined to represent items belonging to a catalog, such as “El Quijote”, “Star Wars”, “John Smith”, “GCA”, ... where “El Quijote” is an item of the catalog “Book”, “Star Wars” is an item of “Film”, and so on.
 

dawn:Literal

 
Atomic values such as textual strings are examples ofdawn:Literal . As the object of a statement (i.e., the property value) can be a simple string or other primitive datatype, a catalog of typedawn:Literal may be used as the value of thedawn:range property defined further. You can define a catalog “Running Time” to represent the duration of films. Members of catalog “Running Time” are literals such as “120” or “80”. The resource representing the catalog “Running Time” must have adawn:type property whose value is the resourcedawn:Literal .
 property 
 

Core Properties

 
The following resources are core properties that are defined as part of the Dawn schema. Properties provides a mechanism for expressing relationships between catalogs and their items or subcatalogs.
 

dawn:type

 
This indicates that a resource is a member of a Dawn core class, and thus has all the characteristics that are to be expected of a member of that class. The value of adawn:type property for some resource is either the resourcedawn:Catalog or the resourcedawn:Member or the resourcedawn:Literal .
 

dawn:subCatalog

 
This property indicates the subset/superset relation between catalogs. Thedawn:subCatalog property is transitive. If B is a sub-catalog of a broader catalog A, resources that are members of B will also be members of A, since B is a sub-set of A.
 

dawn:item

 
This property indicates the membership relation between a catalog and the members that belong to the collection of resources defined by the catalog.
 

dawn:prop

 
This property indicates the existence of a relationship between two resources.
 constraint 
 

Constraints

 
The Dawn Schema defines some properties used to declare constraints associated with classes and properties.
 

dawn:range

 
This property is used to constrain the kind of resources that can participate as objects in a statement involving a given property. A property may have one or more catalogs as its range. The value of a property whose range is a catalog A is constrained to be an item of catalog A.
 

dawn:domain

 
This property is used to constrain the kind of resources that can participate as subjects in a statement involving a given property. A property may have one or more catalogs as its domain. A property whose domain is a catalog A may be only used with items of catalog A.
 

dawn:participation

 
A statement-set is defined as a statement where the subject and object resources are catalogs. A statement-set defines a collection of statements where the subject is an item of the subject catalog in the statement-set, and the object is an item of the object catalog in the statement-set.
 
A participation constraint restricts the number of times a resource in a catalog can participate in a statement-set.
 
This property is used to declare the participation constraint. The value of the participation property is adawn:ParticipationValue
 

dawn:ParticipationValue

 
This is the catalog whose members aredawn:ExactlyOne ,dawn:ZeroOrMore ,dawn:ZeroOrOne anddawn:ZeroOrMore .
 authoring 
browsing
 metadata schema 
searching
tree structure
 

Authoring, Browsing and Searching

 
Most people, including end users, are familiar with hierarchies as a means to indicate organization. As the Dawn vocabulary includes definitions for parent and child relationships, it becomes possible to layer a strict hierarchical view on top of the Dawn graph. Data modeled as per the Dawn schema may, therefore, be browsed via a tree structure. To provide a meaningful view of data, the browser sets a graphical representation for each construct defined in the Dawn schema.
 
The Dawn schema provides a set of well-founded constructs for creating personalized cataloging schemas. A specific schema is built defining the kinds of resources to be described and their properties. Users do not know (need not to know) nothing about the underlying data model to create their own schemas; all they have to do is to assign meaningful names to things.
 
From the user point of view a catalog is just a container with a descriptive name which indicates the kind of things it may hold. Folder icons are used so as to represent catalogs. Users populate catalogs creating objects into the catalogs. Each object in a catalog is given an appropriate name which designates the resource it is representing. Document icons are used so as to represent items. To represent the item relation between a catalog and its members, these latter are depicted as documents contained in the corresponding catalog folder. FigureLOP-001 1 illustrates these concepts.
  Note:
$#160;
In this document figures show two different representations of the RDF data: as in the Dawn browser and using directed labeled graphs, where the nodes (drawn as ovals) represent resources and the arcs represent named properties. Text in bold indicates the type of the resource.
 
Consider you want to classify and organize all of the movies available at a video store. First, you will create a folderFilm for laying the films. As a result of this operation aCatalog labeledFilm is created. Then, you put there all the films available at the video store:Touch of Evil ,Citizen Kane , ... As a result of these operations, the statements(item, Film, Citizen Kane) ,(item, Film, Touch of Evil) , ... are created.

Figure 1: Schema for film descriptions and some sample data

 
 
Users define properties so as to describe catalogs. A property is a catalog whose members are the permitted values this property may have. Catalogs representing properties are created as sub-folders of the catalog folder they describe.
 
For describing films, descriptive attributes such as director or chroma may be useful, so you will create the sub-folders corresponding to these properties into the folderFilm . As a result the statements(prop, Film, Director) and (prop, Film, Chroma) are created. Attributes such asRunning Time andYear may also be important for describing films. The permitted values for these properties are atomic values such as “89” or “1956”. SoRunning Time andYear are catalogs whose members are literals. See FigureLOP-001 1.
 
Note that folders serve always to represent collections of resources regardless of how these resources will be interpreted further; that is, there is no difference between catalogs representing properties and the rest of catalogs. A catalog may make sense by its own or by being a property of another catalog.
 
A property is applicable to the members of the catalog for which this property was defined. Statements associate the resources in a catalog with the resources in the property catalogs defined for it. By performing drag & drop operations, users are able to establish relationships between resources in different catalogs.
 
Consider you want to express the statement “Citizen Kane is a film directed by “Orson Welles”. To do so you just have to drag the filmCitizen Kane overOrson Welles . The statement(prop, Citizen Kane, Orson Welles) is then created .
 
Catalogs represent criteria for restricting searches on data. To get the films byOrson Welles you just have to open the folderDirector and then the documentOrson Welles . This performs a query against the repository for selecting the collection of resources which are members of the catalogFilm and whose value for theDirector property isOrson Welles .
 
As a property can be used to describe more than one catalog, if a property already exists users can make it available to as many catalogs as needed.
 
Consider as a simple example the propertyNationality . It could be useful to have the films classified by nationality, so you create the property catalogNationality into the folderFilm . Then you realize that “nationality” is also an important characteristic of directors. As the propertyNationality already exists you just have to drag this folder over the folderDirector .

Figure 2: Shared properties

 
 
As the propertyNationality is available in two different contexts, when you open the documentAmerican in the contextFilm you will see just American films. If you open the documentAmerican in the contextDirector you will see just American directors.
 
The following figureLOP-002 illustrates how the propertydawn:subCatalog is used for classification.

Figure 3: Sub-Catalogs

 
 
You could define the propertyGenre for representing films' genres such as drama, comedy, children, scientific, historic, ... Now, you might want distinguish among fiction and documentary films. To do so, you can create the sub-catalogsFiction andDocumentary . As a result of this operation the statements(subCatalog, Genre, Fiction) and(subCatalog, Genre, Documentary) are created. Sub-catalogs are depicted as folders. Then you can putDrama ,Comedy ,Horror-Suspense andChildren into the catalogFiction , andScience andHistory into the catalogDocumentary .
 
Consider you want express the statement “Citizen Kane is a drama”. To do so you just have to drag the filmCitizen Kane overDrama . The statement(prop, Citizen Kane, Drama) is then created. Note that you can do this becauseDrama is actually an item ofGenre , sinceFiction is a sub-catalog ofGenre .
 
It is possible to fix some criteria so that later searches be performed on the data you got as a result of previous searches, so if you fix the property valueAmerican and then you open the document labeledDrama , you will get just American dramas. Only the resources which are members of the catalogFilm and whose value for theNationality property isAmerican , and whose value for theGenre property isDrama are selected.
 
FigureLOP-003 4 illustrates this.

Figure 4: An advanced query

 
 
Consistency of data is guaranteed at any moment during the authoring process. Rules for maintaining consistency derive from the constraints users state in their own cataloging schemas. Consider as a simple example the sentence “films have the property chroma”. The Dawn might use this information to suggest legal values:black&white andcolor , or to prevent users from draggingblack&white overCitizen Kane , since the arc representing the relationship betweenFilm andChroma must start atFilm and point toChroma . You can also declare participation constraints such as “films may have only one chroma property”, thus preventing the statements(prop, Citizen Kane, black&white) and(prop, Citizen Kane, color) coexist.
 
As you have likely realized authoring, browsing and searching are closely related in the gestual Dawn interface. This is possible as the whole system is based on metadata and the underlying model heavily use the unified concepts of identifiers and strict distinction between a physical resource and its description.
 

Conclusion

 
We think that the development of an information system based on the "web model" and centered on the use of URN s and standard frameworks for the description and interchange of metadata ( RDF - XML ) will provide a common, evolvable, long-lived infrastructure on top of which network-based applications and knowledge-based systems will exist. We firmly believe that our approach is not only feasible, but also the most sensible way to proceed.
 
Acknowledgments
 
This project is the result of a collaborative work carried out by the Editorial Systems department in Grupo Anaya. The development would have not been possible without the creative input and hard work of its members. Most of the ideas which inspired Dawn are due to Vicente Sosa (vsosa@anaya.es). I would like to thank him for its support and never ending enthusiasm. Thanks are due to Israel Hernanz (ihernanz@anaya.es) for its efforts in developing the RDF repository and the APIs. I would also like to thank Alicia García (agarcia@anaya.es) for its helpful contribution in knowledge representation.
 
Bibliography
Berners-Lee, T.,World Wide Web Design Issues , http://www.w3.org/DesignIssues/
Connolly, D. (ed.)Web Naming and Addressing overview , http://www.w3.org/Addressing/Addressing.html
Bray, T., Paoli, J., Sperberg-McQueen, C.M., (ed.),Extensible Markup Language (XML) 1.0 , http://www.w3.org/TR/REC-xml, February 1998.
Lassila, O., Swick, R. (eds.),Resource Description Framework (RDF) Model and Syntax Specification , http://www.w3.org/TR/REC-rdf-syntax, February 1999.
Brickley, D., Guha, R.V., Layman, A. (eds.),Resource Description Framework (RDF) Schema Specification , http://www.w3.org/TR/REC-rdf-syntax, October 1998.
Bray, T.,RDF and Metadata , http://www.xml.com/xml/pub/98/06/rdf.htm
Berners-Lee, T.,Metadata Architecture , http://www.w3.org/DesignIssues/Metadata.html
Berners-Lee, T.,Semantic Web Road map , http://www.w3.org/DesignIssues/Semantic.html
Berners-Lee, T.,What A Semantic Web is not , http://www.w3.org/DesignIssues/RDFnot.html
Berners-Lee, T.,Why RDF is more than XML , http://www.w3.org/DesignIssues/RDF-XML.html
Cover, R.,XML and Semantic Transparency , http://www.oasis-open.org/cover/xmlAndSemantics.html

99 Tricks To Amaze Your Friends and Impress Your Boss with the DOM and XML in Web Browsers   Table of contents   Indexes   The student and the mechanic - how XML enables architectures to solve real-life document delivery requirements