Data Models as an XML Schema Development Method   Table of contents   Indexes   Implementing a Component Broker using XMI

 Burkett, William 
 Long Beach 
 Product Data Integration Technologies, Inc. 
 USA 
 
William C. Burkett
 Senior Information Engineer
Product Data Integration Technologies, Inc.
  100 W Broadway, Suite 540 Long Beach (California)  USA (90802)
Email: wburkett@pdit.com Web site:www.pdit.com
 Biography
 William Burkett has over 15 years of experience as an industrial and systems engineer specializing in system analysis and data modeling, information system integration, and product data exchange (PDE) technologies. Prior to joining P.D.I.T., he worked for McDonnell-Douglas and Lockheed on PDE technology and standards development programs. Mr. Burkett was an active participant in the development of the STandard for the Exchange of Product Model Data (STEP - ISO 10303, TC184/SC4) since its inception in 1984. More recently, he has been apply PDE principles to the design of XML standards for the integration of Defense legacy systems and the deployment of product catalogs
 

Introduction

 "XML will add meaning to the web."
  "XML will enable application interoperability over the web."
  "Vocabulary" this "Ontology" that "Metadata metadata metadata "
 What does it all mean? Indeed, does it - or can it! - mean anything at all?
 Like many "new" computing technologies, XML is the "Next Big Thing" and is going through a typical "new technology" lifecycle. It is currently in the middle of the "frenzied fad phase" and can do absolutely everything - and absolutely everybody is doing it.
 The reality, of course, is much more sobering: like past "Next Big Things", it will find it's appropriate niche and application as an Internet technology and, until it does, we'll have to endure the wild speculations and flights-of-fancy until they either fade or bear fruit. Regardless of the initial speculations concerning the use and value of XML, some predictions seem to be more of a "sure thing" than others. The most significant of these - and the subject of this paper - is the proliferation of "XML Vocabularies".
 

Notes on the Terminology

 The term "XML vocabulary" is used in this paper rather than "XML Schema" to: (1) avoid confusion with the XML Schema specification; and (2) leverage the popular vernacular. As used here, an "XML vocabulary" is: a structured ontology that specifies the (possible) content of an XML document. A more precise and correct term for this thing would be "XML Content Schema" or "Web Resource Content Schema".
 As used here, "ontology" is defined by Sowa [13] as: "…a catalog of types of things that are assumed to exist in a domain of interest, D, from the perspective of a person who uses a language, L, for the purpose of talking about D."
 

Problem Statement

 Since XML is intended to add meaning to ASCII text through structurally-standard markup, all that needs to be done is standardize the meaning of the markup and then applications will be able to "understand" the data and "talk to one another" - right? Unfortunately, it's not as easy as it sounds; linguists, psychologists, sociologists, and philosopher's have been studying meaning and language for thousands of years and STILL don't know how to respond to the request: "Give me a Lite". A seemingly inevitable result of the XML hype is what's variously been called "the Balkanization of the Web"[10] or a "Tower of (web)Babel". This phenomenon will result from:
 · The rush to develop XML vocabularies in the form of DTD or XML Schemas, promote them as the "lingua franca for <fill in the blank> business processes", and encourage everyone to use this "standard" vocabulary.
 · The fact that hundreds - if not thousands - of organizations are doing it.
 The primary problem is not the number of vocabularies, but the fact that the vocabularies are not integrated with respect to application semantics and there are no frameworks or methods proposed for doing so. They are standalone lexicons with no formal relationship to other vocabularies despite the possibility of significant overlap of semantic scope (i.e., the vocabularies are "about" the same real-world application domain). And XML itself doesn't help because it does not deal with application semantics at all; it is primarily a structural/syntactic specification and the only concession to semantics it makes is the distinction between metadata (i.e., markup) and content. Thus, the whole notion that XML will enable applications to "talk to one another" by making web-based data meaningful evaporates in an ambiguous puff of ASCII because the vocabularies will only be meaningful within the usage communities that developed or use them.
 Does this mean, then, that we doomed to a cacophony of competing vocabulary standards? Or is there a way to address this problem before it truly becomes a Tower of Babel? The purpose of this paper is to present a solution to this problem which breaks free of a number of commonly held beliefs and is simple, powerful, and extensible. The solution is embodied in the Product Data Markup Language (PDML - www.pdit.com/pdml), a methodology and suite of XML vocabularies for exchanging integrated product data between legacy systems in the Department of Defense (DoD). Though the development of PDML targeted a particular application domain (i.e., DoD product data systems), the philosophy and methods are generally applicable to any application domain. Requirements for XML Vocabularies The reason for defining standard XML vocabularies is to enable applications to exchange "meaningful" data and take action based on the interpretation of that data. Therefore, in the development of PDML the focus was on the semantics of the data rather than the encoding (e.g., XML) or transport (e.g., http). This paper shall maintain that focus on semantics and assume that transport of digital data is a solved problem.
 The solution offered by PDML for the reconciliation of a multiplicity of XML vocabularies recognizes that there is a fundamental tug-of-war going on between contradictory - yet each perfectly valid! - functional requirements. To be useful and meaningful, an XML vocabulary must be:
 · Semantically complete and unambiguous such that the vocabulary contains all the "terms" necessary for a particular application domain, and each term is defined precisely enough to be correctly used and correctly interpreted.
 · Standardardized such that it can be effectively, uniformly, and consistently applied to the degree that it can be called out in contracts.
 These requirements are not surprising and represent the objectives of vocabulary standardization efforts such as OASIS, BizTalk, RosettaNet, XML EDI, Dublin Core, etc., etc. By standardizing the terms and definitions of a vocabulary, the hope is that applications can be written that "understand" these vocabularies.
 Unfortunately, these objectives fail to account for a very human and all-too-real phenomena: the meaning of terms tends to drift with repeated use.
 If we define the "usage" of a vocabulary as a purpose and/or role-driven "message type" that is comprised of vocabulary terms, and "use" of vocabulary as a communication event at a given point in time (i.e., an "instance" of the usage), then within a particular usage of a vocabulary, the meaning of the vocabulary terms will tend to:
 1) be slightly different than that specified by the vocabulary standard in a given use - there are many small semantic variations that are unique to the use;
 2) to drift and vary across multiple uses as business processes change.
 These facts make the enforcement of the vocabulary standard problematic since standards are intended to be applicable over an extended period of time (or else they wouldn't be "standard"). It is not impossible to reconcile these requirements within a single vocabulary, but the scope of the vocabulary (i.e., the size/breadth of the usage/application domain) must be very small. Therefore, an additional requirement is that an XML Vocabulary must be
 · Adaptable such that it is able to respond to individual requirements of a particular usage and the evolution of requirements over time. The principle "tug-of-war" in the design of vocabularies is between standardization and adaptability.
 Note that "adaptable" is not the same things as "extensible". "Adaptable" semantics means that the semantics of existing vocabulary terms can be "officially" modified on a per use basis. "Extensible" semantics means that new semantics are added to the vocabulary. (Despite the fact that "extensible" is actually part of the XML name and a primary design feature, extensibility is actually a bad thing with respect to application interoperability, integration, and standardization. "Extensibility" is diametrically opposed to everything these things represent.)
 -The existence of multiple XML vocabularies that address related/overlapping application domains further complicates matters by introducing a fourth equally-as-valid and reasonable requirement that the vocabularies be: · Integrated such that elements from each vocabulary with the same name do not have different meanings and that elements that do have the same meaning do not have different names.
 -These requirements cannot be simultaneously reconciled in a single standard - at least as these standards have been envisioned. PDML offers an alternative vision of how XML vocabularies can be structured as an integrated suite of vocabularies.
 

Integrating Vocabularies

 PDML did not set out to solve the problem of reconciling different XML vocabularies, but rather to integrate product data systems. The Joint Electronic Commerce Program Office (JECPO) of the DoD sponsors the development of PDML to provide an Internet- and XML-based solution for integrating the myriad of product data systems used to support weapon systems. Their goal was to develop an integrated, web-based solution for product data visibility, accessibility, and application interoperability. The challenge they faced was the fact that these product data systems are used (and maintained) by a huge number of different, disparate, and dispersed communities that comprise the DoD and that the systems have a large degree of overlap with respect to the data that they contain. Past attempts to integrate these systems have been a massive, expensive, and ultimately of limited success.
 The integration of XML vocabularies is the essentially the same problem as that addressed by data-driven systems integration methodologies and practices. Therefore, a review of the characteristics and lessons learned from that will provide useful insights into XML vocabulary design principles.
 

System Integration and Application Interoperability

 The integration disparate and heterogeneous information systems has been pursued ever since the recognition that two different systems are using some of the same information and someone said "hey - let's share data and save time". The idea was to enable software application to "interoperate" by exchanging structured data that both applications "understand". Approaches that have been pursued to enable application interoperability (and, thus, establish an "integrated" system) include:
 (Note: These approaches are presented roughly in order of historical appearance.)
 · Point-to-point translators that convert data bound to one application system into the data format of a target application system;
 · A shared database that is used by multiple applications
 · Product Data Exchange (PDE) standards that specify a neutral, application-independent data structure used to convey data between applications (and translators written to/from the PDE standard);
 · Database federations in which each repository makes (some part of) its data visible and accessible via an API to other databases/applications in the federation;
 · Product Data Management (PDM) and Enterprise Resource Planning (ERP) applications that "throw a net" over the applications within an enterprise and route, control, and constrain data that moves between individuals and/or applications.
 All of these approaches have one thing in common: models of the data (typically called schemas). All of the approaches above have strengths and weaknesses with respect to enabling application interoperability and integrating systems:
 
 

Integrating Information Resources

 As it turns out, the data-driven system integration strategies used in PDML are directly applicable to the reconciliation and integration of XML vocabularies because an XML vocabulary is really no different than a database schema. In fact, formal data management and data modelling principles are imminently (and necessarily) applicable to the design and use of XML schemas.
 Traditional approaches for database schema integration usually take the form of merging the schemas through semantic analysis and conflict resolution, resulting in a single, often larger, and semantically "flat" schema (cf. "View Integration" [1, 2, 7]). However, merging large numbers of schemas becomes increasingly difficult simply due to the size and semantic complexity of the global schema.
 An alternative approach adapts the three-schema architecture concept [14] by integrating schemas through abstraction. The individual component schemas are generalized and a third "conceptual" schema is produced that "contains" or is a "semantic superset" the original schemas. The original schemas are not, however, consumed in the process but maintain their existence as "external views" of the conceptual schema. This approach to integration was introduced in the STandard for the Exchange of Product model data (STEP - ISO 10303) [6].
 PDML applies the "integration-through-abstraction" paradigm in the definition the vocabularies that comprise PDML because it is more scaleable than simply merging the schemas. The relationship between the original schemas and the new abstract schema is formalized with a processable mapping specification; the original schemas are thus interpretations of the new abstract schema.
 Interpretation is the key to both integrating XML vocabularies and making them useful within specific usage communities. The smaller domain-specific vocabularies can be used to exchange data along high-volume, semantically-precise channels. Exchanging data within a broader domain requires translation of the data to a more general vocabulary (using the mapping specification) and then translation to another "dialect" (again using the mapping specification.)
 There are added benefits of the integration-through-abstration paradigm that directly addresses several of the requirements above:
 · The abstract integration vocabulary is more stable and less likely to change over time because of the built-in semantic fuzziness - thus, it is standardizable;
 · The domain-specific vocabularies are decoupled from each other and thus free to evolve as requirements change with a minimum "ripple effect" on other vocabularies;
 · The binding of a domain-specific vocabulary to the abstract integration vocabulary with the mapping specification can be "recompiled" when the former is changed, thus maintaining the linkage and integration to other vocabularies.
 The structure of PDML and the nature of the vocabularies of which is it comprised are described below.
 

Ramifications on the development and use of XML Vocabularies

 The Internet and World Wide Web provide a new and powerful system integration platform. The ubiquity of the Internet breaks spatial and temporal barriers and provides the opportunity for users to access data anywhere, anytime. Furthermore, XML provides a platform-independent and web-friendly data structuring syntax for the representation and exchange of data. The question, then, is what are the lesson's that can be gleaned from system integration theory and practices that can be applicable to the integration of data on the Web?
 The most pertinent lessons are
 · An XML Vocabulary is a model of data
 · An XML Vocabulary should be small in scope and specific in application if the meaning of the terms to be clear and unambiguous to the user of the vocabulary.
 · XML vocabularies are standalone in use, but should be part of a comprehensive structure (specified via a mapping) if a "Semantic Web" [4] is to be realized; in other words, vocabularies should not be developed and used in a vacuum.
 · The context, experience, and requirements of a particular use of a particular vocabulary cannot be accurately predicted, nor can the evolution of requirement be predicted.
 

Product Data Markup Language - PDML

 PDML breaks the single-standard-schema paradigm that is prevalent in XML Vocabularies standardization efforts by defining a suite of XML Vocabularies that are integrated through formal mappings to an abstract, neutral schema. The design of PDML meets the requirements outlined above and resolves the "tug-of-war" by defining a componentized and integrated structure where - like any complex product - different functions are fulfilled by different components.
 

Structural Overview of PDML

 The structural architecture of PDML is analogous to the "star-satellite" structure of the client-server model. PDML is composed of the following components:
 · A collection of Application Transaction Set.
 · The Integration Schema;
 · Mapping specification between the Application Transaction Sets and the Integration Schema;
 The relationship between these components is illustrated in Figure 1. As PDML grows, additional transaction sets will be added to the specification.
 
 Figure 1 - Relationship of PDML Components
 PDML defines domain-specific vocabularies called Application Transaction Sets (ATS) that name and structure data in a way already familiar to a particular usage community. An ATS is an XML DTD that specifies the elements needed to exchange XML data between current users of a particular application (i.e., users within the same context). The addition of presentation style sheet would enable a user to receive and view the product data with any mainstream (and XML-savvy) web browser.
 PDML chose to define a usage community as either the users of a particular DoD application system (e.g., JEDMICS) or as users of a particular kind of product data (e.g., product structure data). The usage community is also referred as the context of the exchange of data.
 The usage community represented by all the DoD weapon system design and support personnel actually consists of many component usage communities. The Application Transaction Sets were designed specifically to support a few of the most important of these communities. The data required in these contexts overlaps with data in other contexts; furthermore, the data used in different contexts often has different names, or is stored in a different structure. Therefore, PDML defines an Integration Schema that serves as an intermediary between views, a neutral representation that is designed to service the informati-on needs of all the contexts within weapon system support in a uniform and integrated way.
 The relationship between the Application Transaction Sets and the Integration Schema is specified through mappings, or conversion rules. The mappings, thus, provide an approach for taking domain-specific data (expressed in the vocabulary of an ATS), converting it to a neutral integrated representation, and then reconstructing the same integrated information in the vocabulary of a different ATS.
 An important aspect of the PDML philosophy is that it assumes a "data management" perspective rather than a "document" perspective in the development of the ATSs. As noted by Charles Goldfarb (one of the original developers of SGML and XML) "…many people have noticed that XML documents resemble traditional relational and object database data in many ways. Once you have a language for rigorously representing documents, those documents can be treated more like other forms of data." [9] The "document mindset" that led to the hierarchical structure of SGML and XML was felt to be too constraining and unsuitable for data exchange and data management.
 

Meeting the requirements for XML Vocabularies

 The development of PDML sought to meet the functional requirements defined above. To reiterate, the requirements are:
 1) Complete and unambiguous data semantics to facilitate the import and export of data;
 2) Integrated data semantics;
 3) Standardizable data semantics;
 4) Adaptable data semantics;
 -An additional PDML requirement was leveraging existing technologies. Toward this end, PDML is simply a new application of technology and standards that already exist.
 

Complete and Unambiguous Data Semantics

 In their vision of a "Semantic Web", Tim Berners-Lee, et al., [4] recognizes the trade-off's between local autonomy and global accessibility in the design/deployment of web data; global protocols for access and exchange are necessary for scalability of the web, but localized standards are necessary to preserve localized, narrow-channel communication requirements. They also recognize that the definition of semantics is based on a usage community in which particular meaning and constraints are defined, built, and used. In fact, the notion of "meaning communities" is supported by social science theories on the development of knowledge and meaning that assert that meaning is constructed, reinforced, and institutionalized through usage within a community [3].
 In addition, there is a mechanism already defined that recognizes the context-sensitivity of XML vocabularies: Namespaces. [5] Namespaces provide a syntactic mechanism for differentiating between vocabularies developed for/by different communities.
 PDML leverages the idea that semantics are local to a particular meaning community in the definition of an Application Transaction Set. Furthermore, by delimiting a meaning community as the users of a particular legacy product data system, PDML was able to define a "complete and unambiguous" XML vocabulary - because the users had already had many years of experience using the terms in this vocabulary already!
 For example, the Joint Engineering Data Management Information Control System (JEDMICS) is a very large - and very old - defense data system. It consists of data fields like:
 · Drawing_number
 · Drawing_title
 · cage_code
 · doc_type
 · drawing_revision
 · sheet
 · sheet_revision
 · frame
 · number_of_frames
 · control_code
 · security
 · foreign_secure
 · nuclear
 ·wsc
 · safety
 · dist
 · master_location
 Some of these fields might mean something to non-JECMICS users like "drawing_number" or "sheet". Who but a JEDMICS user, however, would know what a "control_code" was, or what "wsc" meant?
 Complete and Unambiguous in Large Meaning Communities
 Standardizing the unambiguous semantics of particular field or object within a large meaning community is not practically possible due to the requirements and variations within communities within it. It may be possible to standardize the semantics of a field or object that is common to most/all meaning communities on planet Earth, such as Person Name, but even ubiquitous concepts like this are subject to local variations.
 

Integrated Data Semantics

 The integration of data semantics is the most slippery and ill-defined aspect of system integration and presents the single biggest challenge to the effective use of XML on the World Wide Web. While the semantics of an ATS may be complete and unambiguous, this property is not additive when ATS's must be integrated.
 As mentioned above, traditional approaches for integrating data semantics focus on reconciling and merging component schemas to create a "global" schema that is used by all the applications. This approach is completely unfeasible for integrating the semantics of XML vocabularies for two simple reasons:
 a) There is no recognized arbiter responsible for reconciling and merging the schemas. (W3C has abdicated this perceived responsibility to user groups; OASIS and BizTalk are attempting to fulfill this role.)
 b) The size of the merged vocabulary will quickly become too large for anyone to understand and manage.
 The solution to integrated data semantics must draw on the characteristics of the way humans use natural language, since "meaning" is a uniquely human trait. PDML draws on the natural cognitive ability of humans to create and use abstractions to classify, organize, and talk about real-world phenomena. PDML uses abstraction to create and define an Integration Schema that represents a "semantic superset" or "generalization" of semantics of the Application Transaction Sets. The size of vocabularies that comprise PDML can then be kept to a manageable size by defining the Integration Schema at a level of semantic abstraction that is general enough to accommodate the semantic requirements of all the Application Transaction Sets.
 Unlike the more concrete Application Transaction Sets, the abstract Integration Schema is less susceptible to semantic drift due to its inherent semantic "fuzziness". The generalized concepts and structures are meant to serve as a vehicle for carrying or conveying more precise semantics, but they "mean" more things and are thus able to accommodate a wider variety of meanings. It is also likely that the generic model is more "standardizable" within a large meaning community because it is less likely to change over time because it can "float" with the drifting semantic requirements of concrete models.
 Integration Schema
 All of the PDML Application Transaction Sets are views (or subsets) of product data necessary for DoD weapon system support. They overlap with respect to the data they include - elements such as part_number and drawing_number are common to two or more of the views. Thus, the semantics of the ATS's overlap and must be reconciled in the "neutral", "generic" Integration Schema -The PDML Integration Schema is a vocabulary based on the STEP Integrated Resources (ISO 10303, cf. [6, 12]). Like the Integrated Resources, it serves as an integrating mechanism - an integrated view - of the product data used within the communities/applications represented by the Application Transaction Sets. Unlike the STEP, however, data is not exchanged using this neutral view, but rather using the external views. There is no DTD for the Integration Schema, nor XML data based on it. The Integration Schema is not intended to be directly used for product data exchange. Rather, it is more appropriate to consider it a temporary neutral form for integration and view translation.
 When an integrated, cross-application view of product data is needed, data is extracted from the appropriate systems using their Application Transaction Sets, integrated via the Integration Schema, and then converted back to a specific Application Transaction Set view. The PDML Toolkit provides the mapping and conversion capabilities that insulate the users of the individual views from the complexity of the mapping process.
 Mapping Specifications
 -The Application Transaction Sets are application-specific views of product data and define a narrow context of data usage. The Integration Schema is an application independent view of product data and establishes a context of product data usage that encompasses the contexts of the application views. As a view, the Application Transaction Sets can be considered as a particular interpretation of the Integration Schema. This interpretation is formally specified by a Mapping Specification.
 Mapping is more than conversion of between data structures. It encompasses the interpretation of data based on contextual values - a value from a single field doesn't always mean exactly the same thing (though it always generally means the same thing.) Based on contextual value that indicates the use, a field such as document.id could be drawing number, a tech order number, the designation of a standard or specification, or the identification of a digital file.
 The PDML Toolkit "internalizes" and uses the Mapping Specification to drive the conversion of XML data to/from the Integration Schema format.
 

Standardizable and Adaptable Data Semantics

 The requirement for complete and unambiguous data semantics was clearly met by the PDML design/implementation approach described above. By clearly delineating the usage community, the semantics of the XML Vocabulary can be clearly defined. Meeting the "standardizable" and "adaptable" requirement isn't more difficult, but requires a more complex explanation.
 With respect to the semantics of application data, "standardizable" and "adaptable" are opposing forces. Standardizing the meaning of a vocabulary term, for example, precludes adaptability by definition - if the meaning can be adapted to suit a particular use, then it does not have a standard meaning! Unfortunately, natural language has always exhibited and will always exhibit this tendency for semantic drift as human activities elicit/discover the need for new meanings and ways to express them - and IT standards will not overcome this tendency. Rather, any such standard will be a straightjacket that will be overcome by · "overloading" the semantics of a term (i.e., making it "mean" things other than what it was intended); · the creation of usage guidelines to control how the vocabulary is used (e.g., the myriad of Federal Guidelines for EDI usage); · the definition of variant "flavors" of the vocabulary to suit particular needs (e.g., the variants HTML supported by different browsers).
 This is the fundamental conundrum of semantic data standards: How can an XML vocabulary be defined that is adaptable with respect to the subtle semantic shades of a particular use, yet can also be standardized to the degree that application software can correctly "interpret" the terms?
 PDML solves this problem by again appealing to a particular community of users and assigning the community the responsibility for "standardizing" the vocabulary and defining the rules for adaptation/evolution of the vocabulary semantics. This kind of "standardization" is easy to see, but the requirement for integrated data semantics suggests that the mapping specification and the Integration Schema also be considered for standardization.
 Standardizing the syntax of the mapping specification language and the content specification language used to define the Application Transaction Set and Integration Schema vocabularies is a fundamental requirement to ensure computability/processability of the complex information structure. The standardization of a particular mapping specification should, like the Application Transaction Sets, be the responsibility of the usage community.
 The standardization of the Integration Schema, as an XML vocabulary, is a question open to debate. One on hand, it would make sense to standardize the Integration Schema with respect to the usage community comprised of the smaller communities that have specified mappings to it. On the other hand, it really isn't necessary to standardize the Integration Schema because as changing requirements force it evolve, the mapping specifications can simply be updated to reflect the changes (just as it is modified to reflect evolutionary changes in the Application Transaction Set.)
 (Note: PDML is not a standard. It has not been reviewed and approved by any recognized standards body and there are no plans for pursuing standardization. This discussion, therefore, pertains to the possible standardization of PDML or PDML-like information structure.)
 Formal specification of semantic content
 Recognizing that, as a data specification language, XML DTD syntax is rather impoverished with respect to semantic features (e.g., datatypes), PDML chose the EXPRESS language [11] to formally specify the semantic content of the Application Transaction Sets and the Integration Schema. Because it was defined within an industrial data management environment, EXPRESS has all the semantic features and integrity constraint mechanisms for specifying unambiguous (or much less ambiguous) content for XML documents. Also, PDML approached the problem of XML vocabulary development from the "data management" perspective rather than the "document" perspective, so EXPRESS was more suited to PDML objectives by the nature of its design. Two frequent and relevant questions regarding the choice of EXPRESS are: · Why not use UML? (Unified Modelling Language, cf. [8]) · Why not use XML Schema or XML-Data?
 Since the focus of PDML was the semantics of the data structures, UML was considered "overkill" or "over-featured" - there are too many aspects of the language irrelevant to project objectives. XML Schema and XML-Data are both functional replacements for Document Type Definitions (DTD's) and, thus, not only carry forward the "document mindset" inherited from SGML but also contain features irrelevant to data semantics and data management (e.g., parameter entities). In addition, neither of these are recognized standards whereas EXPRESS is an ISO standard.
 

Leverage existing technologies

 Another objective in the development of PDML was leveraging existing technologies to the greatest extent possible. Besides drawing on XML as a data encoding mechanism, the design of PDML also drew on the STandard for the Exchange of Product model data (STEP - ISO 10303). In particular, PDML leveraged:
 · The "integration through abstraction" and interpretation architecture of STEP;
 · The STEP Integrated Resource data structures in the construction of the Integration Schema.
 PDML actually developed no new technology. Rather, it simply combined bits of existing technology to deploy integrated, web-based data resources.
 

XML Vocabulary Design Issues

 Broadly speaking, any and all good data structure design principles are applicable to the design of XML vocabularies. There are, however, a few design issues highlighted by the PDML development experience that particularly important with respect to the use and integration of data resources on the web: · Local autonomy versus global applicability · Impediment of SGML "mindset" · Complexity versus acceptance · Keys, identifiers, and cross-platform uniqueness
 Local autonomy versus global applicability
 This "issue" is perhaps the most fundamental realization of the PDML: that local communities have a right to govern themselves, but they must also buy into an encompassing government framework in order to interoperate with other communities within that structure. The issue, then, is how to design a framework that would enable both local autonomy and global applicability; the tradeoff between them is the data integrity offered by local solutions and scalability of global solutions. Federated databases and PDML both provide a candidate mechanisms, but more study of the tradeoffs and the role of semantics is needed.
 Impediment of SGML "mindset"
 Because of its roots in SGML and document publishing/management, XML has inherited a hierarchical "mindset" that is resistant to change. This mindset is reflected in the XML tools and the DTD designs that are being popularized across the net - everything is built and displayed as "trees".
 For many applications, system integration and data management practice long ago abandoned hierarchical structures in favor of network structures that are more flexible and more accurately reflect the reuse/sharing of concepts in a data file. The ID/IDREF mechanism in XML provides this "network pointer" capability for construction of XML vocabularies and XML documents, but it is a little utilized feature of the language. This issue of hierarchy versus network will become more evident as the number of vocabularies grows that and the need to integrate vocabularies is more fully realized.
 Complexity versus acceptance
 That this issue is important to the widespread adoption and use of XML vocabularies is evidenced in the facts that:
 1) SGML was standardized in 1986 and gained primary acceptance within government applications, but never caught on publicly. (SGML is a complex standard.)
 2) HTML caught on with the general public very quickly. (HTML is - relatively - simple.)
 3) XML caught on with the generalpublic very quickly. (XML is simple with the background provided by HTML experience.)
 4) All those "xxx for Dummies" books are so popular .
 Users of internet technology want simple solutions - they want all that complex stuff hidden away under the hood of application. They want appliances.
 This is not a technical issue, but a business issue. Who is the prospective target audience of an XML vocabulary?
 Keys, identifiers, and cross-platform uniqueness
 The persistent identification of a data object as it moves across multiple platforms and the relationship of that object and its identity to a real-world thing that it may designate is a huge problem the import of which has yet to be realized. Aside from spatio-temporal existence (i.e., "I think, therefore I am"), identification of something is always based on some context within which the identifier is unique. The context-sensitivity of identifiers is a problem for system integration and application interoperability because when integrating data, it is extremely difficult to discern which data is the same data (and therefore should not be inserted into the dataset) and which is different - and using identifiers to do this can be dangerously misleading. The design of XML vocabularies and exchange of XML documents will face this same problem.
 This issue is a real research issue worthy of academic study. Philosophers and Psychologists have dealt with the notion of identity for hundreds, if not thousands of years. The notion of identity, data, and the real world is just beginning to be explored.
 

Summary and Conclusions

 PDML introduces to the web a new paradigm for data use, integration, and exchange. Mainstream pursuits of standardization of XML vocabularies and DTD's focus their efforts on the definition of the elements of their vocabulary - within their own particular usage community! There is nothing inherently wrong with these efforts, and it is extremely important that data usage communities identify and define the principle elements that they use to talk among themselves and exchange information. However, experience with data exchange standards show that such solutions are not · portable; · usable or useful (or not very usable/useful) outside the usage community that defined it; · integrated with other vocabularies (by the single fact that they were developed within and for a particular community and without consideration of other vocabularies.
 PDML provides a mechanism and paradigm in which data exchange is no longer bound to semantically "flat" schemas, but leverages the abstraction and context-sensitivity that people use unconsciously every day in human language use. In this way, individual usage communities can "have their cake and eat it, too" - they can use/exchange data in accordance with the own vocabulary and definitions, but still have an unambiguous path for exchanging data with users from other communities.
 Unlike many XML DTD development efforts, PDML is not "thrown together" to meet a few short-sighted needs. Rather, PDML is based on database schema modelling principles and adopts a long-term view of data use and integration. Short-term solutions are brittle and quickly become obsolete. The flexible, stable, and reusable solution offered by PDML in the form of the Integration Schema offers a long-term solution can be applied (through interpretation) to any number of usage communities. The usage communities may come and go, but the Integration Schema will remain and provide an integrated point for application views both past and present.
 The independent development of XML DTDs as vocabularies for individual and disjoint communities of users will result in a cacophony of incompatible - or worse: partially compatible - data specifications. If there is to be any hope that web resources can ever truly be integrated, some form of relationship or structure that provides a method for mapping between distinct communities is essential. PDML provides a valuable step in that direction.
 1. Batini, C., Deri, S., and Navathe, S., Conceptual Database Design, an Entity-Relationship Approach. Benjamin/Cummings, Redwood City, 1992. ISBN 0-8053-0244-1.
 2. Batini, C., Lenzerini, M., and Navathe, S.B., A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surveys, 18, 4, (1986), pp. 323-364.
 3. Berger, P. and Luckman, T., The Social Construction of Reality A Treatise in the Sociology of Knowledge. paperback ed. Doubleday, New York, 1966. ISBN 0-385-05898-5.
 4. Berners-Lee, T., Connolly, D., and Swick, R.R., Web Architecture: Describing and Exchanging Data. 1999: World Wide Web Consortium.
 5. Bray, T., Hollander, D., and Layman , A. Namespaces in XML. (1998) http://www.w3.org/TR/PR-xml-names. Date of page: 1998-11-17.
 6. Danner, W.F. Developing Application Protocols (APs) Using the Architecture and Methods of STEP (STandard for the Exchange of Product data). Fundamentals of the STEP Methodology. National Institute of Standards and Technology. NISTIR 5972. 1997.
 7. Elmasri, R. and Navathe, S.B., Fundamentals of Database Systems. Benjamin/Cummings, Redwood City CA, 1989. ISBN 0-8053-0145-3.
 8. Fowler, M. and Scott, K., UML Distilled Applying the Standard Object Modeling Language. Addison Wesley Longman, Reading, Mass, 1997. ISBN 0-201-32563-2.
 9. Goldfarb, C. and Prescod, P., The XML Handbook. Open Information Management, C. Goldfarb, ed. Prentice Hall, Upper Saddle River, NJ, 1998. ISBN 0-13-081152-1.
 10. Gonsalves, A. and Pender, L., Schema Fragmentation takes a bite out of XML, in PCWEEK. 1999. p. 1, 16.
 11. ISO. Industrial automation systems and integration - Product data representation and exchange - Part 11: EXPRESS Language Reference Manual. ISO 10303-11:1994, Geneva, 1994.
 12. ISO. Industrial automation systems and integration - Product data representation and exchange - Part 41: Integrated generic resources: Fundamentals of product description and support. ISO 10303-41:1994, Geneva, 1994.
 13. Sowa, J.F., Knowledge Representation. Thomson Learning, Pacific Grove CA, 2000. ISBN 0 534-94965-7. Tsichritzis, D. and Klug, A. The ANSI/X3/SPARC DBMS Framework Report of the Study Group on Database Management. AFIPS Press. 1978.

Data Models as an XML Schema Development Method   Table of contents   Indexes   Implementing a Component Broker using XMI