TMS' History in Web Syndication   Table of contents   Indexes   The Use of Profiling in XML Documents

 Brown, Steve 
POET Software
 San Mateo 
 USA 
 
Steve Brown
 Product Manager
POET Software
  999 Baker Way Suite 100 San Mateo (California)  USA (94404)
Email: sbrown@poet.com Web site:http://www.poet.com
 Biography
 Steve has spent 5 years working in the XML / SGML world. After working for EBT (later Inso) helping the company capitalize upon the SGML and nascent XML standards with its electronic publishing technology, Steve joined POET Software to help launch POET Content Management Suite. His areas of expertise include SGML / XML technology, technical documentation, web content management, and e-commerce. Steve has a A.B. in anthropology from Brown University.
 

Scope

content reuse
 
This paper discusses the emerging use of "content reuse", or the use of the same content in several publications, with SGML and XML. The discussion primarily deals with "complex content", the area of the information spectrum traditionally addressed by SGML which includes maintenance manuals, documentation, and the like, as opposed to traditional "data" applications (e.g., HR, accounting, and customer data). Next the paper will cover a definition of "content" will be defined, content paradigms that are changing as the result of new technologies, and why content must to be reused in commercial settings. Then I will go into various reuse strategies using the standards and tools of today and tomorrow. Finally, I will address what needs to be done to help publishers facilitate maximum content reuse. (Note that "repurposing", or the delivery of the same publication to different audiences or on different media, which has its own distinct set of issues and solutions, is not covered here.)
 

Content; What is it?

 Content is a type of information that contains complex ideas. These ideas cannot be processed by machines, and the consumers of content are humans, not machines. With today's technology, machines cannot create these ideas and put them into human language; humans need to put them into the form of publications.
 

Why Reuse Content?

 Complex content creation is extremely effort-intensive. This effort is necessary because it's critical that the content is accurate, and the person on the delivery end clearly understands what's being communicated. In the case of aircraft maintenance manuals, it could even mean the difference between life and death. So it not only must be created, it must go through a review process in which many eyes and hands will touch the content. Content will be routed to editors, changes will be tracked, and in some cases translations will be required. In short, every approved page, paragraph, section, sentence, etc. represents a lot of human effort, which translates into costs for the complex content publisher. With so much value-and money-wrapped up in content, it makes sense to reduce costs by reusing as much content as possible.
 

The Content Revolution

 A content revolution, parallel to the industrial revolution, is helping to bring about new ways of creating and distributing complex content. With the industrial revolution, commercial production shifted from the paradigm of individually-crafted items to the paradigm of mass-produced items with interchangeable parts. Sophistocated manufacturers learned to reuse parts for different items in their product lines. Gradually, as the manufacturing industry turned into networks of suppliers, standards were created to ensure the interchangeability of parts between suppliers.
 A similar revolution, necessitated by the need to reuse content, and facilitated by new technologies such as XML and content management systems, is introducing a new reusable content paradigm in complex content publishing. Traditionally, content consists of "documents". Traditional documents have a beginning and an end, and may reference or cross-reference other documents, but do not necessarily reuse content from other sources in a systematic way. In addition, documents that need to be distributed in different settings (i.e. distributed in print for managers and on the Web for engineers) need to be manually converted or reformatted and maintained as separate sources, lowering the ROI of reuse.
 publishing 
 
Increasingly, new publications will consist of interchageable parts. Publications will be open-ended, liberally reusing content from other sources and hyperlinking to related texts. They will separate content and formatting, enabling many different kinds of presentation on different delivery media (e.g. print, CD, Web, handheld devices). Most importantly, this content will be as much "assembled" as it is created from existing sources, and personalized for various audiences and purposes.
 

A scenario

 I. M. Smart Hardware builds circuit boards. Many of the components are manufactured elsewhere. They've taken this approach to documentation as well. Each of their suppliers is required to supply reference and maintenance documentation for each component using an industry standard document type. They supply reference and maintenance documentation for the circuit board and the circuitry. They use web-style hypertext links to provide navigation to supplementary information. Where only portions of a suppliers' content is applicable, they embed applicable content. And best of all, they link to or even embed real-time service bulletins from their suppliers. But the best part for them is the ability to assemble their documentation not through cut-and-paste approaches, but by selecting content that should go into a publication, using software tools to put the publication together. The assembly process uses stylistic rules that govern which reuse methods are used. The documentation is delivered to service centers, value-added resellers, and others over the Web, and is supplemented by an online tool for on-demand print. Customers are happy because they have fast access to accurate, up-to-date content. And I. M. Smart is happy because with their existing systems, it took little more than assembly and some review to put it all together.
 

Content reuse: Examples

 The following examples from the aerospace, automotive, telecommunications, and manufacturing industries demonstrate how content is being reused in XML and SGML applications today:
 Procedures: Tasks for repairing different components inside an engine, for example, are reused. Many tasks include the same subtasks, such as subtasks for opening and closing the panel that hides the engine.
 Troubleshooting "topics": An integrated expert system suggests topics, including lots of reused criteria and questions, for diagnosing problems. The SGML content contains logic for hyperlinking to the next logical questions, eventually leading to the diagnosis.
 Systems: Product lines that share components reuse the documentation for these components. Warnings: warnings are reused whenever dangerous parts or procedures are documented. Commercial publishers are also especially aware of the value of content reuse. To publishers, content is their product and their source of revenue. To maximize their return on investment, they must reuse content in as many publications as possible, just as manufacturers reuse as many parts as possible. Publishers also see reuse technologies as an opportunity to offer the service of "information on demand", in which consumers (rather than publishers) choose what goes into a publication.
 

Assembled Content; Models

 The following technologies and standards are paving the way for the future of reusable content:
 - SGML / XML markup
 - SGML / XML entities
 - Database-based content management systems
 - XLink and XPointer
 

How not to do it: Cut and Paste

XML tools
 
Using traditional, proprietary tools (as well as XML tools) publishers can easily "cut and paste" content from one document to another. But this method only works until the reused content changes. If a section copied into another document changes, the publisher has to manually find every place where content is reused, and make manual updates to each of these places. Obviously, this process is labor-intensive and error-prone. Instead of cutting and pasting, it makes sense to maintain reused content in a single place, and reference it from wherever it is reused. SGML and XML are the optimal foundation for this kind of reuse.
 

SGML and XML

 SGML 
 
SGML and XML enforce the structure necessary to enable reuse of content. Through structural enforcement, XML can ensure interoperability between XML-ready systems. Structural enforcement also means reused content can be validated to make sure that it won't break the new document's structure. Basically, structural validation ensures that the applications designed to work with a certain type of content-in authoring, management, or delivery-will work the way they're supposed to.
 The use of SGML has traditionally suffered from expensive, immature, difficult-to-use tools, a high cost of DTD development, and the resulting need for highly trained personnel to put it into practice. In addition, SGML separates content from formatting, and SGML authoring tools aren't designed primarily for formatting, so contributors accustomed to WYSIWYG authoring must learn to "suspend disbelief" and author content without formatting in mind. As a result of the expense, difficulty, and high learning curve of SGML, many corporate and commercial publishers have found the investment in SGML hard to justify. But this is already changing with the introduction of XML, which is making the development of tools less expensive, which will lead to more tool, better tools, and more widespread use.
 

XML/SGML Entities

entities
 
Entities, defined as part of the XML and SGML specifications, provide a standard, vendor-neutral means of reusing content. Entities are XML or SGML fragments that can be stored either inside larger documents or as external objects (e.g. files or objects in a database). XML / SGML publications use entity references to point to entities. The XML or SGML application resolves these references, enabling content to be reused on paper or in electronic form. Entity-based reuse is widespread in many industries, including aerospace, pharmaceudical, manufacturing, and reference publishing, among others.
 Entity-based reuse is very effective, but it requires manual management effort. If a publisher plans to reuse content contained within an existing document, the content must be taken out of the document, put into the form of an entity, placed in the file system or database, and given an identifier so it can be referenced. Finally, the references to the entity must be placed in the target publications. In short, it's a fair amount of work to reuse content in this way.
 Entities also do not account for a crucial part of the content process: version control. Large documents must be reviewed as a whole, and when shared content is updated, the publication using reused content may not be "ready" for the updates. Entity references have no way of pointing to a version, and there is no standardized way of representing version information or versioned content, outside of creating complicated tagging schemes. So if a publisher needs to use an older version of a shared entity, they must create a new, separately maintained entity, and the entity is no longer "reused".
 

Database-based content management systems

 Content management systems provide alternative methods for reusing content that are not always "standard" but help get beyond the limitations of XML / SGML entities. These systems, usually built upon databases, provide functionality above and beyond the file system. Database-based systems for managing complex content provide access control and versioning, which are not easily possible in the file system. In addition, they can automatically manage entities and entity references. But beyond this, content management systems can provide database-enabled content reuse methods not possible with entities.
 

Content object sharing

 Some content management systems can reuse content through "sharing". Users can simply select content from anywhere in a publication and "share" it somewhere else-without going to the trouble of creating a new entity. Best of all, updates are replicated everywhere the content is shared automatically.
 

Content sharing in RDBMSs

 Content sharing is done most easily in content management systems that are built upon object-oriented database systems. XML and SGML are object-oriented, and any system that shares XML/SGML objects must have some knowledge of the XML/SGML object structure. Relational database management systems, or RDBMSs, can store objects only by mapping them to database tables and creating table joins, leading to significant performance hits. So instead of storing every XML/SGML object, RDBMS developers usually create mapping schemes that expose only a few levels of object granularity. The resulting "objects" are stored in the file system and can be shared. Unfortunately, however, there is no guarantee that the publisher won't find a need to either change the XML/SGML structure or manage content at a different level of granularity. When this happens (and it usually does) the database developer must remap the entire DTD to new database tables and joins. This is a process that has high development costs, and may even take the publishing team offline for a significant period of time.
 

Content sharing in ODBMSs

 In contrast to RDBMS-based content management systems, systems based upon object databases, or ODBMSs, are inherently more flexible and do not run into object mapping-related difficulties. XML/SGML content has an object-oriented structure, and ODBMSs are designed to store and manage object-oriented data such as XML and SGML. ODBMSs can easily store every element in the XML/SGML structure as objects without the performance hits taken by RDBMS trying to do the same thing. But more importantly, ODBMSs inherently "understand" the XML/SGML structure; by design, they store and maintain object relationships. For this reason, ODBMSs have a much easier time creating references to XML/SGML objects at any level of granularity. The content management system can expose the XML/SGML structure, users can arbitrarily select content objects for reuse at any level of element granularity. Best of all, if the content management system vendor has done their homework, DTD changes need not require reprogramming. The publisher benefits from much greater flexibility in content reuse-and the elimination of downtime.
 

Proprietary methods

 Whatever the underlying database system, database-based content sharing methods are usually proprietary. Because until recently (with XLink and XPointer, discussed below) there have been no SGML or XML standards for referencing content that is not stored in separate entities, so content management system vendors have come up with their own schemes for database-enabled content reuse. One method, for example, involves placing database object pointers in PIs, or processing instructions inside the XML, that tell the database to share a verion of a specific content object. Proprietary methods such as this are not true to the "spririt of SGML", because the method will only be possible in this particular content management system. However, publishers will have to weigh this disadvantage against the benefits of the feature.
 

Reusing versions

 Despite the concerns of non-vendor-neutral functionality, database content reuse has the added advantage of utilizing database functionality to solve other content management problems. A notable example of this is the reuse of versions of objects. In many cases, it is acceptable to always reuse the latest version of a content object. Reused entities always reflect the current state of the entity, i.e. the latest version, by default. But this may not be adequate in the case of publications where the latest version of a shared content object is not compatible with the rest of the publication. To address this problem, some database content management systems can let users either share the latest version of the content object or "fix" the share to a previous version.
 

Mapping objects to new DTDs

 Another use of database functionality in content reuse is in "mapping" content to different DTDs. Databases do not actually "store" XML; rather, they store objects, types, and relationships, and translate them to XML syntax upon export. Because content objects are abstracted from the specific tags in this way, they can be easily reused in documents of different DTDs. For example, a procedure that is used in two different types of documents could be stored in a central database, but reused in different types of documents (such as two different types of maintenance manuals) without modifying the source.
 

Document assembly

 Perhaps the most exciting use of content management systems is the notion of "document assembly". Document assembly takes us 180 degrees from inefficient document processes, in which documents are each authored from scratch, to the dynamic content paradigm, in which content is reused for new publications, saving time and money for the publisher. With document assembly, publishers have the ability to pick and choose existing content and easily assemble it into new publications. Configuration management tools, relying on database technology, take this concept one step further, letting users pick versions of components for new information products. In some industries such as aerospace, configuration management systems can also provide help with "effectivity", in which reused content varies slightly from from one publication to another. Already commercial publishers are asking about similar solutions for "information on demand", in which publishers provide raw XML content, but individuals pick and choose content from a variety of sources, and build their own personalized publications. XML publications would be tagged consistently, making it possible to gather information from disparate publications. Document assembly tools would then make it possible to pick the XML pieces, put them in an XML framework, and build the new publication.
 

The near future: XLink and XPointer

 The most exciting developments in XML standards are the XLink and XPointer specifications. XLink, in addition to providing multiple facilities for hypertext links, provides specifications for embedding content inline. XPointer is important because it adds facilities for pointing to places in target documents-without making any changes to the target documents. With traditional HTML links, if you want to point to a specific place inside an HTML file, you have to add an "anchor" in the referenced document that the browser application finds when the link is traversed. XPointer specifies how to use anchors as well, but it also specifies other methods, such as treewalking, for pointing to the beginning and end of content to be embedded-without making changes to the target document. XLink and XPointer specify methods for selecting content, linking to it, embedding it inline, or using it to replace other inline content. This may still have the same problem with versions as entities. However, users will benefit greatly from the efficiencies of this kind of flexible, standards-based content reuse when tools are built that implement assembly using these standards.
 

Conclusion: it's the tools

 XML provides the foundation for helping us shift from a closed document paradigm to an open reusable content paradigm. The use of entities and extended linking are an integral part of this foundation. But these standards are intended as foundations only; it's up to tool vendors to exploit these standards to enable the actual work of easily creating and reusing content. As discussed, there are content management systems and other technologies that support these standards, and even add extra functionality where the standards leave off.
 The future of reuse strategies is up to not only having technical users, but also using the tools that make it easy. Tool vendors, system integrators, and others can help us truly take advantage of the efficiencies of reuse by making this functionality accessible to all users-not just techies. Publishers must demand that interfaces for their users be easy-and tailored for specific tasks, not just delivered as "one size fits all" tools that cater to SGML-savvy users.
 In addition, publishers need to choose reuse solutions that are open. Closed systems that do not interoperate with other systems defeat the stated purpose of XML-interoperability-and might as well not support XML at all. If XML content is the product, the assembly line must be seamlessly integrated. XML ends up being more trouble than it's worth if users spend more time trying to learn and use difficult tools that require manual effort; their time should be spent on the actual business of content creation, assembly, review, and production. So content management systems must tightly integrate with the authoring tools, workflow tools, print production tools, web delivery tools, etc., that suit the customer's unique needs. Tools must be tied together and tailored to make creation, review, and production easy-and as automated as possible. Together, XML standards, easy-to-use tools, and open systems will help publishers reap the benefits of content reuse.

TMS' History in Web Syndication   Table of contents   Indexes   The Use of Profiling in XML Documents