Accessing LDAP Data with XML   Table of contents   Indexes   XML in Investment Banking

 

Building an XML Web Site Production System

 California 
PC World Online
 San Francisco 
Turner, Matt
 USA 
 
Matt  Turner
Manager, Application Development,  PC World Online 
 501 Second Street
San Francisco  (California)  (USA) 94107 
Email: matt_turner@pcworld.com

Biographical notice

Matt Turner is the Manager of the Applications Development Group at PC World Online and has been developing content management systems and just about every other kind of web application imaginable for the past two years. Before that, he developed database applications for poor unconnected PCs.

Recently Matt and histeam of developers implemented an XML web site production system for PC World Online. After this experience, Matt is now convinced that relational databases are dinosaurs compared to XML based systems.

 Introduction
About a year ago, PC World Online set out to develop a web site production system to store it's wide range of documents in a single data store and generate many different versions of these documents in an automated fashion.
 This spring, PC World Online was able to roll out the first sections of it's web site which used this new web site production system. Today, every page on PC World Online is generated from the data store of documents, from two paragraph news stories to nine page features.
 How did PC World Online create a system that was that flexible enough to store its wide range of documents and yet so robust that it is the core of a high volume web site? We used XML.
 

PC World Magazine and PC World Online

 PC World Magazine is one of the world's leading sources of information on the computer industry. It's target audience is computer savvy managers and consumers and much of the content is reviews of products and tips on how to better use your computer.
 PC World Online, a separate group which is part of PC World Magazine, is charged with all of the electronic publishing of the PC World Magazine content. The most visible venue for this content is PC World Online at www.pcworld.com. The web site is a high traffic site which combines all of the content of the magazine with online exclusive features, news stories, a shareware library and more.
 In addition to the web site, PC World Online handles all of the electronic licensing for PC World Magazine. This entails taking the content from the magazine and converting it to many different formats, most of which are licensee specific.
 

The Document Storage and Delivery Puzzle

 PC World Online faced a number of challenges in developing a system to produce it's web site and generate its licensee content. But the major problem was one of document storage and delivery.
 The system had to be able to store documents that have very different and complex structures and come from very different sources, such as:
 
  • A feature article from the magazine which may be 9 pages long, have 2 sidebars, 5 charts, 4 screen shots and 4 art images. This and all other articles from the magazine are produced as Quark files.
  •  
  • An online feature of the same length and complexity, but about a totally different topic. For instance the magazine feature may be about desktops and consist of product reviews and the online feature may be about MS Word and have tips about how to use that application. Online features are typically submitted in MS Word format.
  •  
  • An online news story that is four paragraphs long. Also submitted in MS Word.
  •  Then the system had to be able to produce documents in at least these basic formats:
     
  • PC World Online HTML in at least 5 different channels.
  •  
  • HTML, DHTML and XML for different browsers and possible future browsers.
  •  
  • Licensee content in an unknown number of formats.
  •  The system also had to store these documents with their complexity intact. The articles produced for the magazine are presented to readers in a very information rich manner. The magazine uses spot art, colors and placement to convey to the reader the structure of the article. For instance, within an article sidebars are set off with different backgrounds and items such as reviews or tips are presented with headers or boxes to indicate to the reader that these are special types of content.
     A primary goal for the data storage aspect of the system was to somehow store this kind of complexity along with the documents so that articles could, as much as is possible online, be presented to readers in the same kind of information rich layouts as the magazine.
     

    A Template and Database Solution?

     A traditional method of web site production is the concept of a template and database system. This type of system had been in use at PC World Online for simple areas of content and it's capabilities suggested that with a sophisticated implementation it could be the basis of the overall system.
     A template and database system usually involves a set of HTML templates with placeholders for content pieces and a separate database to hold the content. At the time of delivery an application reads the template, makes a call to the database and inserts the fields from the database into the template. This creates a document with the formatting of the template and the content from the database.
     This is a very powerful tool that for the first time allowed many sites, including PC World Online, to automate the delivery of much of their content and also create many different versions of the same content.
     But these versions can only be different in areas outside of the actual content. So the first delivery format requirement, the generation of different HTML for PC World Online, may be accomplished. But more complex formats require that the content itself must be transformed. An example of this is licensee content that requires a certain marker for headings and another for paragraphs. It is not possible to generate this format unless the structure of the content itself is somehow stored in a machine readable manner.
     In most implementations of this type of system, PC World Online's included, the database of content consists of a single table with index fields such as title and category. The actual article is stored in an additional database field. Because the primary focus of these systems is to generate pages for a web site, the article is usually lightly formatted with HTML. Minimally this means <P> tags, but there is no way to stop the natural tendency of writers and editors to offset subheads and more complex article elements with HTML formatting.
     Because HTML as we know it is no longer parsable and because it is difficult to control the entry of rogue tags, it is impossible to base any automated procedure on this content and thus impossible to change the format of the content itself at the time of delivery.
     To avoid this problem and still use a database is a very difficult task. One possible method involves expanding the single table to include a field for every possible item. The entire set of documents would be examined and then the most complex combination would serve as the model. Every article would then be stored in a single table that would have all the fields for a large article plus all the special fields for elements such as tips and product reviews.
     Another method would be to store the documents in a highly abstracted table structure. In this model, a document would consist of a web of records each relating to each other and each with a different attribute - a record for the main body, a record for the sideboard, a record for the tips, etc.
     The first model is very inefficient and both models are incredible hard to build data entry tools for and even harder to maintain. And one of the primary lessons of application development for online publishing is that applications must be simple and easy to maintain.
     Because template and databases delivery mechanisms are not truly flexible and storing documents in a complex manner in a database is daunting at best, the template and database model failed to meet the goals of the PC World Online system.
     

    The XML Solution: Document Storage

     Unlike database systems, SGML and it's new offspring XML were not mainstream concepts in online publishing even as little as one year ago. It's ironic that web site developers, who constantly dealt with HTML and may have heard that it was an offshoot of SGML, did not know the true purpose of those tags: to provide structure and not display.
     But the idea was there, waiting to be found, and the movement towards a more usable version of SGML called XML was beginning to get the word out to the uninitiated.
     With a little research into the roots of HTML and this new concept called XML, PC World Online discovered what is now taken for granted in the online industry: there is no way to store complex documents other than XML.
     The XML concept of tagging a document with the structure, enforcing the structure against a DTD and then being able to programmatically read the document simply makes sense. Instead of trying to store documents in fixed structure such as a database and essentially forcing the documents into that structure, XML describes all possible forms of documents in terms of a hierachical document tree. The elements of this tree can be defined as nested in others and can have position, number and be made optional or required. This makes for an incredibly flexible document model. But it also makes for an enforceable one. The DTD that describes these elements can also be used to enforce them. So whatever your rules are, the documents that are tested against the DTD cannot break them.
     In a surprisingly short amount of time, a team of editors, production staff and developers worked with Mike Brown, an SGML consultant (www.brown-xml.com), and created the PC World DTD. This DTD is actually a very simple and features common article elements such as headline, section and paragraph. It also features special structures for tips and product reviews. To the credit of the group working on it, the DTD has changed very little since it's inception about one year ago.
     Using the DTD, command line tools, XML parsers and ArborText's Adept editor, we now transform all of our various document formats into XML in conformance with the PC World DTD. This is a vastly simpler task than hand coding the documents into HTML and the other formats required.
     In particular, the conversion of documents from Quark format has been streamlined. Articles are converted from Quark into ASCII files with Quark tags as text. Several processes are run to convert the Quark tags to XML tags and then each document is tested against the DTD for conformance. The article is then edited, usually to properly nest tags, until it conforms to the DTD.
     Once the document is validated, copy editors and editors use Adept to further enhance the structure of the documents adding in elements that could not be translated from the Quark such as the structures for product reviews and tips.
     The resulting XML documents successfully capture the complexity of the original documents and in many cases have additional information such as product links and urls for companies which are added to the online version of the article to provide an even richer reader experience.
     

    The XML Solution: Delivery

     Beyond providing a flexible storage system, XML can also be used to extract content in a very powerful and flexible way. This is crucial to the PC World Online system that must automatically transform the valid XML documents, in all their complexity, into many different delivery formats.
     With a database system, access to information is simple and powerful. When the record is called, tools can ask for the contents of the field and they will be returned.
     With XML, the access to the data is, if anything, more powerful. XML documents are accessed with a parser. The parser reads the tagged document and creates a document tree of all of the tags and the contents of the tags. It is easy to access a certain element of the tree - simply walk the document tree and get out the contents of the tag. Retrieving the headline from an XML document is as easy as getting it from a field from a database.
     But the additional benefits of the parsed XML document tree provide a much richer set of data retrieval tools than any database. The entire tree can be accessed and so not only is retrieval of a certain tag possible, but any tagged portion can be retrieved. It is as easy to get the fifth product review including all of the tags contained in the review as it is to get the headline. In addition, the review is a valid XML document which itself can be processed.
     But the real value of XML is beyond simple access to data. With some tools, such as the XML tools provided by Vignette StoryServer, during serialization or the rendering of the selected portion of the XML, it is possible to transform the XML to other formats. This is called a translation table. The StoryServer methodology is based on early XSL proposals and is a straight forward way of transforming XML in an automated fashion. For each tag, a procedure may be run for each part of the tag - the start, the body and the end. In most cases, the procedure simply replaces the tag with an appropriate entity for the audience. HTML for online, text or formatting for licensee and paper. But the procedure may also be code that is run from that point in the document and has the full XML services available. So complex entities such as references can resolve the reference to another document and display that in place of the XML tag.
     This ability to not only provide straight forward access to portions of documents but also programmatically transform documents is what allows the PC World Online system achieve the goal of being able to delivery many different formats of documents, including those where the content itself must be transformed.
     The PC World Online document delivery system is built using Vignette StoryServer which is a traditional database and template system. But Vignette was also on the forefront of the industry in recognizing the value of XML and XML services were developed for StoryServer in January of this year.
     The resulting system makes the best of both types of systems. The template system allows the creation complex containers for content and the XML enables every aspect of the actual content itself to be transformed.
     Once PC World Online editors sign off on the valid XML documents they have tagged using Adept, these documents are checked into a relational database that is very similar to the databases in a traditional template and database system. Article level attributes are decomposed into index fields but where the traditional systems store lightly coded HTML, a valid XML document is stored.
     When a request is made to view an article, the appropriate template is called and it retrieves the record containing the article from the database. Instead of inserting the returned fields into the placeholders in the template, the system sends off the XML to the XML processing part of the system. The XML processing parses the XML, returns the appropriate portion of the document and then sends the XML through the appropriate translation table and returns the resulting selected and transformed XML to the template. The template inserts the returned text into the appropriate position and delivers the finished page, the combination of XML and template, to the requestor.
     This pathway is the same for all forms of our content, from pages for our site with very complex HTML wrappers and HTML formatting within the articles to proprietary formats for licensees with a completely different presentation of the content.
     

    XML System = Real World Results

     The creation of the XML based web site production system has totally revolutionized the way PC World Online creates and delivers it's content.
     Using XML, the conversion of documents from other formats is greatly simplified and is also now machine testable. This simplified pathway has greatly reduced the time spent converting document formats.
     Editors now have direct access to documents. Using an structured authoring environment like ArborText's Adept, editors can have access to the content and even change the tagging, but it will always be in conformance with the overall document structure as outlined in the DTD. Granting the editors this access greatly streamlined the production process and sped up publication time.
     Using templates and different translation tables, documents can be delivered in almost any format. From the different HTML versions that comprise PC World Online to the various formats of our licensees we have automated processes that take the XML and transform it into the format required. And, in addition to these existing formats, we will be able to quickly adapt to formats that evolve in future browsers, perhaps even sending XML sections that can be viewed using XSL.
     The actual storage of the documents is the most important and successful aspect of the system. Using the DTD we are able to store our documents with their complex structures intact and PC World Online is now able to deliver on the web the same kind of complex documents that readers experience in the magazine.
     The complex storage of documents also has uses beyond creating nicely formatted articles. Because the PC World articles are tagged with information rich tags, unique versions of the content that are tailored to a specific area of that content can be presented to users. This is a totally different way of delivering content than the usual methods of a list of documents by type or search results. In both of those instances, readers are sent to the entire article when only a portion of the article may have been of interest.
     Using XML documents and tools, readers can be presented with positive search trees. For instance a list of products that have reviews. When the reader makes a selection for a review just that portion of the document that comprises the review is displayed. This is a new presentation of the content separate from the article and tailored to the individual reader.
     This type of product and many more will allow us to bring to our readers a much more focused product and provide them with the actual content they want.
     XML provided the basis for a web site production solution that goes well beyond the simple task of storing and delivering documents. While handling this formidable task, the use of XML also opens up entirely new areas of content generation and combination that were unthinkable with other traditional production methods. So far the system has changed to way PC World Online produces it's web site. It will soon change the web site itself, making it a much more enjoyable and fulfilling experience for our readers for years to come.

    Accessing LDAP Data with XML   Table of contents   Indexes   XML in Investment Banking