| SGML and the On-line Legislature | Table of contents | Indexes | PDoS - Pinnacles DSSSL-O Stylesheet | |||
| Maziarka Michael |
Publishing to the Web is More than Converting Data into HTML |
The Web was Designed to Be "Easy" |
| The World Wide Web, and subsequent use of Web technology for internal corporate use, gained mass appeal because of its simplistic approach to presenting data in an interactive manner to information seekers. In no time at all, everyone from middle school students to corporations was building Web sites. Those sites were created — more or less — as marketing vehicles for their creators. In a matter of hours, anyone could make their message available to the world. |
| As a result of this rapid adoption of Web technology, corporate, commercial, and technical documentation publishers were presented with a new challenge:Publish on the Web! One would think that accomplishing this task would be as easy as "converting your data into HTML." Unfortunately, many additional challenges face the information publisher who must not only distribute information via the Web, but also using the more traditional vehicles of paper and CD. |
Web Publishing Challenges |
| The challenges faced by the information publisher are many fold, primarily due to the different document access paradigm enabled by Web technology. For Web technology to be a useful vehicle for users to retrieve information, it must provide a way for the user to: |
|
| This high-level list of user requirements for accessing documents on the Web is not unlike the list you would create for paper documents. However, dependent upon the types of documents that you produce, the methods and values assigned to the above list might be far different for Web publishing. |
| One key difference between publishing to paper and publishing to the Web is the quantity of data being supplied and retrieved. Traditional documents are often all-encompassing for a particular topic because it is inconvenient for a user to retrieve additional documents to locate information. Additionally, (and possibly more importantly) it is often not practical from a cost perspective to publish many small documents. As a result, print documents are larger in size. |
| Web technology, on the other hand, relies upon network connections to retrieve information for users. To reduce transaction times (through reduced network traffic), smaller units of data are preferred. In addition, users want to quickly read and assess whether the retrieved information is what they need. To that end, publishing smaller units of data better fits the Web model of retrieving and using data. |
| Depending upon what type of documents you produce, the model might be the same to publish for print and for the Web. Documents such as business correspondence and reports tend to contain smaller quantities of data. On the other hand, maintenance manuals, journals, and reference data publications can contain hundreds, if not thousands, of printed pages. Web users cannot afford to retrieve entire publications of that size to locate a procedure, article, or abstract. |
| For producers of short documents, publishing to the Web requires granting access to your data and providing mechanisms for quickly locating the documents. For producers of large documents, publishing to the Web requires processes that enable the production of documents in multiple forms; print documents are a continuous stream of ordered data; Web documents are a collection of cross linked microdocuments. The challenges introduced by publishing sizeable documents to the Web include: |
Publishing to the Web and Document Management |
| Clearly the answer to publishing both traditional and Web documents is to use document management technology combined with a neutral encoding of data. To be truly effective at publishing to the Web, the document management system must support document component management, rather than entire documents. Component management enables reuse of document modules between different documents and different media output. Publishing to print output or the Web becomes a process of building document views, or collections, of the microdocuments or components. Those views establish relationships between the components, which enables later update and subsequent republishing of the information. |
| It is important to not confuse publishing to the Web, where afinished view of information is accessed, and providing direct access to your document management system through "Web Clients." Web Clients provide a way for users to directly access a document management system through a Web browser rather than a Client (or interface) specific to the document management system. Creating a published view of information must separate work in process from completed and approved information and provides document navigation capabilities aimed towards a user, rather than a creator of the information. Those navigation capabilities may very well be a limited subset of the complete repository of information. |
Data Formats |
| Initially, the Web meant use of HTML. HTML became the panacea for information distribution. It was easy to learn and use, and providedenough formatting capabilities to display a range of data. Unfortunately, like many things which are simple, HTML was also fairly limited. Browser developers began extending the tag set to meet the growing demands of customers. HTML also presented the problem of not being sufficient as a mark-up language to produce print documents. As a result, the documents were created using SGML, desktop publishing tools, or word processors, and then converted into HTML for the Web. |
| For cases where it is preferable to view information in its original format (e.g., documents where the printed page representation is prefered or legacy documents), "Plug-ins" were developed to launch the word processor or viewer which matches the data format. One successful use of that approach is Adobe's PDF (Portable Document Format) to render the printed pages in an electronic format. This approach enables use of documents in a format other than HTML. The downside to using these alternatives is that it takes longer to download the data due to the larger size of the files. |
| Ideally, Web sites could contain a mix of data types based upon the needs of the Publisher. An ASCII version of data (HTML, SGML, or XML) would be available for cases where the data representation need not match the print copy. The Web site could also contain format-dependent view which could be supplied through the use of PDF or other alternatives. SGML, although ideally suited as an extensible neutral-encoding language, also carries the perception of being difficult and complicated. That is where XML comes into the picture. XML provides a simpler alternative to SGML which would still provide a neutral encoding for data, but would be extensible to handle more complicated data structures and format requirements than HTML. At the same time, XML might be sufficient to use for other media requirements such as print or CD. |
Microdocuments |
| The Web paradigm calls for the use of microdocuments, or documents which contain enough information to have meaning and can stand alone as documents. Traditional print documents are an ordered collection of these microdocuments. Exactly what unit of data is a microdocument will depend upon your data. Likely candidates include content specific units of data such as tasks, procedures, topics, abstracts, and articles. Microdocuments can also be structure specific units of data such as section, subsection, and appendix. |
| The microdocument concept matches well with component document management technology. Microdocuments, also sometimes called Minimum Revisable Units (MRU's), are stored as objects in the document manager. Object storage permits them to be stored once, but used in many documents or locations. The document management system provides version control for each object, tracking which version of an object is applicable for any given document from which it is used within. The document itself, is nothing more than an ordered collection of objects. Through the use of SGML or XML, building of the document is enabled through the use of the Document Type Definition. |
| Using that same approach, collections of objects (which might not be complete printed documents) can be created for aWeb view of the information. Publishing to the Web is the act of making objects within that collection accessible to the Web. |
Web Navigation |
| Printed documents contain several techniques for locating information: Tables of Contents (in various forms), Indices, and cross references. Tables of Contents provide a breakdown of the information from which the user narrows down their possibilities until they locate the data which they wish to review. Indices provide data location through the use of keywords or phrases. Cross references enable users to find related information without searching through either of the former methods. |
| To locate data on the Web, similar techniques are used. However, the implementation of the techniques are different because of the interactive (and electronic) nature of the Web. In addition, the use of document management and the Web permits another type of searching not typically found in print products. |
| Although the phrase "Table of Contents" has a page-oriented connotation, users do want some type of document navigation which allows them todrill down through document contents. In the print product, the Table of Contents is often generated by the desktop publishing or composition tool. It contains not only the document structure, but also the corresponding page on which the information begins. Although the page number is meaningless on the Web site, the document structure is still very useful for the user. As such, theTable of Contents for the Web site, can be ordered collections of objects from the document management system. |
| In addition, a Web site might contain many different documents (e.g., Product 1 Shop Manual, Product 2 Shop Manual) and document types (e.g., Shop Manuals, Service Bulletins). Providing lists of these document types and documents is again nothing more than ordered collections of objects from the document manager. |
| What was represented as Indices in a print document, is replaced by full text search capabilities in the Web. Cross references, typically resolved by the publishing tool for print products, are converted into URL links for the Web site. To accomplish both of these capabilities, the document management system must maintain object identities on the Web site, and convert links to the appropriate object IDs when publishing to the Web. |
| The Web also presents a new mechanism for users to locate data not typically found with print products. Document management systems maintain meta-data — or information about the data — for each object. Meta-data might include the author responsible for the information, or for which end products the data is applicable. In the document manager, the meta-data is used to record pertinent information about objects and to provide a searching mechanism for users to locate data. When a document manager is used to publish to the web, that meta-data can be used as another mechanism for Web users to locate information. |
Context-Dependent Views of Data |
| One of the advantages of using document management systems is that they can maintain the parent and child relationships for microdocuments. As such, the same information object can be used in more than one document. Through the use of configuration management, you might access different versions of that microdocument based upon from which document you access the information. |
| For printed products, context-dependency is not a problem because it is a static view of information. However, when publishing to the Web, it is important that the correct version of an object is used, based upon which context is requested. |
Automating Document Publishing (Availability) and Updates |
| To publish a printed document, approved data is collected and held until publication time. At that time, the data is published into a paper form. Checkpoints mark the released version to provide a reference point to the data for future reference. The cycle then repeats itself. |
| Publishing to the Web does not need to be an "all or nothing" proposition as it can be for printed documents. Instead, as information is completed and approved, individual microdocuments can bepublished to the Web. In this case,publishing simply means making the data accessible to the Web view. Depending upon your system configuration that might mean transferring data to another server, converting the data into HTML, or simply flagging the data in the editorial database as completed. Within the combined world of Web publishing and document management, the Web access to the information is simply controlled by queries against parameters. Based upon the results of a query, different documents, or versions of documents, are made available to the Web user. |
Publishing to the Web Summary |
| Publishing information to the Web requires creating a view into your information which will enable users to locate data in a more timely manner than paper. To build a successful Web site, the Web publishing process should be tied into a document management system which does more than store data in a repository. The document management system provides the foundation from which a Web site maintains and takes advantage of the data relationships established in the document management system. |
| The document management system also provides an encapsulated view of information for the Web site, with the correct hooks for users to navigate and locate the data they need to find. That view, and the corresponding navigation aids, are probably different than those used by an editor. Publishing to the Web is the same as creating a finished goods view of your data. Document management provides the capabilities to create and manage that view in a highly automated manner. |
| SGML and the On-line Legislature | Table of contents | Indexes | PDoS - Pinnacles DSSSL-O Stylesheet | |||