Introducing the SGML Technology at the Publishing Houses of Wolters Kluwer Hungary   Table of contents   Indexes   Cost-Effective EDI Using XML?

 
 

High Level Architectures of Document-Object Publishing Systems


 
Gregory   Vaughan
  Technical Fellow
  Database Publishing Systems Ltd.
608 Delta Business Park
Swindon   Wiltshire  United Kingdom  SN5 9QL
Phone: +44 1793 512 515
Fax: +44 1793 512 516
Email: gsv@dpsl.co.uk Web: www.dpsl.co.uk
 
Biographical notice:
 
Gregory Vaughan
 
Gregory Vaughan is a Technical Fellow with DPSL, and has a wide variety of database and publishing systems experience. As an analyst at General Electric in the US, Greg specialised in hierarchical and relational database systems. As an analyst and project manager at Datalogics he participated in the specification, design and implementation of a wide variety of SGML and database publishing systems, as well as project management and sales support. At DPSL, Greg has lead a wide variety of consultancy and system integration projects, including functional specifications, cost-benefit analyses, DTD analyses, conceptual and detailed system analyses, and software coding & testing. He has also led the development of DPSL's internal project management methodology. His background includes a BS Engineering degree from Purdue University, and graduate work in management, databases, and expert systems.
 
ABSTRACT:
 
A natural application of both XML and SGML is in systems that combine re-usable document objects, such as AECMA S1000D data modules, into electronic and printed publications. Although a variety of document repository and workflow products now exist to manage this kind of information, these products do not provide a complete solution and must typically be integrated together with more traditional publishing tools for a complete system. Estimating the procurement and development costs for such systems can be difficult without a clear understanding of the architecture involved.
 
This paper investigates high-level conceptual designs for a document-object publishing system. The discussion will describe a strategy for de-composing a publishing system into its major functional components, such as workflow management, composition, and electronic delivery. Within each component the discussion will present a typical selection of products & services involved, and a range of typical costs. In particular, products aimed at a high-tier vs. low-tier solution will be presented. The discussion will also review areas that commonly require customisation and options for how the customisation can be achieved (e.g. in-house development, outsourcing, etc.). The resulting architecture will be expressed in the form of a deliverables breakdown and can then be used as the foundation for an initial cost estimate, a further detailed system design, a cost-benefit analysis, or RFP.
 
It should be noted, that this paper presents a general overview of this process, and by definition does not cover all of the detail that would be inherent in a real, specific system. Further, in an individual circumstance, there may be a requirement for components other than those identified here. However, this general framework can be extended to support such refinements and greater detail.
 
Central to the discussion of a document-object publishing system are the terms "fragment" and "publication". Here it is assumed that the end-product, as sold to the customer by a publisher, is called a "publication". Publications are created by combining smaller text documents, graphics, audio-visual clips, etc. into an ordered collection. Each of these smaller documents are referred to as "fragments" throughout this paper. Within the publishing organisation, authors typically create and revise fragments, and the publications are created programmatically from the fragments.
 
Document-object publishing systems are typically centred on a document repository that stores all the fragments needed throughout the entire authoring, reviewing, and production cycles. These are stored in such a way as to be re-usable, i.e. that they can be shared amongst a variety of output publications whilst being stored internally only once. These fragments exist independently of the publications they appear in, are tracked through separate production cycles, and have separate security restrictions and version histories. Many document-object systems also provide a workflow management environment that stores the lifecycle definitions for the different types of manuals produced from the fragments; the various versions of modules and manuals used throughout the lifecycle; and product definition information, including selection criteria, output media definitions, and style sheets.
 
To summarise, document-object publishing systems typically incorporate the following key features:
 
  • storage of re-usable document fragments, enabling the generation of many different output publications from the shared fragments;
  • storage of these fragments in a media-independent format, enabling output to many types of output media and presentation styles from one common source;
  • use of automated version tracking and, optionally, change-marking to minimise the amount of proofing & checking done by reviewers;
  • automated workflow management, tracking and reporting; and
  • security control.
 
 

High-Level Components

 
As previously mentioned, the document-object publishing systems described above do not exist "off-the-shelf". Rather, they typically are built up as a collection of products and services that are procured and integrated together, either by an organisation's internal IT department or by a third-party systems integrator. As a method of planning this integration, many analysts first break down the system into a set of approximately 7-10 logical parts, called components. Here we define the term component to mean a collection of products and services all focused on providing one major system function, such as authoring, composition, or electronic delivery.
 
The figure below illustrates the major components in a typical document-object management system. For example, the concept of a central, secure storage system for the fragments is usually crucial. Therefore a logical place to start is to define a Repository Component that contains everything (products & services) associated with storing and tracking fragments and publications, including:
 
  • a client-server document repository product that provides versioning, security, and sharing mechanisms for both the fragments and publications;
  • an underlying DBMS product that provides data storage, backup, recovery, and physical-layer tuning facilities to the repository;
  • a variety of customisations (services) to tailor the repository to the particular documents involved; and
  • a variety of consulting services to install the system and tune it on the particular hardware chosen for the system.

 
Example High-Level Component Breakdown

 
 
Similarly, the products and services related to authoring the text and graphic fragments can be collected into an Authoring Component, the products and services related to assembling and composing fragments into printed publications into a Composition Component, and so on. This type of logical breakdown by functional area allows us to more easily organise a cost estimate or project schedule for the system.
 
 

Authoring Component

 
The authoring of the text for re-usable fragments is generally carried out using SGML/XML context-sensitive authoring tools. These tools may also provide output that will meet the expectations for proofing the material.
 
Authoring tools should provide for change marking such that the composed material can be marked as to where proofreaders should concentrate their attention (and therefore where they shouldn't). It is envisaged that this change marking process be "manual" in the sense that the authoring software will not automatically mark changed material between versions; rather, the technical author will use SGML tags to explicitly mark changes to be shown to downstream reviewers and translators.
 
This component also typically provides a graphics editor for the creation and revision of graphical material.
 
 

SGML Editors

 
A variety of SGML editor products are currently on the market, including:
 
  • ArborText ADEPT*Editor
  • Adobe Framemaker+SGML
  • InContext Systems InContext
  • SoftQuad Author/Editor
 
Some features and issues to consider when evaluating editing products include:
 
  • Templates for SGML data & attribute entry
  • Menus of available SGML elements on insert
  • Sophistication of page layout
  • Sophistication of formatting
  • Table formatting
  • Equation handling
  • Standards-based stylesheets (FOSI, XSL)
  • Repository bridges available
 
 

Customisations

 
Customisations for authoring systems centre on the need to provide style sheets for the material being edited. For the document-object system under consideration, this generally means writing authoring-level style sheets for each type of fragment supported in the system (authoring-level style sheets simply format material for the authors on-screen; they do not necessarily seek to duplicate publication appearance or provide any kind of page layout). If the editing tool is providing formatting for proofing of the fragments, these style sheets must also be written, but often this facility is provided by the composition engine and is discussed below.
 
 

Repository Component

 
This component contains everything necessary to securely store structured & unstructured documents, and related graphics and A/V material. This typically consists of:
 
  • An SGML/XML repository server;
  • A number of SGML Repository clients.
 
 

SGML Repository Products

 
SGML repository systems store data at the element level and can store instances of any DTD. The following are some of the relevant products in this category:
 
  • Chrystal Astoria
  • Documentum Software Documentum
  • IDI BASIS SGMLServer
  • OpenText LiveLink
  • Texcel Information Manager
  • Xyvision Parlance
 
Some issues and features to consider when evaluating repository products include:
 
  • Support for native SGML/XML
  • Element-level SGML/XML storage
  • Full security control
  • Version tracking
  • Full text search
  • SGML search
  • Custom attributes
  • SDK
  • Web-based clients
  • Choice of underlying DBMS
 
 

Repository DBMS Engines

 
One important consideration in procuring a repository is whether the repository allows different underlying RDBMS systems to be used. The following are some popular choices for underlying databases:
 
  • MS SQL Server for NT
  • ObjectStore
  • Oracle
  • Sybase
 
Some issue and features to consider when evaluating databases include:
 
  • ODBC drivers
  • Multi-threaded server
  • Ability to tune the physical storage layer
  • Full backup/recover capabilities
  • Transaction journaling
  • Multi-platform server
 
 

Customisations

 
Typical repository customisations include:
 
  • definitions of how to partition WIP and current and back publication issues into cabinets and folders;
  • custom attribute definitions to support searching; and
  • system administration scripts (backup/recover, archive/restore, performance tuning, space reclamation, etc.)
 
 

Workflow Component

 
The Workflow Component stores all the information necessary to:
 
  • define workflow steps;
  • define users and user groups;
  • define error conditions and the workflow paths that occur when errors are encountered;
  • define the authorisation of particular users and groups to perform particular workflow steps; and
  • track the actual use of the system through each workflow step, and record the progress of users as they author, revise, compose, proof the re-usable modules, and then as they define, revise, compose and proof the full manuals that reference the modules.
 
The component typically consists of:
 
  • a Workflow engine server;
  • a number of Workflow clients, with software that provides a "work queue" environment to allow users to see what tasks and documents have been assigned to them; and
  • an underlying workflow server DBMS.
 
This component generally provides the main user interface to the entire publishing system. This may consist of the user interface provided by the workflow management tool and/or a custom user interface tailored to the particular functions required by the system (the custom interface could be built using standard Windows development tools, such as Visual C++ or Visual Basic, that provide custom menus, forms, and hierarchical navigation tools).
 
Closely tied to the workflow component will be a series of management reports using either the reporting features of the workflow product, or reporting done directly against the repository and workflow databases. These reports can typically be run on either demand and/or via a scheduling process, and will allow management to view the status of the user manual development tasks (but not necessarily print production tasks) as they occur.
 
 

Workflow Management Products

 
The following are some relevant workflow management products to consider:
 
  • Staffware plc Staffware
  • Xerox InConcert
 
Some features and issues to consider in workflow management include:
 
  • Graphical workflow designer (the ability to graphically defined the workflows)
  • Team queues and resource scheduling capabilities
  • Choice of underlying DBMS
  • Audit trail
  • Reporting capabilities
  • Web-based clients
 
 

Customisations

 
Workflow by its very nature requires a significant amount of customisation. Each different project must budget time to include defining the custom workflow(s) that apply to an organisation's document management & production processes. These can be quite involved depending on the organisation(s) involved, but generally include at least:
 
  • Authoring cycle step definitions
  • Review cycle step definitions
  • Production cycle step definitions
  • Custom management reports
 
 

Composition Component

 
The Composition Component consists of everything related to producing paper output from the document fragments stored in the repository. It typically consists of:
 
  • A composition engine;
  • Custom composition formats; and
  • Custom logic to generate front and rear matter such as tables of contents and indices.
 
 

Products

 
The following are some popular products relevant to composition:
 
  • Adobe FrameMaker + SGML
  • Advent 3B2
  • ArborText SGML Publisher
  • Datalogics DLPAGER
  • Miles 33
 
The following are some issues and features to be considered:
 
  • Mode (interactive vs. batch)
  • Native SGML
  • Standards-based formatting scripting language (e.g., DSSSL, XSL, etc.)
  • Can generate front and rear matter
  • Sophistication of page layout
  • Sophistication of formatting
  • Speed
  • Support for loose-leaf
  • Generation of PDF directly
 
 

Customisations

 
Typical customisations involving the composition components centre on the need to write custom formatting stylesheets and/or scripts. The project should budget time to write formatting stylesheets for each type of output publication, and optionally each type of fragment, depending on how complex the formatting for the fragment needs to be. This generally depends on how the fragment is to be reviewed; for example, many organisations use the composition engine to typeset the fragment and then distribute Adobe PDF to be annotated by reviewers).
 
 

Electronic Delivery Component

 
The electronic delivery component contains everything relevant to producing and browsing electronic books. This component generally consists of a browser and its associated book production and indexing tools, and again custom style sheets for each output publication to be published.
 
 

Electronic Delivery Browsers

 
Common electronic book browsers are generally either Web-based tools such as Netscape and MS Internet Explorer, or SGML-based tools. The following are representative products:
 
  • Folio Folio VIEWS
  • Inso DynaText
  • Jouve GTIPublisher
  • MS Internet Explorer
  • Netscape Communicator
  • Synex (Inso) ViewPort
  • SoftQuad Panorama
 
The following features and issues should be considered:
 
  • Native XML/SGML
  • Sophistication of page layout
  • Sophistication of formatting
  • Support for standard formatting language (e.g., CSS, XSL, etc.)
  • Presence of scripting language (Java, VBScript)
  • Availability of SDK
  • XML hyperlink support
  • Annotations, bookmarks
  • Support for full-text search
  • Support for SGML search (elements in context, attribute values, etc.)
 
 

Customisations

 
Customisations involving the electronic delivery component again focus on the need to write custom formats as style-sheets and/or scripts. The project should budget time to write formatting stylesheets for each type of output publication, and then time to actually build and index the electronic books (or HTML/XML) being delivered.
 
 

Conversion Component

 
This component contains everything related to converting legacy data and populating the system with fragments based on the data. Typically, this involves:
 
  • conducting an investigation with the organisation's staff to determine the data formats and volume of existing material necessary for conversion into the new system;
  • conducting an investigation to determine the proper feature breakdown of the manuals to be supported by the system;
  • manual cleanup of fragments as necessary in conjunction with the relevant departments within the organisation; and
  • manually capturing "glue" rules and conditions present in the existing material for use by programs that extract reusable modules from the SGML repository and generate full manuals.
 
 

Products

 
The following are some popular products and tools relevant to conversion:
 
  • AIS Balise
  • OmniMark Technologies OmniMark
  • PERL
  • Sema Group Mark-It
 
The following issues and features should be considered for conversion products:
 
  • Parses SGML
  • Provides comprehensive scripting language
  • Regular expression matching
  • Event-driven transformations
  • Tree-driven transformations
  • Can call external functions
  • Availability of SDK
  • Availability of ODBC Link
 
 

Customisations

 
Due to the need to generate a library of re-usable modules from the existing legacy data, it is anticipated that the majority of the conversion work will be a manual process involving considerable intellectual work by the organisation's engineering, marketing, and publishing services staff. Initially, this work will be to determine the conversion requirements and rules. As legacy documents have typically not been written with modularisation in mind, the conversion design and execution will include tasks to modularise the content. Software customisations involve writing scripts for the conversion product(s) chosen, and time to run these scripts and clean up and parse the resulting SGML/XML instances. This can be one of the most time-consuming and potentially risky areas of the entire project.
 
 

Documentation Component

 
Most systems include a Documentation Component that consists of:
 
  • User's Manual
  • On-Line User's Help
  • System Administrator's Manual
 
 

Training Component

 
Most systems include a custom user-training course that must be developed based on the particulars of the document types and user interface of the system. In addition, time might optionally be budgeted for a system administrator training course. The project should budget time for both the development and delivery of the training courses.
 
 

Consultancy Component

 
This component contains all the analysis, design and custom integration work not covered explicitly in the individual components listed previously. Generally, this includes work on the overall system design and the integration of the components, such as:
 
  • System Requirements Document (SRD): this document defines the functional, performance, and security requirements for the system being built or procured;
  • Design work, including the DTD design for the fragments and all the publication documents, the overall production workflow design, the design of any custom user interfaces required by the system, and the design of any custom programs required by the system;
  • System Design Document (SDD): this deliverable documents the work performed in the previous item, such as the user interface, the reports, and the low-level design (if desired) of custom code being produced for the system;
  • System Test Plan (STP): this document details the Acceptance, Component, and Unit Tests;
  • System Project Plan(SPP): this document details the project Schedule, Costs, and Resources; and
  • a Management budget, typically a 10-20% assignment for one person over the duration of the project.
 
 

Costing an Example

 
 

Example Scenario

 
Consider a manufacturer of computers that produces a variety of different models from a low-end entry-level model up to a top-of-the-range system. The documentation for each model must be specific to the model, however models often share hardware features in common. For example, some, but not all models have integral CD drives, however, all have a brightness control on the monitor. Therefore the models share a common feature base, and a natural shareable document fragment becomes a "hardware feature write-up". The fragments also include graphics that illustrate the features.
 
Assume also that there is an existing database of individual model manuals, each stored as an MS Word document.
 
The features are combined into two output publications: a printed user manual for specific models, and an HTML document describing all the features across all the models, suitable for web browsing. Two full-time authors handle the authoring of the feature descriptions and a production staff of two handles the production of the manuals. Finally there is a departmental manager to whom this entire staff reports.
 
In this example, whether work is done in-house or by outside contractors, only two day rates will be used. All prices are in pounds sterling and assume a day rate of £700 for analysts, and £400 for conversion, documentation, and training specialists.
 
 

Component Costings

 
The spreadsheet below represents a brief set of costs for a repository component for this example. The budgeted costs include the purchase of an SGML repository server and clients, consultancy time for application development such as defining the cabinet architecture and custom attributes, installation, and tuning.

 
Example Repository Component Estimate

 
 
Budgeted costs for the workflow component include workflow management software that is not pre-integrated with the repository, the writing of integration software programs that cause documents to be checked into and out-of the repository when specific workflow events and commands occur, and the development of additional custom management reports from the audit trail information maintained by the workflow management system (beyond those provided off the shelf).

 
Example Workflow Component Estimate

 
 
Within the authoring component, the estimate includes an SGML editor for each of the 2 authors and a bridge from the editors back to the repository. There is also a line item for a graphics editor for use by one of the authors.

 
Example Authoring Component Estimate

 
 
For composition, the budget covers the costs of a composition engine and time to develop formatting styles, TOC generation, and index generation for the printed manual.

 
Example Composition Component Estimate

 
 
As discussed above, for this system electronic delivery is accomplished via Web delivery. For the purposes of this example, it is assumed that there will be availability of an MS Internet Server (as part of Windows NT Server). The budgeted costs include time to develop an SGML to HTML conversion script for the material to be displayed on the Web pages (these scripts to be used with the conversion tool budgeted in the conversion component below).

 
Example Electronic Delivery Component Estimate

 
 
For conversion of legacy data, the budgeted costs include an SGML conversion tool that supports regular expressions and SGML parsing, time for an analyst to develop a set of conversion scripts to read the existing MS Word RTF files and convert them to an initial version of SGML, and time for a conversion specialist to do any required manual clean-up.

 
Example Conversion Component Estimate

 
 
For documentation, the budget includes the development of custom on-line user documentation and user manual based on the specific workflows used by the system. There is also time allotted for the development of a small system administration document that describes how to use the administration tools (backup, recovery, etc.) provided by the DBMS system, the repository system, and the workflow system, in an integrated way specific to the particular hardware environment.

 
Example Documentation Component Estimate

 
 
The project budget includes the development and delivery of both a short user training class and a system administrator class. In this case, the development of the training is done by analysts and the delivery by training specialists.

 
Example Training Component Estimate

 
 
Finally, there are budgeted costs for the development of a statement of requirements; a documented system design; a test plan; a project plan and schedule; the DTDs for the fragments and the printed publication; and time for project management.

 
Example Consultancy Component Estimate

 
 
 

Total Estimated Cost

 
Using a high-level component breakdown methodology, this example shows that the total estimated cost for this fictional system is £177,500, distributed according to the following percentages:

 
Total Cost Estimate

 
 
It should be noted that the numbers given above are an exmple only. Costs for an actual system would be dependent on its particular requirements. However, the above does show the use of the framework from the determination of the high-level components to a development of a cost estimate.
 
 

Conclusions

 
This paper has presented a high-level architecture for a document-object publishing system, describing common tools and features that can be purchased in the current market, and noting where customisations must be performed. The resulting architecture can be used as the basis to arrive at an estimate of the system's costs. As the estimating and procurement processes progress, the high-level architecture can be broken down to a finer level of granularity to improve the accuracy/quality of the estimate. At both the given high-level or at subsequent lower-levels of detail, such an architecture can contribute to the development of cost-benefit studies, or proposals, RFPs, and RFIs.

Introducing the SGML Technology at the Publishing Houses of Wolters Kluwer Hungary   Table of contents   Indexes   Cost-Effective EDI Using XML?