Using XML in a Teleeducational Tool   Table of contents   Indexes   XML in the BMW Group: Sharing information components across the enterprise

 

A new metaphor for editing structured documents

 Christian   Wallgren
  Product Manager
  PharmaSoft AB  P.O. Box 1237
 S-751 42 Uppsala   Sweden
Phone: +46 18 185452
Fax: +46 18 109200
Email: Christian.Wallgren@pharmasoft.com Web: http://www.pharmasoft.com
 
Biographical notice:
 
Being in the SGML business for the last seven years, Christian started as Product Manager for the large scale launch of SGML editing systems at Ericsson Telecom. He then founded his firm Publishing Development where he directed the development of the DTD editor SGML Companion. The last three years Christian has been working in the Core Technologies group of PharmaSoft. He is now the Product Manager of the PS Author product.
 
PharmaSoft is a global company focused on improving the overall productivity of the pharmaceutical marketplace by developing information systems built on in-depth understanding of pharmacology and a high degree of technical competence.
 
ABSTRACT:
 
Using a general business metaphor instead of word-processing may prove successful for certain types of documents. This paper will present a practical solution where this metaphor has been implemented.
 

The Case study

 

Background

regulated documents
 

This presentation will show how a new metaphor for editing may be used for certain types of highly structured and formally regulated documents.
parmaceutical
 

It is based on our experiences from the design and development of an SGML/XML-based tool for the pharmaceutical industry.
 
I hope it will of interest even for people outside our industry, because our findings clearly have some general applicability, both from a practical and theoretical point of view.
Formal documents
 

Formal documents

 
Formal documents is a concept, which has no theoretical definition. There is a resemblance to legal documents like contracts, insurance policies etc.
authorities
 healthcare 
 

In the interaction between the Pharmaceutical Industry, the Regulatory Authorities and Public Health, this type of documents are widely in use. Examples are the "Summary of Product Characteristics" and the "Package Insert". These formal descriptions of the product are of vital importance both for the physicians and for the patients. They are bound to be as regulated as the product itself, which means that every sentence and every word is weighed carefully by the authorities before an approval. In some cases the text has to be translated into each of EC languages and all versions have to be approved at the same time.
 
From an SGML/XML point of view, these documents have the following characteristics:
 
  1.  They are dictated by central authorities, often in the form of guidelines and templates
  2.  They have a strict, mostly sequential, structure. For an example of this, seeWAL-001 .
  3.  Their functional parts are mostly mandatory and not repetitive. Recursion is seldom involved.
  4.  The model groups are not nested

The typical formal document



 SPC  (A Summary of Product Characteristics)

 
formal document
 

Today these documents are mostly prepared in word-processing environments, which means that the structure is supported only by templates and guidelines and not enforced by any mechanism. In some cases templates have been issued by the authorities to be used by the companies, i.e. by Swedish and Danish MPA  (Medicinal Products Agency)
 
What is expected from a system handling these documents? In addition to the usual word-processing features, the following features seem to be the most important when editing and managing formal documents:
 
  • version control
     
    Version control. Maintenance of versions does not only apply to the content, but also to the structure, fixed titles and instructions
  • auditing
     
    Auditing ("who wrote what, where and when") and rigorous user access control
  • comparison
     
    Comparison between documents and versions of the same document
  •  collaborative 
     
    Collaborative authoring between and within organizations and on different parts of the document
  •  Ability to exchange the documents between disparate organizations, typically between a company and an authority
editing environment
 

The traditional editing environment

 MS Word 
 

MS Word output format (.doc) has until today been recommended by EMEA  (the European Agency for the Evaluation of Medicinal Products) . The reason for this is the fact that MS Word is so widely used, that its storage format serves as a de-facto standard. The metaphor is the traditional word-processing document.
 
Discussing this with the pharmacists and other experts, we pointed out that the word-processing format now being used
 
  1.  is not an open standard, meaning that it could be changed over a night and that there is no public and no vendor-independent mechanism for monitoring the standard
  2.  does not separate content from formatting instructions, which disables it for qualified analysis
  3.  does not have support for schemas, which makes it impossible to automate a structured form of editing
 
This seemed to bother them, especially considering the longevity of the formal documents. Also, after showing the functionality of an SGML-editor, they thought that this kind of tool would serve the need to enforce the formal structure of the document.
 
The professionals emphasized the need of better control and better management of information. Although they took for granted that the necessary word-processing features would still be available, they seemed willing to trade off some of the "vanilla" in exchange for faster processing and simplicity. This is easy to understand, since each day of delay in getting an approval is extremely costly for the companies and any facility, which unburdens the authors from dealing with styles and layout is welcome.
 
The task for the authorities in this process is to audit and comment the document, sometimes even making amendments to it. Then the documents may oscillate between the company and the authority, each round generating a new working version. Finally the document is approved by the authority.
 
After that a Summary of Product Characteristics or a Package Insert has been approved, the authority would like to have the document stored in a way, which enables context directed search on phrases through all products in the database, regardless of manufacturer. Thus, it must be possible to make analysis like: Give me the all sentences within the chapter "Undesirable effects" between 1980 and 1999 containing the word "antiretroviral agents" in EMEA SPCs.
 
Using regular relational DBMS methods for storing the documents, it would be easy to integrate it with the existing and planned systems.
 

The hunt for a functional environment

 
Given this information, we began to realize that substituting MS Word was really a challenge from the point of easy authoring, but still our chances were not too bad. The main obstacles were the traditionalism and conservatism among some users.
 
We realized that we would have to avoid technicalities and to functionally delimit the environment to get an acceptance. General SGML/XML concepts like elements, entities, attribute value, notation etc must be hidden.
 
In order to achieve this goal we could customize a commercial SGML- or XML-editor. There are excellent products with built-in customization tools, APIs and macros, with which we could hide the technicalities and enhance the user friendliness.
 
Also, we have, for the time being and in this environment, an impression that the users are reluctant to substitute the main-stream word-processor with a general all-purpose XML/SGML editor. Since the usage of the editor is limited to certain applications, highly competent but general editors will be hard to financially justify.
 
Using the metaphor of a general business application was chosen instead.
application specific editor
 

An application specific editor

 
We decided to build an application specific editor. In SGML terms, that meant an editor to be used for a predefined set of DTDs. As said before, the demand was to use generally available storage techniques, such as relational DBMS. The second design decision taken was to fully to integrate the database with the editor and merge it into one product.
 
After two years of development the first beta of our product was shipped to customers in December 1998.
PS Author
 

This product, which is named PS Author, has currently the following features:
 
  •  imports and exports SGML
  •  exports XML, HTML and Rich Text Format
  •  loads and saves each chapter as a SGML Micro-Document in a standard relational DBMS
  •  maintains incremental and user defined versions
  •  maintains check-in/check-out of individual chapters on user level and supports collaborative authoring in the way that any saved result is immediately visible to all participating authors of that document
  •  graphically presents differences between documents and versions of documents
  •  maintains the variable part of the schema in the data base
  •  uses XSL and DSSSL style-sheets for formatting processes
  •  provides a general word-processing environment, being able to handle pictures, tables, comment fields etc
 
Having tested this solution in production, the customers' major reaction was positive, with one exception. They lacked a function to import the stock of MS Word documents into the product. This will be possible in the next release of the product.
 

The theory part

 DTD, Document Type Definition 
schema language
 

The choice of schema language

 
In PS Author the traditional SGML DTD is used as a schema carrier. It also has a grammar role in SGML, but not in XML . The advantage of the DTD syntax is
 
  •  It is a part of both SGML and XML, no other schema language has been standardized yet
  •  There is software supporting it, like the SP parser by James Clark
 
The weakness lies in its limited data typing capacity.
 
We are looking at using the Document Content Description for XML proposal as an alternative schema language in a forthcoming release of the product.
enabling architectures
 

The use of an enabling architecture

 
An interesting question is: "How can we formally specify the range of possible schemas, which can be handled by the editor?"
 Extended Facilities 
 HyTime 
 

A specification of the SGML Extended Facilities, formally contained in the HyTime standard, gives us the Architectural Forms. By using an enabling architecture it could be possible to require each client DTD to conform to the meta-DTD. This is described by Steven Newcomb, who says that "DTDs can be permitted to change in any way that does not violate the constraints imposed by the SGML architecture" .
 
Before we try to answer this question, it must be clarified that although testing conformance of schemas against an architecture may be done programmatically, this is not the task of an architecture engine. The architectural validation performed by this engine is done on the instance and not the DTD.
 
But is it really possible to verify if a DTD only produces instances, which are architecturally valid with respect to one or more specified architectures? I will answer this by showing an example.
 
By using the standard of Architectural Forms we have made PS Author an engine of its own enabling architecture. Today only one architecture is used. The forms of this architecture are itemized inWAL-002 .
 
The PS Author main architecture
Name Meaning Content Attributes
M Document element Me, Id, Bs
Me Meta information Specified elements for meta information. Order: sequential Occurrence: single-mandatory
Id Identification Specified elements for identification Order: sequential Occurrence: single-mandatory
Bs Body-structure (Bh | Ft | St)+
Bh Body-heading (Bh | Ft | St)* Heading
Ft Full-text
 
  •  Specifiedstructure with mixed content meta elements, type I:
  •  Paragraph and tables (CALS), picture, comment, emphasis
Heading
St Simple-text
 
  •  Specifiedstructure with mixed content meta elements, type II
  •  Paragraph, emphasis
Heading
 
To illustrate how this works, let us select the element "5.1 Pharmacodynamic properties" which is a subelement to "5. Pharmaceutical properties". "5.1 Pharmacodynamic properties" must be followed by "5.2 Pharmacokinetic properties" (seeWAL-002 ). They both are derived from the "Full-text" form and they are children of the "Body-heading" form.
 
However, there is no guarantee that the element type generating this instance has the form of
 
<!ELEMENT pharmpro - - (pharmady, pharmaki)>
 
which is the declaration in the SPC-DTD. It could well have been generated by an element type like
 
<!ELEMENT pharmpro - - (pharmaki & pharmady)>
 
In the current release of the product only - model groups of the type sequential - mandatory non repeatable (A, B,C,..,N) are permitted for certain forms. Thus, there is a need for a second architecture, which is currently not implemented, but used implicitly. The problem here lies in the way this restriction has to be expressed. E.g. an ordering architecture for model groups, containing 1 to 3 elements would look like:
 
Ordering architecture
Name Meaning Content
Bs Body-structure (C1, (C2, C3?)?)
C1 - C3 1st - 3rd element (C1, (C2, C3?)?)?
 
As can be seen, this is a rather clumsy solution to the problem when the number of elements grows.
 
The conclusion is that Architectural Forms is a powerful concept in modeling different types of formal documents. For some purposes, like conversion using style-sheets, it is sufficient with the main architecture. For other purposes, it is necessary with more complex architectures, which are hard to express in SGML DTD-terms.
 
Whether Architectural Forms can serve as a formal model for describing the possible schemas that an editor can support still remains to prove.
 
An interesting use of Architectural Forms could be to let the product accept instances with no DTD, i.e. no explicit schema, and derive the schemas by using the architectures of the product. However this is not an option in our case. We have to be able to formally declare what kind of schemas, and hence, what type of document structures, that we can support, not only what we can accept.
Design
 

Design aspects

 
The idea of assigning an architecture to the editing tool may seem to be a little bit backward. There are some specific merits in this approach, which I would like discuss:
 
  1.  General rules can be utilized, which opens for a very simplistic design of the product, seeWAL-003 . E.g. the low number of icons indicates a much simpler interface than that of MS Word.
  2.  Another obvious result of this limitation is the fact that we can use a general relational model, which covers both the schemas (since they are of limited scope) and the actual documents. Theoretically, this is not a great step forward, considering the competent object data base engines, which are available today on the market. Practically and pragmatically, on the other hand, it is of great importance, since we do not have to force the customer to buy new data base software - the RDMS is already there (in most cases).
  3.  Style sheets, although customizable by the use of DSSSL and XSL, can be used "off the shelf" because of the little variation in structures and because headings are taken from fixed attributes and hence are part of the schema (this choice was taken because the heading of each chapter is an essential part of the regulation of the document).
  4.  It will be possible in near future to have any compliant schema adopted by the product. This might seem to be a standard feature in any SGML- or XML-editor, but it is more to it than that. A compliant schema can instantly be capitalized upon, by being usable for editing, rendering and storing from the very moment it is adopted.

PS Author



main window screen shot

 
 

Summary

 
I have tried to show that in some cases it may prove appropriate to use an editing metaphor, which is related to the specific tasks of business, rather than to general word-processing conventions. PS Author is an example of such an approach.
 
It is also been my ambition to explore the possibilities of delimiting the range of schemas and to show that such a limitation has merits in an extremely simple implementation of storage and rendition mechanisms, simple adoption of schemas and a simple user interface.
 
Bibliography
Convention to be followed for templates, London, 18 September 1998, EMEA The European Agency for the Evaluation of Medicinal Products, Technical Co-ordination Unit
François Chahuneau, "SGML & Schemas: From SGML DTDs to XML-DATA" SGML/XML Europe 98 proceedings
Document Content Description for XML, W3C Note NOTE-dcd-19980731 http://www.w3.org/TR/NOTE-dcd http://www.w3.org/TR/NOTE-dcd
Steve Newcomb, "SGML architectures: Implications and Opportunities for Industry",http://www.techno.com/sgmlarch.htm http://www.techno.com/sgmlarch.htm

Using XML in a Teleeducational Tool   Table of contents   Indexes   XML in the BMW Group: Sharing information components across the enterprise