Enabling Everyday Business Applications to Work with Structured Information by using the Associative Model   Table of contents   Indexes   Defining Reusable, Distributable Information Objects Using XML-Data Schemas

 
 

An SGML-based Office Document Exchange and Management


 
Shy-Ming   Ju
  Professor
  National Institute of Technology at Kaohsiung
Dept. of Information Management 1 University Road, Yenchao
Kaohsiung   Taiwan  Republic of China
Phone: +886-7-6011000 ext. 4100
Fax: +886-7-6011042
Email: smju@ccms.nitk.edu.tw
 
Biographical notice:
 
Dr. Shy-Ming Ju
 
Dr. Shy-Ming Ju is Professor and Chairman of the Department of Information Management, National Institute of Technology at Kaohsiung. He is also a researcher of the Science and Technology Advisory Group, the Executive Yuan, ROC. Prof. Ju is devoted to promoting SGML both in the public and private sectors in Taiwan. His enthusiasm helped the creation of Project YAO several years ago.
 
ABSTRACT:
SGML-based facility
specification
 

A major thrust in the government of the Republic of China is to computerize the exchange and management of office documents. The Research Development and Evaluation Commission is about to promulgate a specification describing how office document exchange and management should be done. The author describes the development of the specification and a conforming SGML-based facility that could be used by an estimated 7,000 government offices.
 
As a true SGML application for a real and profound case, it demonstrates the extensibility of SGML and the ease of conversion from SGML to XML.
 
 

Introduction

 
A major thrust in the government of theROC  (Republic of China) is to computerize the exchange and management of office documents. TheRDEC  (Research Development and Evaluation Commission) , a government agency, is in charge of expediting this effort. RDEC identified 25 distinct types of office documents that are most commonly used by government agencies, and standardized the structure and style of these documents. Out of these 25 types of office documents, nine are mandatory and the remaining 16 are optional but highly recommended. In principle, RDEC could give away free software to government agencies so that they can quickly and uniformly computerize their document exchange and management. In reality, RDEC cannot do so because of the following constraints:
  1. RDEC cannot endorse any particular vendor's software, because that will give unfair competitive edge to the vendor.
  2. RDEC cannot acquire the software, either by in-house development or by outsourcing, and then give it away, because it amounts to using tax-payer's money to jeopardize the livelihood of software vendors who are also taxpayers.
 
 
Therefore, RDEC can only promulgate a specification describing how office document exchange and management should be done. Software vendors can then develop their products according to this specification. A prototype office document creation and exchange facility conforming to this specification will be very helpful to demonstrate the feasibility of this specification and to highlight relevant technical issues.
 
When office documents are exchanged electronically in an open systems environment, the process must meet the following conditions:
  1. It must allow attachments of diverse types to be exchanged.
  2. It must be able to accommodate variations of an existing document type or additional document types created by individual agencies.
  3. It must be able to exchange documents between different hardware and software platforms without restriction.
  4. It must be able to render the electronic office document on paper or on screen according to a style.
  5. It must assure that sufficient meta-information is provided in a document to facilitate document management and application.
 
 
In essence, the above requirements mean that the specification must describe:
  1. How to define the logical and physical structures of any type of document.
  2. How to describe the style of any type of document.
  3. How to transport an office document along with its attachments to the recipient.
  4. benefit of application software.
 
 
SGML has the mechanism to support all of the above functionality.
 
 

The Specification

 
Through a project funded by RDEC , we have developed a specification based on SGML that contains:
  1. A tag and attribute set for describing office documents.
  2. Twenty-five Document Type Definitions (DTDs) for describing the structures of the 25 types of office document. We call them the structure DTDs.
  3. A DTD for describing the style of any type of office document. We call it the style DTD.
  4. Twenty-five Document Instances (DIs) describing the styles of the 25 types of office documents. We call them the style DIs.
  5. A scheme for packing an office document along with relevant entities such as attachments, user-defined structure DTD and style DI for transmission.
 
 
In the following discussion, a particular type of document, the Meeting Notice, will be used as an example to highlight the various aspects of document exchange and management. Figure is a typical printed meeting notice that consists of the following fields:
  • Document Type
  • Retention duration
  • File Number
  • Level of Urgency
  • Security Classification
  • Conditions for Declassification
  • To
  • C.C.
  • Date
  • Reference Number
  • Attachments
  • Purpose of Meeting
  • Date and Time of Meeting
  • Venue of Meeting
  • Convenor
  • Point of Contact
  • Telephone
  • Attendants
  • Observers
  • Remarks
  • From

 
A typical printed meeting notice

 
 
structure DTD
 

The Structure DTDs

 
To enhance readability and modifiability, as well as to simplify the design of style DIs, we decided to provide a separate structure DTD for each document type. The structure DTD for Meeting Notice is shown in Figure . Even though this approach facilitates application software development, it does make the maintenance of the DTDs tedious. In the next edition, we plan to use parameter entities for common constructs to reduce prominent repetitions.

 
Structure DTD for meeting notices

 
 
Based on the structure DTD in Figure , the structure DI for the particular meeting notice in Figure is shown in Figure . Note the way an attachment is referenced with an external entity name in an "ENTITIES" attribute. The entity name is then associated with a file name in the local file system through an "ENTITY" declaration, and the file type is defined with an "NDATA" construct in the declaration. Processing of the file type is further explained with a "NOTATION" declaration.

 
Structure DI for the particular meeting notice

 
 
style DTD
 

The Style DTD

 
In closer observation we found that the styles of all office documents are made of line segments, literal strings, and invisible rectangles that are to be filled with document contents. Therefore, we have designed a style DTD as shown in Figure .

 
Style DTD

 
 
This style DTD provides syntax for describing:
  1. Type of document and its page size, page orienta-tion and coordinate system.
  2. Coordinates of a line segment and the global attributes for a group of line segments, such as their pattern (i.e. solid or dotted line), thickness and color. Any line segment must belong to a line segment group.
  3. Literal string and the rectangle it resides, and the global attributes for a group of literal strings, such as their orientation, alignment, line spacing, font and size. Any literal string must belong to a literal string group.
  4. A rectangle with a unique identifier and the coordi-nates of its upper-left and lower-right corner, and the global attributes for a group of rectangles, such as the placement of the rectangles (fixed or floating) and the relevant attributes for the contents therein.
  5. Last but not the least, the mapping of contents in a structure DI into rectangles in a style DI. The key of mapping is the tag name in the structure DI and the rectangle ID in the style DI.
 
 
Based on the style DTD, a style DI for the meeting notice in Figure is shown in Figure . It should be noted that the style DTD is applicable to any type of form such as that shown in Figure . In fact, the style DTD can describe a blank form pretty accurately. However, the style DTD is not meant for sophisticated typesetting. When placing contents into a blank field, it does not have sufficient expressive power to reproduce exact appearance as that of the original document. For office document exchange, at the receiving end, machine processing of documents does not care about style, and human cognition can be satisfied if the structure and content of a document can be correctly deduced from the hints of the style. This is the case when "approximate rendering" is acceptable instead of perfect reproduction.

 
A style DI for the meeting notices

 

 
A blank form

 
 
packing scheme
 

The Packing Scheme

 
We have used a profile of the ISO standard, SDIF  (SGML Document Interchange Format) , for packing various entities for exchange. The packing scheme is shown in Figure . Selecting a transport mechanism is beyond the scope of this specification, but TCP/IP, SMTP or HTTP will be adequate.

 
The packing scheme

 
 
The 25 structure DTDs and style DIs will be published as public identifiers, so both the sender and receiver during document exchange will have access to them. Therefore, the structure DTD and style DI associated with an office document need not to be packed in, unless they are created by the sender.
 
conforming facilty
 

A Conforming Facility

 
We have developed a prototype of conforming document creation and exchange facility that consists of:
  1. An SGML parser that is a Chinese localization of James Clark's nsgmls. It parses a structure DI or a style DI and produces a normalized ESIS output.
  2. A DTD-driven editor that displays a tree structure according to a given structure DTD (see Figure ) and guides a user through the document creation process. Tags are inserted automatically.
  3. A viewer that performs the following chores:
    1. Extracts line segment and literal string information from the ESIS output of the style DI, and constructs a blank form internally.
    2. Extracts mapping information in (name, rectangle ID)-pair and constructs a mapping table internally.
    3. Extracts tag names and the associated contents from the ESIS output of the structure DI, and uses each tag name to find a (name, rectangle ID)-pair. The content is then placed into the rectangle identified by the ID.
    4. Generates a virtual layout when the merging is complete, then renders the virtual layout on paper or on screen.
     
    The viewer's operation is highlighted in Figure , and an approximate rendering of the particular meeting notice in Figure is shown in Figure .
  4. An SDIF packer/unpacker.
  5. A DMS  (document management system) that is still under development. The DMS is built on an object-oriented database management system. It can automatically create a schema based on a given structure DTD, and store all documents of that type according to the schema. Document management can then take advantage of the query and search functionality of the underlying database management system.
 

 
First-level tree structure

 

 
Merging and rendering process

 

 
An approximate rendering for the particular meeting notice

 
 
 

Conclusions

 
The SGML-based specification allows users to define new document types and styles using the predefined tag and attribute set, thus demonstrates the exten-sibility of SGML. The style DTD can describe any layout that is constructed from line segments, literal strings and invisible rectangles for character string positioning. This means that the specification can also support electronic form exchange and management.
 
The specification is a true SGML application in the sense that it completely separates document structure from document style. Because of its judicial use of SGML syntactic constructs, the specification can be easily converted to an XML-based one. If adopted by the government, it is estimated that 7,000 government offices and numerous private offices will use this specification and similar conforming facility. Therefore it is safe to say that this is one SGML application with profound consequence.
 
Additional references in Chinese are listed in Figure

 
Additional references in Chinese

 
 
Bibliography
Best 01
Best, K. "Just How Many DTDs Do You Need?" SGML '96 Conference Proceedings. Boston, MA, November 1996, pp. 131-140.
Chahuneau 02
Chahuneau, F., Guennou, S. and Blavier, A. "SGML Template Driven Database Extraction: A New Approach to Report Generation." SGML '96 Conference Proceedings. Boston, MA, November 1996, pp. 315-322.
Conrad 03
Conrad, K. "Tools for Implementing SGML-Based Information Systems: Viewers and Browsers, Text Retrieval Engines, and CD-ROMs." SGML '96 Conference Proceedings. Boston, MA, November 18-21, 1996, pp. 39-49.
DeRose 04
DeRose, S. "The SGML FAQ Book: Understanding the Foundation of HTML and XML." Electronic Publishing Series, Number 7. Boston: Kluwer Academic Publishers, 1997. Extent: xxiv + 250 pages, appendices. ISBN: 0-7923-9943-9.
DuCharme 05
DuCharme, R. "SGML CD: A Complete SGML Tool-kit." Charles F. Goldfarb Series On Open Infor-mation Management. NJ: Prentice-Hall Professional Technical Reference, 1997. Extent: xx + 353 pages, CDROM disc. ISBN: 0-13-475740-8.
Goldfarb 06
Goldfarb, C. "The SGML Handbook." Oxford: Oxford University Press, 1990. 688 pages. ISBN: 0-19-853737-1.
Goldfarb 07
Goldfarb, C. "Entity Management in SGML." November 30, 1993, 16 pages http://ftp.sunet.se/pub/textprocessing/sgml/YAO/sgmlem.txt
Kennedy 08
Kennedy, D. "Approaches to DTD Design." TAG 9(5), May 1996, pp.1-4. ISSN:1067-9197.
Kennedy 09
Kennedy, D. "Tools for Implementing SGML-Based Information Systems." SGML '96 Conference Pro-ceedings. Boston, MA, November 1996, pp. 27-36.
Kumpf 10
Kumpf, D. "Re-engineering Your Company's Knowledge Infrastructure: Standard Tools vs. Standard Data Representations." SGML '96 Conference Proceedings. Boston, MA, November 1996, pp. 501-506.
Levinson 11
Levinson, E. "Exchanging SGML documents using internet mail and MIME." Computer Standards & Interfaces, 18(1996), Elsevier Sciences B.V., pp. 93-102.
Madigan 12
Madigan, C; Silber, M.; Wilson, S., "Lessons Learned Prototyping an SGML-based Computerized Document Management System." IEEE Transactions on Professional Communication 40(2), June 1997. pp. 139-143. ISSN: 0361-1434.
Marziarka 13
Maziarka, M. "Representing Information Applicability Using SGML Constructs: Marked Sections or Element/Attribute Representations?" SGML '96 Conference Proceedings. Boston, MA, November 1996, pp. 289-298.
Murata 14
Murata, M. "File Format for Documents Containing both Logical Structures and Layout Structures." Electronic Publishing: Origination, Dissemination and Design, 8(4), July 1997, pp. 295-317.
Pepper 15
Pepper, S. "Whirlwind Guide to SGML Tools and Vendors." SGML '96 Conference Proceedings. Boston, MA, November 1996, p. 37.
Quin 16
Quin, L. "Suggestive Markup: Explicit Relationships in Descriptive and Prescriptive DTDs." SGML '96 Conference Proceedings. Boston, MA, November 1996, pp. 405-418.
---- 17
The SGML University Board of Regents. "SGML Power Tools." Denver, Colorado: SGML University Press, 1997. ISBN: 0-9649602-0-6.
Wheedleton 18
Wheedleton, C. "The Power of Using Content Tagging and Attributes with Your Data." SGML '96 Conference Proceedings. Boston, MA, November 1996, pp. 71-76.

Enabling Everyday Business Applications to Work with Structured Information by using the Associative Model   Table of contents   Indexes   Defining Reusable, Distributable Information Objects Using XML-Data Schemas