Style Sheets: I18N aspects   Table of contents   Indexes   Managing information networks with Topic Maps

 
 

Development of SGML/XML Middleware Component


 
Kunio   Ohno
  Chief Scientist
  INS Engineering Corporation
4-31-18 Nishi-Gotanda, Shinagawa-Ku
Tokyo     Japan  141
Phone: +81 3 3490 6104
Fax: +81 3 3490 6155
Email: ohno@inse.co.jp Web: www.inse.co.jp
 
Biographical notice:
 
Kunio Ohno
Approach Incorporated
Beyer, Mortren
 Japan  
 Tokyo  
 

Kunio Ohno is The Chief Scientist of INS Engineering (INS-E) Corporation. He is engaged in the research and development of distributed multimedia systems at INS-E's Business Development Headquarters. His interests include agent technology and distributed object. He is also a member of OMG Japan SIG. Prior to joining INS Engineering in 1994, Kunio worked for NTT and its Human Interface Laboratories.
 
Mortren   Beyer
  System Engineer
  Approach Incorporated
Takahashi Bldg. 3F, 3-2-7 Azabudai, Minato-Ku
Tokyo     Japan  106-0041
Phone: +81 3 5572 6152
Fax: +81 3 5572 6153
Email: morten@approach.co.jp Web: www.approach.co.jp
 
Biographical notice:
 
Morten Beyer
 
Morten Beyer is a system engineer with Approach Incorporated, a Japanese company specializing in SGML system integrations and consulting services.
 
ABSTRACT:
 API, Application Programming Interface 
Business applications
CORBA IDL
Custom API
Document Parsing
Element API
 Middleware  
Parse
 Relational database  
 Security 
Source Document API
Three-tier client/server system
Version Control
XML DOM
client/server system
 

SGML/XML document management architecture based on relational database has been developed. The core of the system is implemented as middleware components within the three-tier client/server systems. The target of the middleware is the realization of the interfaces for business applications.
 
The APIs of the middleware are classified into three groups. The first group is called "Source Document APIs", which manages the SGML/XML document instances directly. The second group is called "Element APIs", which manages the parsed element of the document instances. The navigation of the SGML/XML document and selection, control, and information update of the element are made through this API. The third API is "Custom APIs", which enables to extend and customize the functions of the various business applications.
 
Our concept of SGML/XML database should not only be for the management of documents, but also for broader business application fields. SGML/XML is not simply a framework to define document structure but a framework for APIs of business applications. This concept will become more clear when the API is defined by CORBA IDL which we are planning to implement in the near future.
 
From such a point of view, management of SGML/XML element should be important. Such information may include various basic information in a company, but provides a target for security, version control, transaction, etc. Management of related information between those elements should also be important. If so, elements of documents should naturally be managed within RDB.
 
From the above, the framework of SGML/XML should naturally be treated as the middleware of the three-tier client/server system, because it looks like SGML document through API from a client, and its element is managed by RDB at the backend.
 
Although the APIs between client and middleware are written in the C language now, we are trying to modify some of them to CORBA IDL. XML DOM which is also defined by CORBA IDL will be implemented after that.
 
 

Introduction

Access control
 CALS  
 Database 
Document Integration
ISO9000
Multimedia
 Security 
 Workflow 
 document management 
document management environment
workflow management
 

About five years ago, office workers in Japan wrote documents using Ichitaro or OASYS. Ichitaro is a word processor package software, which runs on NEC's PC9800 while OASYS is a dedicated word processor hardware. At that time, Japanese organizations' document management practices were mainly centered around the traditional paper based approach with a majority of the business critical information existing primarily in paper form. After this, the document management environment has gradually changed. The change was partly caused by the organizational requirement for restructuring and the introduction of CALS and ISO9000 . Another factor that should be mentioned as a driving force behind this change is the continued proliferation of PC's and the improvement seen in their networking capabilities based on the Internet and LANs. Rapid development of computer technology and the shift in business information management and maintenance requirements have affected the organizational information systems in Japan. Some of the requirements are as follows:
 
  • Multimedia support
  • Integrating documents with the Internet
  • Integrating documents into business databases
  • Interoperability with existing documents as word processor files, image files, etc.
  • Connection to workflow management with documents for the automation of office work and relationship management of documents
  • Access control and security
 
 

Focus on Industrial Challanges

 
 

Market Needs for Document Management in Japan

De-facto standard
Departmental solutions
 

The key to open and flexible document management is seen in the fundamental architecture of the document management effort. As can be seen from various departmental solutions and practices within an organization, no de-facto standard exists today.
 
 

Technological Possibilities

 
 

API for Documents

Compound Document
Distributed Networks
 OMG 
OpenDoc
 W3C 
 

OMG's compound document standard [1] defines a set of API's to manipulate component of hierarchically structured documents. We doubt, however that this will become popular since it is based on Apple's OpenDoc. Microsoft's OLE, which was a rival of OpenDoc has only functions for documents to link and embed applications, and does not have expandability for distributed networks and other systems except windows. In short, a satisfactory architecture does not exist today. Even so, the W3C's XML (described later) holds a promising future.
 
 

Three Tier Client/Server System

 DTD, Document Type Definition 
LISP
Object Oriented Database
Three Tier Client/Server System
Transaction Processing system
 middleware 
 

When incorporating a database system and a transaction processing system as part the document management framework with the back bone being a client/server system, it is important to provide an API not only for the management of documents and business work-flow, but also on a broader business application level. SGML is not simply a framework to define document structure, but a framework for API of business applications. From another point of view, document information is nothing but a display of printed rendition of hierarchically arranged elements as defined with a DTD.
 
From such a standpoint, management of document elements as presented in the SGML instance becomes important. In addition to the basic document information, utilizing a three-tier client/server system provides a target for security, version control, transaction control, etc. Management of related information between those elements will become important. If so, elements of documents should naturally be managed in RDB.
 
One may think that the hierarchical structure of SGML can naturally be managed in an OODB, rather than RDB, but it is not actually the case. Just as the list structure of Lisp can easily define tree structure, RDB tables can organize tree structure by using its element as a pointer to another table, and can thus make element retrieval speeds higher than with OODB.
 
From above, the framework of SGML should naturally be treated as the middleware of the three-tier client/server system, because it looks like SGML document through API from a client, and its element is managed by RDB at the backend. Figure 1 shows a representative network for those systems.

 
Three-tier Client/Server Network

 
 
 

Distributed System

 CORBA, Common Object Request Broker Architecture 
 DataCartridge 
Distributed System
 Middleware 
 OMG  
 OODB 
OQL
 RDB 
 SQL  
heterogeneous environment
 

In cases where system operation is required in distributed network environments, the key to realize this will actually be to use a system using CORBA of OMG [2]. The CORBA standard is an interface specification when producing client programs that provides interoperability between objects in a heterogeneous distributed environment. to, and does not define actual installation of a server. For the above three-tier system, a function that references the RDB is required. A Query function is defined in the CORBA service [3], but it is based on a query to OODBs (OQL), and is not adequate for existing RDBs. Therefore, hybrid API fusing CORBA and SQL should be considered. Oracle calls such middleware "DataCartridge [4].
 
 

Object Relational Approach for Multimedia management

Datablade
Illustra
Multimedia management
SQL3
 

To integrate middleware and RDB, an object relational approach such as SQL3 is also possible. This approach enables direct query to multimedia. ORDB system can define new data types with its API functions like class definition with its methods in object oriented languages.
 
Illustra is one of the ORDBs which has data type libraries called Datablade. The Datablade is a library of functions (methods in general terms) as compared to data types (class in general terms) for applications in various areas.
 
Illustra provides the extremely favorable environment to store, retrieve and manipulate multimedia. Such features could not be realized with a conventional database, and can thus be said to be an ideal tool to structure document management system with multimedia as its main component.
 
 

Our Solution

 
 

SGML Datablade

 
 

Architecture

Illustra Datablade
 

We have tried to develop a document management system that handles multimedia as its element on Illustra. This system is implemented as Illustra Datablade. SGML, which is a language to describe logical structure of a document, is applied as the framework of the Datablade. The system architecture is shown in Figure 2.

 
SGML Datablade

 
 
 

Basic Features

 
The basic features can be summarized as follows:
 
  • Registration of SGML data
  • Output of SGML data (usage/change)
  • Output/registration by element (dummy is buried except for output object)
  • Deletion of SGML data
  • Retrieval of instance from DTD/retrieval of DTD from instance
  • Retrieval of entity used in instance
  • Retrieval of entity used in element
  • Sharing of element class by plural instances
  • Retrieval of upper element
  • Retrieval of lower element
  • Retrieval of equivalent element
  • Retrieval of DTD, instance, entity and element by date
  • Retrieval of DTD, instance, entity and element by owner
 
At present, the server is implemented on Solaris environment with clients of Windows 95/NT installed with Netscape.
 
 

Characteristics

 
An Illustra system using this Datablade will have the following advantages compared with conventional SGML databases:
  • Suitable for retrieval, reference, distribution, and management rather than authoring
  • Can directly handle SGML elements including multimedia with database
  • Can retrieve element data by using other Illustra datablades
  • Can use any authoring tool, parser, viewer, etc.
  •  
    The above characteristics have resulted from the consideration that conventional SGML database products were poor in features such as document management and workflow control of company operations, since these were developed for drawers (drawers of manuals etc.) of SGML data.
     
    On the other hand, a company information system based on the Internet framework (simple intranet) with email service and document distribution, has quickly gained popularity due to excellent cost/performance. There are problems, however, in terms of access management and data protection. Also in Japan, it is now necessary to make a mechanism against invasion by hackers into information system and the threat of stealing and ruining data. The above described system is effective also in this area.
     
     

    SGML Data Cartridge

     
     

    Market Reaction

     
    The initial version of the SGML Datablade was shipped out in September 1997 to customers who consisted mainly of research and development divisions of larger companies. However, actual operational departments and integrators consecutively requested an Oracle version. So, development on a Oracle version was quickly started.
     
    When initially going forth with the SGML Datablade design, assumptions that implementations into other systems was likely thus minimizing on the use of Illustra's own made the transition to Oracle relatively easy.
     
     

    System Architecture and difference from Datablade

     
    Fig 3. Shows the system architecture of the SGML DataCartridge. The following functions were changed at the time of transition from the Datablade to the DataCartridge:

     
    SGML/XML DataCartridge

     
  • Functions to call other datablades as well as functions to directly handle multimedia elements like pictures, figures and sound were eliminated.
  • The Datablade did not manage document instances in the database. This function was added to the DataCartridge.
  • The Data blade resolved instances into elements by parsing them on the client side and stored them in the database. The DataCartridge made it possible for the parsing to be done on the server side.
  •  
     

    Data Cartridge APIs

     DataCartridge 
     

    The APIs of the data cartridge are classified into three groups as shown in Fig.4. The first group - "Source Document API" - manages the SGML/XML document instances directly. The second group - "Element API" - manages the parsed element of the document instance. Navigation of the SGML/XML document and selection, control, and information update of the element is made through this API. The third API is "Custom API" which mainly depends on a certain application with related DTDs, and extends and customizes the functions of the various business applications that may include access control, security, workflow, etc.
     
     

    Source API

    Source API
     

    The Source API's function is to store the SGML instance source files in the database. The source file can be parsed to an element tree or a set of element trees according to its appropriate SGML declaration and DTD.

     
    APIs of DataCartridge

     
     
    The storage of element trees in the database is also operated by the Source API. Even for non-SGML files, which cannot be parsed of course, the source API can store them into the database and manage their properties like authors, date of creation, versions, etc.
     
     

    Element API

     
    After the SGML source file is parsed, the element tree is created and the Element API becomes available. The life time of the element tree forms a session of Element API. Within the session, the Element API manages the version of every element in the tree. When the element tree is committed to store into the database, the source file is updated and a new version number is assigned. Element version is managed by additional attributes within the tag as follows:
     
    <SECTION ADD="1" DEL="2" DATE="19970901" OWN="INS"> SECTION1 </SECTION>
     
    The deletion of an element is made by giving the value of DEL attribute.
     
    An optional element tree is created when the client needs text retrieval service using the Element API. In this retrieval element tree, entity reference data are implemented at the appropriate places.
     
     

    Custom API

     
    The Custom API is basically a general API, which includes SQL of the RDB and enables to extend and customize the functions of the various business applications which may include access control, security, workflow, etc. Usually, the Custom API depends on a certain application with related DTDs, and supports the functions of certain business applications.
     
     

    Vertical Industrial API

     
    The Custom API is very important and useful when we want to develop dedicated document databases for certain vertical industries. The conceptual architectural is shown in Fig.5.

     
    Architecture of Custom APIs for Vertical Industries

     
     
    We have already developed a prototype API for the medical industry and the consumer electronics Industry. The Custom API can be something like plug-ins for creating certain database packages.
     
    Our concept of SGML/XML database should not only be for the management of documents, but also for broader business application fields. SGML/XML is not simply a framework to define document structure but a framework for APIs of business applications. This concept will become clearer when the API is defined by CORBA IDL, which we are planning to implement in the near future.
     
     

    Next Development

     
     

    CORBA based API

     
    In accordance with recent progress of business object standardization in the OMG, it has been requested to define API with IDL. A simple method to modify API written in C function to IDL is application of wrapping technology [5]. The conceptual architecture shown in Fig.5 will be changed to CORBA IDL based architecture as shown in Fig.6. We are developing it under certain ORBs (Orbix[6], OmniBroker, and JacORB) at present.
     
     

    XML

     
    After the announcement of the W3C's XML1.0 Recommendation, XML is being given much attention in Japan. When we look at XML, we are most interested in DOM (Document Object Model)[7].

     
    Conceptual Architecture based on CORBA IDL

     
     
    The specification of this model were described in W3C's home page, and found to be very close to our middleware concept in terms of forming API to document.
     
    Based on the XML specifications, we would like to modify the SGML DataCartridge to support XML and provide an SGML/XML DataCartridge. In order to realize the specifications, the following functions should be supported:
     
    • For the internal code, functions to manipulate the character code of Unicode 2.0 and/or ISO/IEC 10646 should be supported
    • For the external code, functions to manipulate the single byte character code of Latin-1 should be supported.
    • For the external code, functions to manipulate the wide character code of UCS-8 should be supported.
    • For the external code, functions to manipulate the double byte character code of Shift_JIS, EUC-JP, ISO-2022-JP, and UTF-8 should be supported.
    • Parser for the XML document instances should be supported, which should include the function to parse the instances without DTDs.
    • Functions for the link (XLL) should be supported.
    • Functions for the style-sheet (XSL and/or CSS) should be supported.
    • Functions for the DOM should be supported.
     
    We have planned to implement those functions as a Custom API with an XML Parser, but the specification on XSL and XLL is not clear. Another approach to develop a new cartridges is under way.
     
     

    Discussion

     
     

    Response of Customers

     
    The DataCartridge has been introduced into business operation at certain companies, in addition to the research and development organizations, while the Datablade has been introduced only to R&D organizations. The reason for the limitation of the Datablade introduction to R&D organization only, is based on the Illustra Database market policy of Infomix Inc., who announced to stop supporting the Illustra Database at the end of 1998.
     
    The companies where the DataCartridge has been introduced are medical publication and documentation companies as well as consumer electronics documentation companies. One of the key factors when settling for the DataCartridge solution was not only its performance but also its customizability and interoperability to existing systems.
     
     

    Market and Technological Trend

     
    To evaluate the future market and the technology around the DataCartridge, three key fields seem to be important: Network, Digital documents, and Object Technology.
     
     

    Network System

     
    Recently, most Japanese companies have introduced Internet for their email systems and homepages. A few companies integrate their company databases and internet, which will be the base of their intranet systems. Within their databases, Oracle was the most popular one. This was the reason we have determined to develop the DataCartridge which was interfaced to Oracle using existing SQL. Because of the recent stagflation, many Japanese companies are trying to introduce intranet in order to reduce their personnel cost. The intranet market seems to be growing very quickly in Japan.
     
     

    Digital Document

     
    After the introduction of web homepages, the technology of digital documents is changing very quickly in Japan. The users of existing dedicated word processor machines and word processing software as Ichitaro are decreasing and the users of web browsers are extraordinarily increasing.
     
    The books of HTML is increasing in bookstores and many documents related to computers and/or networks such as help files come to be written in HTML. This trend will accelerate due to the free high quality web browsers by the competition of Netscape and Microsoft. The impact of XML is also very strong for people who are using those web browsers.
     
     

    Object Technology

     
    Business objects are thought to be the key technology for many Japanese companies to survive for the future global economy world. Although many Japanese companies are interested in the architecture of OMG CORBA to prepare business objects, very few of them have introduced the ORB in their intranet systems. Many of them are interested in the integration of networks and digital documents. We think object technology is the key for the integration as shown in Fig.7

     
    Relationship of Integrated Information Techologies in Future

     
     
     

    Potential of SGML/XML DataCartridge

     
    XML/DOM and our product occupies the overlapping area of network, digital document, and object technology, because both of them supports client/server API to document objects. API of document object relates to those three areas, because it essentially relates to web browers, which is in the overlaping aea of network and digital document, CORBA IDL, which is in the object technology and network, and SGML, which is in the object technology and digital document as shown in Fig.7.
     
    We believe the technology relationship shown in Fig.7 will be essential in the future, and API of document object defined by OMG IDL will be very important for future IT field. The DataCartridge API of document object defined by OMG IDL is shown in Fig.6 while that of XML is the central part of DOM specifications.
     
     

    Conclusions

     
    We have developed the SGML Datablade for ORDB of ILLUSTRA and the SGML DataCartridge for standard RDB such as Oracle. The former aims at a framework for multimedia information management systems, and the latter aims at providing API through documents for business applications.
     
    The DataCartridge is introduced into business operation of certain companies, in addition to the research and development organizations ,while the Datablade is introduced only to R&D organizations. The companies which have introduced our DataCartridge are medical document publishing and consumer electronics documentation organizations. The key factor for the introduction of DataCartridge were not only its performance but also its customizability and interoperability to existing systems.
     
    Acknowledgments
      In closing, we would like to thank Mr. Nakatsugawa and Mr. Otsuka of INS Engineering Co., and Mr. Sato of Approach Inc. for the management and help to complete the Datablade and DataCartridge systems. Also we would like to thank Mr. Yoshida, Mr. Miyaji of INS Engineering Co., and Mr. Ito of NIS Plus Inc., who have dedicated themselves to the implementation of the systems.
      References
      [1] OMG Document; "User Interface Common Facilities", ftp://www.omg. org/pub/docs/formal/97-06-06.pdf
      [2] http://www.omg.org/
      [3] OMG Document; "Object Query Service Specification", ftp://www.omg.org/pub/docs/formal/97-07-04
      [4] http://www.oracle.com/nca/html/nca_wp.html
      [5] T.J.Maubray et.al."The Essential CORBA", John Wiley & Sons, pp.231-267, (1995)
      [6]http://www.iona.com/Orbix/index.html
      [7] http://www.w3.org/TR/WD-DOM/

    Style Sheets: I18N aspects   Table of contents   Indexes   Managing information networks with Topic Maps