XML data processing and Relational Database Systems   Table of contents   Indexes   Euler, Topic Maps, and Revolution

 

Problems with linking, and reuse of text

, Why publishing on different targets might not be as easy at it first seems.
 Erlend   Øverby
  Senior Analyst
  Center for Information Technology Services  University of Oslo P.O.Box 1084, Blindern
Gaustadaléen 23
 0316 Oslo   Norway
Phone: +47 22 86 25 33
Fax: +47 22 84 00 43
Fax: +47 22 85 27 30
Email: Erlend.Overby@usit.uio.no Web: http://www.uio.no/~erlendo Web: http://www.usit.uio.no/seksjon/odi/ Web: http:/sgml.uio.no/
 
Biographical notice:
 
Erlend Øverby has been working at the University of Oslo with SGML and student-related information since 1992. He is also a cofounder of HyPATIA a consulting company based in Norway, with focus on Document Management, DTD-design and practical use of structured information in organizations.
 
ABSTRACT:
 
When you hear about SGML/XML and document management, one of the arguments for using SGML/XML is that you can produce the information once and use it in several different places. This is called reuse of text, where you can create new documents just by picking elements from a document management system. This is not as easy as it might look in the first place.
 
For some types of information reuse makes complete sense i.e. Technical documentation and reference information. But for "normal" text the idea of reuse is not as easy as it might seem at the first glance.
 
One of the problems with reuse is the text is written for one context and one purpose in mind. And when that information is isolated and used in one other context the tone and language might be wrong. Each written peace has its own tone, and story rhythm. If you just pick a Chapter/Section from one document and are intending to reuse that part in another document with a different context the story rhythm will not fit in with the rest of the text.
 
If the information is supposed to be used in different context, it is important that the author is aware of this, and therefor can modify the tone and rhythm so that the information can be reused. The author has to change the way they are writing, each information object has to be able to stand alone, and therefor be reused. One thing that is feasible to do is to rewrite the document so that the tone and style in the new document is consistent.
 
Everyone that have been to an SGML course is told that the one of the real advantages with SGML/XML is that you can separate content and presentation, and that you easily can present your information on paper, CD-rom, and on the Internet. But this is not always true, because every presentation form has its own way to communicate. We have hundred of years of experience in communicating on paper and we have thousands of years of oral communication experience. And what we immediately see is that there is a difference in the written and oral communication, and that we cannot easily transfer the same information from one to the another. The same is also true when we try to move information from paper based media into electronic media such as the Internet and CD-rom. If we want to transform information from a paper based production environment into electronic media, the author need to be aware that this information is intended to be used in different ways. If the information also is intended to be used in a hypertext environment it is even more important that the author is aware of this and modifies the text with regard to this new way of communication.
 
If we are going to change the way we communicate and specially if we are going to use and take advantage of hypertext we need some new tools that can give the author a possibility to author in an hypertext environment. Or if the author is authoring for both paper and for "electronic" delivery. The authors need to author for both the paper and for the electronic media. Maybe the communication forms are different for both CD and for the Internet as well. This will definitely create some new challenges for the authors.
 

Introduction

 
First of all; a lot of text is designed for reuse and to be published on different targets. An example of such a document type is a technical manual. These documents are what I would like to call self-contained. That means they usually make sense if you read them outside their normal context and also they represent a document type where you read for a specific purpose. The document class is designed for this purpose and it makes sense regardless of its context and its presentation.
 
Another type of document that easily can be presented on different targets and reused, is dictionaries. Where small isolated fragments of text are turned into a dictionary, thies document class is well suited for linking. Since one is reading specific information, one expects links, and you know where the thread of information you are looking for ends. That means you will not be distracted by other irrelevant links, and you know what to expect from the information at the other end of the link.
 
It is my opinion that only some special document classes are well suited for reuse and linking and that they easily can be presented on different targets.
 

Reuse of text

 
One of many arguments for the use of SGML-based document management systems, is that you have very good control over the content. Another is that you can easily edit and modify parts of the document. This is all very well, but it is also claimed that it is easy to combine different parts of the document and to create new ones efficiently, but I claim this is true only for a certain few document classes.
 

What is reuse

 
Reuse is to have some text blocks which you can use in repeatedly different contexts. This could be anything from a part of text to whole chapters. An experience I once had with this problem was that we had a DTD with a lot of "chapter" like content elements, describing the actual content of the information. We then tried to organise the content elements, to create a new view of the information. This was easily to managed, and the idea was not bad in itself, but when we started to look into the information and at the actual text that was written, we had some problems.
 
The problems we faced were not of the technical type, but dealt with the semantics of the text that we had produced, when the text was isolated from its context. A lot of the meaning was lost as well, because the authors had produced this information for some other use than we now had applied to the information. A lot of the text had no meaning, due to the fact that it relied on information in previous parts and paragraphs, even though the idea of the dtd design was to group information within different element containers. We realised that we could not remove the information from its context, since the context was vital for the understanding. We realised that if we included more of the surrounding of the information the problem was less, but basically we still had a problem.

New compound document

 
 

Different classes of information

 
As I mentioned in the introduction, some classes of information are designed for reuse, or the information is produced in such a way that it makes sense regardless of its context. However most information is not designed for reuse. Usually we produce information for one specific purpose, and for one specific context, and not for different contexts.
 

Different types of reuse

 
To be able to reuse the information we need to design it for reuse (obviously). First we need to analyse the use of the information, and then decide what parts of the information are going to be reused. Last we then need to remodel our dtd's and structures to match this design goal.
 
To reuse information on a paragraph level is difficult, but to reuse information on a "container" level is feasible, if the design supports it and if the class of the document is a class where you would expect reuse.
 
Reuse of text can happen in two ways. One is by reference where the original text is referenced and not modified. One of the advantages by choosing this model is that when the original text is modified all the references to the text are modified. This could be a wise strategy if the information is designed for it, and when the parts have accurate information such as prices, numbers etc. The other way is when the information is included, so that you easily could modify it to fit into the current context.
 

Problems with reuse

 
The problem with reuse is not of a technical art. With SGML and XML it is very easy to create systems, and rules for reuse of information, and parts of information. Hoever when we start to combine bits of information from documents and information from different sources, and from different authors, the style of the language will be different. Since the information is normally written for a specific context and the tone of the language, the time and the form are also consequent. When parts from different documents are combined into one style and the form will be different in the new combined document.
 

What can be done?

 
When you want to be able to create really reusable text you need to decide on a writing strategy and to decide on what level of granularity you wish to reuse text. It is also important to decide what types of text you wish to reuse, and to ensure that this text can stand by itself. The text has to be "self" contained. With that I mean; the text can not build on something from its context, or from other parts of information written before; also it can not refer to something that will come later in the document thread.
 
If you are reusing small parts of text, such as paragraphs for sales brochures or letters, you would have a very good starting point for a new document, which you can modify to fits it goals. This rewriting strategy has the disadvantage that you do not have a real reusable document, but you have a collection of paragraphs that you can easily use to create a good basis for a new document.
 

Publishing on different targets

 
The smartest thing with SGML and XML is that they separate content from the presentation and another advantage mentioned is that we can present the same information on different targets just by changing the presentation.

Publishing on different targets

 
 

What is publishing on different targets

 
Publishing on different targets is to use the same information more than once. Two of the most used targets today are the Web and paper. But as new technologies evolve, new targets will arise. One such example is the "Palm pilot".
 

Communication on different targets

 
All types of communication have their own kind of language. Mixing the communication language with the presentation form does not necessarily lead to a good combination. We have several thousands years of experience in oral communication and several centuries of experience in written communication. By experience we know that communication by speech is not the same as the written one. Either when we try to speak out loud what is written, or when we write down what we have said exacly the communication is not very good. I think that there is a certain parallel when it comes to thedifferences between paper and WWW.
 

What can be done?

 
To be able to communicate through a new media, I believe we will require new tools to be able to create effective communication on the Web, or on other new media's that will arise in the future. We need tools that work after a paradigm suited for the media where the information is going to be presented.
 
The question is then, what kind of role do SGML and XML have in this picture? We have to create structures that allow or contain information suited to different targets where the information will be presented. The authors have to create information fitted for each presentation media, and we need to have elements or use marked sections which tell the system what information goes to what target. As a consequence of this the authors need to produce information for each of the chosen targets when they produce their information.

Sample document for different targets

 
 

Solution

 
To be able to create reusable information suited for different targets you need to start working with the authors. You need to educate the authors in how the different media's communicate, but the most important thing to do is to get the authors aware of how the information they produce will be used. Author awareness is the most important thing.
 
To be able to create reusable text you need to make rules for how the form of the text has to be. In what time or person the text should be written.
 

Tools

 
Today we have tools which are well suited for the production of information on paper. These tools are created to fit the paper paradigm of communication. These tools work differently with regard to the information they produce. Examples of such applications are MS Word, Frame Maker, Quark, Illustrator etc. All these applications are for producing information on paper, and they do that extremely well. However to use the same tools for producing information on the Web might not work as well.
 
I do hope that the evolving interest for XML and its related standards will give us tools that are well suited for presenting information on the new electronic media's such as WWW. Many of the tools used today for creating information on the Web are based on the paper paradigm, and these tools use paper "terminology". Questions that are raised are: "Can we use the termemphasise on the web" or "What does bold mean". In other words do we need other words or mechanisms for use on the Web?

XML data processing and Relational Database Systems   Table of contents   Indexes   Euler, Topic Maps, and Revolution