Document Structure Identification: a New Paradigm   Table of contents   Indexes   Global CSCW Computer Supported Cooperative Work with SGML

 
 

Authoring: intelligent templates for authoring of SGML documents


 
Frank-Marcus   Steinmann
  Project Manager
  debis Systemhaus
Magirusstr. 43
Ulm   Germany  89077
Phone: +49 731 9344 3221
Fax: +49 731 9344 100
Email: fsteinmann@gei-ulm.daimler-benz.com Web: www.dtro.e-technik.tu-darmstadt.de/fms/fms.html
 
Biographical notice:
 
Dr. Frank-Marcus Steinmann
 
Frank-Marcus Steinmann is working for debis Systemhaus as project manager. Here he has served as consultant and developer for a number of SGML applications in the auto industry since 1996. Prior to this he was scientific assistant at the Department of Computer Engineering at the Darmstadt University of Technology, Germany. There he received the Ph. D. in Computer Engineering in 1996 for his results in researching on text and speech processing.
 
ABSTRACT:
 authoring 
 document 
 documents 
 editing  
 template 
templates
 

The possibility, shown in this paper, to define any part of SGML / XML documents as atemplate and to insert it afterwards - also in other documents - represents a significant help for authors of SGML / XML documents. This becomes even more interesting, because the positions where the insertion of the template is possible, are searched au tomatically. An important aspect is not only to find the obvious gaps in the document, but also the difficult positions, for example, when parts of the templates already exist in th e document. The information needed for this is contained completely in the Document Type Definition (DTD). Because of this, the presented algorithm is directly suitable for all possible DTDs without any adaptation.
editor
 

Conventional SGML editors generally allow the insertion of single elements instead of whole parts of documents. In addition, they only allow for the insertion of elements from the immediate area of the cursor. DTD independent mechanisms for insertion have neither been presented nor used yet.
 HyTime 
 clink 
 links 
 

The presented template algorithm is able to split a template into sub-templates if required and to insert them one by one, into the document . Because of this, the treatment of templates is simplified, above all, in their creation by the author. Moreover, this feature allows the insertion of several subtrees simultaneously, even when they contain links to each other. Because of these links there is normally no other way to insert several subtrees, especially when inserting them successively. By adding specific knowledge about the semantics of the CLINK concepts of HyTime to the template algorithm, this will create the possiblity to support HyTime CLINKS by the template algorithm.
realization
 

All things considered, an intelligent template algorithm has been realized and integrated into an SGML editor. Finally, this algorithm will be presented together with the resulting experiences.
 
 

Introduction

 documents 
 

Due to the increasing use of SGML and XML applications, for example in the oil, the pharmaceutical, the telecommunications and the auto industries, as well as in the WWW, there is a growing need to process SGML / XML documents.
 insertion 
 

Frequently used tools for this are SGML editors. Conventional SGML editors generally allow the insertion of single elements instead of whole parts of documents (templates). In addition, they only allow for the insertion of elements from the immediate area of the cursor. As a significant help for the author it should be possible to insert complete parts of documents instead of single elements and to fi nd the position where to insert them, automatically. The role of the author (user) shall be the creation of documents, according to a fix DTD (document type definition). The DTD will be created by an SGML expert.
 XML 
 

It does not matter if we are dealing with SGML or XML documents, as long as a DTD is present. Because of this, the following analysis applies to (valid) XML even if it is mentioned only as an example of SGML.
 
A simple case, it is worth using templates is shown in example 1.
 
Example 1: Given the following extract from a DTD:
 
<!ELEMENT team-members - - (team-member+) >
<!ELEMENT team-member  - - (roles , name , department? , address? , zip? ,
city? , phone? , fax?) >
<!ELEMENT roles  - - (role+) >
<!ELEMENT role  - - (#PCDATA) >
<!ELEMENT name  - - (#PCDATA) >
<!ELEMENT department  - - (#PCDATA) >
<!ELEMENT address  - - (#PCDATA) >
<!ELEMENT zip  - - (#PCDATA) >
<!ELEMENT city  - - (#PCDATA) >
<!ELEMENT phone  - - (#PCDATA) >
<!ELEMENT fax  - - (#PCDATA) >
 
For the creation of a complete <team-member> element with an conventional editor, the author had to create and to fill 10 elements for each <team-member>: the <team-member> element itself, and the content elements <roles>, <role>, ... , <fax>. With the following template the creation of this 10 elements can be reduced to one step of insertion:
 
<team-member>
    <roles>
        <role></>
    </roles>
    <name></>
    <department></>
    <address></>
    <zip></>             
    <city></>
    <phone></>
    <fax></>
</team-member>
 
The template algorithm searches the document for all positions where the insertion is possible, and if the user agrees, the template will be inserted there.
 
Although the structure of the template in the example above is very easy, the user will save time for finding the inserting position as well as for creating and filling the 10 elements. For frequently used content s we can additionally put the contents of the elements and the values of the attribute into the template too. For the DTD in example 1 we could create a template like the following:
 
Example 2:
 
<team-member>
    <roles>
        <role>Author</>
    </roles>
    <name>N.N.</>
    <department>S1</>
    <address>Magirusstr. 43</>
    <zip>89077</>                
    <city>Ulm</>
    <phone>(0731) 9344-0</>
    <fax>(0731) 9344-100</>
</team-member>
 
After inserting this template we only had to enter the correct name for all team-members of department S1 into the <name> element.
 
 

What is a template

 
 

Definition

 
Example 1 shows that a template can be both, a SGML structure without any content and a complete SGML subtree with contents of elements and attributes. The most effective use of templates is for frequently used SGML structures (with or without content), since the expenditure for the insertion into the SGML documents is considerably reduced. For the insertion of a template the template algorithm searches for all positions in the document where the insertion is possible, and presents them to the author. The template algorithm does not depend on any special DTD.
 
As is well known, SGML documents can be understood as trees. The nodes represent the elements of the instance. The DTD defines, which children are allowed for each node. Because of this, a simplified definition for a template is a SGML subtree. A general definition for any part of a document would be a set of at least one subtree. We will see later, that the simplified definition is absolute sufficient.
 
Fig. shows the tree representation of example 1.

 
The <team-member> template shown as a tree. By the function "Insert Template" the template (gray background) can be inserted e.g. between the two <team-member> subtrees existing in the document (white <team-member> elements).

 
 
 

Creation of templates

 
It is useful to give the user the possibility to create his own templates. Of course, predefined templates, created by an SGML expert can be provided too. For the creation of templates there are many ways. The rea lization presented in section permits the user to create templates in a similar way to the well known COPY ( / PASTE) function, by copying the selected part of the document into the template. This permits creating templates easily during editing documents with the same editor. An alternative way creating templates could be the use of another editor or of a special tool.
 
 

Inserting position

 
The position to insert the template is searched automatically by the template algorithm. Generally there will exist several possible positions. Because of this, the template algorithm should show all possibilities successively and ask the user each time, if the template should be inserted at the presented position. So the template algorithm will be DTD independent.
 
 

Searching area

 
There are many possibilites to define the area where to search for positions to insert the template. It can be the whole document or an area which has to be selected by the user. The realization presented in secti on searches from the selected element forward in the document until the end of the document.
 
 

Suitable positions

 
Generally there are 3 different types of positions to insert a template:
  1. The template can be inserted directly (see example 1, fig. ): The position is presented to the user.
  2. One or more nodes of the template already exist in the document: An example for this is shown in fig. . Generally there are two possibilites. Asimple algorithm would not show this position. In this case the user had to pay attention to choose an appropriate root element, when creating templates. Theintelligent algorithm presented in the following, tries to add elements and contents of the template to the structure which already exists in the document (see section ).
  3. Some nodes under which the template can be inserted are absent: Generally the template algorithm cannot create these missing elements. For this, special knowledge about de sired contents, attribute values, and so on of the elements to be created would be required. This knowledge had to be entered by the user each time when a template was inserted. When the user rejects the presented posit ion, all added elements and all information he gave to insert them must be removed. Especially in the case of complex structures this would be very frustrating. That is why this type of insertion will not be considered in the following.
 

 
The template <t> cannot be inserted directly, if <t> already exists in <b> and may occur only one time at this place. The insertion is completely impossible, if <b> is missing, because a general algorithm cannot create additional elements (in this case <b>, prospectively with mandatory subtree <d>).

 
 
 

Insertion

 
When the template algorithm finally has found a position, which has been accepted by the user, the template can be inserted. The simple algorithm will only copy the whole subtree into the document. The intelligent algorithm has in relation to the simple one the following features:
  1. If some elements of the template already exist in the document, the rest of the template can be added around these elements in the document (see fig. )
  2. This facilitates the creation of a template by the user considerably. The attention required for choosing the template root element when using the simple algorithm (see fig. ) is not necessary.
  3. Furthermore, several subtrees can be inserted simultaneously with this.
  4. Links between these subtrees will be preserved (see section ).
 
 
Remember that according to section , item c, additional elements cannot be created.

 
The intelligent algorithm complements already existing structures (white elements) with parts of the template. The added elements are presented in gray.

 
 
Fig. resumes the example of fig. . Since element <t> already exists in the document, the simple algorithm cannot insert the template (gray background). In fig. we can see, how the intelligent algorithm will manage that problem. It only adds the elements <v´>, <x´> and <y´> (gray elements) of the template to the elements <t> and <u> of the document. The intelligent template algorithm obviously is able to split a template into sub-templates if required and to insert them one by one, into the document.
 
 

A lot of possibilites

 
Generally, there will be various possiblities to insert the sub-template. Because of this, a DTD independent template algorithm has to try all permutations of the sub-templates to be inserted. It remains for the author, to select the desired permutation. The algorithm presented in section shows all permutations one after the other until the user stops it (agreeing or rejecting). If he agrees, the corresponding permutation will be inserted into the document. If the author rejects a permutation by asking the template algorithm for another permutation or by cancelling the template algorithm, the rejected permutation has to be removed from the document.

 
Before complementing the existing subtree <t> (see fig. ) the template can be inserted as left sibling (1), afterwards as right sibling (2) of the existing subtree <t>.

 
 
Fig. shows the principle how the intelligent algorithm can find all permutations. Before complementing an element <t> existing in the document with the rest of subtree <t> of the template (as shown in fig. ) the algorithm tries to insert this subtree <t> as left sibling of the element <t> in the document (before ), provided that the DTD allows another element <t> at this position. This is shown in fig. .(1). In the same way it tries after complementing of the element <t> existing in the document to insert subtree <t> of the template as right sibling (after ), as shown in fig. .(2). This will also be done for all children (elements <u>, <v>, <x> and <y>). So all possible permutations will be found recursively. If the author rejects a permut ation, it will be removed from the document and the next permutation will be presented.
 
Example 3: Given the DTD of examples 1 and 2, the template of example 2 and the following extract from a document:
 
<team-member>
    <roles>
        <role>Project Manager</>
    </roles>
    <name>Steinmann</>
    <phone>(0731) 9344-3221</>
</team-member>
 
Additional to the (simple) positions before and after the <team-member> element exisiting in the document the intelligent algorithm will find three permutations, because the element <role> of the template can either be inserted before (as left sibling), either not be inserted or either be inserted after the existing element <role> of the document (as right sibling). All the other elements can only be added to the elements existing in the document, except <name> and <phone>, because these elements already exist in the document and the DTD does not allow to insert them once more (see <u> in fig. ).
 
Permutation 1
 
<team-member>
    <roles> 
        <role>Author</>
        <role>Project Manager</>
    </roles>
    <name>Steinmann</>
    <department>S1</>
    <address>Magirusstr. 43</>
    <zip>89077</>                
    <city>Ulm</>
    <phone>(0731) 9344-3221</>
    <fax>(0731) 9344-100</>
</team-member>
 
Permutation 2:
 
<team-member>
    <roles> 
        <role>Project Manager</>
    </roles>
    <name>Steinmann</>
    <department>S1</>
    <address>Magirusstr. 43</>
    <zip>89077</>                
    <city>Ulm</>
    <phone>(0731) 9344-3221</>
    <fax>(0731) 9344-100</>
</team-member>
 
Permutation 3:
 
<team-member>
    <roles> 
        <role>Project Manager</>
        <role>Author</>
    </roles>
    <name>Steinmann</>
    <department>S1</>
    <address>Magirusstr. 43</>
    <zip>89077</>                
    <city>Ulm</>
    <phone>(0731) 9344-3221</>
    <fax>(0731) 9344-100</>
</team-member>
 
The number of the different permutations depends on the DTD, the template and the document, into which the template has to be inserted. Each element, which can occur multiple and which occures both in the template and in the document (like element <role> in example 3), increases the number of permutations with the following possiblities:
  • the insertion of <Element.x> of the template before <Element.x> of the document (see fig. .(1))
  • the complementation of <Element.x> (see fig. )
  • the insertion of <Element.x> of the template after <Element.x> of the document (see fig. .(2))
  •  
    Although this did not cause problems up to now, an explosion of the number of different permutations can be avoided by disabling the insertion at unneccessary positions. This can be done either for all templates t ogether, either for all element types of one template or either for each element. One way could be to provide a switch for each of the position mentioned above (configuration):
  • enable the insertion as left sibling of an element with the same type
  • enable the complementation of an element with the same type
  • enable the insertion as right sibling of an element with the same type
  •  
    Remember that this has only to be done for elements which can occure multiple.
     
    The default configuration of the algorithm presented in section disables the insertion of #PCDATA at every place as left or right sibling of another #PCDATA.
     
    Particular attention has to be spent by inserting links and entities (see section ).
     
     

    Several subtrees

     
    As showed in fig. , the intelligent algorithm is able to insert several subtrees simultaneously. This is possible even if the subtrees are not direct neighbours. In the following it will be shown, that the intelligent algorithm even handles links between these subtrees of a template correctly. There is no other general way to copy links between subtrees. Particularly it is not possible to insert the subtrees one after the other. This will be shown in the following section.
     
     

    How to handle links

     
    In the following, first the insertion of simple templates consisting of only one subtree will be discussed. Especially it will be dealt with the SGML standard linking mechanism ID-IDREF. This mechanism assumes, that the target is marked by an attribute of the type ID with an unambiguous value. The referencing elements use an attribute of the type IDREF which contains the same value like the ID attribute of the element to be refe renced. Furthermore, HyTime CLINKs will be discussed in section .
     
    Independend from the applied linking mechanism, we have to consider links
    1. into
    2. out of
    3. inside
     
     
    the subtree. Links outside the template need not to be considered, because they are not affected.
    1. Links into the template have to stay on the original targets, they must not be set at the inserted template. This is necessary to avoid ambiguou s targets. Using the ID-IDREF mechanism, all values of the ID attributes (of the inserted template) have to be set at new, unambiguous values.
    2. Links out of the template can be kept only, if the targets exist in the document, where the template has to be inserted. Generally this cannot be assumed. Using the ID-IDR EF mechanism, all values of the IDREF attributes (of the inserted template) have to be set at new targets.
    3. Links inside the template, that means links with the referencing element as well as the target inside the template, should be copied by inserting the template. Using the ID-IDREF mechanism, all values of the ID attributes (of the inserted template) have to be set at new, unambiguous values and all values of the IDREF attributes (of the inserted template) have to follow.
     
     
    To sum up, these three cases are shown in fig. .

     
    Three cases for links (arrows) by copying a template. The area with gray background is the original, where the template has been created from, the gray elements represents the copies of the inserted template elements.
    1. Links into the template must not be copied, that means there will not be created a copy of "a".
    2. Links out of the template can be copied, if the target exists. For this "b" will be copied to "b´".
    3. Links inside the template should be copied. For this "c" will be copied to "c´".
     

     
     
     

    Links between subtrees

     
    In the section above we have seen, that links between subtree will be lost, if the subtrees will not be inserted simultaneously. If the links should be kept, it is necessary to insert the subtrees simultaneously, as this is done by the intelligent algorithm. Then the links can be treated like in (c) and will be copied by inserting the template. Fig. shows an example for this. If the subtre es had been inserted one after the other, the link would have been destroyed, because according to (a) the value of the ID attribute in the right subtree had to be set at another value.

     
    The intelligent algorithm permits the insertion of several subtrees simultaneously. In addition links between these subtrees will be copied correctly.

     
     
     

    HyTime CLINKs

     
    HyTime CLINKs typically use an indirection with <nameloc> elements, e. g. for inter-document-linking. Even if these <nameloc> elements are not part of the template, they have to be copied too, so the link wi ll not be lost. This can be achieved with the intelligent algorithm by adding functionality to search and to copy the relevant <nameloc> elements into an additional subtree automatically. In the example of fig. this could be subtree <t2>. The functionality to be added to the template algorithm adds knowledge about the semantics of the HyTime CLINKs to it. This means the collection of the a ffected <nameloc> elements and their integration in the template. With this, the absolute DTD independence will be restricted to DTDs, which support HyTime.
     
    It is not recommended to extend the template up to the first common ancestor <a>, because in some cases the possibile position, where the original template <t1> could be inserted, will be prescribed too hard. Because of this, the inserting algorithm should be designed that way, so that several subtrees can be inserted, even if the template do not contain their common ancestor.
     
     

    How to handle entities

     
    If a template contains references to entities, it must be guaranteed, that they are defined in the document, where the template has also to be inserted. Conflicts of names with entities already defined in the document have to be avoided.
     
     

    How to handle external files

     
    If a template contains links to external files (e. g. graphics), it must be guaranteed, that they are accessable for the document, where the template has also to be inserted. E.g. there is to pay attention for access rights or when copying the template into a file of another local machine. If necessary, the external file has to be copied by inserting a template. Conflicts of names with external files already linked to the document have to be avoided.
     
     

    The algorithm and experiences with it

     
    All things considered, an intelligent template algorithm has been realized and integrated into an SGML editor. It was necessary, that the editor allows to access to the SGML structure information of the document.
     
    The user can create a template by selecting an area of the document in the editor and activating the function "Create Template". The template will be stored in a file, the user has to choice.
     
    By activating the function "Insert Template" the user can choose a template and insert it into the actual document between the cursor and the end of the document.
     
    Every time when the template or parts of it could inserted, the user will be asked if he accepts this position. Here he can choice the following possibilites:
    1. Accept or not accept the insertion at this position.
    2. Stop the template algorithm or search for a next position.
     
     
    With the configuration (see section ) the number of different permutations can be considerably reduced, if required.
     
    The algorithm to insert templates consists of the following loop:
     
    Go from the selected element in the actual document element by element until the end of the document and repeat for each element the following steps:
    1. If the actual element does not have any child, try to insert the whole template as child. By success goto AskUser.
    2. If the actual element and the template root element are identical, try to insert the template around the existing structure in the document (intelligent insertion). Try all permutations one after the other. If the user accepts a permutation, let the actual element be the first element inserted into the document an go to step 1, if the user wants to insert the same template another time. If the user does not want this, the template algorithm is completed successfully at this point. If the user does not agree try the next permutation, if there exist another or go to step 4.
    3. If the actual element has at least one child, try to insert the whole template as left sibling of the first child of the actual element. By success go to AskUser. If the template could not be inserted, call re cursively step 1 with the first child of the actual element as actual element. Repeat step 3 for all other children (if there exist more than one) of the actual element. Finally try to insert the whole template as right sibling of the last child of the acutal element. By success, go to AskUser.
    4. Let the actual element be the next element in the document an continue with step 1.

      1. AskUser
      2. If the user does not agree to insert the template at this position, remove the inserted template and go to step 4. If the user accept this position, the template remains in the document and the inserted ID att ributes will be modified, according to section .
      3. If the user wants to insert the same template another time, let the root element of the inserted template be the actual element and continue with step 1. If the user does not want this, the template algorithm is completed successfully at this point.
       
     
     
    The template algorithm is easy to use and represents a significant help for authors of SGML documents. The number of presented permutations did not cause any problems so far.
     
     

    Conclusion

     
    The SGML templates presented in this paper enhance the functionality of conventional SGML editors to create templates consisting out of single or several subtrees with or without content and to insert them afterwa rds, even in other documents. The more extensive and the more frequently used, the more time and expenditure the user can save. Where the template has to be inserted, the structure of the document must be appropriate. If the template does not fit into a gap in the document directly, there must be overlaps, that means there must be common elements in the document and in the template. To keep the template algorithm general and easy to u se (the user will not be asked for information about missing elements) the template algorithm cannot create additional elements between the document and the template to insert. Consequently the creation of templates bec omes very easy, the user has only to create extensive templates, that means the template root element has to be choosen so high in the hierarchy as possible. The template root element had to be choosen much more carefully, as the template algorithm could not possibly add parts of it around existing elements in the document.
     
    Furthermore the template algorithm can insert several subtrees simultaneously, because elements of the template are permitted to exist in the document already. The benefit is, that links - even between subtrees - will not be lost, both by creating and inserting the template. By inserting the subtrees successively, these links could not be handled. In addition, this is the basis to support indirect HyTime CLINKs. For this some kn owledge about the semantics of the HyTime CLINKs has to be added to the template algorithm (with this, the absolute DTD independence will be restricted to DTDs, which support HyTime).
     
    The template algorithm shows all possibilities how to insert the template into the document. Out of these possibilities the user can choose the appropriate one. Configurations help to reduce the number of differen t possibilities, if required. With all this, the template algorithm can be used directly without any adaption for all possible DTDs. Beyond it, templates can be used in the same way for valid XML.

    Document Structure Identification: a New Paradigm   Table of contents   Indexes   Global CSCW Computer Supported Cooperative Work with SGML