Microsoft's vision for XML   Table of contents   Indexes   What isn't a document?

 
 

Hypertext Linking with HTML, SGML and XML - technologies and techniques


 
Neil   Bradley
  Senior Consultant
 
8 Beacon House
Burrells Wharf Square
Isle of Dogs
London   England  E14 3TJ
Email: neil@bradley.co.uk Web: www.bradley.co.uk
 
Biographical notice:
 
Neil Bradley
 
Mr Bradley has worked with SGML for over 10 years, as a programmer, analyst, trainer and consultant, mainly in the data conversion industry. He has written DTDs and designed editorial and delivery systems for customers in the publishing, oil, aerospace, telecommunication and patent industries. He now works for TTCG (Thomson Technology Consulting Group) as a senior consultant, and wrote The Concise SGML Companion and The XML Companion.
 
ABSTRACT:
 HTML, Hypertext Markup Language 
Hytime
 SGML 
 URL 
 XLL 
 XML  
 hypertext 
 

One of the most powerful features of electronic document publishing is its ability to include interactive features that allow readers to navigate to other documents, or to other parts of the same document, directly from references in the text. This is termed a 'hypertext' linking feature. SGML  (Standard Generalized Markup Language) incorporates a primitive intra-document scheme. HTML  (HyperText Markup Language) uses the URL standard to provide simple, single-directional inter-document linking. XML  (eXtensible Markup Language) , in conjunction with the adjunct XLL  (XML Linking Language) standard, offers much more powerful capabilities, such as bi-directional links, and links into and out of read-only documents, or other documents that have no unique element identifiers. Finally, the HyTime standard contains very advanced facilities for multi-media based systems, including links to coordinates in an image, or time-slice within a movie clip.
 
This paper describes all these technologies, and offers practical tips for the use of the more common ones. In particular, techniques for assigning unique ID values to elements in large, complex documents are discussed, as well as issues relating to the tracking and control of such links, including the use of databases to automate or validate them.
 
Almost as soon as the written document was invented, there was found to be a need to refer the reader to related text found in another document, or in a separate part of the same document. The reference may direct the reader to the title of another document, to a chapter heading or to a page number. One major benefit of electronic publishing is that browsing software can be made to perform the tedious task of actually locating and accessing the remote information. The browser is said to 'follow the link' when the reader clicks the mouse over a highlighted reference.
 
The three related markup languages, SGML, HTML and XML, all offer hypertext linking features, and it is not surprising that their capabilities overlap somewhat. SGML offers a very simple intra-document linking scheme. To overcome the limitations of SGML, HyTime was devised to provide multi-direction and media independent links. HTML uses a different approach to allow inter-document linking to documents and parts of documents anywhere on the Internet. XML is compatible with SGML, but is expected to be used with XLL (the XML Linking Language) to provide a much more powerful set of hypertext linking options, including HTML-like Internet linking.
 
 

SGML

 
There are two special attribute types available in SGML that allow an element to be uniquely identified, or designated as an element that points to other elements. The ID attribute type is used to indicate that the attribute holds a value which uniquely identifies each instance of the element it belongs to. For example, a Chapter element could be given a Target attribute:
 
<!ELEMENT chapter (...)>
<!ATTLIST chapter target ID #REQUIRED>
<chapter target="chap1">...</chapter>
<chapter target="chap2">...</chapter>
 
An SGML parser detects the significance of the Target attribute, and checks to ensure that a value is not duplicated, as this would of course lead to ambiguity. If two chapters both had an identifier of "chap13", the browser would not know which one to access when a reference to this name was used in a link.
 
For a link to be fully described, the 'source' point must also be identified by a special purpose attribute, which holds the unique value of the required target. The IDREF attribute type serves this purpose:
 
<!ELEMENT link (...)>
<!ATTLIST link ref IDREF #REQUIRED>
<chapter target="chap1">...</chapter>
...
... for details see
<link ref="chap1">Chapter 1</link>.
 
An SGML parser can detect these attributes and check that the value held corresponds to the value of an attribute of type ID elsewhere in the document.
 
 

HTML

 
The HTML format was actually built-around its hypertext linking feature. Indeed, the first two letters in its name stand for 'HyperText'. HTML consists of a list of pre-defined tags and attributes, so the concept of attribute types does not apply here. An attribute called Name is used to hold unique identifier values. An attribute called Href (Hypertext REFerence) is used to point to one of these values. The surprising thing is that both these attributes are used in the same element, the A element (Anchor). This means that there is an A element at both the source and destination ends of the link.
 
<H2><A name="chap2">Chapter Two</A></H2>
...
... and see <A href="#chap2">Chapter Two</A> 
for details...
 
However, in HTML 4.0, almost all elements have been given attributes called Id, which may also hold unique identifier values. Using these makes HTML documents look much more like SGML documents.
 
<H2 id="chap2">Chapter Two</H2>
...
... and see <A href="#chap2">Chapter Two</A>
for details...
 
There is of course a discrepancy between the identifier value and the reference to it in the examples above. The preceding '#' symbol is significant to a Web browser. All text after this symbol is a document fragment identifier (in this case 'chap2'). Any text preceding the hash symbol is a URI (Uniform Resource Identifier), which locates a document anywhere on a local file system, local network, or remote system that is connected to an intranet or the global Internet. For example, to view the part of the HTML 3.2 specification that describes the A element, the link reference is 'http://www.w3.org/TR/REC-html32#anchor'. As the example links above were internal to the document, no text preceded the '#' symbol. When the '#' is not present, the anchor points to another complete document.
 
 

XML

 
As XML is a subset of SGML, it is possible to use the ID and IDREF attribute types exactly as described above. It should be noted that in both cases the 'within a document only' requirement need not be as restricting as it first sounds. When preparing a weekly journal, for example, it is common practice for the "document" to be a single journal, but when publishing the material electronically, the whole collection can be joined together and be considered a single large document. It is of course necessary, however, to ensure that all links will be unique when the material is first generated.
 
Although aimed at the Web, like HTML, this language does not directly have the same linking capability. However, an adjunct standard called XLL is being developed that not only replicates the URI scheme described for HTML, but goes well beyond it.
 
 

XLL

 
The links described so far have all been single directional, single target links, to elements that have a unique identifying name. Each of these factors can be seen as constraints in some circumstances. What if the target element has no identifier? What if I want the link to operate in both directions? What if there are several possible targets which the user may wish to choose from? The XLL standard provides answers to these questions.
 
XLL (the eXtensible Linking Language) is currently being defined by the W3C. It is designed to complement XML (but could operate with SGML too). XLL features are attached to XML attributes by use of attribute names that are deemed significant to any XLL-aware processing application. For example, when the attribute name 'xml-link' is detected, the element containing it is immediately identified as significant to the linking process. Other attributes to the same element provide additional information.
 
XLL first defines a standard attribute, called ‘Href’, which should be suspiciously familiar to HTML users. It holds the URL of the target element. This means that, like HTML, it is possible to locate documents and fragments of documents anywhere on the Web. However, in keeping with the generalized markup philosophy, there is no standard name for the element containing this attribute. To identify an element that contains a hypertext link, another attribute is used. Called XML-Link, this attribute also identifies the kind of link contained in the Href attribute. To simulate the capabilities of HTML, the value it holds is ‘simple’.
 
  ... see <goto xml-link="simple" href="otherdoc#part9">  part nine of the other document </goto>.  
 
When a DTD is in use, it is likely that a specific element will be included purely to serve as a link identifier. In this case, it should not be necessary to include the XML-Link attribute in each element start-tag. Instead, it can be defined in the attribute declaration for that element:
 
  <!ELEMENT goto      (#PCDATA)>  <!ATTLIST goto       xml-link  #FIXED    "simple"                         href      CDATA     #REQUIRED>    ... see <goto href="otherdoc#part9">  part nine of the other document</goto>.  
 
If the name ‘Href’ is inconvenient, perhaps because it is already used in the element for some other purpose, or because it is not deemed suitable for some other reason, then it can be changed. The mechanism by which this happens involves the use of another attribute, called ‘XML-Attributes’. This attribute contains pairs of values. The first value in each pair is the default name, as would normally be recognised by an XLL-aware processor. The second value is the replacement name. The only attributes which cannot be re-named are the XML-Link attribute and (for obvious reasons) the XML-Attributes attribute itself.
 
It is possible to move the linking information away from the source part of the link, possibly into a separate file. This has a number of benefits. First, it makes editing of links much easier as they are all in one place. Second, it means that bi-directional links are possible, because the concept of source and destination become meaningless. Using this technique, it is possible to follow a link from a targeted element to the document that references it. Of course, there may be many references to a single target, but this can be handled too, simply by adding more links to the group. When the user selects a link, a list of possible destinations is presented. Each title in this list comes from the content of a Title attribute:
 
  <extend inline="false">  <locator href="..." title="Summary">  <locator href="..." title="Details">  <locator href="..." title="In context">  <locator href="..." title="Other opinions">  </extend>  
 
Using the extended link concept, it is possible to include references ‘out of’ documents that cannot be edited (possibly because they do not belong to the person or organisation creating the links). It would appear, though, that the resource in the ‘read-only’ document must at least have a unique identifier, so that it can referred to for inclusion in the linked group. But this is not the case. XLL includes the concept of ‘extended pointers’, which are used to identify an object by its position. Using directions such as ‘go to the third child of the document element’ then ‘drill down to the first occurrence of a Para element’, it is possible to locate any element, and indeed any ‘pseudo’ element (the text between elements). Keywords such as ‘ROOT()’, ‘CHILD()’ and ‘STRING()’ are used to build a route to the target.
 
XLL is also able to ‘suggest’ the ways in which the target object should be presented. Two attributes are used in combination to provide various options. The Actuate attribute contains a value of ‘auto’ or ‘user’, indicating whether the link is only followed when the user selects it, or is followed automatically as soon as the reference appears on-screen. The Show attribute has a value of ‘replace’, ‘embed’ or ‘new’, indicating that the target object should replace the reference (the window scrolls or clears to make room for the target text), that the target text is to be embedded in the reference, or that the target text should appear in a new window. To emulate the standard behaviour of the A element in HTML, values of ‘user’ and ‘replace’ would be used. Values of ‘auto’ and ‘embed’ could be used to automatically insert the title of another object into each reference, which would be dynamically updated after the title is edited (note the similarity to entities in this scenario).
 
 

HyTime

 
HyTime is an ISO standard that was released in 1992, well before XML and XLL came on the scene, as an adjunct to SGML. It provides a large range of linking functionality to make up for SGML's primitive features. Perhaps due to its complexity, this standard has to date not been implemented widely. However, it has proved a useful reference framework for development of XLL, which shares some of its features, such as multi-directio links and the ability to locate resources by their contextual location. But HyTime has some unique features. The most interesting is its ability to isolate non-identified segments of non-SGML data, such as music and video clips. It is possible, for example, to link to a 15 second clip of music in the middle of a musical piece. HyTime still has a possible future in building the framework of multi-media presentations.
 
HyTime allows links to be directed through an intermediate structure, which allows this type of link to point to objects in other documents, while still being of type ID yet not causing parser errors. This technique is also used to point directly to non-SGML formats, such as images:
 
  <nameloc id="MyLogo" nametype="entity">     <nmlist>MyLogo</nmlist>  </nameloc>  ...  See my company   <clink linkend="MyLogo">logo</clink>.  
 
A ‘tree locator’ concept is used to identify objects that are part of larger objects and also do not have a unique identifier of their own. The XLL extended pointer concept is derived from, and expands upon this feature, though the syntax is entirely different. Simple numbers are used to identify the sequential position of each element at each level in the document hierarchy. For example, to identify the third paragraph in the fifth section of the second chapter in a book, the values ‘2 5 3’ would be used (assuming no other preceding elements, such as titles, are present at any level in this structure).
 
For information formats that do not include discrete identifiable blocks, such as music and video clips, a selection can be identified using coordinates. For example, a video clip can be located in a complete film by specifying the period in time that it spans, say from 1 minute 3 seconds to 2 minutes, 14 seconds.
 
Finally, the HyQ query language allows an object to be identified by some feature, rather than by any prior knowledge (either direct or indirect) of its location. For example, to find the word ‘summary’ in a Title element, the following query could be used:
 
  select ( DOMTREE And ( Eq(Proploc(CAND GI) "TITLE")  He(Dataloc(CAND 1 -1) "summary") ) )  
 
 

Managing Links

 
Unique identifiers may be assigned to objects as they are created in a number of different ways. Which method to choose depends largely on the capabilities of the working environment.
 
An SGML/XML editor is able to validate identifier values for uniqueness. Providing that all links are to other objects in the same file, these products can also validate that references point to valid target objects. However, the choice of values is left largely to the user. When the ‘system’ comprises nothing more than such an editor, it is left to the user to generate the value and to know what that value was when later creating a reference to the object it identifies.
 
Links within a single, small document, typically reflect the name of a title or heading of the section concerned. Link values such as ‘introduction’, ‘summary’ and ‘requirements’ are sufficient.
 
For cross-document linking, it is usual to prefix the part identifier with a brief file identifier. When the operating system restricts the file name severely, such as MD-DOS eight character names, this file name may be used. The file name needs to appear in each object identifier for this to work (though this could be done automatically as part of a publishing preparation process). For example, ‘BD009-summary’, ‘BX901-summary’ and ‘XYZ12345-summary’ are three identifiers that are found in ‘BD009.SGM’, ‘BX901.SGM’ and ‘XYZ12345.SGM’ respectively. Note that attributes of type ID must not be used in these cases (in SGML or XML), as they are separate documents that when parsed will be found to contain links to objects that are not present in the same file. However, this technique is not required for HTML, as the browser separates the action of accessing the remote document from the act of jumping to the required object. The Href attribute contains the file name (actually the full URL to the file), but the Name or Id attribute only contains the fragment identifier.
 
Unfortunately, a title may be much longer than is suitable for use as an identifier value (even if it can be accurately remembered). For example, ‘Individual Statistical Results - Stage III (Simulated Trials)’ would be hard to recall and copy accurately. Although abbreviation schemes may be developed, different titles may actually abbreviate to the same code. For example, ‘Individual Results’ and ‘Indian Restaurant’ would both resolve to ‘IndRes’ if the scheme adopted was to use the first three letters of each word.
 
Fortunately, long documents tend to include section and sub-section numbering, such as ‘1. Introduction’, ‘1.1 Background’, ‘1.2 Participants’ and ‘1.3 Feedback’. Providing that new sections are unlikely to be added, or old ones deleted, these numbers can be used as unique identifiers. For example, ‘BD009-1.3’ refers to section 1.3 of file BD009.SGM. This approach is particularly suitable when the reference text uses these numbers to identify the object in a printed version of the text. For example, ‘To see who participated in the scheme see section 1.2.’. In this case, adding the reference code is simply a matter of copying the ‘1.2’ into the reference attribute.
 
As already indicated, the solution described above is not suitable when the data is changed by adding new or deleting old material. A change in section numbering will make all references of this type point to the wrong item. Of course, this is not a huge issue when the reference text itself must also be edited to reflect the re-numbered target. But when this is likely to happen, the numbers may be generated automatically to avoid such editing requirements.
 
When both titles (or abbreviated titles) and section numbering are not appropriate or possible, it may be necessary to simply assign a unique code that has no obvious connection to the object it identifies, in the same way that a barcode number bears no obvious relationship to the tin of beans is represents. A simple sequence count is used to generate new unique values as they are required. This value could be generated by software, or simply be written down and ticked when it is used. Inserting references is no longer simple, though, as the author has no idea what identifier may have been used. Either the referenced document and fragment must be displayed whenever a reference to it is made, or a database/spreadsheet/paper record must be maintained.

Microsoft's vision for XML   Table of contents   Indexes   What isn't a document?