Text analysis tools for XML documents using regular expressions &, XSL   Table of contents   Indexes   The application of core standards - a technical approach

 

Business applications made easy

 Rivers-Moore, Daniel 
 
 Daniel  Rivers-Moore
 Director of New Technologies
  RivCom 
 Swindon 
 United Kingdom 
 Wiltshire 
RivCom,  Lotmead Business Village
Wanborough
Swindon  Wiltshire  SN4 0UY United Kingdom
Phone: +44 (0) 1793 792004 Fax: +44 (0) 1793 792001 email: daniel-rivers-mooret@rivcom.com web site: www.rivcom.com
 Biography
 Daniel Rivers-Moore – Daniel is Director of New Technologies at RivCom, a consultancy and services company specializing in helping businesses adopt XML technologies to meet their information management and distribution needs. He has been actively involved in the development of the XML family of standards, having been a member of the original XML Special Interest Group, joint project leader of the STEP/SGML harmonisation initiative under ISO, and software development lead in the recently completed European XML/EDI Pilot Project. In April 1997, at the WWW6 conference in Santa Clara, he gave the world’s first public demonstration of XML content being displayed within an industry-standard browser (using a browser plugin developed by RivCom). Daniel has spoken at numerous international conferences related to product data and structured information. Recent speaking engagements include presentations on XML at the Plant Information Management Conference, the Annual Conference of CITE (Construction Industry Trading Electronically), the European Commission’s XML/EDI Dissemination Event, an XML Awareness Day for the British Computer Society’s Object Oriented Programming Specialist Interest Group, and a strategy planning meeting on XML adoption at the NATO Consultation, Command and Control Agency in The He is currently assisting the International Press Telecommunications Council (IPTC) in the development of NewsML, an XML-based standard for the management and interchange of news objects and collections of news objects in all media.
 Abstract
 The XML family of standards makes possible the next logical step in the evolution of software from more procedural to more declarative approaches. With a declarative approach, an application developer specifies what data transformations should occur, rather than how each transformation should be performed. This presentation will describe an approach to application design and development that makes maximal use of this concept, seeing the application as a series of transformations (specified using XSLT) between different information structures (specified using XML).
 The generic information structures developed by the recently completed European XML/EDI Pilot Project will be described, and prototype applications built on the basis of them will be demonstrated. These prototypes will show how powerful, multi-lingual, configurable, distributed software applications can be built out of simple reusable standards-based components.
 The roles of XPath, XLink and XML Schema in the specification of such applications will be discussed, and a case will be made for the development of an XML "Action Language" which, once the logical structure of an application has been defined, will serve to fire the process and make the application run in real time. The XML Action Language developed by the European XML/EDI Pilot Project will be described, and its use in driving the prototype applications will be explained.
 

What is a software application?

application development
 declarative programming 
procedural programming
 
A software application is essentially a tool for transforming one form of information into another, with or without a degree of human intervention. In this sense a software process is highly analogous to a business process or an industrial process. All three transform inputs into outputs of (hopefully) higher value. Such processes can be defined procedurally – by specifying how they are carried out – or declaratively – by specifying what should be transformed into what.
 XML 
 
The historical evolution of software has been a steady movement from more procedural to more declarative approaches. Object orientation represented a significant step towards a more declarative paradigm for software application design.XML makes possible the next logical step in this direction.
 IT 
 
There was a time whenIT was known asDP . At a time before graphical user interfaces and hands-on computing, the role of the computer was essentially seen as being one of transforming different data structures into one another as the information they represented moved through the business process.
DP, data processing
 
Today, humans are much more intimately involved in the information flows. The desktop PC took data out of the centralised data banks and put it within reach of office workers sitting at their desks. The home PC, the Web and the Internet-enabled mobile phone continue this trend and bring information and data to our fingertips and into every corner of our daily lives. But at root, what is happening is still the same. Information is being encoded through different data structures, and moved around the world, being transformed as it goes according to the needs of the business or of the individual user.
 HTML, Hypertext Markup Language 
 WML 
 user interface 
 

The nature of the user interface

 Some of the data transformations involved in the coming generation of Internet-based distributed applications will be hidden from the user. They may happen on Web servers, inside client PCs, inside a mobile phone or a household appliance – the user really doesn’t care. Other transformations will happen in a hands-on manner, with the user intimately involved through some kind of user interface.
 XHTML 
 
The browser is likely to become the ubiquitous user-interface host on the desktop or laptop PC, but other user interfaces will be provided through palmtop devices, mobile phones, TV sets with their remote controls and a host of other appliances ranging from motor-car dashboards to microwave ovens. The requirement on user-interface design is therefore not to create a dedicated application with its own graphical user interface, but rather to create ways of transforming whatever data the user needs to interact with intoXHTML ,WML , or whateverXML flavour is required for the particular interface device being used.
 WML 
 XML 
 
In order to maximise flexibility, and the possibilities of reuse of code, it is important to separate out logically distinct aspects of information, and hence separate out the transformations involved so that only one logical transform is performed at a time. It is also important to separate the structural aspects of the user-interface design, which will be the same whatever physical device type is being used to host the interface, from the purely presentational aspects, which will be specific to a particular kind of device.
 XML 
 XSLT 
 
XML itself is based on the principle of separation of content from presentation, so this approach finds a natural fit in theXML paradigm. And theXML transformation-specification language,XSLT , will play a crucial role in this aspect of application design (as indeed in several others, as we shall see shortly). UsingXSLT transformations at the point of delivery makes it possible for the application designer to consider thelogical structure of the user interface separately from itspresentational aspects.
 XSLT 
application design
 

Principles of application design

 As part of the European XML/EDI Pilot Project (a collaborative project carried out during 1999 and part-funded by the European Commission through its ISIS programme) the following design principles for Internet-based applications were identified:
 
  • UseW3C standards wherever possible to drive the application in all its aspects
  •  W3C 
     
  • UseXML as the default data format, at all levels of the application
  •  XML 
     
  • Decompose the complexity of the application requirements into a sequence of data transformations
  •  
  • Maximise reuse, by minimising the number of distinct data structures used
  •  
  • Use a common, generic data structure for all data that is not of such complexity as to require a dedicated, optimised structure of its own.
  •  The separation of logical user-interface design from presentation, which we discussed in the previous section, is one aspect of design for maintainability and extensibility that will be essential in the brave new world of Internet-based computing. It is an example of the third principle above, namelydecomposing the complexity of the requirements into a sequence of data transformations . By carrying out two transformations instead of one – first from the interchange format to the logical structure of the user interface, then from that logical structure to its specific manifestation in the interface device being used – the application becomes far more comprehensible, extensible and maintainable.
     

    Minimising the number of distinct data structures

     Let’s look now at the next principle – minimising the number of data structures used by an application.
     A little thought will make it clear that there is in a sense a required minimum number of distinct data structures required in any Internet-based distributed application:
     
  • Firstly, there must be a data structure for the transfer of information between machines. Given that there is no guarantee that the machines that need to communicate with one another will be using the same software, or following the same business rules, this exchange will need to be based on some standard representation for the kind of information being sent, agreed by some community broad enough to include at least the sender and the receiver of the message. There are currently numerous initiatives under way to define such common interchange formats for a plethora of broad or narrow industry domains and communities of interest. We shall call this data structure theinterchange format .
  • interchange format
     
  • Secondly, there will probably be a structure for storage of a persistent record of the information. We shall call the format used for this purpose thestorage format , which may or may not be the same as the interchange format. It may be that the information is stored in some object-oriented or relational database repository, whose structure is designed to map directly to theXML format used for interchange. But there may be sound reasons to do with data-management, search or performance optimisation, for making the storage and interchange formats quite different from one another, independently of the kind of repository in which the storage actually takes place.
  •  XML 
    storage format
     
  • Thirdly, there must be a structure that can be interpreted by whatever device (or helper application) is being used to drive the user interface. This might beHTML – or, in the near future,XHTML – to drive a browser-based user interface,WML for an interface through a mobile telephone, Microsoft Excel format for an application that uses Microsoft Excel for data input or display, and so on. We shall call this thedevice format . It is specific to a particular kind of device, but generic to all applications that use that device. Individual application developers do not need to design their own device formats. They are a property of the devices used, and are defined either by standardisation bodies or by the manufacturers of those devices.
  •  HTML, Hypertext Markup Language 
     WML 
     XHTML 
     
  • Fourthly, there must be a structure for describing the logic of the way the information is to be presented to or interfaced with by the user. We shall call this theinterface logic format . As we have indicated above, conceptual interface design can be expressed in this format, then transformed at the point of delivery into the device format, through a transformation process perhaps driven by one or moreXSLT stylesheets.
  •  XSLT 
     interface logic 
     
  • Fifthly, there must be a structure for describing the application’s functionality and behaviour. In traditional application design, this could take many forms. Perhaps a formal model of the application would be developed using aCASE tool based aroundUML . Perhaps no formal model would be developed at all, but the application logic would be implicit in the lines of C, C++, Java or whatever coding language the programmer used to write the application itself. However, in the interest of developing extensible and maintainable applications, which is our concern here, we are proposing a shall assume some kind of formal model, and will therefore need what we shall call anapplication logic format in which it is expressed. This is consistent with our wish to move from a procedural towards more declarative approach to application design.
  • CASE, computer-aided software engineering
     UML 
     application logic 
     
  • Finally, there may be any number of data stores and data sources, holding information on which the application needs to draw, in addition to the information provided directly by the user or received over the wire in the interchange format mentioned above. There will therefore need to be one or more formats in which this additional information is delivered to the application. We shall call this format or formats thesupplementary data format(s) .
  •  generic data format 
     
    To summarise, an Internet-based distributed application will need, as a minimum, 2 to 6 distinct kinds of data format, namely:
     
  • one or moreinterchange formats
  •  
  • zero or morestorage formats (depending on whether the application requires a persistent record of the information exchanges or transformations that occur when it is run)
  •  
  • zero or moredevice formats (depending on whether users will be directly involved in the running of the application)
  •  
  • aninterface logic format (if users are directly involved)
  •  
  • anapplication logic format to drive the behaviour of the application itself
  •  
  • zero or moresupplementary data formats.
  •  

    Building an application

     EDIFACT, Electronic Data Interchange For Administration, Commerce and Transport 
    XML/EDI Pilot Project
     
    I’d now like to take a look at how the application-design principles listed above were applied by the European XML/EDI Pilot Project to build a prototype application – a Transport Firm Booking application based around established EDIFACT messaging protocols for container transport operations.
     XML 
     XSLT 
     
    Based on the first two of our guiding principles, the application is built usingXML forall the data formats it required, andXSLT to defineall the necessary transformations. This involved developing some general-purposeXML structures for the last three kinds of data format listed above. Let’s take a look at these now.
     

    Datasets and items

     DTD, Document Type Definition  
    extensible information sets
     generic data format 
     
    In order to minimise the number of distinct data formats needed in its applications, the European XML/EDI Pilot Project developed aDTD for what it calledExtensible Information Sets (XIS), consisting ofDatasets andItems . This was used as a commonsupplementary data format throughout the application.
     One of the requirements of the project was to build applications that allowed users speaking different languages to use the application interfaces. Thus, it was necessary to configure a given user interaction session for language, and this required a list of supported languages as a piece of supplementary data. This is how the language list looked:
     
    <Dataset Suid="language" Type="language">
    <Description xml:lang="EN">Language</Description>
    <Item Suid="EN">
    <Description>English</Description>
    </Item>
    <Item Suid="FI">
    <Description>Suomi</Description>
    </Item>
    </Dataset>
    
     This presents aDataset consisting ofItems of type “language”, each with a “sibling-unique identifier” (Suid) and a description.
     By adding additional subelements toItem , richer structures can be handled. This approach was used for anotherDataset consisting, this time, of the carriers that might be used to undertake the transportation operation:
     
    <Dataset Suid="carrier" Type="carrier">
    <Description>Carrier</Description>
    <Item Suid="carrier1">
    <Description>R.C.Duke</Description>
    <Company>R.C.Duke and Co. Ltd.</Company>
    <Email>customerservice@rcduke.com</Email>
      <Phone>+44(0)171 123 4567</Phone>
     </Item>
     <Item Suid="carrier2">
      <Description>Universal</Description>
      <Company>The Universal Transport Company</Company>
      <Email>orders@unitrans.co.uk</Email>
      <Phone>+44(0)181 222 3333</Phone>
     </Item>
     <Item Suid="carrier3">
      <Description>ABC Carriers</Description>
      <Company>ABC Carriers of Europe</Company>
      <Email>info@abcc.co.uk</Email>
      <Phone>+44(0)1793 121212</Phone>
     </Item>
    </Dataset>
    
     Another powerful way to enrich these data sets is to provide multipleDescription elements for anyDataset orItem . These can be qualified by language (using the xml:lang attribute) and/or by variant. As we shall see shortly, theseDescription qualifiers can be used to drive a highly configurable user interface.
     
    <Dataset Duid="Label" Type="Label">
    ...
    <Item Duid="Label24">
    <Description xml:lang="EN" Variant="Full">Company name</Description>
    <Description xml:lang="FI" Variant="Full">Yrityksen nimi</Description>
    <Description xml:lang="EN" Variant="Compact">Company</Description>
    <Description xml:lang="FI" Variant="Compact">Yritys</Description>
    </Item>
    ...
    </Dataset>
    
     

    Logical forms and their presentation

     interface logic 
     presentation 
     
    One of the requirements of the Transport Firm Booking application was to allow the user, once a carrier had been chosen, to enter details of the transportation operation he or she wants that carrier to perform. For a user seated at a PC with a Web browser, part of the data-entry form looks like this:
     The way the first section of this form is encoded in theinterface logic specification is as follows:
     
    <Section LabelRef="Label2">
    <Label/>
    <Item LabelRef="Label24" InfoSource="Context2">
    <Label/>
    <ReadOnlyData/>
    </Item>
    <Item LabelRef="Label25" InfoSource="Context3">
    <Label/>
    <ReadOnlyData/> 
    </Item>
    <Item LabelRef="Label26" InfoSource="Context4">  
    <Label/>
    <ReadOnlyData/> 
    </Item>
    </Section>
    
     For a user with a mobile phone, theinterface logic specification is unchanged, but the way it would be displayed might be quite different:
     HTML, Hypertext Markup Language 
     WML 
     
    TheHTML required to produce the first result, and theWML required to produce the second, are both generated out of the sameinterface logic data. It is important to notice that the words that appear on the form are not included in theinterface logic specification. Instead,LabelRef andInfoSource attributes are provided. TheLabelRef attribute identifies one of theItem elements in the LabelDataset which was introduced at the end of the last section. If you look back at the extract from that file, you will see that Label24 has a Full English-language variant of “Company name”, and a Compact English-language variant of “Company”. You will see that in the browser, it is the Full variant that has been used, and in the mobile phone it is the Compact variant.
     The next section of the browser form has two text fields available for the user to input their name and email address. Theinterface logic data for this part of the form is very similar to the previous one. It simply uses differentLabelRef andInfoSource attributes, and replaces theReadOnlyData elements withInterface elements consisting ofEditableField elements containing emptyValue elements. When the user enters data into the field, the currently emptyValue elements will be populated by the data entered by the user.
     
    <Section LabelRef="Label3"> 
    <Label/>
    <Item LabelRef="Label23" InfoSource="Context5">
    <Label/>
    <Interface>
    <EditableField>
    <Value/>
    </EditableField>
    </Interface>
    </Item>
    <Item LabelRef="Label25" InfoSource="Context6">
    <Label/>
    <Interface>
    <EditableField>
    <Value/>
    </EditableField>
    </Interface>
    </Item>
    </Section>
    
     

    Using XPath statements to identify information sources

     XPath 
     
    Just as theLabelRef attributes above referred to aDataset of Labels, so theInfoSource attributes refer to aDataset of Contexts. What is meant by a Context here is a specification of the place in theinterchange format structure where the data in a particular field in the user interface comes from, and to which it is returned after it has been edited by the user.
     Pursuing the example of company name, we note that theInfoSource for the company name in the example above is identified as “Context2”. Let us now look at thecontext.xml file, which contains the Context statements these identifiers refer to. As we have said, we are using a common data format for allsupplementary data , so this file too contains aDataset ofItem elements, but on this occasion theType attribute of theDataset is set to “Context”. Here is the relevant extract:
     
    <Dataset Type="Context">
    ...
    <Item Duid="Context2">
    <Description xml:lang="EN" Variant="Full">Carrier company name</Description>
    <Content>
    //PartyContactsGroup/Party[@PartyQualifier='Carrier']/Name/TextLine
    </Content>
    </Item>
    <Item Duid="Context3" Datatype="email">
    <Description xml:lang="EN" Variant="Full">Carrier e-mail</Description> 
    <Content>...</Content>
    </Item>
    ...
    </Dataset>
    
     XPath 
     
    There are twoItem s in this extract. Note that the second has aDatatype attribute of “email”. We shall return to this in the next section. For now, we’re interested in the firstItem . What this tells us, in English, is that this field contains the carrier company name. What it tells the system, usingXPath syntax, is that this piece of data is the content of theTextLine subelement of theName subelement of theParty subelement of aPartyContactsGroup element, where thePartyQualifier attribute of theParty element has the value “Carrier”. And indeed, when we look at the XML file that was read by the system in order to generate this form and present it to the user, we find the following structure:
     
    <PartyContactsGroup>
    <Party PartyQualifier="Carrier">
    <Name>
    <TextLine>The Universal Transport Co.</TextLine>
    </Name>
    </Party>
    </PartyContactsGroup>
    
     XPath 
     
    Because the words “The Universal Transport Co.” appear here in a context that matches theXPath statement above, these words are displayed in the user interface as the content of the relevant read-only field. This field is identified for the user by a Label drawn from theDescription with the appropriate variant and user language identified by theLabelRef attribute in theinterface logic specification.
     It should be becoming clear how we can build up quite sophisticated and flexible application user interfaces out of this kind of simple structure, in an entirely XML-driven manner. I’d like now to take a brief look at data validation issues, then move on to theapplication logic aspect, before concluding.
     

    Data validation

    datatypes
     validation 
     
    When the user presses theSubmit button on the form, it might be necessary to check the validity of the data they have entered, before sending the data on to the next stage of the application (which may be local or remote). This can be done very effectively by associating a datatype with each data context, and defining rules for the validation of each datatype.
     Let’s take the example of the carrier’s email address, which is identified in the interface logic specification above as belonging to Context3. We noted above that thisItem has aDatatype of “email”. The application uses andatatype.xml file contains aDataset of Datatype specifications. Here is an extract from that file:
     
    <Dataset Type="Datatype">
    ...
    <Item Duid="email">
    <Description>Email address</Description>  
    <Dataset Type="validation">
    ...
    <Item Processor="xslt">
    <Dataset Type="test">
    ...
    <Item Suid="test1">
    <Description>An email address must contain an @ sign</Description>
    <Test>contains(Content, '@')</Test>
    </Item>
    ...
    </Dataset>
    </Item>
    ...
    </Dataset>
    </Item>
    ...
    </Dataset>
    
     XPath 
     
    The above structure specifies that the email Datatype is associated with a set of “validation” mechanisms. One of these validation mechanisms uses an “xslt” processor to carry out the validation. The specification for this validation process contains a set of tests that the XSLT processor can use to validate the content of any field that is associated with the email Datatype. Among these, the test called “test1” checks the data against theXPath expression “ contains(Content,'@') ”. We are also provided with an error message which will be displayed if this test fails.
     HTML, Hypertext Markup Language 
     XSLT 
     
    When the user presses theSubmit button, a series ofXSLT transforms is triggered, which draw on the various data structures we have looked at, and which produce a newHTML display in the browser, which looks like this:
     

    Driving the application logic

     application logic 
     declarative programming 
     
    Now it’s time to take a look at theapplication logic file that drives this data validation process. Here is the relevant extract:
     
    <Item>
    <Interface>
    <ActiveTextSpan Type="Button">
    <Value>Submit</Value>
    <Trigger>
    <Event>OnClick</Event>
    <Action>
    <Window name="customer"/>
    <File uri="currentstate/DataValidation.htm"/>
    <Parameter name="Source" type="ref">context.xml</Parameter>
    <Parameter name="Stylesheet">
    <Action>
    <Parameter name="Source" type="ref">datatype.xml</Parameter>
    <Parameter name="Stylesheet" type="ref">validate.xsl</Parameter>
    </Action>
    </Parameter>
    </Action>
    </Trigger>
    </ActiveTextSpan>
    </Interface>
    </Item>
    
     XSLT 
     
    Theapplication logic is specified in theAction element in the above snippet. This is embedded in a piece ofinterface logic which states that there is anInterface consisting of a “Button” labeled with the word “Submit”. The “OnClick” event on this buttonTrigger s anAction on the part of the application. ThisAction displays in theWindow named “customer”, and saves as aFile in a specified location, the result of performing two nested transforms. The main transform consists of running anXSLT stylesheet against thecontext.xml file (some of whose content we have seen above). The stylesheet used to drive this transform is itself generated by anAction , namely to run thevalidate.xsl stylesheet against thedatatype.xml file.
     Notice that through this technique it is possible to dynamically generate stylesheets on the fly, to carry out complex sequences of transformations, to send the results to windows and to files, and to do all this in response to user actions on interface objects that were themselves generated by previous such transforms and events. Altogether a powerful set of capabilities driven by combinations of quite simple underlying data structures!
     

    Conclusion

     XML 
     XPath 
     XSLT 
     
    We have seen through a few small examples howXML and its related specifications (particularlyXSLT andXPath , can be used to drive a fully-functional application. The number of different data structures is quite small, and each one is quite simple.
     XML 
     
    When object-oriented programming came into being, it took some time before programmers and application designers had fully mastered the techniques and come to grips with the implications of the changing paradigm. The same will be true forXML -driven application design and development. But there can be no doubt that we are on the brink of an exciting period when new ideas will be put to the test and out of them will emerge powerful, robust, standards-driven application development paradigms. I hope the work shown here provides a useful contribution to that process.
     Acknowledgements
     All the members of the European XML/EDI Consortium. (See http://www.tieke.fi/isis-xmledi )

    Text analysis tools for XML documents using regular expressions &, XSL   Table of contents   Indexes   The application of core standards - a technical approach