XML Use by US Intelligence: A Case Study   Table of contents   Indexes   XML Messaging at Chase Manhattan Bank Global Markets

Print
 Printing 
legacy
 

Traditional Electronic Printing On The Internet

 Dallas 
McCalpin, William J. "Bill"
 Texas 
The Xenos Group, Inc.
 
William J. "Bill"  McCalpin
Senior Architect,  The Xenos Group, Inc. 
 3010 LBJ Freeway, Suite 301
Dallas  (Texas)  75234 
Email: billm@xenosgroup.com

Biographical notice

Mr. McCalpin is Senior Architect for The Xenos Group, an electronic document systems integrator based in Toronto. As Senior Architect, he is responsible for designing system solutions for Xenos' customers in the electronic print and imaging industries. Mr. McCalpin has 17 years' experience in the fields of electronic printing and imaging. Prior to joining The Xenos Group, he held a number of technical and management positions with MOD_2000, Inc., Image Sciences, Inc., Computer Language Research, and his own consulting firm.

Mr. McCalpin has an M.A. in Politics and Literature from the University of Dallas, and a B.A. in Politics cum laude from the University of Dallas. He was awarded the LIT, designation as a `Laureate of Information Technologies' by the Association for Information and Image Management in 1998. He was awarded the MIT, designation as a `Master of Information Technologies' by the Association for Information and Image Management in 1997. He was awarded the CDIA, certification as a `Certified Document Imaging Architect' by the Computing Technology Industry Association in 1996. He was awarded the EDPP, designation as an 'Electronic Document Printing Professional' by Xplor International in 1992. He is the only person in the world to hold all four designations.

Mr. McCalpin writes and speaks frequently on subjects in the electronic printing and imaging industries. He has spoken more than forty times at Xplor, AIIM, DocuGroup, and Guide meetings, in sessions such as `A Funny Thing Happened On The Way To The Electronic Forum', `Losing Information Through Imaging', `The Family Tree Of Printer Data Streams', and `Bytes Over Bits, The Superiority Of Text Over Raster In Archiving'. He has published a number of articles on printing and imaging, including: "AFP And ImagePlus' MO:DCA - Not Quite Equals" - Enterprise Systems Journal, "Losing Information Through Imaging" - Business Documents, and "Why Every Electronic Print Architecture Is Wrong" - Xploration

Mr. McCalpin is a member of both Xplor and AIIM. He serves on the AIIM Accreditation Committee. He is a former officer in Xplor's Southern Region. He is currently an Associate Editor of Xploration and has been on the editorial review board of Enterprise Systems Journal.

AIIM
CDIA
Computer Language Research
Dallas
EDPP
Image Sciences
LIT
MIT
MOD_2000
McCalpin
Texas
University of Dallas
Xenos
Xplor
 

Overview

 The electronic printing industry is a huge market. Xplor International estimates the industry to be 110 billion dollars a year world wide. This market has been developing over the last 21 years, since Xerox TITLEduced the first high speed, cut sheet duplex printer, the 9700. Of course, there were electronic printers even before the Xerox 9700, such as the IBM 3800, but the fact that the Xerox 9700 was the first high-speed, cut sheet, duplex printer made it the ideal printer for large volume business applications. With additions from IBM, Siemens (now Oce'), Delphax, Hewlett Packard, QMS, and others, the electronic printing market exploded in the late 80's. Companies found it economical to produce high volume, high quality documents, and, coupled with changes in the U.S. Postal Service, found ways to deliver information to the end user more quickly and with better impact than ever.
 Of course, that information is on paper.
 We now have the Internet, an electronic, not paper-based method of delivering information to end users. The original Internet delivered text-based data. The Internet that we have seen exploding over this decade delivers HTML-based data. While the text-based data could be rich in information, it was poor in presentation. While the HTML-based data was a big improvement in presentation, it actually contained less information in some ways.
 Thus, the arrival of XML to the Internet, with its ability to simultaneously preserve the author's content as well as be adequately presented via XSL, means that for the first time, the Internet has an architecture capable of efficiently transmitting information.
 So what does this mean to those companies who have spent many years and millions of dollars perfecting their electronic print applications? The success of the Internet presents a different level of problem than simple printing technology improvements. The TITLEduction of duplex printing, multiple-up/continuous feed printing, and highlight color represented only an incremental change on how companies handled their large print applications. But the change to electronic delivery of the same documents that used to be on paper presents the staggering problem of how to convert documents which have strictly prescribed presentations to an SGML type of architecture.
 

Presentation Data Streams and Formatting Issues

 The term, `presentation data stream' encompasses all of the data stream architectures used in electronic printing or presentation. That is, a presentation data stream is a data stream whose purpose is to accurately present information. By `accurately', what we really mean is `in the manner in which the author intended'.
 Common presentation data streams are:
 
  • Xerox Metacode
  •  
  • IBM AFP and MO:DCA
  •  
  • Hewlett Packard PCL
  •  
  • Xerox XES (UDK)
  •  
  • Adobe PostScript and PDF
  •  In each of these data streams, there exists the mechanism to describe the presentation attributes for the information we want to present. For example, since most of these data streams were defined in the paper era, the data streams would contain the following attributes for a piece of text:
     
  • X coordinate
  •  
  • Y coordinate
  •  
  • Font (family, point size, style, etc.)
  •  
  • Orientation
  •  Note that none of these attributes have anything to do with describing the use of the text-based information. In fact, it's clear that the text-based information can be described in many ways. For example, the word "TEXT" might be described in a presentation data stream as:
     
  • X, Y, Font, Orientation, "TEXT"
  •  
  • X, Y, Font, Orientation, "TE", new X, new Y, "XT"
  •  
  • X, Y, Font, Orientation, "T", new X, new Y, "E", new X, new Y, "X", new X, new Y, "T"
  •  We can see that within the presentation data stream, there is no requirement that text continue to exist as discrete units at all. After all, the purpose of the data stream is to display accurately the information, and not to preserve some order to the data which does not help the presentation process.
     We might suspect that most presentation data stream generators do happen to generate the print data in the order in which the page is built: top to bottom, word by word. This is generally true. However, powerful composition engines such as Adobe's PageMaker are able to provide kerning for each character, so in many data streams the text must be presented letter by letter. And IBM's AFP print driver for Windows happens to build the page exactly backwards: it lists every single character on the page separately, starting with the last character in the last word on the page and working its way up.
     On the other hand, SGML languages do not normally concern themselves with the details of the presentation. In HTML, for example, the normal way to specify a bold font is not to name the font in such as way that the display device or printer cannot mistake it, but to just use the <B> tag. This tells the HTML browser to use whatever font it thinks is a bold font, if the browser even has one. As we can see, it becomes difficult in this context for the author to tightly control the presentation of the data.
     Given that the preservation of the "look and feel" of a document in a presentation data stream is problematic when ported to an SGML instance such as HTML, it is no surprise that people choose to port them to Adobe's PDF instead. Since PDF is a presentation data stream, the look and feel of the document can be preserved, and Adobe has wisely made its Reader freely available and "easily" accessible by many Internet browsers.
     Does the use of the Acrobat Reader plug-in make everyone happy? No, at present, we hear a lot of concern by companies that they do not want to require their end users to acquire a plug-in, and they are concerned about the potential need to reinstall the plug-in after browser maintenance. Although we expect to see these concerns diminish over time, we do not expect to see them disappear until the PDF Reader becomes a seamless part of major browsers. In fact, it was rumored not long ago that Netscape intended to distribute the Acrobat Reader as an integral part of its browser, but, apparently, the plans to do this have been put on hold.
     Of course, there's another solution to the problem of presenting information on the Internet that is currently in electronic print documents: don't bother trying to preserve the look and feel of the original documents, just the data. In other words, represent the data in legacy documents in a way which suits the number medium.
     One of the difficulties of the modern age is that all of our paper documents tend to be one size, but our electronic monitors tend to be of another. Presentation data streams like PDF are very much oriented to traditional paper sizes. However, the resolution of today's monitors is sufficiently poor that we cannot see the text on a US letter sized document which is displayed as a whole on the screen. This means that PDF documents are normally displayed by halves: first you see the top half, and then you must scroll down to the bottom half.
     If, therefore, we intend to use a language like HTML to present our information, we would probably choose to place our information onto "pages" which suit our browser's natural viewing area. For example, Netscape 4.01 on a VGA display can show about 17 lines of text in a 12 point font, as compared to approximately 60 lines of text on a piece of 8 1/2 by 11 inch paper. Thus, if we believed that our average recipient of our information had this browsing environment, we would be inclined to reformat our pages as if our "paper" were 8 1/2 by 3 1/2 inches. (Note: this example is for portrait documents.)
     But, while reformatting our documents to conform to this new Internet environment may seem like the obvious solution, there are a number of considerations.
     
  • Company branding
  •  
  • Customer Support
  •  
  • Customer Confusion
  •  
  • Duplicate Coding
  •  

    Company branding

     Everyone knows that the insurance industry in the United States is heavily regulated. The forms which comprise an insurance policy typically need to be filed in each state in which that insurance product is sold. And it is common enough that the form must individual state variations, mandated by states' Boards of Insurance. It is not surprising to have at least twenty different copies of a form to be able to sell it in all fifty states.
     But it may as a surprise to many people that insurance companies are often very concerned about formatting issues that even the state regulators are not concerned with. We know of one prominent life insurance company that mandates that all of its policy pages (the filed forms as well as the supporting pages, such as the welcome letter) be composed in the Optima font..
     Why? Because insurance companies realize that for many of their customers, the only contact that the customers have with the insurance company is through the packet of paper called the policy. For this reason, the insurance company is highly motivated to ensure that the policy is as professional and as representative of the image that the company wishes to present as possible. In the case of the company above, it clearly wants its customer base to think of it every time the customers see text in an Optima font.
     What does this mean for the Internet world? The same company which sets strict guidelines on the appearance of its paper documents will likely have the same feeling about its web documents. Or, to put it bluntly, this company will not want its web documents to look like every competitors'. Clearly, this will be more difficult to achieve in an SGML based environment.
     The vendor who is attempting to sell such a company on the wisdom of XML documents had better examine the company's existing documents as well as the company's culture before blithely assuming that issues like formatting are relatively unimportant.
     

    Customer Support

     Many companies will have the need to distribute the same information on both paper and by the Internet. It is common for companies which send out paper documents to end users to want to save costs by putting those same documents into an internal, browser-based network application for their customer support staff. Often (even usually), what the customer support representatives sees is not the exact same document that the customer has received. This leads to the question that every customer service representative hates.
     "What's the number at the top of page 5?"
     Obviously, unless the customer service representative has a similar copy of the document that the customer has received, the CSR has no idea what is even on page 5, much less what the number is at the top of it. As you can see, the reformatting of the paper documents to accommodate the different size of the browser window can now be considered a disadvantage.
     But XML does have a way to deal with the problem which was not available in HTML. After all, in the XML data stream, each item is tagged according to its use. Thus, at a minimum, the XML browser should enable the customer to place the cursor over the unknown number and see a small box appear giving the XML tag that is associated with that number. The XML tag name can then be given either to the CSR who will explain the number's meaning to the customer, or, even better, the browser will pass the tag name to an optional help facility which will present a full explanation of the purpose of the data item.
     This facility, of course, is not available in HTML, since HTML tags are restricted to a small set of grammatical tags, which do not contain the author's content. Thus, while data in an HTML data stream might be coded with the <p> paragraph tag, the same data in XML would have the far more useful <part_number> tag associated with it.
     

    Customer Confusion

     The advent of widespread business use of the Internet provides the same opportunity for user confusion as did the advent of the early laser printers. When creating web documents, remember this mantra: "Just because you can doesn't mean you should."
     When some users of laser printers realized that you could put many different fonts on a page, they felt compelled to do so, even if the result looked somewhat like a ransom note. Users of full color printers have exactly the same temptation: to use the full capability of the printer on every document.
     So what's the problem? Simple. We have trained our customers to recognize that important documents frequently are the simplest in formatting. How many different fonts do you see used in a contract? And in how many colors is that contract printed?
     Let's put it another way: when you see a document which is in many fonts, many point sizes, and many colors, do you assume that it's marketing literature and therefore not something important? How many times have you opened an envelope, had the barest of glimpses of the paper inside, and thrown it all away because you assumed it was just sales literature based on its appearance? Given that people will be seeing more and more web documents and spending less and less time on each one, would you not imagine that we will have the same problem in cyberspace? If we see it recommended that important web documents never be so large that there needs to be a scroll bar on the right hand side because a large percentage of web users never scroll down, then how much more important is it that serious documents not have marquees at the bottom, flashing icons in the middle, and three dimensional spinning objects at the top?
     Again, "just because you can doesn't mean you should."
     

    Duplicate Coding

     What is the most overlooked disadvantage to reformatting legacy documents for the Internet? You have just doubled the amount of effort it takes to produced documents.
     If you think about, for many applications, you will have to continue to support the print programs because paper will still be a requirement for some or all customers. So if you also need an SGML version of the same document, and if the presentation of that SGML document is at all important to you, you will have to hire additional people to support the "new" print application.
     Whatever the disadvantages are of a PDF-like solution, at least you don't have to spend any time or money reformatting your documents.
     

    Legacy Print Data As A Data Base

     In XML applications written from scratch, we can imagine that - most of the time - the data for the applications comes from a data base or transaction file. In the case of the data base, SQL commands are built to extract the data which is going to be giving XML tags. In the case of the transaction file, data to be tagged is identified by record, position, and length. In both cases, it is usually easy to find the data that we need to tag.
     However, we are looking at using a legacy print stream as the source for the data to be tagged as part of an XML document. Extracting data out of a legacy print stream is not nearly so easy as it is out of a data base or transaction file. There are a number of issues which have to be considered.
     
  • The text is no longer intact
  •  
  • The text is not on a given record or offset
  •  
  • The text keeps moving around
  •  

    The text is no longer intact

     As we noted above, the purpose of presentation data streams is to present. And so long as the information in the document prints or display accurately on the appropriate device, there is no requirement whatsoever that the text in the data stream be preserved in its original paragraphs, sentences, or words.
     Therefore, it is difficult to tag "12-XYT34G-0" as the data which belongs to <PART-NUMBER> if "12-", "XYT34G", and "-0" happen to be in three different records in the legacy print data stream. In order to correctly tag the data, the legacy print data stream must first be parsed, and then intelligence applied to the resulting data list so that print items which appear to be together can be joined. As you can imagine, this is an imperfect process which can require careful tuning.
     The important thing to note here is that unless you have personally examined the legacy print data stream - which for some data streams is quite difficult - you cannot know if the text items are together or not. Just because the data prints together is no guarantee that the data is together.
     

    The text is not on a given record or offset

     In line data - that is, data which prints sequentially down in the page in complete lines - the location of information is often predictable. Just as in the transaction record, the desired information might be at offset X for a length of Y, so with line data is it often possible to see that the desired information is on record X in each report at offset Y for a length of Z.
     However, all bets are off with the more advanced presentation data streams. It is not possible to predict where data might appear on a page of presentation data. You cannot know either the record number nor the offset - nor even the length (see the previous point). In short, the mechanisms which are often useful in extracting information from line data are by definition unreliable for all page printer data streams.
     Even if your page printer data stream generator were creating data on some pages in a predictable fashion, you cannot depend on the generator always doing so. Remember, the only purpose of a presentation data stream generator is to create a data stream which displays or prints accurately. Few if any generators expect to have their data post-processed, so as long as the data prints correctly, then generator can and will create data in a variety of convoluted ways.
     Again, the only solution to this difficulty is to parse the inbound data stream and build a logical representation of the page, identifying which data items are together and where they are.
     

    The text keeps moving around

     Unfortunately, even if you are able to solve the previous two issues, you will be confronted by the next problem. This problem is that print applications often have similar but not identical versions of the same forms in the same data stream at the same time. This means that while certain data can be found in the imaginary box described by X1, Y1, X2, Y2 on form A, the same data will be found on the next page inside a box lower or higher on the page, because the form B is slightly different from form A. Or, to look at the problem another way, how can you find the total at the end of a list if you don't know in advance how many items are in the list?
     In this case, the extraction process to find the information in the legacy print data for the XML generator has to have sophisticated conditional processing. For example, the code needed to find the total number at the end of a list might be: IF FORM=XYZ AND PAGE_OF_REPORT>1 AND X_LOCATION(DATA_ITEM)=X_LOCATION("Total Number") THEN XML_TAG=("<TOTAL_LIST_VALUE>",DATA_ITEM. Of course, this presumes that the command language has a way to identify what FORM=, PAGE_OF_REPORT=, and X_LOCATION all mean.
     

    Push Versus Pull

     The Internet supports both "push" and "pull" paradigms for distributing information. In the push paradigm, data is sent to the user, such as by e-mail. In the pull paradigm, the data is warehoused in the company's archive, and the user must come to the archive to retrieve the information. There is, of course, an "invited pull" paradigm, in which an invitation is pushed to the user with a link through which the user can easily pull the information from the archive.
     The most active part of legacy printing on the Internet today is in the field of bill presentment and payment. In this scenario, the bill is usually kept in the biller's archive, and the user is invited to see the bill (come to the archive) via a pushed e-mail. This is done for a variety of reasons. One of the major reasons is so that the enterprise which is doing the billing will be sure that you, the end user, has actually visited the site.
     However, an important consideration in this choice is that while the data is in HTML (or PDF, for that matter), there is no particular added value in pushing the document to the end user. The end user can't do anything special with the data anyway. This is not true if the data is in XML format, however.
     Imagine that the enterprise is generating financial statements in XML. If these statements are e-mailed to the end user, not only does this solve the problem of the tremendous storage considerations that archiving these statements pose, but the end user will have XML-aware software which will format the data in ways that the enterprise will not have thought of nor prepared for.
     Remember the point made above in which we pointed out that coding HTML statements could require a second programming team in addition to the team needed to support legacy print documents? If XML is sent to the end user, then the burden of formatting the final form of the information is potentially shifted from the enterprise generating the data to the recipient of the data. If a bank sends you a statement every month, and you would like a summary of all twelve statements in the year, what would be easier for the bank: to have its programming staff create yet another set of HTML pages, or just send you the XML data each month and let the equivalent of a product like Intuit's Quicken create a new report from all the data?
     From the bank's point of view, the former is not only easier, but more germane to its core business. After all, the bank's primary business may be handling financial data, but if another vendor wants to do the print formatting, then so much the better.
     What this means is that the early Internet/e-business implementations have tended to follow the pull model, the arrival of XML may favor the push model, for those documents which don't need to be archived as is at the home site.
     

    Comments On XSL

     

    TITLE to XSL

     Following the work on XML, there is ongoing work on XSL, the eXtensible Style Language. This language is the result of the desire to better control on the author's side the presentation of the XML document.
     DSSSL - Document Style Semantics and Specification Language - is an ISO standard, developed to be a stylesheet standard for SGML documents. However, no commercial application uses it.
     CSS - Cascading Style Sheets - is supported on the other hand by both Microsoft and Netscape as a mechanism for overriding the default formatting of HTML tags. XSL is expected to be a superset of CSS, so that automatic conversion from CSS to XSL should be possible.
     The first public working draft of XSL 1.0 was announced by the World Wide Web Consortium on August 18, 1998.
     "W3C will be developing both the XSL and CSS style sheet languages in parallel, as they are both useful for Web sites and they give Web designers an expanded set of tools to do their work. CSS is used to style HTML and XML documents on the Web. In addition to styling XML documents, XSL is also able to generate new XML documents from XML data. XSL and CSS will share the same underlying concepts and will use the same terminology as much as possible." (http:10/06/98/www.w3.org/Press/1998/XSL-WD)
     

    What does XSL look like?

     XSL is written in XML. You can define tags in XSL which will be usable in an XML document. For example:
     
  • This is an <emph>important</emph> point.
  •  
  •  
  • <xsl:template match="emph">
  •  
  • <fo:sequence font-weight="bold">
  •  
  • <xsl:process-children/>
  •  
  • </fo:sequence>
  •  
  • </xsl:template>
  •  In the first line above, note that we are using a tag called <emph>. Because this tag must be a valid XML tag, there is a corresponding end </emph> tag to denote the end of the text string which will be affected by the tag. The definition of the <emph> tag follows the example.
     You will note is that the net effect of the tag is to "bold" the text in the current font. Note that the tag does not actually specify the font to be used, and if the browser involved has no bold version of the current font to use, the text will probably be presented in the current font with no error.
     This tag similar to the HTML tag <B>.
     

    Can XSL be used to format legacy data?

     XSL is intended to give the author ability to "better" control the presentation of the XML document, not "fully" control. If you are familiar with the mainframe composition tool DCF, then you will understand the amount of control that the author has over the document. For example, the example of the <emph> tag above is very similar to the definition of DCF's hp2 tag (highlight definition, level 2)
     First of all, we need to understand that there is a fundamental conflict between the desire of XML to organize a document based on the use of the data and the desire of legacy print streams to totally control the presentation of the data without regard to its organization. That is, whereas an XML document will normally see this paragraph as a single entity, a legacy print data stream will almost certain see this paragraph as a series of independent lines of text, each at separate locations on the page.
     Think about an existing legacy document. If there are forms in the document, there are often situations in which one paragraph is formatted slightly differently than another. To duplicate this, we would have to code unique paragraph definitions for each unique presentation of the paragraph. For each fragment of text in the document that is uniquely formatted, a separate XML definition might have to be made. It depends on how important the fidelity of the XML document to the legacy print document is to you.
     Of course, if exact fidelity is not that important to you, you should take advantage of the opportunity to reformat the document. That is, you shouldn't simply accept that it's OK for the XML document to be similar to the original legacy document; you should use this opportunity to re-examine why the document is the way it is. Maybe there's a better way to present the information, especially in light of the new abilities (e.g., hypertext links) that the new technology affords you.

    XML Use by US Intelligence: A Case Study   Table of contents   Indexes   XML Messaging at Chase Manhattan Bank Global Markets