Integration of business processes with XML   Table of contents   Indexes   The role of standards in application development

 XML 
 

XML and PDF in digital printing

 irreconcilable differences?
 Brooks, Kenneth 
 
 Kenneth  Brooks
 Jr.
 Vice President, Digital Content Division
  Barnes & Noble, Inc. 
 New York 
 USA 
Barnes & Noble, Inc.,  122 Fifth Avenue
New York  New York  10011 USA
Phone: 212-633-3402 Fax: 212-633-3470 email: kbrooks@bn.com web site: www.bn.com
 Biography
 Kenneth Brooks, Jr. — Ken Brooks is Vice President of Digital Content at Barnes & Noble, Inc. and President of EP Ventures, Inc. He joined Barnes & Noble in 1999 where he founded EP Ventures, a Philippines-based text conversion and composition company, as well as the 1873 Press, a POD and eBook publishing entity. Ken has held several senior management positions in publishing, including Vice President of Operations, Production, and Strategic Planning at Bantam Doubleday Dell and Vice President of Customer Operations at Simon and Schuster. Prior to his entry into publishing, Ken was a Senior Manager in Andersen Consulting’s Logistics Strategy Practice. He holds a Bachelor’s and Master’s degree in Industrial Engineering from the Georgia Institute of Technology.
 Abstract
 In the world of trade book publishing where PDF workflows are just beginning to be accepted, XML workflows are still largely unknown. This discussion will highlight one approach to merging the two types of workflow to create a highly successful digital printing and eBook distribution operation.
Motivation
 

Motivation

 Early in 1999 Barnes & Noble identified the need to establish a conversion operation to support its eBook and print-on-demand (POD) plans. The company was already implementing plans around eBooks with an investment in NuvoMedia (now Gemstar) and planned efforts with Microsoft to support their MS Reader launch and with Glassbook to provide distribution of the Glassbook reader and titles on its site. B&N was also in the midst of planning a POD deal with IBM to install InfoPrint 4000 and InfoColor 70 equipment in its Memphis, TN distribution operation.
 B&N looked around for partners in establishing the operation, but couldn’t find companies that were able to achieve the necessary quality at a reasonable cost. Target quality was 99.998% character accuracy – better than that provided by traditional typesetting – along with commercial grade tagging and page make-up at quality levels acceptable to trade publishers.
 The operation had to be focused on conversion of hardcopy backlist versus electronic files due to the large amounts of content in that format. The trade book industry is only now moving into fully digital workflows and for most trade titles published today, the publisher still usually doesn’t have the file as it was finally printed in a manageable format.
 This approach is in direct contrast to the one taken by many technology companies that are pursuing automated tagging and reformatting of electronic files. With all of the companies focused in this area, it was anticipated that there would be opportunities to purchase or license reasonably inexpensive solutions in the short term. Once text and image conversion from hardcopy was addressed, the move into electronic file conversion could be pursued either directly or through alliances with software suppliers.
 After evaluating a number of alternatives and not being able to find anyone offering the required cost, quality and mix of services B&N established an operation spanning New York, Mexico City and Manila.
Initial Effort
 

Our original conversion process

 In the current world of eBook and POD publishing and distribution the ultimate deliverables are PDF and HTML variants. PDF is required to drive POD equipment and some eBook readers. The HTML variants are found in Rocket eBooks, SoftBook readers, Peanut Press Readers and, of course, the standard PC browser. As a result, the initial focus was to implement a process to merge the production of paged and adaptive formats from both hardcopy and electronic files. This was accomplished in a straight through, hybrid Quark/HTML workflow that directly produces all of the variants required using manual processes.
 To take advantage of processing economics an international process was implemented. Books start in NYC, where they are received from a publisher. The books are pre-edited to identify the types of processing required and any challenging elements entailed. The book is then sent to Mexico City for scanning.
 In Mexico City the books are scanned using either a 300dpi process or a 600 dpi process depending on the ultimate formats required. If a publisher is requesting that the title go directly into POD, the 600 dpi process is used to give appropriate resolution through the print engines. The PDF in this case is simply a “package” of 600 dpi scanned page images. If eBook formats are desired the 300dpi process is used. In the eBook workflow the scans are then transmitted to the Manila conversion operation as TIFF images.
 It’s in Manila that much of the interesting work takes place. The files are zoned, driving images to an image cleanup process and text into OCR. OCR text streams are then cleaned up using a heavy dose of AI supplemented with manual editing to produce high quality RTFs, with a small amount of styling applied.
 Once a clean RTF is achieved, the file splits into tagging and page production. In the tagging process the RTFs are converted into HTML and additional tagging is applied using conventional HTML editing tools. There are actually several different files created here, depending on the number of formats that are being requested by the publisher. In addition to HTML we can produce OEB, Rocket eBook format, Softbook format, and Microsoft Reader format.
 In page production the file is imported into Quark and standard page layout techniques are used to get a print- or eBook-ready PDF Normal file with typography at a level acceptable to New York trade publishers. This PDF is then filtered to the various formats required by different POD printers or PDF eBook distribution operations. This is essentially an imposition step.
 The files all then return to New York where they are proofed before final delivery to the publishing customer.
Evolution
 

The evolving process

 The eBook world is rapidly moving to single XML standard to replace the insanity of the multiple conflicting HTML variants. So far it’s looking as though there will be two overall standards: OEB in most of the eBook world and PDF for paged representations. There are persistent rumors that Adobe will be officially recognizing the existence of XML and incorporating it into PDF – indeed XML is creeping into Frame and a number of their other applications. Both this shift and the increasing capabilities in Manila, are prompting a move from a traditional straight-through, manual publishing workflow that generates a number of output formats, to a two-stage process that yields the same number of formats in an automated manner.
 The automated process is similar to the manual process through the point of the RTF. The big difference is in the post-RTF process. Instead of using HTML, tagging is done to a standard DTD and placed into an XML repository. Since XSL and other XML rendering mechanisms haven’t proven themselves capable of generating the quality of typography needed on the fly, there is really no alternative but to store two versions of the file: an XML version to generate the various non-paged outputs and a PDF version for paged outputs.
 Output formats, such as OEB or HTML 4.0 are then generated on demand using XSL.
Issues
 

Issues

 The issues in this process will be familiar to XML practitioners. These range from using XSL to create high-quality pages and managing a large number of shifting XSL stylesheets, to issues around where in the process tagging should be done and, a dearth of XML tools appropriate to the Manila production environment, where a good grounding in the fundamentals of XML and its uses cannot be presumed.
 

High quality pages

 The first challenge is the creation of high-quality pages. Publishing has long been grappling with how to apply automation to high quality composition to reduce or eliminate designer intervention. This is no different in an XML workflow. The inherent problem is that XML is a non-page representation of a text and if on demand production of pages is desired complex design problems must be resolved on the fly. Some of the more challenging problems include:
 
  • Cross-page implications of widow & orphan control
  •  
  • Column balancing, either within or across pages
  •  
  • Kerning, inter-character, and inter-word spacing
  •  
  • Juxtaposition of non-text elements with text
  •  
  • Preservation of the design of the original book in the reproduction
  •  Our operation achieves the maximum flexibility possible in this environment through the use of custom XSL code to transform the tagged XML in our repository to the final target formats. Overall, however, the jury’s still out on whether XSL can take files all the way from an XML markup to quality pages – the solution may very well end up being to take the best composition one can get from XSL output, hand adjust the output, and then archive the result.
     

    XSL proliferation

     XSL proliferation is another issue. While it’s easy to see that an XSL style sheet is needed to produce every output format, it’s also necessary to have a more sophisticated style sheet to deal with styling within formats, usually publisher-specific, but sometimes even getting down to the book level. This requires that the style sheet perform two separate functions. First, it must transform the XML into a format such as HTML (called the “transform” function) and second, how to make the output in various formats look like the original publisher’s house style or, worse, match the book itself (called the “styling” function).
     

    Iterative tagging

     In the process described there are several areas where XML tagging can be applied: in the original zoning, in the cleanup process on the RTF (in the form of RTF styles which are post-processed into tags), and post-cleanup in the more rarified XML environment. The choices here represent tradeoffs between productivity and accuracy – the idea is to be able to touch the text (or image) once and get as much mileage as possible out of that touch. For example it’s necessary to zone page images after scanning to drive the OCR and image cleanup workflows. To the extent that text elements can be identified in the images, later tagging effort can be saved by having the zones also correspond to those elements. This works so long as a great deal of unnatural effort isn’t induced in the process, and so long as it isn’t easy for the operator to apply tags that will later prove to be invalid in accordance with our DTD.
     

    Production tools

     B&N has created a highly productive manufacturing environment in its Manila operation that relies on standardization, process control and bulletproof tools. Operators in these environments need tools that can be rigorously customized to eliminate excessive work steps and that aren’t terribly complex – not because of any lack of ability, but because of high production and quality expectations. Further, the tools need to be able to be integrated into an overall workflow and driven from a common production control and content database that span the entire process.
    Conclusion
     

    Conclusion

     During the past year Barnes & Noble has come a long way from the realization that a world-class conversion operation was needed that could integrate page and tagged content production. The operation has been established and is working smoothly through a variety of issues around the integration of XML and PDF workflows. Through efforts such as this and our other POD and eBook efforts, B&N feels that is helping the industry to achieve the goals that many people share today: that any consumer can find the book they need in the format they can best use wherever and whenever they need it, and that any publisher can count on being able to produce new eBook formats as they arise, without the need to reconvert original content.

    Integration of business processes with XML   Table of contents   Indexes   The role of standards in application development