XML and XSL from servers to cell-phones   Table of contents   Indexes   Informix &,amp, XML

 

XSLTVM - an XSLT Virtual Machine

Novoselsky, Anguel
 
 Anguel  Novoselsky
 Member of Technical Staff IV
  California 
 Netfish Technologies, Inc. 
 Santa Clara 
 USA 
Netfish Technologies, Inc.,  2350 Mission College Blvd., Suite 650
Santa Clara  California USA  95054
Phone: +1 408 350 9500 email: anovosesky@netfish.com
 Biography
 Anguel Novoselsky - A Novoselsky has been working in the area of compiler and language design for more than 15 years with companies in USA, Canada and Bulgaria. Currently, he is a member of Oracle team developing Oracle XML/XSL components and he is also a principal representative to the XML WG for Oracle Corporation. He holds M.S. of Computer Science from Technical Institute of Sofia.
Karun, K
 
 K  Karun
 Project Lead
  California 
 Oracle Corporation  
 Redwood Shores  
Oracle Corporation,  500 Oracle Parkway
Redwood Shores  California  94065
Phone: 1-650-560-9874 Fax: 1-650-506-7203 email: kkarun@us.oracle.com
 Biography
 K Karun - K. Karun currently works as Project Lead for Java XML Components in CORE and XML Development Group. He holds a M.S. in Computer Science from Michigan State University and B.Tech. from Indian Institute of Technology, Madras. He lives in Mountain View, California
 Abstract
 The emergence and popularity of XML helps facilitate the development and integration of business and application semantics. However, each enterprise defines their own data elements to better communicate the "meaning" of their data. Translation will therefore be key for interoperability. XSL standards for transformation are sufficient to allow exchange between business vocabularies. Currently there are many XSLT engine implementations. In this paper, we present a novel approach for implementing transformations using XSL. We describe a XSLT Virtual Machine (XSLTVM). XSLTVM is the software implementation of a "CPU" designed to run compiled XSLT code. A concept of virtual machine assumes a compiler compiling XSLT stylesheets to sequence of byte codes or machine instructions for the "XSLT CPU". This approach clearly separates compile-time from run-time computations and specifies an uniform way of data exchange between instructions by defining a common interface areas like stack or pipeline for example. The separation line between compile-time and run-time computations depends on the level of machine instructions. Splitting a heterogeneous, high-level instruction into number of atomic, low-level instructions exposes some run-time checks and allows compiler to take care of them. For example, a general CMP instruction checks operand types at run-time. If it is replaced with few type specific CMPs then compiler checks operand types at compile-time and generates appropriate type casting instructions if needed.
 

Introduction

 XSL 
 
XSL is a language for expressing stylesheets. It consists of two parts: a language for transforming XML documents, and an XML vocabulary for specifying formatting semantics. In this paper we discuss the part for transforming XML documents. A transformation expressed in XSLT describes rules for transforming a source tree into a result tree. The transformation is achieved by associating patterns with templates. A pattern is matched against elements in the source tree. A template is instantiated to create part of the result tree.
XSLTVM
 
AnXSLTVM is an abstract computing machine. Like a real computing machine, it has an instruction set and manipulates various memory areas at run time. It is reasonably common to implement a programming language using a virtual machine; the best-known virtual machine may be the P-Code machine of UCSD Pascal. The described XSLTVM is an attempt to define an stack oriented VM with low-level set of instructions as a virtual model for XSLT processing. However, the virtual machine does not assume any particular implementation technology, host hardware, or host operating system. It could be a software implementation of a "CPU" designed to run compiled XSLT code. A concept of virtual machine assumes a compiler compiling XSLT stylesheets to sequence of byte codes of machine instructions for "XSLT CPU".
 This approach clearly separatescompile-time fromrun-time computations and specifies an uniform way of data exchange between instructions by defining a common interface areas like stack or pipeline for example. The separation line between compile-time and run-time computations depends on the level of machine instructions. Splitting a heterogeneous,high-level instruction into number of atomic, low-level instructions exposes some run-time checks and allows compiler to take care of them. For example, a general CMP instruction checks operand types at run-time. If it is replaced with few type specific CMPs then compiler checks operand types at compile-time and generates appropriate type casting instructions if needed.
 Although not explicitly mentioning terms like "virtual machines" the first generation of XSLT processors are implementing variants of quasi high-level instruction set in a form of objects with action functions/methods and as it was pointed out this is not an optimal solution. Also their architecture is usually designed as object-oriented enforcing data encapsulation. As a result data exchange between instructions/objects is not unified so the implementations are forced to use expensive heap memory allocations for temporary results. The first section includes information on the XSLTVM architecture and instructions set. Examples illustrating the compilation of XSLT stylesheet fragments into XSLTVM instructions with more detailed explanations are shown in Section 3. The next section emphasizes the design goals of the proposed VM approach over the current implementations. Finally we discuss the related and future work.
 

Architecture of XSLTVM

VM stack
 
The XSLT Virtual Machine has a stack based architecture. TheVM stack (is analogous to the stack of a conventional language: it holds local variables and partial results, and plays a part in template invocation and return. Also, most of XSLTVM machine instructions take their operands values from the top of the VM stack and replace them with the result. Stack cell contents types are:
 
  • node
  •  
  • descriptor
  •  
  • boolean
  •  
  • number
  •  
  • string
  •  Descriptors are node set descriptors and template descriptors. Thenode set descriptors are created dynamically at run-time. Each one has fields for the current node, for node set size and index which points to the actual nodes location in a node set stack.Template descriptors are created at compile-time and reside in atemplate descriptor pool . They are loaded into the VM stack during template call. Each template descriptor has the following fields:
     
  • size: template frame size equal to number of local parameters, plus number of local variables, plus number of temporaries like loop indexes, plus three. These three extra stack cells are allocated for the template current node (referenced as_current in the examples bellow), for template descriptor index (referenced as _desc) and for the return address.
  •  
  • address: template instruction pool entry index.
  •  
  • list of parameters and their offsets in the frame. This list is used by PARAM instructions for parameter binding.
  •  Thenode set stack is analogous to the VM stack but contains data which size is not known at compiler time. It grows and shrinks in parallel with the VM stack, which simplifies the run time maintenance. XSLTVM has amachine instruction pool for template body instructions. Each template has an entry instruction which is executed first when the template is instantiated. The entry instruction index in the instruction pool is treated as a template address.
     XSLT templates like programming language procedures have an associatedtemplate stack frame for parameters and local variables. Variables, template parameters and loop indexes are compiled to stack offset addresses, which eliminates run-time search. The frame is allocated into the VM stack during template call with size taken from atemplate descriptor . The process of frame allocation and template call is explained in more details in the next section. Functions like templates take their parameters from the stack, also.
     A separatetemplate pattern pool contains the all of the template patterns. This allows speed optimization techniques like parallel or table-driven pattern matching to be applied on the pattern pool contents. The XSLTVM instruction MATCH works with the pattern pool. It finds the wining pattern and pushes the corresponding template descriptor into the stack. Aconstant pool is used as a storage for names, literals and constants.
     

    Compiling for XSLTVM

     The XSLTVM works with the compiled set of instructions generated by a XSLT compiler. In this section, we discuss formats of a typical XSLT instructions.
     

    Instruction description

     With only few exceptions XSLTVM instructions have the following formats:
     
    opcode, mode, operand1 [, operand2 ]
    
     Depending on the mode the operands could be one of following:
     
  • address - stack index
  •  
  • address - constant pool index
  •  
  • address - top of the stack.
  •  
  • address - instruction pool index.
  •  
  • immediate value
  •  Machine instructions are divided into two relatively independent subsets: XSLT and XPath. This allows the XPath subset in a form of XPath engine to be used somewhere else. The XSLT instruction subset includes instructions for template call, for variable and parameter load and store, instructions for element, attribute, comment and processing-instruction creation, conditional instructions, loops, etc. The XPath instruction subset includes instructions for step search, boolean instructions, arithmetic instructions, relational instructions, function call instructions, etc. Some of those instructions and their usage are described in more details in the next section. All XSLTVM pools are created at compile-time. Their sizes don't change at run-time so memory for them is allocated only once during loading. The only dynamic XSLTVM parts which need run time maintenance are the stacks - VM stack and node set stack.
     

    Expression and template compilation

     First example illustrates the compilation of an XPath expression including steps and filter sub-expressions.
     
    /sales/domestic[position()=1  and  @attr="value"]
    
     is translated to:
     
    CHILD       #sales      // constant pool index
    CHILD       #domestic   // constant pool index
    STORE       i           // store current node set as a
    // loop index
    JMP         LOOP0011
    LOOP0001:  INCR        i
    LOOP0011:  FOR_EACH    i, ENDL0001 // instruction pool index
    LOAD_S      i           // loads current node
    CALL_FUNC   #position
    COMP        #1
    JNE         LOOP0001
    LOAD_S      i
    LOAD_ATTR   #attr
    COMP        #value
    JNE         LOOP0001
    ADD_NODE    i           // add 'i' to result node set
    JMP LOOP0001
    ENDL0001:
    
     CHILD is a step search instruction which replaces the current node set which resides on the top of the stack with the result node set. STORE moves the current node set to the template frame cell allocated by compiler for loop index 'i'. INCR increments the value of 'i'. In a case of node set descriptor it increments current node field. LOAD_S loads single value. If it is a node set it loads the current node. FOR_EACH first operand is node set address, second - loop's end address. This instruction checks if the current node number is less then the set size. If not then jumps to end loop address. ADD_NODE instruction adds the current node to the result node set on the top of the stack if both filter conditions are true.
     Next example illustrates how the previous XPath expression is used in template body:
     
    <xsl:template match="doc">
    <xsl:param name="file">table.html</xsl:param>
    <table href="{$file}">
    <xsl:apply-templates  select="/sales/domestic[position() =1
    and @attr = "value"] />
    <xsl:with-param name="file">
    file.xsd
    </xsl:with-param>
    </table>
    </xsl:template>
    
     is compiled to:
     
    T0022:     TEMPLATE
    PARAM       #file, #table.html
    MK_STAG     #mytable              //creates <table>
    MK_ATTR     #href                 //current fragment is
    //<table href="  ">
    MK_ATTRVAL  file
    ADD_NODE    _current              //creates node set for XPath
    ..                            //expression, evaluate XPath
    ..                            //and as a result, the current
    ..                            //node set is on the top
    STORE       j                     //store the XPath node set
    JMP         LOOP0012
    LOOP0002:  INCR        j
    LOOP0012:  FOR_EACH    j,ENDL0002
    MATCH      j                                 //match with pattern pool
    //matched template descriptor
    //is pushed
    ALLOC_FRAME j                     //allocates new template frame
    PARAM       #file,#table.xsd      //stores param value at offset
    CALL_TEMP   _desc                 //jumps to template index from
    //descriptor
    JMP         LOOP0002
    ENDL0002:
    MK_ETAG     #table                //creates </table>
    RETURN
    
     Template starts with PARAM instructions (if any). It checks the template descriptor (offset _desc) if the corresponding parameter (first operand) is already loaded. If not then it loads the parameter from the address specified by the second operand and marks it as done. MK_... instructions create elements, attributes, comments, text and processing-instructions. MK_ATTRVAL places the contents of parameter 'file' as a value of the current output attribute. MATCH takes the node from 'j', finds the winning pattern and loads the copy of template descriptor into the stack. ALLOC_FRAME uses this descriptor to allocate the template frame. It also loads the node specified by the first operand at offset _current. Note that the frame includes the template descriptor which is already in the stack. CALL_TEMP loads the next instruction index as a return address and jumps to the template entry instruction. RETURN jumps to the current frame return address and pops the frame from the stack.
     

    Design goals

     As it was pointed out the main design goal was to define an virtual machine as a model for XSLT processing. Also, we consider the following XSLTVM features as an important design improvement over the current generation of XSLT processors:
     
  • XSLTVM doesn't allocate heap memory at run time. The only data memory allocated is for the virtual machine stacks.
  •  
  • Variables and parameters are compiled to stack offsets which eliminates run time search.
  •  
  • Template patterns are separated from templates into a pattern pool. Effective searching techniques (Finite Automaton for example) could be applied to find the winning pattern.
  •  
  • Current XSLT implementations perform number of run-time checks which can be done by compiler at compile-time but this is possible only in a case of low-level target machine like XSLTVM.
  •  Once compiled XSLT 'programs' like Java classes can be run many times on different XSLTVM. Also, XSLT 'programs' become implementation language independent. In other words, there should be no difference between code generated from Java XSLT compiler and code from C/C++ XSLT compiler, they both should be able to run on the same XSLTVM.
     

    Future work

     Several of the current XSLT processors implement variants of quasi high-level instruction set in a form of objects with action functions/methods. The goal of a virtual machine is to identify a optimal low-level instruction set. As part of future work, we would like to refine the current instruction set based on performance optimizations. In addition, providing a standard binary file format for the compiled XSLT stylesheet would facilitate re-use of the compiler output with different implementations XSLTVMs.
     We also plan to add support for XMLSchema datatypes and archetypes, and provide extensibility mechanism in XSLT virtual machine to make callout to other programming or scripting languages.
     Bibliography
     
    XML World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/1998/REC-xml-19980210 .
     
    XML Names World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/1998/REC-xml-19980210
     
    XSLT World Wide Web Consortium. XSL Transformations (XSLT). W3C Recommendation. See http://www.w3.org/TR/xslt
     
    XPATH World Wide Web Consortium. XML Path Language. W3C Recommendation. See http://www.w3.org/TR/xpath
     
    JVMS Tim Lindholm, Frank Yellin. The JavaTM Virtual Machine Specification. See http://java.sun.com/docs/books/vmspec/2nd-edition/html/VMSpecTOC.doc.html

    XML and XSL from servers to cell-phones   Table of contents   Indexes   Informix &,amp, XML