The SGML Character Model   Table of contents   Indexes   The Pros and Cons of Industry-Standard DTDs

  Pieper  Frank 
 

Document Structure Independent Data Modelling

 

Abstract:

  Consider a class of database publishing systems that exploit the principles of generalized markup. Each system in this class encompasses three main subsystems: a data storage subsystem, processes that transform the stored data intoSGML , and processes that transform theSGML documents into final publication form.
  We are interested in the impact that the chosen style of storing data has on the structural flexibility of the database publishing system. In other words, we wonder how much effort, and of what kind, is involved in adding aDTD (Document Type Definition) to the publication domain, or in altering an existingDTD .
  Four styles of data storage are beforehand, namely: 1) storing text files containing theSGML ; 2) a database whose schema has been designed after one or moreDTD s; 3) a generally applicableSGML database; and 4) databases designed independently of document structure. Each of these approaches has its own advantages and disadvantages, but regarding structural flexibility, document structure independent data modeling is the winner by far.
 

Introduction

  Ever since the introduction of the database publishing concept in the mid '80-s, generalized markup languages likeSGML have been playing a predominant role in its effectuation. (In the sequel we will exclusively refer toSGML , but the material applies just as well to systems that use other, even non-standard, GMLs.)
  This paper considers a class of database publishing systems with the following architecture. Large amounts of data are stored in a sufficiently organized mode: this we call adatabase . One process converts these stored data intoSGML files, and another process completes the publications. See Figure 1.
 
Architecture of the database publishing systems under discussion
  Note that in the general case, we do not expect the data storage subsystem to make use of a relational, or other,DBMS (DataBase Management System) . Every sufficiently organized collection of data is entitled to the "database" predicate. We are even rather liberal concerning the "sufficiently organized" criterion: if it suits someone's purpose, then it is someone's database.
  The process of turning the database content into anSGML document for postprocessing, is called theextraction process . That process is the prime reason for this study, even though the central subject is the data storage subsystem. We are interested in database design strategies in order to get maximum possibilities for the extraction process.
  Everything that subsequently happens to the extractedSGML in order to get a completed publication, is called theproduction process . That one is not within the scope of this paper.
  One of the great advantages of database publishing is the possibility to reuse content: within one document, among documents of one type, and even among documents of different types. The latter is especially interesting in situations where at the time of database design, no complete overview exists of current and futureDTD s. The ability of a database publishing system to cope with newly defined document structures is called the system'sstructural flexibility .
 It is little wonder that the style of data storage determines for a great deal the structural flexibility of a database publishing system. In this paper, four such styles are presented, and they are compared on the aspect of structural flexibility.
 But first, an example.
 

An Illustrative Example of the Problem Domain

 Imagine that we are The Combination, a collection of trading companies dealing in, say, office supplies.
 We are no competitors of one another. First of all, we market different things: some of us deal in desks and chairs, others in whiteboards, many in pens and pencils, and some in computer supplies like floppy disks and printer toner cartridges. Those among us with similar product ranges, like the pens and pencils colleagues, distinguish between one another by level of luxury, or price, or other marketing aspects.
 Some of us are also resellers of selected products of others, e.g. the supplier of floppy disk labels also sells the special soft-tip pens used to write with on these labels.
 Imagine also that we accomplish synergy of our marketing power by publishing a joint catalogue. To this aim, we exploit a database publishing system according to Figure 1.
 The catalogue starts with some general information, not related to a specific supplier, product group, or product. After this common front matter, the catalogue contains a section for each member who is the prime supplier of at least one product. Such a section provides the prime supplier's name and a list of all his products. Each list item describes the product group that the product belongs to, the (other) product data, and the names of any members of The Combination that resell the product.
 Here's a segment of the label structure of the joint catalogue (see Figure 2). Please ignore the obvious simplicity of the model: an example is only needed to demonstrate some effects, and this one has enough complexity to do so. Greater complexity makes things worse, but not principally different.
 
Label structure of the Joint Catalogue
 After The Combination has published several annual issues of the Joint Catalogue, it occurs that one of our members, viz. the One Stop Office Shop (OSOS), intends to create a catalogue of its own. OSOS is not the prime supplier of any product, but a reseller of many products of many prime suppliers.
 The Combination's Presidium permits (under conditions) that OSOS uses the database of our common database publishing system. One of these conditions is that OSOS' catalogue will mention the prime supplier of each product that appears in it.
 OSOS is user oriented to a greater extent than The Combination. Thus, OSOS organizes its catalogue not by prime supplier, but by product group. The label structure of OSOS' catalogue is depicted in Figure 3.
 
Label structure of OSOS' Catalogue
 It starts with OSOS' own supplier name and a common matter. The remainder is a sequence of product groups. Each product group consists of the product group description and the range of products within that group. Of each product, OSOS' catalogue provides the product data and the prime supplier's name.
 Now, let us see how much trouble OSOS has to get through in order to get the data correctly in this new structure. We will see that this very much depends on which data storage strategy was originally chosen by The Combination.
 

Four Data Storage Strategies

 
 

PlainSGML Text Files

  The lowest sophistication level of data storage is to use text files that directly contain theSGML required by the production process.
 Possible reasons to choose this strategy
 
  • TheSGML form is immediately available, so the system does not require an extraction program.
  •  
  • Standard tools exist, in the form ofSGML editors andDTD editors, that enable the input for the data storage. This implies, for instance, that defining aDTD encompasses data input only, and not programming like in some of the other options.
  •  
  • NoDBMS is required, since every computer operating system has a built-in facility for storing plain text files.
  •   But the simplicity of this approach also induces its weakness. For how much effort, and of what kind, is involved in adding aDTD to the publication domain, or in altering an existing one? Let us look at the OSOS case.
     Two options exist for the realization of the OSOS catalogue: an extraction level solution and a storage level solution, so to speak.
     
     

    Extraction Level Solution

      The first option is to extract the newSGML text from the existing one, i.e. to treat the Joint CatalogueSGML file as the database, and to program a process that renders the OSOS CatalogueSGML file from this database.
      Unfortunately, no standard tools exist that can transform a givenSGML file into one with the same content data in a differentDTD and elements hierarchy. Even if it would exist, it would still require parametrizing with the incoming and outgoingDTD s, and the transformation function to get from the one to the other. Especially this last function can be very hard to define rigorously, let alone specify in computer-readable form.
      Therefore, third generation level programming is inevitable. Not only that, but every time either one of the twoDTD s is altered, the consequences for the extraction program will have to be checked. It usually implies that the extraction program itself will have to be altered. Or worse: a simplification of the Joint Catalogue for instance, might result in the OSOS Catalogue not being derivable any more.
     That is why we prefer the storage level solution.
     
     

    Storage Level Solution

      By the storage level solution, the stored data is augmented such that it also contains the OSOS catalogueSGML text.
      To hold on to the idea that no extraction program is required, implies that the OSOS catalogue is represented as anSGML text that contains copies of many subdocuments (supplier data, product group data, and product data, all in various instances) of the Joint CatalogueSGML file.
     Firstly, this copying is itself highly error prone. But what's at least as important: every alteration in one of these subdocuments has to be carried out in both catalogue documents.
     So this is not the best approach; we should better look for a solution that contains every common subdocument only once. (By the way: note that multiple storage of data was already present in the original Joint Catalogue, namely SupplierName of Resellers!)
     Let us try the following. Every common subdocument is represented in a separate text file; and each of the two catalogues is represented in a "framework" text file that directly contains the higher-level elements, but only entity references to the subdocuments.
      To achieve this, we would first of all have to decompose the JointCatalogueSGML file into a higher-level framework and numerous lower-level subdocuments: common data, product data, product group data, and supplier data. We might decide to create four separate directories, one for each of the four types of subdocument files.
     Next, the higher-level framework of the OSOSCatalogue would have to be created, and its leaf nodes filled with references to all the correct subdocument files.
     The new situation is schematically represented in Figure 4.
     
    Decomposed file and directory structure
      At least this technique would relief us from the two-fold content alterations in common subdocuments. But also, it would burden us with with an extraction process that we didn't need before, albeit a simple one: namely to combine a catalogue framework file and the subdocuments that it refers to into a complete catalogueSGML file. Also, some changes would still have to be administered in both of the catalogue files and in one of the subdocument directories, namely insertions into, and deletions from, either one of the four sets of subdocuments.
     Note that we couldn't have predicted beforehand, i.e. during database design, at which levels in the document's internal tree the subdocuments would have to be separated and replaced by references. Hence in order to be prepared for any such change in the publication domain, we would have to decompose the database at many, if not all, levels in the labels hierarchy.
     This leads us, more or less automatically, to the second storage solution level.
     
     

    Document Structure Driven Data Modeling

      One known technique for database publishing is to create a database application where theDTD of the publications is reflected in the database structure. For short, this means that well-chosen labels of theDTD appear in the database design either as table names or as attribute names. Moreover, parent-child relations between labels are modelled in the database either by foreign keys (in cases where every child-label element is necessarily a child of exactly one parent-label element) or by separate element relation tables (in all other cases). The database structure thus derived from the JointCatalogueDTD is depicted in Figure 5.
     
    Database structure for the Joint Catalogue
     One CommonMatter can occur in various issues of the JointCatalogue, hence we have a relation table between the JointCatalogues table and the CommonMatter subdatabase. For analoguous reasons, we have relation tables between the JointCatalogues table and the PrimeSuppliers, and between Products and Resellers. On the other hand, every record in Products can only belong to one of the PrimeSuppliers, so this relationship is implemented by means of a simple table reference. A similar argument holds for the ProductData subdatabase: every item in it is intended to describe a property of only one of the Products.
     Now let us see what a (virtual) database for the OSOSCatalogue would look like, if we wouldn't have to count with the existence of a JointCatalogue database. See Figure 6.
     
    Database structure for OSOS' Catalogue
     The integration of the last two figures results in the (real) database structure that the existing one will have to be turned into, in order that both catalogues can be generated from the one database. It is depicted in Figure 7.
     
    Database structure for the two catalogues together
     Note that after the integration, a few optimization steps would have been possible. From the Products table the ProdGroupDescr attribute, originated from the JointCatalogue database, might have been omitted because it can be derived from the ProductGroups table that came from the OSOSCatalogue database. Likewise, we could have left the SupplierName out from the Products table, because it is derivable from the PrimeSuppliers table. And finally, the Resellers table might have been integrated with the PrimeSuppliers table if only the relationship table between Products and Resellers would simultaneously be redirected towards the PrimeSuppliers table.
     Note also, however, that such optimizing structure changes would imply alteration of the existing extraction program for the JointCatalogue.
     Summarizing, if The Combination would initially have chosen this second data storage approach for the JointCatalogue, then the project of adding the OSOSCatalogue to the publication domain would have consisted of the following steps.
     
     
    1. Extending the database structure as indicated above.
    2. Extending the database application in order that it becomes possible to bring, and henceforth keep, the data in the additional tables and attributes up to date.
    3. Input of the extra data.
    4. Writing at least one new extraction program.
     Quite an effort altogether. Luckily all the programming items in the above list can be carried out at the fourth generation level of programming languages.
     
     

    Treating Document Structure as Data

      Unsatisfied as we are even with the second storage strategy, we inspect the possibility of a SGML oriented database. The well-known idea is to use a database application that can contain label structure,and document elements hierarchy,and document content, as data. The database application, possibly in combination with anSGML editor, is responsible for guaranteeing that the elements hierarchy and the content are consistent with theDTD . Such a database application might have a database structure like in Figure 8.
     
    Database structure for a generic SGML database application
     At a first glance, this appears to be a rather complicated, and hence expensive, solution. Two arguments explain why this impression might not be correct.
      For one, we must not lose out of sight that the examples used here are of an impractical simplicity. In reality,DTD s like that of the Joint Catalogue and OSOS' Catalogue tend to have dozens of labels, possibly even a few hundred. That would give the second storage solution, discussed in the previous subsection, a database structure with a comparable complexity as this one.
      The second argument is that there is real gain in the fact that this is a very generic database structure, not limited to one publication domain like the previous solution is. Such a database application can be traded as a product, and will thus be less costly per user. An example of such a product is MediaWare's Publishing Base.
     Let us now look at the consequences of this storage option for The Combination and OSOS.
     First, The Combination fills the LabelStructure Module with the label structure of the Joint Catalogue, and the other modules after that with the elements etcetera of the Joint Catalogue.
     When OSOS wants to reuse the content for its own catalogue, the LabelStructure module of the database will have to get filled with a structure that combines the two catalogue label structures in itself. Such a LabelStructure module filling is depicted in Figure 9.
     
    Label structure for the two catalogues combined
     Problems that are immediate from Figure 9.
     
  • The database is unable to guard the consistency between several redundant items. For instance, there is no guarantee that all Products that belong to a certain ProductGroup in the OSOSCatalogue, have the same ProductGroupDescription that this ProductGroup has. Likewise, whether or not every Product of a given PrimeSupplier in the JointCatalogue contains the same SupplierName as this PrimeSupplier, also depends solely on the accuracy of the users.
  •  
  • The other problem is that the label structure for the JointCatalogue in the database is now seemingly different from before. To be more specific: when extracting theSGML version of the JointCatalogue, we must henceforth be careful to skip the SupplierName in each Product.
  •  After these two hurdles are overcome, a third one appears, not so fundamental as the other two but at least as burdensome. It is the same one that was already part of the problem with the two earlier storage strategies. Once the "labelstructural" facilities have been provided for, there is a large amount of work left to be done. The higher-level elements of the OSOS Catalogue have to be added, and subsequently linked to all the correct SupplierName, ProductData, and ProductGroupDescription elements. And even after that, for the rest of their lives, each time when a member of The Combination adds a product to the Joint Catalogue, OSOS may have to incorporate it into its own structure too.
     We will look for a storage strategy that prevents such database size proportional workloads each time a new document structure is imposed on existing content. Since document databases tend to become large, a database architecture that prevents elements-level database manipulation in such cases, is likely to be rewarding even if the thing that has to be done instead, is programming (only once).
     
     

    Document Structure Independent Data Modeling

     The answer is to forget, during database design, that the aim of our system is to publish something. All that we are concerned about is that every piece of relevant data gets a place in the database, and that the database structure reflects the intrinsic coherence between all these pieces of data.
      This means that the focus of our attention is not thepublication domain any more. It's theproblem domain.
     Facts in the problem domain of our example
     
  • There are members, also called suppliers, and every member has a supplier name.
  •  
  • There are products, categorized into product groups, and every product group has a description.
  •  
  • Every product belongs to exactly one product group.
  •  
  • For each product, exactly one supplier acts as the prime supplier of that product.
  •  
  • A product can also be carried by other members than the prime supplier. These others are called the resellers of the product.
  •  
  • Product data are related to each product.
  •  
  • There is other structured information available, not related to a specific supplier, product group, or product.
  •  All this leads to the database structure of Figure 10.
     
    Document structure independent database structure
     Observe the pleasing simplicity of this database diagram. It elegantly reflects the the structure of the information without any bias originated from publication needs.
      No matter this simplicity, the Joint CatalogueSGML document is easily extracted from this database by a proces that can quasi-formally be described as follows.
     Outermost step (JointCatalogue level)
     
  • PutStartTag "JointCatalogue";
  •  
  • PutStartTag "CommonMatter";
  •  
  • {extract the CommonMatter of the JointCatalogue};
  •  
  • PutEndTag "CommonMatter";
  •  
  • ForEach PrimeSupplierIdX From ProductsDo
  •  
  • {extract the PrimeSupplier subdocument forX };
  •  
  • PutEndTag "JointCatalogue".
  •  Within this procedure, a subprocedure takes care of one PrimeSupplier subdocument at a time.
     Next inner step (PrimeSupplier level)
     
  • PutStartTag "PrimeSupplier";
  •  
  • PutStartTag "SupplierName";
  •  
  • Put SupplierNameFrom SuppliersWhere SupplierId =X ;
  •  
  • PutEndTag "SupplierName";
  •  
  • ForEach ProductIdY From ProductsWhere SupplierId =X Do
  •  
  • {extract the Product subdocument forY };
  •  
  • PutEndTag "PrimeSupplier".
  •  This subprocedure also contains a sub-sub-procedure.
     Next inner step (Product level)
     
  • PutStartTag "Product";
  •  
  • PutStartTag "ProductGroupDescription";
  •  
  • Put ProdGroupDescrFrom Products ProductGroupsWhere ProductId =Y ;
  •  
  • PutEndTag "ProductGroupDescription";
  •  
  • PutStartTag "ProductData";
  •  
  • {extract the ProductData forY };
  •  
  • PutEndTag "ProductData";
  •  
  • ForEach SupplierIdZ From ResellersWhere ProductId =Y Do
  •  
  • {extract the Reseller subdocument forZ };
  •  
  • PutEndTag "Product".
  •  Finally, a sub-sub-sub-procedure takes care of the Reseller subdocuments.
     Innermost step (Reseller level)
     
  • PutStartTag "Reseller";
  •  
  • PutStartTag "SupplierName";
  •  
  • Put SupplierNameFrom SuppliersWhere SupplierId =Z ;
  •  
  • PutEndTag "SupplierName";
  •  
  • PutEndTag "Reseller".
  •  Which alterations to the database structure and/or the application are necessary, or at least advisable, in order to make extraction of the OSOS Catalogue attainable?
     None.
     All that OSOS needs to do, is build an extraction program like the one we just saw for The Combination. Partly, the one for OSOS is even slightly simpler, due to the fact that OSOS' catalogue has a less deeply-nested label structure. But also, OSOS' extraction process is partly more complicated, since only the products for which OSOS is a reseller are to appear in the catalogue.
     One database instrument to realize this latter requirement is the notion of view. Let us assume that a database view is created that contains exactly those products for which OSOS is a reseller. Say that it is named OSOSProducts.
     Outermost step (OSOSCatalogue level)
     
  • PutStartTag "OSOSCatalogue";
  •  
  • PutStartTag "SupplierName";
  •  
  • Put SupplierNameFrom SuppliersWhere SupplierId = {OSOS' SupplierId};
  •  
  • PutEndTag "SupplierName";
  •  
  • PutStartTag "CommonMatter";
  •  
  • {extract the CommonMatter of the OSOSCatalogue};
  •  
  • PutEndTag "CommonMatter";
  •  
  • ForEach ProductGroupIdX From OSOSProductsDo
  •  
  • {extract the ProductGroup subdocument forX };
  •  
  • PutEndTag "OSOSCatalogue".
  •  Next inner step (ProductGroup level)
     
  • PutStartTag "ProductGroup";
  •  
  • PutStartTag "ProductGroupDescription";
  •  
  • Put ProdGroupDescFrom ProductGroupsWhere ProductGroupId =X ;
  •  
  • PutEndTag "ProductGroupDescription";
  •  
  • ForEach ProductIdY From OSOSProductsWhere ProductGroupId =X Do
  •  
  • {extract the Product subdocument forY };
  •  
  • PutEndTag "ProductGroup".
  •  Innermost step (Product level)
     
  • PutStartTag "Product";
  •  
  • PutStartTag "ProductData";
  •  
  • {extract the ProductData forY };
  •  
  • PutEndTag "ProductData";
  •  
  • PutStartTag "SupplierName";
  •  
  • Put SupplierNameFrom OSOSProducts SuppliersWhere ProductId =Y ;
  •  
  • PutEndTag "SupplierName";
  •  
  • PutEndTag "Product".
  •  

    Conclusions

     We have seen a general architecture for database publishing systems, and four different technical solutions for the data storage subsystem therewithin. We have compared these four after a specific criterion: the ease with which a new document structure could be added to the publication domain of the system. The benefits of one of the data storage solutions, called document structure independent data modeling, appear to be spectacular.
     Yet, proclaiming document structure independent data modeling the undisputed champion, would be untenable. Other data storage options have advantages too, since there are many possible criteria for judging such a strategy.
     
     
    1. Ease of extraction process programming. Is theSGML form available immediately, or does it require an extraction program? If the latter, is such a program a part of the approach? If so, does this cover all possible extractions (e.g. omissions of temporarily unwanted content parts)?
    2. Programming-free data input enabling. Does the approach lend itself to the immediate use of standard tools? For instance, does defining aDTD in the data storage subsystem require programming, or can this be done by supplying data?
    3. Cost of required infrastructure. For instance, is aDBMS required?
    4. Intuitivity of the input tool. Will a user who is adapting the database be confronted with the terminology of: 1)SGML (which is not bad), 2) the publication structure (which is good), or 3) the problem domain (which is superior)?
    5. Structural flexibility. How much effort, and of what kind, is involved in adding aDTD to the publication domain, or in altering an existing one, in cases where the content of the intended documents is already present in the database?
    6. Elements hierarchy sharing. Does the approach support discriminating between different target groups, e.g. multi-country or multi-language publishing? If the latter, does it also support translation process management (like using a synchronization mechanism between different language versions of a single text element)?
    7. Content sharing. Does the approach support arbitrary sharing of content, not only within one document or among documents of one type, but even between documents of different types?
    8. Data integrity assurance. Does the approach imply that consistency of the entire set of data is maintained? Are the data protected from "quick and dirty" mutations?
     Rating the four strategies on all of these criteria will lead to a table much like the following.
     
    Property Style 1 Style 2 Style 3 Style 4
    A + + 0 + 0
    B + + 0 + 0
    C + + 0 0 0
    D 0 + 0 + +
    E - - - 0 +
    F 0 + + + + + +
    G 0 + + + + + +
    H 0 + + +
     This table brings us to a conclusion of a higher quality.
     
     
  • It is hard to think of any situations where the second option, i.e. document structure driven data modeling, will come out best. It is not superior on any of the criteria. Meanwhile, a system like that is equally application-specific, and hence unique, and hence expensive, as one designed with document structure independence. So, an organisation that is willing to invest so much effort in a data storage subsystem, is likely to want its database to be document structure independent.
  •  
  • Storage of text files is especially interesting as a low-cost entrance into the world ofSGML . There is noDBMS required, and no extraction process programming needs to be done. Standard editing tools both for the documents and for theDTD will do as a data input application. As a disadvantage, immediately when any new functional requirements emerge, this technique runs short on various dimensions of flexibility.
  •  
  • If multi-language publishing and/or arbitrary content sharing are required, but the need for structural flexibility is neither present nor expected to arise, then a standardSGML oriented database application for the storage of structural, hierarchical, and content data is likely to be the most cost-effective solution.
  •  
  • Document structure independent data modeling has the most advantages of all, except for its price. It requires a genuine DBMS, a taylor-made application, and taylor-made extraction programs. These things can make a system expensive. But for a given type of organizations, the advantages are well worth such an investment. Once a taylor-made application is available, the users can input the data while maintaining to think in their own vocabulary, instead of switching to publishing jargon. Data integrity checking is built-in to the maximum possible extent. Several kinds of content flexibility, like multi-language and multi-country publishing, target group dependent document composition, and arbitrary content sharing among publications, are natural consequences. And, last but not least, document structure independent data modeling is the approach that yields the highest form of structural flexibility.

  • The SGML Character Model   Table of contents   Indexes   The Pros and Cons of Industry-Standard DTDs