| The SGML Character Model | Table of contents | Indexes | The Pros and Cons of Industry-Standard DTDs | |||
| Pieper Frank |
Document Structure Independent Data Modelling |
Abstract: |
| Consider a class of database publishing systems that exploit the principles of generalized markup. Each system in this class encompasses three main subsystems: a data storage subsystem, processes that transform the stored data intoSGML , and processes that transform theSGML documents into final publication form. |
| We are interested in the impact that the chosen style of storing data has on the structural flexibility of the database publishing system. In other words, we wonder how much effort, and of what kind, is involved in adding aDTD (Document Type Definition) to the publication domain, or in altering an existingDTD . |
| Four styles of data storage are beforehand, namely: 1) storing text files containing theSGML ; 2) a database whose schema has been designed after one or moreDTD s; 3) a generally applicableSGML database; and 4) databases designed independently of document structure. Each of these approaches has its own advantages and disadvantages, but regarding structural flexibility, document structure independent data modeling is the winner by far. |
Introduction |
| Ever since the introduction of the database publishing concept in the mid '80-s, generalized markup languages likeSGML have been playing a predominant role in its effectuation. (In the sequel we will exclusively refer toSGML , but the material applies just as well to systems that use other, even non-standard, GMLs.) |
| This paper considers a class of database publishing systems with the following architecture. Large amounts of data are stored in a sufficiently organized mode: this we call adatabase . One process converts these stored data intoSGML files, and another process completes the publications. See Figure 1. |
![]() |
| Note that in the general case, we do not expect the data storage subsystem to make use of a relational, or other,DBMS (DataBase Management System) . Every sufficiently organized collection of data is entitled to the "database" predicate. We are even rather liberal concerning the "sufficiently organized" criterion: if it suits someone's purpose, then it is someone's database. |
| The process of turning the database content into anSGML document for postprocessing, is called theextraction process . That process is the prime reason for this study, even though the central subject is the data storage subsystem. We are interested in database design strategies in order to get maximum possibilities for the extraction process. |
| Everything that subsequently happens to the extractedSGML in order to get a completed publication, is called theproduction process . That one is not within the scope of this paper. |
| One of the great advantages of database publishing is the possibility to reuse content: within one document, among documents of one type, and even among documents of different types. The latter is especially interesting in situations where at the time of database design, no complete overview exists of current and futureDTD s. The ability of a database publishing system to cope with newly defined document structures is called the system'sstructural flexibility . |
| It is little wonder that the style of data storage determines for a great deal the structural flexibility of a database publishing system. In this paper, four such styles are presented, and they are compared on the aspect of structural flexibility. |
| But first, an example. |
An Illustrative Example of the Problem Domain |
| Imagine that we are The Combination, a collection of trading companies dealing in, say, office supplies. |
| We are no competitors of one another. First of all, we market different things: some of us deal in desks and chairs, others in whiteboards, many in pens and pencils, and some in computer supplies like floppy disks and printer toner cartridges. Those among us with similar product ranges, like the pens and pencils colleagues, distinguish between one another by level of luxury, or price, or other marketing aspects. |
| Some of us are also resellers of selected products of others, e.g. the supplier of floppy disk labels also sells the special soft-tip pens used to write with on these labels. |
| Imagine also that we accomplish synergy of our marketing power by publishing a joint catalogue. To this aim, we exploit a database publishing system according to Figure 1. |
| The catalogue starts with some general information, not related to a specific supplier, product group, or product. After this common front matter, the catalogue contains a section for each member who is the prime supplier of at least one product. Such a section provides the prime supplier's name and a list of all his products. Each list item describes the product group that the product belongs to, the (other) product data, and the names of any members of The Combination that resell the product. |
| Here's a segment of the label structure of the joint catalogue (see Figure 2). Please ignore the obvious simplicity of the model: an example is only needed to demonstrate some effects, and this one has enough complexity to do so. Greater complexity makes things worse, but not principally different. |
![]() |
| After The Combination has published several annual issues of the Joint Catalogue, it occurs that one of our members, viz. the One Stop Office Shop (OSOS), intends to create a catalogue of its own. OSOS is not the prime supplier of any product, but a reseller of many products of many prime suppliers. |
| The Combination's Presidium permits (under conditions) that OSOS uses the database of our common database publishing system. One of these conditions is that OSOS' catalogue will mention the prime supplier of each product that appears in it. |
| OSOS is user oriented to a greater extent than The Combination. Thus, OSOS organizes its catalogue not by prime supplier, but by product group. The label structure of OSOS' catalogue is depicted in Figure 3. |
![]() |
| It starts with OSOS' own supplier name and a common matter. The remainder is a sequence of product groups. Each product group consists of the product group description and the range of products within that group. Of each product, OSOS' catalogue provides the product data and the prime supplier's name. |
| Now, let us see how much trouble OSOS has to get through in order to get the data correctly in this new structure. We will see that this very much depends on which data storage strategy was originally chosen by The Combination. |
Four Data Storage Strategies |
PlainSGML Text Files |
| The lowest sophistication level of data storage is to use text files that directly contain theSGML required by the production process. |
| Possible reasons to choose this strategy |
| But the simplicity of this approach also induces its weakness. For how much effort, and of what kind, is involved in adding aDTD to the publication domain, or in altering an existing one? Let us look at the OSOS case. |
| Two options exist for the realization of the OSOS catalogue: an extraction level solution and a storage level solution, so to speak. |
Extraction Level Solution |
| The first option is to extract the newSGML text from the existing one, i.e. to treat the Joint CatalogueSGML file as the database, and to program a process that renders the OSOS CatalogueSGML file from this database. |
| Unfortunately, no standard tools exist that can transform a givenSGML file into one with the same content data in a differentDTD and elements hierarchy. Even if it would exist, it would still require parametrizing with the incoming and outgoingDTD s, and the transformation function to get from the one to the other. Especially this last function can be very hard to define rigorously, let alone specify in computer-readable form. |
| Therefore, third generation level programming is inevitable. Not only that, but every time either one of the twoDTD s is altered, the consequences for the extraction program will have to be checked. It usually implies that the extraction program itself will have to be altered. Or worse: a simplification of the Joint Catalogue for instance, might result in the OSOS Catalogue not being derivable any more. |
| That is why we prefer the storage level solution. |
Storage Level Solution |
| By the storage level solution, the stored data is augmented such that it also contains the OSOS catalogueSGML text. |
| To hold on to the idea that no extraction program is required, implies that the OSOS catalogue is represented as anSGML text that contains copies of many subdocuments (supplier data, product group data, and product data, all in various instances) of the Joint CatalogueSGML file. |
| Firstly, this copying is itself highly error prone. But what's at least as important: every alteration in one of these subdocuments has to be carried out in both catalogue documents. |
| So this is not the best approach; we should better look for a solution that contains every common subdocument only once. (By the way: note that multiple storage of data was already present in the original Joint Catalogue, namely SupplierName of Resellers!) |
| Let us try the following. Every common subdocument is represented in a separate text file; and each of the two catalogues is represented in a "framework" text file that directly contains the higher-level elements, but only entity references to the subdocuments. |
| To achieve this, we would first of all have to decompose the JointCatalogueSGML file into a higher-level framework and numerous lower-level subdocuments: common data, product data, product group data, and supplier data. We might decide to create four separate directories, one for each of the four types of subdocument files. |
| Next, the higher-level framework of the OSOSCatalogue would have to be created, and its leaf nodes filled with references to all the correct subdocument files. |
| The new situation is schematically represented in Figure 4. |
![]() |
| At least this technique would relief us from the two-fold content alterations in common subdocuments. But also, it would burden us with with an extraction process that we didn't need before, albeit a simple one: namely to combine a catalogue framework file and the subdocuments that it refers to into a complete catalogueSGML file. Also, some changes would still have to be administered in both of the catalogue files and in one of the subdocument directories, namely insertions into, and deletions from, either one of the four sets of subdocuments. |
| Note that we couldn't have predicted beforehand, i.e. during database design, at which levels in the document's internal tree the subdocuments would have to be separated and replaced by references. Hence in order to be prepared for any such change in the publication domain, we would have to decompose the database at many, if not all, levels in the labels hierarchy. |
| This leads us, more or less automatically, to the second storage solution level. |
Document Structure Driven Data Modeling |
| One known technique for database publishing is to create a database application where theDTD of the publications is reflected in the database structure. For short, this means that well-chosen labels of theDTD appear in the database design either as table names or as attribute names. Moreover, parent-child relations between labels are modelled in the database either by foreign keys (in cases where every child-label element is necessarily a child of exactly one parent-label element) or by separate element relation tables (in all other cases). The database structure thus derived from the JointCatalogueDTD is depicted in Figure 5. |
![]() |
| One CommonMatter can occur in various issues of the JointCatalogue, hence we have a relation table between the JointCatalogues table and the CommonMatter subdatabase. For analoguous reasons, we have relation tables between the JointCatalogues table and the PrimeSuppliers, and between Products and Resellers. On the other hand, every record in Products can only belong to one of the PrimeSuppliers, so this relationship is implemented by means of a simple table reference. A similar argument holds for the ProductData subdatabase: every item in it is intended to describe a property of only one of the Products. |
| Now let us see what a (virtual) database for the OSOSCatalogue would look like, if we wouldn't have to count with the existence of a JointCatalogue database. See Figure 6. |
![]() |
| The integration of the last two figures results in the (real) database structure that the existing one will have to be turned into, in order that both catalogues can be generated from the one database. It is depicted in Figure 7. |
![]() |
| Note that after the integration, a few optimization steps would have been possible. From the Products table the ProdGroupDescr attribute, originated from the JointCatalogue database, might have been omitted because it can be derived from the ProductGroups table that came from the OSOSCatalogue database. Likewise, we could have left the SupplierName out from the Products table, because it is derivable from the PrimeSuppliers table. And finally, the Resellers table might have been integrated with the PrimeSuppliers table if only the relationship table between Products and Resellers would simultaneously be redirected towards the PrimeSuppliers table. |
| Note also, however, that such optimizing structure changes would imply alteration of the existing extraction program for the JointCatalogue. |
| Summarizing, if The Combination would initially have chosen this second data storage approach for the JointCatalogue, then the project of adding the OSOSCatalogue to the publication domain would have consisted of the following steps. |
|
| Quite an effort altogether. Luckily all the programming items in the above list can be carried out at the fourth generation level of programming languages. |
Treating Document Structure as Data |
| Unsatisfied as we are even with the second storage strategy, we inspect the possibility of a SGML oriented database. The well-known idea is to use a database application that can contain label structure,and document elements hierarchy,and document content, as data. The database application, possibly in combination with anSGML editor, is responsible for guaranteeing that the elements hierarchy and the content are consistent with theDTD . Such a database application might have a database structure like in Figure 8. |
![]() |
| At a first glance, this appears to be a rather complicated, and hence expensive, solution. Two arguments explain why this impression might not be correct. |
| For one, we must not lose out of sight that the examples used here are of an impractical simplicity. In reality,DTD s like that of the Joint Catalogue and OSOS' Catalogue tend to have dozens of labels, possibly even a few hundred. That would give the second storage solution, discussed in the previous subsection, a database structure with a comparable complexity as this one. |
| The second argument is that there is real gain in the fact that this is a very generic database structure, not limited to one publication domain like the previous solution is. Such a database application can be traded as a product, and will thus be less costly per user. An example of such a product is MediaWare's Publishing Base. |
| Let us now look at the consequences of this storage option for The Combination and OSOS. |
| First, The Combination fills the LabelStructure Module with the label structure of the Joint Catalogue, and the other modules after that with the elements etcetera of the Joint Catalogue. |
| When OSOS wants to reuse the content for its own catalogue, the LabelStructure module of the database will have to get filled with a structure that combines the two catalogue label structures in itself. Such a LabelStructure module filling is depicted in Figure 9. |
![]() |
| Problems that are immediate from Figure 9. |
| After these two hurdles are overcome, a third one appears, not so fundamental as the other two but at least as burdensome. It is the same one that was already part of the problem with the two earlier storage strategies. Once the "labelstructural" facilities have been provided for, there is a large amount of work left to be done. The higher-level elements of the OSOS Catalogue have to be added, and subsequently linked to all the correct SupplierName, ProductData, and ProductGroupDescription elements. And even after that, for the rest of their lives, each time when a member of The Combination adds a product to the Joint Catalogue, OSOS may have to incorporate it into its own structure too. |
| We will look for a storage strategy that prevents such database size proportional workloads each time a new document structure is imposed on existing content. Since document databases tend to become large, a database architecture that prevents elements-level database manipulation in such cases, is likely to be rewarding even if the thing that has to be done instead, is programming (only once). |
Document Structure Independent Data Modeling |
| The answer is to forget, during database design, that the aim of our system is to publish something. All that we are concerned about is that every piece of relevant data gets a place in the database, and that the database structure reflects the intrinsic coherence between all these pieces of data. |
| This means that the focus of our attention is not thepublication domain any more. It's theproblem domain. |
| Facts in the problem domain of our example |
| All this leads to the database structure of Figure 10. |
![]() |
| Observe the pleasing simplicity of this database diagram. It elegantly reflects the the structure of the information without any bias originated from publication needs. |
| No matter this simplicity, the Joint CatalogueSGML document is easily extracted from this database by a proces that can quasi-formally be described as follows. |
| Outermost step (JointCatalogue level) |
| Within this procedure, a subprocedure takes care of one PrimeSupplier subdocument at a time. |
| Next inner step (PrimeSupplier level) |
| This subprocedure also contains a sub-sub-procedure. |
| Next inner step (Product level) |
| Finally, a sub-sub-sub-procedure takes care of the Reseller subdocuments. |
| Innermost step (Reseller level) |
| Which alterations to the database structure and/or the application are necessary, or at least advisable, in order to make extraction of the OSOS Catalogue attainable? |
| None. |
| All that OSOS needs to do, is build an extraction program like the one we just saw for The Combination. Partly, the one for OSOS is even slightly simpler, due to the fact that OSOS' catalogue has a less deeply-nested label structure. But also, OSOS' extraction process is partly more complicated, since only the products for which OSOS is a reseller are to appear in the catalogue. |
| One database instrument to realize this latter requirement is the notion of view. Let us assume that a database view is created that contains exactly those products for which OSOS is a reseller. Say that it is named OSOSProducts. |
| Outermost step (OSOSCatalogue level) |
| Next inner step (ProductGroup level) |
| Innermost step (Product level) |
Conclusions |
| We have seen a general architecture for database publishing systems, and four different technical solutions for the data storage subsystem therewithin. We have compared these four after a specific criterion: the ease with which a new document structure could be added to the publication domain of the system. The benefits of one of the data storage solutions, called document structure independent data modeling, appear to be spectacular. |
| Yet, proclaiming document structure independent data modeling the undisputed champion, would be untenable. Other data storage options have advantages too, since there are many possible criteria for judging such a strategy. |
|
| Rating the four strategies on all of these criteria will lead to a table much like the following. |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||