[topicmapmail] Autogenerating Topic Maps

Dan Corwin dan@lexikos.com
Thu, 27 Feb 2003 22:12:57 -0500


> Premnath Raghavendran wrote:
> 
> How do I autogenerate Topic Maps?  My data source is a 
> collection of word documents on various subjects.

That depends critically on the information in your corpus you want to
use when autogenerating the topic map.  Let's assume you want one that
just holds word counts (a trivial case). Conceptually, you might write
this code:

  1) create an empty Topic Map, TM
  2) for each document found in the corpus:
     A) make a new corresponding "document" topic, D, in TM
     B) add to D some selected, document-embedded metadata  
     C) for each word found within the given document:
        i) look in TM for a "word" topic, W, named by the word
           a) if found, increment its contained word count
           b) if not, build a new W topic with count=1
       ii) in the TM, associate D with W
 
When this stops, TM will hold a topic for each word and document. 
Thanks to step {C.ii}, its associations will also say what words each
document holds.

> If I want to develop a logic to automatically generate topic 
> maps for these various documents, how do I do?

Your phrasing suggests goals more complex than word counts, but I can't
tell what.  Please write back and get more specific.  If your answer can
be made to fit into a variation of the pseudo code, you'll have the germ
of a spec.  These links may help to clarify the possible options:

  [1] http://www.ontopia.net/topicmaps/explorations.html
  [2] http://www.lexikos.com/nlptools.jsp

> An article at Ontopia suggests to first create RDF out 
> of them & then move ahead to Topic Maps.

The first article on [1] under "autogeneration" indeed does suggest
using RDF when the data source is structured.  If your corpus is full of
data tables, that might apply.  But I would expect strings and their
locations within the corpus to be more useful.

Steve Peppers' article (just below that one) cites embedded metadata as
a source for step 2B.  If you can get at it, I'd add the semi-structured
markup in Word documents (exposed if you save one to HTML, WordPerfect,
etc.)  It may let you find and add (e.g.) "headings", to locate words
more precisely within your TM.

Steve also mentions unstructured text as a source.  If your pseudo code
spec replaces "word" with "phrase", "name", or even "root word", it will
cross a line into NLP, a realm loaded with complex issues.  Here, the
R&D costs get larger, and accuracy may vary widely with the details of
your specs, code, and corpus.  But there is no free lunch, and for a
large corpus, NLP approachs often make economic sense.  

Operator guidance boosts accuracy, so if you can accept the "assisted
generation" of TMs instead of their "autogeneration", use it.  [2] shows
the kinds of modules needed (in the green and blue areas).  What they
jointly build are symbolic models of what each given "phrase" refers to
- its subject.

Such models could be given in RDF, or some other formal language
specific to the NLP processor.  Regardless of those details, such
phrase-referrent models are basically just what you need to find or
build a topic that will represent the subject inside your new TM. 
That's why NLP is such a useful tool here!

Typically, to use NLP, you must separately model beforehand (e.g., using
[2]'s yellow modules) all the *types* of subjects whose references you
will seek in your corpus.  The more effort you put into this task, the
less "autogenerated" your TMs will seem.  Your specs should thus address
how much prep work you will accept.

> Please help me out from implementation point of view. 
> How exactly do I go about it?

More specs are needed to decide on implementation.  Sorry, but the range
of options here is broad, as you can see.

Cheers,

Dan Corwin