[topicmapmail] Generation of Topic Maps and Machine Learning

Dipl.-Wirtsch.-Inf. Lutz Maicher [Universität Leipzi g] maicher@informatik.uni-leipzig.de
Sun, 31 Oct 2004 12:27:28 +0100


Hi Aayush,

* From: Aayush Puri
* Subject: [topicmapmail] Generation of Topic Maps and Machine Learning

>I wanted to know then possibilities one has
>in order to generate topic maps from a given
>source of textual documents. So what I will
>have is a text source and what I am interesting
>in doing is to generate topic maps between
>"certain" topics.

At university of Leipzig we have two approaches to generate Topic Maps
automatically. Both approaches are statistically and bases on finding
"terminology" used in the relevant corpus. "Terminology" (from our point of
view) can be seen as all words which occur
significantly more often then expected.

The first approach is especially used for large text corpora (some 100k of
plain text). This method we call reference corpus analysis. We compare the
frequency each word occur in the corpus with its frequency in a reference
corpus (this might be a corpus of the common German language, or a corpus
with
common German language and some common medical terms to filter only
terminology which is about special medical issues). All terminology is a
Topic in the Topic Map. Associations are calculated by so called
"co-occurrences". Each term co-occur with each other (in a sentence or a
document) term with a given
frequency. If this frequency is significantly higher than observed in our
reference corpus than we expect that a special relationship between these
two
terms is given (in the context of the given corpus). And we create an
Association between these two terms (without any type etc.). This method can
be refined by named entity detection etc. If you are interested  I refer to
our article at KnowTech '03 [1]. Sorry, this article is in German, but the
references might be in your interest (some are in English).

The second approach is for small corpora (I think that the most application
for automatic generation of Topic Maps handles with small corpora). We
call this approach "terminology extraction". Because we work with
statistical methods, they normally have limitations with small corpora.
Therefore the
TOMATO system, a prototype implemented on top of this approach, works with
some relevance feedback functions. This feedback is stored in a Topic Map an
can be exchanged between different clients (i. e. if two people join a
common project etc.)  If your are interested I refer to our
article at I-Know'04 [2].

In a research project (my chair participates in) a system was created,
which generates such networks of terms and associations (this is not called
Topic Map on the moment, but it resembles strongly), from spoken language
*on
the fly*. While two people discuss, a "Topic Map" of their discussion is
created and presented with a video beamer. Additional information (which is
extracted from the textual knowledge base of the enterprise which is using
this technique) is provided inside this network (attached to the relevant
terms in the network) to support the discussion.

This all sounds very practicable and may have is relevance in some use
cases. But you have to bear in mind, that our approaches solely work on the
"syntactical surface" of these texts. Certainly, not the "real" Subjects are
detected (please have always the relation between Topics and Subjects in
mind while thinking about automatic generation of Topic Maps).

>For simplicity the topics among
>which I need to draw associations are limited
>and are pre-defined (at the time of providing the textual sources).

Our approach "finds" new Topics and their expected relationships inside a
corpus.  The question is, what do you mean with "the topics are limited" and
the associations are "pre-defined". Please specify these requirements in
more detail!

I hope that helps a bit,
Greetings from Leipzig,
Lutz

[1] http://www.informatik.uni-leipzig.de/~maicher/forschung2.html
"Automatische Erstellung individualisierter, domänenspezifischer Topic Maps
..."
[2] http://www.informatik.uni-leipzig.de/~maicher/forschung2.html "Moving
Topic Maps to mainstream - Integration of Topic Map Generation in the User's
Working Environment"

_________________________________________________________________________________
Dipl.-Wirtsch.-Inf. Lutz Maicher
Graduiertenkolleg Wissensrepräsentation | Universität Leipzig
Abteilung Automatische Sprachverarbeitung | Institut für Informatik |
Augustusplatz 10-11 | 04109 Leipzig

fon 0341 97 32 303 | mail maicher@informatik.uni-leipzig.de
http://www.informatik.uni-leipzig.de/~maicher/