[topicmapmail] Testbed for Subject Identity Measure]
Dipl.-Wirtsch.-Inf. Lutz Maicher [Universität Leipzi g]
maicher@informatik.uni-leipzig.de
Tue, 29 Jun 2004 16:08:18 +0200
Hi Steve,
> Thank you for sharing information about this very interesting work. I
> regret I cannot provide the topic maps you're looking for. I wish you
> every success.
I'm really intersted in a vital discussion about our current work.
> The draft Topic Maps Reference Model (TMRM) has much to say about how
> to disclose "Topic Map Applications" -- ways of recognizing when
> topics have the same subjects and should therefore be merged. I would
> be interested to know your reaction to it, since both it and your
> project are concerned with subject identity. An introduction to the
> Topic Maps Reference Model can be found at
> http://www.coolheads.com/SRNPUBS/ontolog040610
I belief that our approach doesn't compete with the TMRM (and the TMDM)
because our SIM only indicates how closely related the Subjects of Topics
of two distributed Topic Maps may be. If a user decides (by virtue of the
SIM) these Topics get identical Subject Identifiers. This means that our SIM
is on top of the given standards.
We make use of two statements within the TMDM: "A subject inidcator is an
information resource that is refrerred to from a topic map in an attempt to
unambiguously identify the subject of a topic to a human" (ch 5.4.2). And
"Merging beyond the minimal merging required by the rules of Clause 6 is
freely allowed. Most commonly this will be done by inferring the subject of
the topics from their characteristics." (ch 5.4.1)
The equality rules of the TMDM hold if i. e. two topics have a pair of
identical subject identifiers. Ok. But a subject identifier is a locator to
the subject indicator. We assume that in distributed environments topic map
authors use different subject indicators with different subject identifiers,
but referring the same subject. With the SIM we provide a similarity measure
which indicates the closeness of the Subject of two Topics only derived
from a statistical anaylsis of their topic items content (which means all
topic characteristics, subject locators and (the content) of the subject
indicators).
It is an interesting question if the usage of a measure like SIM can be part
of the disclosure of a topic map application, i.e. that a SIDP of a given
topic is the logical value of the question "Is the SIM greater then a
treshold?". But these ideas were out of our focus in moment.
We think that the SIM is interesting when two distributed topic map authors
made topic maps which represent subjective assertions about a similar
domain. Because subjects can't be separated sharply, the authors will
introduce topics with slightly different subjects, basenames, occurrence
etc. With the help of SIM we want to simplyfy the decision which topics
should be merged if these topic maps will get-together.
Kal provided some data for a testbed: a collection of topic maps about the
debates of the british parliament. Unfortunatly these topic maps describe
only the structure of these debates, like "MP 123 spoke in debate 334" and
"MP 123 votes 'aye' in division 344". If we want to merge topic maps of
different days the problems sketched above doesn't arise because all
subjects are either sharp (MP 123) or not-mergable (debate 334). To solve
the problem of sharpness we had the idea to make a merging from another
perspective. That a debate is an occurrence of a specific subject, i. e.
regenerative fuel. For each debate their is a topic. The SIM of two debates
which have a similar subject (like regenerative fuel) should be high. We
discarded this idea because on the on ehand the topics of a debate are very
poor: they have only occurrences (no names etc.). This means, that we end up
with clustering the topics on basis of the strings in the occurrences. On
the other hand we have no "proofing" data to calculate precision and recall.
Therefore we are still searching other material for the testbed of the SIM.
This must not be a ready to use topic map, we can transform interesting data
into topic maps. We need data were different persons made assertions about
subjects in a closed domain. They had to name the subjects and to declare
occurrences of these subjects in the given domain. Does anyone has a good
idea where we can find such data?
Lutz Maicher
____________________________________________________________________________
_____
Dipl.-Wirtsch.-Inf. Lutz Maicher
Graduiertenkolleg Wissensrepräsentation | Universität Leipzig
Abteilung Automatische Sprachverarbeitung | Institut für Informatik |
Augustusplatz 10-11 | 04109 Leipzig
fon 0341 97 32 303 | mail maicher + informatik.uni-leipzig.de
http://www.informatik.uni-leipzig.de/~maicher/