[topicmapmail] Merging of Distributed Topic Maps based on the Subject Identity Measure (SIM) Approach

Dipl.-Wirtsch.-Inf. Lutz Maicher [Universität Leipzi g] maicher@informatik.uni-leipzig.de
Wed, 20 Oct 2004 10:56:27 +0200


Hi Lars Marius,

> * maicher@informatik.uni-leipzig.de
> |
> | We are still interested in a vital discussion about the problem we
> | want to solve. We don't share the optimism that PSI repositories
> | will be widely adopted in distributed and heterogeneous
> | communities. Therefore we introduced our SIM approach. The SIM is a
> | measure, which determines how closely related the Subject of two
> | Topics might be. This decision is made automatically and only based
> | on the Content provided by the regarding Topic (Maps). For more
> | details, we suggest a closer look to our papers.
>
> I liked your approach quite a lot, and thought it very interesting.
> I'd really like to try it out and see to what extent it actually works
> on real data. (There are some topic maps in the Omnigator that have
> overlapping subjects without having the same PSIs. We could also try
> it on the XML conference papers topic map, instead of the existing
> ad-hoc heuristics used for merging there.)

To evaluate approaches for automatic merging you need an objective criterion
for testing purposes. In our example we used the ISDN to decide whether the
regarding Topics have to be merged. If you don't have such a objective
criterion you can't calculate precision, recall and F-Value. But without
calculation of these measures you can't assess the quality of the
approaches. We need testbeds which provide that objective criterion.

For the first time we have to concentrate on practical approaches (which
make painful simplifications). But many Subjects can't be discriminated
sharply etc. (especially all abstract Subjects have no clear border) [see
1]. For such cases we can propose similarity approaches but we will never be
able to assess the real matching quality. (Research done to
inter-indexer-consistence showed the impossibility of an *objective*
assessment of these matching proposals).

> One thing I found strange was that you defined your own measure for
> string distance instead of reusing existing measures such as
> Levenshtein distance. Why was that?

We have two requirements for String similarity, which are met by our
approach:

1. Language Independence. Our approach should work i. e. for two Catalan
Topic Map
Fragments as well as for English Topic Map Fragments. This requirement
excludes approaches which uses thesauri, ontologies, lists of stop words
etc.

2. Inexpensive. The similarity calculation have to be very computational
cheep.

The importance of the second requirement is shown by the testbed discussed
in the article. We calculated merging candidates out of two Topic Maps which
consist of 300 Topics each. For the small example a SIM for appr. 90.000
Topic Pairs has to be calculated. For each calculation a lot of String
similarities have to be computed (for each possible pair of Topic Names and
each possible Pair of Occurrences). Especially if Occurrences are small
texts etc., we need a very "cheep" String similarity measure. What happens
if Topic Maps with millions of Topics join?

For computation of string similarity one can decide whether to use an
approach which acts on the syntactical surface (our approach) or to use an
approach which makes semantic comparisons. The latter case is definitely not
language independent (our first requirement), but might yield good results
for the detection of similar Subjects.

Summarising, if you have good ideas for string similarity measures which are
cheap (strong requirement) and language independent (weak requirement) I'm
very interested in. Perhaps we can use the dice coefficient on the basis of
trigrams?

> Also, I'm uncertain about the URI similarity measure. If two URIs are
> nearly the same, what does that tell you? It's unlikely to be because
> the author mistyped the URI, because mistyped URIs don't work at all.
> And if they both work, URI equivalence rules will reveal this in some
> cases (which your measure does not take into account). Finally, if you
> consider the subject identifiers
>
>  (1) http://psi.example.org/something/#european-union
>  (2) http://psi.example.org/something/#african-union
>  (3) http://psi.noe.no/other/#european-union
>
> then (1) and (3) are much more likely to identify the same subject
> than (1) and (2) are.

That's absolutely right. But first, we have to distinguish between the URI
and the information resource which is referenced by this URI (if there exist
one). Both can make some statements about the Subject of a Topic.I think
that analysing
the referenced information resources might be more fruitful than analysing
the regarding URIs. What's your opinion?

Second, we have to distinguish between the contexts where these URIs occurs.
If an URI is used as Subject Indicator two URIs which only differs in the
fragment (after #) have to be regarded as representing different Subjects
(because the owner of the namespace knows the pragmatics of the URIs and
decided to use two different Subjects). But if these different URIs are used
as Occurrences we can't decide in the same strictness. And at least, if URIs
are used to reify Topic Maps Fragments we have to treat URIs in an
completely
different fashion.

To exclude mistyping we can assume, that all URI which have a Levenshtein
distance of zero or one has to be regarded as the same URI (for example).
But the computation of this distance is expensive.

> Another consideration is that I think types are extremely important.
> If the names are the same but the types are disjoint (person and
> place, say) then you can safely ignore the names. You might even want
> to make the algorithm consider typing topics first, and only
> afterwards go after the instances.

I think the inclusion of structural information (types and neighbourhood via
associations etc.) is very important. I have some good ideas which should be
discussed later.

> Not sure if this is helpful, but it may be worth considering, if you
> haven't already.

Thank you for your ideas. We are very interested in discussion about our
research.

Lutz

[1]
http://www.idealliance.org/papers/extreme03/xslfo-pdf/2003/Kent01/EML2003Kent01.pdf

_________________________________________________________________________________
Dipl.-Wirtsch.-Inf. Lutz Maicher
Graduiertenkolleg Wissensrepräsentation | Universität Leipzig
Abteilung Automatische Sprachverarbeitung | Institut für Informatik |
Augustusplatz 10-11 | 04109 Leipzig

fon 0341 97 32 303 | mail maicher@informatik.uni-leipzig.de
http://www.informatik.uni-leipzig.de/~maicher/