[topicmapmail] Merging of Distributed Topic Maps based on the
Subject Identity Measure (SIM) Approach
Lars Marius Garshol
larsga@ontopia.net
Wed, 13 Oct 2004 10:20:59 +0200
Hi Lutz,
* maicher@informatik.uni-leipzig.de
|
| We are still interested in a vital discussion about the problem we
| want to solve. We don't share the optimism that PSI repositories
| will be widely adopted in distributed and heterogeneous
| communities. Therefore we introduced our SIM approach. The SIM is a
| measure, which determines how closely related the Subject of two
| Topics might be. This decision is made automatically and only based
| on the Content provided by the regarding Topic (Maps). For more
| details, we suggest a closer look to our papers.
I liked your approach quite a lot, and thought it very interesting.
I'd really like to try it out and see to what extent it actually works
on real data. (There are some topic maps in the Omnigator that have
overlapping subjects without having the same PSIs. We could also try
it on the XML conference papers topic map, instead of the existing
ad-hoc heuristics used for merging there.)
One thing I found strange was that you defined your own measure for
string distance instead of reusing existing measures such as
Levenshtein distance. Why was that?
Also, I'm uncertain about the URI similarity measure. If two URIs are
nearly the same, what does that tell you? It's unlikely to be because
the author mistyped the URI, because mistyped URIs don't work at all.
And if they both work, URI equivalence rules will reveal this in some
cases (which your measure does not take into account). Finally, if you
consider the subject identifiers
(1) http://psi.example.org/something/#european-union
(2) http://psi.example.org/something/#african-union
(3) http://psi.noe.no/other/#european-union
then (1) and (3) are much more likely to identify the same subject
than (1) and (2) are.
Another consideration is that I think types are extremely important.
If the names are the same but the types are disjoint (person and
place, say) then you can safely ignore the names. You might even want
to make the algorithm consider typing topics first, and only
afterwards go after the instances.
Not sure if this is helpful, but it may be worth considering, if you
haven't already.
--
Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
GSM: +47 98 21 55 50 <URL: http://www.garshol.priv.no >