[topicmapmail] Merging of Distributed Topic Maps based on the Subject Identity Measure (SIM) Approach

Bernard Vatant bernard.vatant@mondeca.com
Wed, 20 Oct 2004 18:44:07 +0200


*Lars Marius
>> Another consideration is that I think types are extremely important.
>> If the names are the same but the types are disjoint (person and
>> place, say) then you can safely ignore the names. You might even want
>> to make the algorithm consider typing topics first, and only
>> afterwards go after the instances.

*Lutz
> I think the inclusion of structural information (types and neighbourhood via
> associations etc.) is very important. I have some good ideas which should be
> discussed later.

Amazingly, I've thought a lot lately about typing (or classification) as a recommended
first step in identification process, for both conceptual and technical reasons. I'm happy
to see you both ploughing in that direction, too. In fact, it was a hot and never solved
debate, in PubSubj TC, to figure out the level of commitment implied by the use of
identifying properties, whatever their type : name, PSI or any other property fit for
establishing identity, exact or fuzzy. This is an important aspect of this issue - not the
only one.

So, I wonder if Lars Marius would go as far as rephrasing his above paragraph, replacing
"name" by "subject indicator", as :

"If the subject indicators are the same but the types are disjoint, then you can safely
ignore the subject indicators."

IOW, would you recommend (as a best practice at least) that the merging constraint carried
by equality of subject indicators could be relaxed when classes are implicitly or formally
disjoint (like person and place).

OTOH, should not PSIs include somehow the declaration of the class of the identified
subject, in such a way that use of the PSI for a topic in an explicitly disjoint class
would be considered as an error?

Of course, this goes into moving ground, because it's clearly questioning the "absolute"
nature of subject identity, that is supposed to be carried by the subject indicator in the
TM paradigm.

Bernard

**********************************************************************************

Bernard Vatant
Senior Consultant
Knowledge Engineering
bernard.vatant@mondeca.com

"Making Sense of Content" :  http://www.mondeca.com
"Everything is a Subject" :  http://universimmedia.blogspot.com

**********************************************************************************

> -----Message d'origine-----
> De : topicmapmail-admin@infoloom.com
> [mailto:topicmapmail-admin@infoloom.com]De la part de
> Dipl.-Wirtsch.-Inf. Lutz Maicher [Universität Leipzig]
> Envoyé : mercredi 20 octobre 2004 10:56
> À : topicmapmail@infoloom.com
> Objet : Re: [topicmapmail] Merging of Distributed Topic Maps based on
> the Subject Identity Measure (SIM) Approach
>
>
> Hi Lars Marius,
>
> > * maicher@informatik.uni-leipzig.de
> > |
> > | We are still interested in a vital discussion about the problem we
> > | want to solve. We don't share the optimism that PSI repositories
> > | will be widely adopted in distributed and heterogeneous
> > | communities. Therefore we introduced our SIM approach. The SIM is a
> > | measure, which determines how closely related the Subject of two
> > | Topics might be. This decision is made automatically and only based
> > | on the Content provided by the regarding Topic (Maps). For more
> > | details, we suggest a closer look to our papers.
> >
> > I liked your approach quite a lot, and thought it very interesting.
> > I'd really like to try it out and see to what extent it actually works
> > on real data. (There are some topic maps in the Omnigator that have
> > overlapping subjects without having the same PSIs. We could also try
> > it on the XML conference papers topic map, instead of the existing
> > ad-hoc heuristics used for merging there.)
>
> To evaluate approaches for automatic merging you need an objective criterion
> for testing purposes. In our example we used the ISDN to decide whether the
> regarding Topics have to be merged. If you don't have such a objective
> criterion you can't calculate precision, recall and F-Value. But without
> calculation of these measures you can't assess the quality of the
> approaches. We need testbeds which provide that objective criterion.
>
> For the first time we have to concentrate on practical approaches (which
> make painful simplifications). But many Subjects can't be discriminated
> sharply etc. (especially all abstract Subjects have no clear border) [see
> 1]. For such cases we can propose similarity approaches but we will never be
> able to assess the real matching quality. (Research done to
> inter-indexer-consistence showed the impossibility of an *objective*
> assessment of these matching proposals).
>
> > One thing I found strange was that you defined your own measure for
> > string distance instead of reusing existing measures such as
> > Levenshtein distance. Why was that?
>
> We have two requirements for String similarity, which are met by our
> approach:
>
> 1. Language Independence. Our approach should work i. e. for two Catalan
> Topic Map
> Fragments as well as for English Topic Map Fragments. This requirement
> excludes approaches which uses thesauri, ontologies, lists of stop words
> etc.
>
> 2. Inexpensive. The similarity calculation have to be very computational
> cheep.
>
> The importance of the second requirement is shown by the testbed discussed
> in the article. We calculated merging candidates out of two Topic Maps which
> consist of 300 Topics each. For the small example a SIM for appr. 90.000
> Topic Pairs has to be calculated. For each calculation a lot of String
> similarities have to be computed (for each possible pair of Topic Names and
> each possible Pair of Occurrences). Especially if Occurrences are small
> texts etc., we need a very "cheep" String similarity measure. What happens
> if Topic Maps with millions of Topics join?
>
> For computation of string similarity one can decide whether to use an
> approach which acts on the syntactical surface (our approach) or to use an
> approach which makes semantic comparisons. The latter case is definitely not
> language independent (our first requirement), but might yield good results
> for the detection of similar Subjects.
>
> Summarising, if you have good ideas for string similarity measures which are
> cheap (strong requirement) and language independent (weak requirement) I'm
> very interested in. Perhaps we can use the dice coefficient on the basis of
> trigrams?
>
> > Also, I'm uncertain about the URI similarity measure. If two URIs are
> > nearly the same, what does that tell you? It's unlikely to be because
> > the author mistyped the URI, because mistyped URIs don't work at all.
> > And if they both work, URI equivalence rules will reveal this in some
> > cases (which your measure does not take into account). Finally, if you
> > consider the subject identifiers
> >
> >  (1) http://psi.example.org/something/#european-union
> >  (2) http://psi.example.org/something/#african-union
> >  (3) http://psi.noe.no/other/#european-union
> >
> > then (1) and (3) are much more likely to identify the same subject
> > than (1) and (2) are.
>
> That's absolutely right. But first, we have to distinguish between the URI
> and the information resource which is referenced by this URI (if there exist
> one). Both can make some statements about the Subject of a Topic.I think
> that analysing
> the referenced information resources might be more fruitful than analysing
> the regarding URIs. What's your opinion?
>
> Second, we have to distinguish between the contexts where these URIs occurs.
> If an URI is used as Subject Indicator two URIs which only differs in the
> fragment (after #) have to be regarded as representing different Subjects
> (because the owner of the namespace knows the pragmatics of the URIs and
> decided to use two different Subjects). But if these different URIs are used
> as Occurrences we can't decide in the same strictness. And at least, if URIs
> are used to reify Topic Maps Fragments we have to treat URIs in an
> completely
> different fashion.
>
> To exclude mistyping we can assume, that all URI which have a Levenshtein
> distance of zero or one has to be regarded as the same URI (for example).
> But the computation of this distance is expensive.
>
> > Another consideration is that I think types are extremely important.
> > If the names are the same but the types are disjoint (person and
> > place, say) then you can safely ignore the names. You might even want
> > to make the algorithm consider typing topics first, and only
> > afterwards go after the instances.
>
> I think the inclusion of structural information (types and neighbourhood via
> associations etc.) is very important. I have some good ideas which should be
> discussed later.
>
> > Not sure if this is helpful, but it may be worth considering, if you
> > haven't already.
>
> Thank you for your ideas. We are very interested in discussion about our
> research.
>
> Lutz
>
> [1]
> http://www.idealliance.org/papers/extreme03/xslfo-pdf/2003/Kent01/EML2003Kent01.pdf
>
> _________________________________________________________________________________
> Dipl.-Wirtsch.-Inf. Lutz Maicher
> Graduiertenkolleg Wissensrepräsentation | Universität Leipzig
> Abteilung Automatische Sprachverarbeitung | Institut für Informatik |
> Augustusplatz 10-11 | 04109 Leipzig
>
> fon 0341 97 32 303 | mail maicher@informatik.uni-leipzig.de
> http://www.informatik.uni-leipzig.de/~maicher/
>
> _______________________________________________
> topicmapmail mailing list
> topicmapmail@infoloom.com
> http://www.infoloom.com/mailman/listinfo/topicmapmail
>