[topicmapmail] Classification of occurrences using keywords

Lars Marius Garshol larsga@garshol.priv.no
13 Nov 2002 22:30:09 +0100


* Murray Altheim
| 
| A keyword can certainly be treated as a subject in its own right, as
| Lars Marius and Steve demonstrated in their paper. It's the world of
| difference that I'm concerned about. I think it's a mistatement to
| say that "topic" and "keyword" are synonyms, and this ignores *how*
| keywords are typically used (ie., not as subjects in their own
| right, but as sets upon which searches are conducted).

That's why I kept asking what you meant by "keyword". If you by
"keyword" mean data item used in a particular kind of search then
obviously that is not the same as a topic. "Keyword" as they appear in
the metadata fields of document I think are most useful in a TM
context if they are considered to be the same thing as a topic.
 
| For example, when a person goes to Google and types in "orton
| lyrics" or "orton dyslexia" as keywords in two separate searches,
| nobody should assume that "orton" is the same subject. It's merely
| the same string of characters; it might not even be the same part of
| speech.

That's two different keywords, and as you say the fact that they have
a substring in common is nothing to the point.
 
| Yes, Lars Marius and Steve have demonstrated an approach to this,
| but for applications where we have 48 million records (such as
| WorldCat) I think it's just about crazy to deal with
| keywords-as-Topics.

Why? And how would you do it?
 
|   (1) if one is creating a topic map of the subjects themselves
|       (as publications) I don't think I'd want all those keywords
|       necessarily in the same topic map unless they were very
|       definitely identified distinctly as keywords;

Why?
 
|   (2) it would greatly increase the number of topics in the map,
|       which might not be very efficient;

Why not? Are you saying that something that contains 48 million things
can't possibly be efficient?
 
|   (3) it muddies an existing topic map that contains Topics only
|       for the entities it aims to map; for navigation, visualization,
|       etc. this might make things exceedingly complex;

I'm not sure what you mean to say. What are "the entities it aims to
map", and how do those differ from "topics representing keywords"?

|   (4) it increases processing time. If for each record there are
|       a dozen keywords, or keyword phrases, the number of Topics
|       is likewise increased greatly (though of course not a linear
|       function since we assume some Topics share keywords);

Isn't this the same as (2)?
 
|   (5) keywords in actual use are not always simply single tokens,
|       they're often multiple tokens or phrases. There needs to be
|       some automatic means of identifying synonyms of both these
|       single tokens and multiple token keywords, eg., so that
|       "eating animals" and "eat animals" are close semantically,
|       but "eating animals" and "animals eating" are not. And this
|       of course varies (in general) with context. Computational
|       linguistics has shown that this problem is enormously more
|       complex that might appear at first glance. So far it usually
|       comes down to a manual (and therefore error-prone) process;

Of course doing a proper mapping from keywords to meaningful topics
can be difficult in many cases. But what's the alternative?

| While I've oft heard disparaging remarks about keywords and full
| text searches, this is still how the majority of searches occur. 

It is, and that's because they are quite effective ways of searching.
What I've found is that they work even better with topic maps. We
offer full-text search in our product suite, and several of our
customers have built keyword-searching solutions for their ontologies.

| We have it as a task to improve this. When the things we search for
| are documents that have existing keyword sets, we need to figure out
| how best to use those keywords. We've seen one proposal; I'm curious
| what other methods might work better on large bodies of information.

So am I. Do you have any proposals, or know of any?

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >