[topicmapmail] Classification of occurrences using keywords
Murray Altheim
m.altheim@open.ac.uk
Wed, 13 Nov 2002 18:54:08 +0000
Kal Ahmed wrote:
> On Wednesday 13 November 2002 17:37, Murray Altheim wrote:
[...]
>>While you may enter keywords as individual topics in a topic map
>>in order to better process them (as topics in their own right),
>>the use of keywords for searching and document identification
>>and their identity with "topics" (in the general sense) is hardly
>>a given, and one I would strongly disagree with. "Topic" and "keyword"
>>are not synonyms in either the dictionary, in common use, or in their
>>use in topic maps (even in your examples).
>
> Why can a keyword *not* be treated as a subject ? It is true that there is a
> world of difference between "Navigable History" as a subject and "Navigable
> History" as a keyword, but surely a topic playing the role of keyword in a
> has-keyword association with a reified resource can be treated as the latter
> rather than the former ?
A keyword can certainly be treated as a subject in its own right, as
Lars Marius and Steve demonstrated in their paper. It's the world of
difference that I'm concerned about. I think it's a mistatement to
say that "topic" and "keyword" are synonyms, and this ignores *how*
keywords are typically used (ie., not as subjects in their own right,
but as sets upon which searches are conducted).
For example, when a person goes to Google and types in "orton lyrics"
or "orton dyslexia" as keywords in two separate searches, nobody should
assume that "orton" is the same subject. It's merely the same string
of characters; it might not even be the same part of speech.
In the first case one is likely looking for Beth Orton lyrics, in
the second the Orton-Gillingham Academy of Practitioners (for dyslexia,
among other things).
>>I provided an example of this which I'll reiterate. I wrote:
>> > [...] So, searching for say a paper on "Navigable History" (a subject)
>> > we might use the keywords:
>> >
>> > event history, navigable history, constructive time, edit-based
>> > indexing, information workspace, analysis, interpretation, authoring,
>> > spatial hypertext
>> >
>> > I mentioned that keywords essentially are a deconstruction or decompo-
>> > sition of a topic.
>>
>>What I mean by this is that the set of keywords I provided *together*
>>describe the paper by Shipman.
>
> Yes, but each keyword individually serves as data to a search mechanism. There
> is a collective set "the keywords of the paper "Navigable History: A Reader's
> View of Writing"" and there is each individual member of that set. I feel
> that these are two distinct concepts.
Yes, I agree. But it's the "serving as data to a search mechanism" that
is the typical use of keywords, whether the search mechanism is computer
or human.
>> > Just to test this theory, I grabbed those keywords from a specific paper
>> > by Frank M. Shipman and Haowei Hsieh. I can take those keywords and
>> > paste them into Google and find the paper. [goes off and tries it] Damn,
>> > but it works!
>
> What happens if you choose one of those keywords or two or three from the
> collection ? Surely the same paper is still found (modulo Google magic). Just
> as the "Navigable History" keyword topic might play the role of "keyword" in
> an association with multiple reified resources.
As in my "orton lyrics"/"orton dyslexia" example, no, you cannot assume this
is the case. It's the conjunction of keywords that provides search accuracy,
the more (to a limit) keywords, the more likely you'll find what you're
looking for. You might get lucky if the keywords are fairly rare, but used
in conjunction even extremely common keywords can provide an accurate search.
So while "orton" is a very rare term, you could still search on
"dyslexia practitioners academy new york" and find the same page. But
searches on fewer terms produces less likely matches.
>> > Now, it'd be hard to argue that "analysis" (or really, any of the
>> > above keywords) matches the subject "Navigable History: A Reader's
>> > View of Writing".
>
> Not matches, no but "analysis" has been selected as a keyword for the
> resource. So there is an association between the topic "the keyword
> 'analysis'" and the topic which reifies this paper.
Yes, and I think it's somewhere in what kind of relationship that
keyword has to a subject, and how the keyword(s) is represented in
a topic map, and how the associations between those keywords (perhaps
but not necessarily as TM Topics) and the topics they relate to,
these are still all IMO still unanswered questions. Yes, Lars Marius
and Steve have demonstrated an approach to this, but for applications
where we have 48 million records (such as WorldCat) I think it's just
about crazy to deal with keywords-as-Topics.
(1) if one is creating a topic map of the subjects themselves
(as publications) I don't think I'd want all those keywords
necessarily in the same topic map unless they were very
definitely identified distinctly as keywords;
(2) it would greatly increase the number of topics in the map,
which might not be very efficient;
(3) it muddies an existing topic map that contains Topics only
for the entities it aims to map; for navigation, visualization,
etc. this might make things exceedingly complex;
(4) it increases processing time. If for each record there are
a dozen keywords, or keyword phrases, the number of Topics
is likewise increased greatly (though of course not a linear
function since we assume some Topics share keywords);
(5) keywords in actual use are not always simply single tokens,
they're often multiple tokens or phrases. There needs to be
some automatic means of identifying synonyms of both these
single tokens and multiple token keywords, eg., so that
"eating animals" and "eat animals" are close semantically,
but "eating animals" and "animals eating" are not. And this
of course varies (in general) with context. Computational
linguistics has shown that this problem is enormously more
complex that might appear at first glance. So far it usually
comes down to a manual (and therefore error-prone) process;
(6) I remain to be convinced that keywords-as-Topics is the best
way, except for where small scope and manually-created
ontologies make sense, which I don't think is the common one.
I think that's why I continue with this discussion.
> <snip/>
>
>>The real question for me (nor for the original questioner) is not
>>how to have an author manually build a topic map ontology for a
>>given set of keywords, as that is a manual task, enormously complex
>>and requiring both ontological and domain-specific skills that might
>>not be available, for large document sets is an unreasonable task,
>>and besides, I don't think any of us have the authority or skills
>>to do what librarians do when they classify publications. I certainly
>>don't feel qualified to take someone else's publication (ie., a real
>>one, with an ISBN number) and add my own set of keywords-as-topics,
>>ignoring the real ones published with the document. And for the
>>several hundred documents I've got (that have their own existing
>>keyword sets) it would take a huge amount of time. What about 50,000
>>documents? 300,000 documents? OCLC's WorldCat has 48 million records,
>>and they all have keywords.
>
> In my experience (mainly with tech. doc. so YMMV), keywords are either (a)
> taken from a controlled vocabulary or (b) created by an author/indexer in an
> ad-hoc manner. I find that (b) is more common than (a), though I should
> imagine that librarians would tend toward (a). I think that if working with
> an keywords taken from a controlled vocabulary, one should seek to determine
> whether or not there is an ontology underlying that vocabulary and if so,
> model it in a topic map. If there is no underlying ontology or if keywords
> have been created in an ad-hoc manner, you can (automatically) do no more
> that treat them as "keyword" topics with no relationship between them (only a
> relationship to the resources)
The keyword sets are often created by the author or publisher, not by
the librarian, and unfortunately the standards for this are not universal.
So two documents covering essentially the same semantic territory might
not share *any* keywords, or at very least we cannot assume that their
keyword sets match to any specific degree.
>>What the question (I believe) here is, is how to best use existing
>>*sets* of keywords in a topic map in such a way as to use the
>>conjunction of all their meanings as an identifier for the subject
>>being entered as a topic in a topic map. To best use them in a topic
>>map.
>
> Now thats an interesting question. But I would argue that in most search
> systems using keywords, a user will search for one/some of the keywords, not
> all of them. In this case, what is the value of treating the keywords as a
> set ?
They're published as a set. You and I as implementors will likely
deal with them as a set. When I enter a document into my bibliography
I take its given set in its entirety. I don't make judgements. The
48 million records in WorldCat have sets of keywords for most records.
I agree that users wouldn't type in all of a document's keywords, but
obviously the better what they do type in matches an existing set of
keywords, the more likely they'll find a particular document.
>>I don't think that question has yet been addressed (it's what I've
>>been thinking about for the past few weeks).
While I've oft heard disparaging remarks about keywords and full text
searches, this is still how the majority of searches occur. We have
it as a task to improve this. When the things we search for are
documents that have existing keyword sets, we need to figure out how
best to use those keywords. We've seen one proposal; I'm curious what
other methods might work better on large bodies of information.
> I would be interested in hearing how you follow that train of thought to its
> conclusion. I hereby offer several beers over which you can explain it to me
> ;-)
I have a car that can get to Oxford. Pick a time and an alleyway. :-)
Murray
......................................................................
Murray Altheim <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK
If you're the first person in a new territory,
you're likely to get shot at.
-- ma