[topicmapmail] Generation of Topic Maps and Machine Learning
Murray Altheim
m.altheim@open.ac.uk
Mon, 01 Nov 2004 16:26:23 +0000
Lars Marius Garshol wrote:
> * Aayush Puri writes:
[...]
> | Also what are the possibilities that some machine learning algorithm
> | can be applied so that the system betters the associations (and
> | hence the quality of topic maps) whenever provided with more sample
> | sets (in this case more textual information related to the topics).
>
> I don't know of any NLP methods that work in this way. There are
> methods that do classification (working out what a document is about)
> and learn as they go along. However, classification is much easier
> than concept and association extraction, and I don't know of anything
> that can apply learning to this task, nor can I really imagine how it
> would work.
Back in 1996/97 I had a job at a NASA center in West Virginia (!)
called the National Technology Transfer Center (NTTC), whose
mission it was to coordinate the transfer of NASA technology to
the private sector.
NASA and the NTTC had taken some research done at Carnegie Mellon
in NLP and applied it to document categorization. Thousands of
wooden pallets covered by plastic shrink-wrapped reams of paper,
some going back to the 1950's, were unwrapped and dropped onto
document scanners. These were single and double-sided research
papers, reports, memos, etc. whose document boundaries weren't
even known (i.e., this was literally just a big pile of paper).
Each page was scanned as a TIFF image, OCR'd, and then the OCR
run through the Carnegie Mellon University (CMU) engine to
produce a cleaned up OCR plus a summary of the page in a format
known to the engine. For each page these three documents are
stored on the system. I don't have a handy reference to this
system, but for anyone interested you could contact the NTTC at
http://www.nttc.edu/
If you're an American business, they actually *owe* you
assistance, as it's written into their charter.
Anyway, the reason I write this message is that I had a number
of in-depth discussions with personnel both at NASA and the NTTC
regarding this project, and have some understanding of the state
of the art of NLP from these discussions. I don't claim any
particular expertise in this area, but I'll relate what I
understand from these discussions (and a few others) and my
own research.
The Carnegie Mellon engine receives an OCR text, which often
contains spelling errors and other anomalies from the scanning
process. It does a spell check based on a normal English
dictionary that has been modified to recognize research terms
common within the corpus. It uses a per-document statistical
analysis, plus looks at the documents "near" it in the scanning
process, which may or may not be related to the page itself.
From this it builds an "understanding" of the document which it
uses in both trying to establish the extant set of pages composing
a specific document (noting that pages may even be missing). Once
the document boundaries are established, further processing is
done to determine statistically what the document is "about," so
that searches may locate it amongst the corpus.
The thing I got most out of all of this was that the CMU engine
knows *absolutely nothing* about the subjects being represented
within the documents. There is *absolutely no* concept of subject
or subject identity, and the researchers at CMU would laugh at
the idea. As was reinforced by a conversation with Geoffrey
Nunbert, a computational linguist at Stanford, NLP is likely
still decades away from gaining any kind of "understanding" of
text, and there are many within the field (those not deluded by
the successes of the statistical method in conflating that with
"machine understanding") that we've somehow got to the point
where machines understand human language. They don't. All of NLP
after decades of research is still down to text frequency. The
CMU/NASA system still operates based on looking for particular
combinations of words within specific distances of each other.
I'm on the Lucene (search engine) mailing list, and certainly
the sophistication of the search engines has gained quite
markedly over the last decade, but the essentials haven't changed
much at all.
My initial enthusiasm with Doug Lenat's Cyc ontology came in
large part because of my belief that the *only* way that a NLP
system can begin to make fundamental progress towards any kind
of machine understanding of concepts and relations is by being
backed up by a common sense + domain ontology, so that the
presence of specific terms in relation to others can be placed
within an ontology of possible meanings. WordNet on its own,
which operates at a linguistic level but with no further attempt
at ontology, cannot accomplish this. E.g., in Wordnet one can
locate homonyms, synonyms, etc. for a given term, but in Cyc
one can being to connect the various terms, concepts, actions
and events one finds in say, driving a car, brushing one's
teeth, or assembling a space shuttle.
So where does Topic Maps fit into this?
Well, Cyc does have connections to WordNet, and also is is able
to scan the Web for content. Pretty impressive, really. But
it doesn't have three fundamental parts of Topic Maps: a solid
concept of subject identity, a mapping functionality, and, I
would argue, a decent categorization system. On the latter count,
I think a Faceted Classification system would be necessary, and
I believe that a Topic Map-based FC system would be ideal. Cyc
uses what are called microtheories, which are context-specific
theories within the system. I think Topic Maps' conceptual model
particularly strong in this regard, and the combination of Topic
Maps and Cyc (or a Cyc-like ontology) would be very powerful.
For specific domains where the range of concepts and relations
are very limited, it might be possible to avoid the need for the
common sense ontology, but of this I'm uncertain; humans often
use quite a variety of ways of expressing themselves and often
use examples and metaphors to illustrate points. This kind of
thing would be completely lost to any machine without at least
the ability to establish a metaphoric relation of say, the
presence of a Three Bears story showing up in a NASA report. I
don't suggest that an AI system would know why a Three Bears
story would show up, or what it would necessarily mean, but it
could at least recognize the story (from its common sense
ontology) and recognize it as a story, and perhaps as a metaphor.
Absent the ability to recognize the various forms of metaphor,
simile, allegory, machines are doomed to miss most of how humans
communicate. We communicate by telling stories. [Like I am
right now.]
I remain very enthusiastic about the possibility of building
such a system incorporating the Topic Maps paradigm, and think
it's one part of the puzzle that is missing. I have little faith
if any in NLP on its own accomplishing much more than it has so
far. Incremental improvements, yes. Machine understanding, no.
Murray
......................................................................
Murray Altheim http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK .
The Rise of Pseudo Fascism -- David Neiwert
Part 1: The Morphing of the Conservative Movement
http://dneiwert.blogspot.com/2004_09_19_dneiwert_archive.html#109028353137888956
Part 2: The Architecture of Fascism
http://dneiwert.blogspot.com/2004_09_26_dneiwert_archive.html#109563628314780505
Part 3: The Pseudo-Fascist Campaign
http://dneiwert.blogspot.com/2004_10_03_dneiwert_archive.html#109596147171278590
Part 4: The Apocalyptic One-Party State
http://dneiwert.blogspot.com/2004_10_10_dneiwert_archive.html#109694976530359103
Part 5: Warfare By Other Means
http://dneiwert.blogspot.com/2004_10_17_dneiwert_archive.html#109755467135245579
Part 6: Breaking Down the Barriers
http://dneiwert.blogspot.com/2004_10_24_dneiwert_archive.html#109858062597237163
Part 7: It Can Happen Here
http://dneiwert.blogspot.com/2004_10_31_dneiwert_archive.html#109902109250035295