[topicmapmail] starting with topic maps: resources <-> topics relationship?

Murray Altheim m.altheim@open.ac.uk
Wed, 15 Oct 2003 00:24:14 +0100


Josema Alonso wrote:
[...]
>> So long as each web page has a canonical URL, it can be brought in as an
>> occurrence in a Topic Map. You'd probably want a sniffer to grab the
>> document's <title> and maybe other metadata info (like deliberately 
 >> creating Dublin Core content within <meta> elements and harvesting that
 >> same content when you do your mining).
> 
> I see. Some of our documents already have DC meta tags.

Great -- so at least some of your people are familiar with and acceptant
of DC content.

>> But I'm not sure why you'd want things in the territory to necessarily
>> point out at the map. Typically, the map points at the territory. And
>> given that there's no browser support for <meta> usage such as you
>> describe, it's a bit of a wasted enterprise.
> 
> I have a problem with the designers. I should add every resource manually to
> the map after them. So, they design a page, they finish with it and I have
> to go after them adding the resource to the map. Too bad when they create
> dozens per day, should find another way.

It shouldn't be too much work to write a Java tool that could import an
XHTML document, grab whatever DC metadata content was already there, the
content of <title>, reveal it in a GUI for review, and then rewrite it
to the document. The author and revision information could be added at
that time, including revision timestamp. I implemented something a bit
similar to this in my Ceryle tool, and the coding is not difficult.

> Also, I'm very afraid of the size of the map. Including thousands of pages
> as resources (sorry if this is not the right name in the spec, maybe I
> should say topic or occurrence or whatever, I promise I'm learning these
> concepts but it takes a time) in the map could make it so large...

There'd only be one linking element for each page, unless you want there
to be a Topic for each page. But if you've got a computer that's in the
2GHz range, you're not going to see much of a problem on parsing. You'll
only have one copy of the current map to deal with, so even if it gets
big, it shouldn't be a problem. If you use Kal Ahmed's TM4J topic map
engine, you can use a persistent store backend like Ozone so that the
whole thing won't have to live in memory.

>>Also to be noted, is that maps exist for different purposes. You see maps
>>of North America for political boundaries, geographic features, weather
>>zones, agricultural harvests, etc.  The territory itself is mined for
>>information specific for each instance of a map.
> 
> Good point. For example, we have different profiles defined, and a page
> should be linked to more than one category.

You'll find that there's the function of mapping between the topic map
and your resources, and then there's the internal-to-the-map structures
that describe the interrelationships between them. You can create a
Topics-and-Associations "ontology" of these relationships and keep
that in a separate XTM document, using <mergeMap> to bring it in as a
common module.

>>Jack Park and I have been discussing similar ideas. Currently, the
>>discussion centers around using Lucene as a search tool to create
>>indices, which are converted into XTM for use within the Topic Map.
> 
> Hmmm...
> We built a search engine using Lucene. It's indexing all of our
> '*.uniovi.es' web sites. It usually takes almost a day to index the whole
> domain. Sometimes even more. So, believe me, it is certainly a large number
> of pages and servers. I'm still afraid of the size of the XTM file, and of
> its manual update.

Okay, good point. If you're dealing with that volume, there will be
substantial file sizes no matter what method you use. You might be
able to come up with some way of not dealing with it always as a file,
such as keeping the topic map in a persistent store as I mentioned
above. You could always export it to XTM for archiving, but the "live"
topic map would live in Ozone.

>>You'd need tools to dig into various file formats such as MS Word
>>or PDF if you plan to mine those types, or just a <title> and <meta>
>>sniffer if not. If you use well-formed or valid XHTML rather than
>>HTML as your content, you will have an easier time processing the
>>files.
> 
> Ok, I see the point. At least I'll try them to use XHTML from now on.

That's okay so long as you have control, but once someone posts a
PDF, they'll either have to supply the metadata and a means of
linking that info with the PDF file (either external to or within
the topic map), or you'll need a way to read the metadata from the
PDF. I'm currently dealing with this same issue.

>>I looked into a project called DocSearcher, which seems to do a
>>great deal of the above, but it would need to be completely
>>reengineered, since it's not very well designed. But you could
>>...
> 
> I'll take a look. Thanks.
>
>>...
>>I published a "spec" on using Dublin Core metadata in XHTML at
>>
>>    http://www.altheim.com/specs/meta/NOTE-xhtml-augmeta.html
>>
>>and there's also a number of good docs on the subject at the DC
>>site itself. I'd use Dublin Core for your metadata as much as
>>possible. It's a solidly-understood and accepted schema from a
>>very successful project.
> 
> I'm printing the doc and I'll start reading it asap. I was not absolutely
> sure about the DC metadata, but you're confirming what I thought about it
> and that's why we already started to use it a while ago. So, I think we'll
> go on with its use.

There's some inordinate number of library records marked up in it,
like hundreds of millions or something. WorldCat, if I remember.

> Wow, long message, full of ideas. Thanks a lot, Murray.
> 
> And now, before the end of this one, another random thought. I have been
> also thinking about developing some kind of plug-in for the designers. They
> use Dreamweaver.
> This plugin would allow them to assign the page, once designed, to some of
> the topic maps already created manually by me. What about it?
> I'm just afraid of the XTM file size again, after thousands of pages
> created...at least I should use an intermediate layer in here for sure.

Well, I'd look at another library-based technology called Faceted
Classification. An old technology from the 1930s, where subjects
don't exist as atomic classes but are composites built from an
accretion of sub-subjects called "facets". You'd supply a base set
of facets and build up a set of subjects from them. The facets and
the subjects would be stored in a topic map ontology document, with
that document establishing PSIs (canonical URLs) for the subjects,
and you organize your documents/resources according to those subjects.

This embodies a good portion of my Ph.D. work.

Google on "Uta Priss" and "Faceted Classification". There's others
out there too, but Uta's written some accessible stuff. Also,

    "An Algebraic Approach for Specifying Compound Terms in
     Faceted Taxonomies", Tzitzikas et al.

    "A Hierarchy-Aware Approach to Faceted Classification of
     Object-Oriented Components", Damiani, Fugini, Bellettini.

[I just happened to have these two on my desk.]

> Well, that's all by now. Very, very interesting discussion for me. This is
> something I have tried to make right for years and as of today I still
> haven't found a good solution. Maybe this time :-)

It's always good to be learning new things, and it always helps
me too to talk through ideas. Glad to be of help.

Murray

......................................................................
Murray Altheim                    http://kmi.open.ac.uk/people/murray/
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK               .

   Monkeys use thoughts to control robotic arm
     http://www.sfgate.com/cgi-bin/article.cgi?file=/c/a/2003/10/13/MN2018.DTL
   Bush uses media expertly to push apocalyptic view
     http://truthout.org/docs_03/091403J.shtml