[topicmapmail] DMOZ in XTM

Jan Algermissen algermissen@acm.org
Thu, 07 Oct 2004 15:16:52 +0200


Marcel,

Marcel Ferrante wrote:

> - As DMOZ, in the heart of project I don't suggest a big xtm file for
> my system makes the queries. I had been downloaded the dmoz file and
> it is more than 1 GB...Beeing pratical, I suggest implemet a ER model
> of topic maps concept, that is diferent of XTM as you said.
>   - So, the next question is: why I'm using the XTM after all ?
>     - To interchange the data. If some one want makes his
> classification of-line or to make available the service to another
> applications. This point of view see the web services in the next
> moment.

Yes, exactly! There is no point in maintaining an ontology (any kind of 
data, actually) as an XTM document, propably not even as a topic map in
a topic map engine[1]. The issue is to make data available *as* a
topic map (looking at the data through topic map eyes). I realized this
after converting some thesauri into XTM...it felt so useless, given that
the thesaurus was already stored in a suitable format.

The key (IMHO) is to use XTM as the message mime type (assuming we'll
have application/xtm+xml at some point in time) for HTTP based 
interactions with data providers (services/stores) such as DMOZ.

Why don't you, for experimental purposes, write a CGI that mimiks XTM 
based communication with www.dmoz.org, by scraping DMOZ's HTML and 
turning it into XTM. I did that once for Google's link: feature - it's 
fun and very educating.

Jan

[1] For highly demensional data it does make sense, but usually the
domain that the data is about is in itself constraining enough to
justify storage in a relational database.
> 
> "since creating an ontology for life, the universe, and everything is
> quite a challenge."
> 
> - Let's start with simplicity. The focus is organize the URLs in the beginning.
> 
>   - The objective at first is fill the lack of DMOZ. For me this
> project stopped in the time. It is the same thing, same procedure for
> the user since three years ago. Points to attack:
>     - The structure of DMOZ is confuse the concepts. In the same
> taxonomy we could find agregation, specialization, localization, etc.
>     - They use a poor faceted classification. The resource (URL)
> appers in the many topics but it's and the topic? Should allows this
> too.
>     > So their struture shall be divided the faceted categories, like
> is present in project like flamenco or facetmap. To divide we can use
> a good web thesaurus (eg eurovoc).
>     > And the principal: The user must have the possibility to
> classify the URLs and topics using the mapic topics concetps. It maybe
> has a wizard to trainne the user to do this.
>     - The navigation show only one hierchical level. So, to goes to a
> extremity the use have to wait the page refresh a five or six times.
> Very, very boring !!
>     >  See www.knowledgeprocessors.com
>     - The search in the directory (by google) show the URL's in the topics.
>     > I want produce a  filter or reflection in the structure. That's
> a navegation combined with the search like flamenco
> (http://bailando.sims.berkeley.edu/flamenco-interface.html)
> 
>   - Do a prototipe to feel the reactions.
>      - In the begging I'm thinking just use mysql that is free, but we
> can use oracle if the project increase it's dimension.
> 
> - For the future the project we can thing:
>   - Construct a client software for the user do it's classification
> with more agility or off line.
>   - Retrieve the best URL classitication done. The favorities or
> bookmarks of the users.
>   - Don't limited the topics maps crawler in the DMOZ project, the
> Wikipedia is the next victim (and I see google in the last battle,
> with Bill don't arrive before..)
> 
> To finalize: "as well as man-hours and sheer know-how"
> I'm talking from Brasil, thank you for attention, I became very
> suprise when the answer arrive from a name that I took from the thesis
> that I read, pardon me for my english, and you can divide your costs
> by 5 if the project here. I'm serious, this is only a fact.
>