Bridging Information and Knowledge
XML 2000, Washington DC,
6 December 2000
Michel Biezunski, mb@infoloom.com
Definitions
Information [Computers]
How data are represented and manipulated. Format, Structure, Platform.
Knowledge [Humans]
What it means, how it can be understood. Domain, Context, Connections.
Building Knowledge
We need to build our knowledge from:
Structured information, carefully organized and prepared to be accessed in a (limited) number of ways.
A mass of unstructured information, heterogenous, not prepared to be accessed (other than by full text search).
The Bridges
How Bridges Happen
By providing standards
Standards are acting as attractors to drive everybody into a common interchange platform.
Standards need to be used.
1. Information Management
Structured Documents and Databases
Data is organized into fields/element types.
Schemas are applied. They need to be modeled prior to be used.
Information is easier to retrieve.
Inline Markup for Elements and Links
Markup, when embedded into the data, is considered inline.
Difficulties when merging with other markup schemes.
Same situation for links ("simple links"). The anchor which is the origin of the link is the link itself.
Easy to create, difficult to manage on a large scale.
Structure needs to be known at browsing time
Querying exploits existing structure
Sometimes special training necessary.
OK for closed environments
Not appropriate for the Web.
You can't ask users to learn every underlying structure when they go from one page to the other.
Structure: once for all
Documenting structure helps.
But what if specific structure is irrelevant to the task?
It may also be good when it's created and then become obsolete.
Or too rigid (see tree-based thesauri)
How does it merge with different structure ?
Information about information
Metadata usually understood as adding information about information.
Example: library catalog
In a book, the title page is inside.
There are no fundamental difference between data and metadata.
However, there is information which is provided originally and information which is subsequently added to it, or by other parties than the authors.
Getting meaning out of structure ?
Markup facilitates interchange of information
Markup doesn't necessarily have a semantic. <i>, <p> as opposed to <book>, <house>, <u8474yr>
Markup is not always meaningful (especially when not properly documented).
2. Knowledge Engineering
Getting meaning out of anything
Applying computer-driven algorithms help make sense from something that originally was unstructured, …
But not always a lot of sense.
However sometimes enough sense to be truly useful.
Knowledge technologies are proprietary
Knowledge-base products usually implement proprietary solutions.
Customers' investment in these technology is therefore limited.
Systems can usually not be upgraded or transformed to another.
This is an important limitation in interchanging knowledge.
Interchanging knowledge is becoming a necessity.
Structure is standardized, but often too expensive
Retrofitting masses of information is practically unreachable. What about the Web, for example.
The web is properly marked up, but not structured.
There is no point in structuring it, it's too big and too spread out.
Putting things together
Taking benefit of both:
Standardized, interchangeable structured information
Rich repositories of unstructured information
Solution:
Superimposing a semantic layer above existing information resources, regardless whether information is already structured or not.
Sharing knowledge among user communities
Knowledge does NOT mean:
Formats
Structures
Platforms
It means:
Content
Subjects
Understanding
Common understanding is what we are looking for here.
RDF (Resource Description Framework)
A generic, neutral, and powerful approach to :
Assign properties on information objects.
Connect information nodes together in a vast network
Create computer-driven processes to exploit this network.
Topic Maps
A generic, neutral and powerful approach to:
Group relevant information about subjects of interest.
Connect these subjects together.
Describe the validity in which subjects are connected.
Why two overlapping standards?
In a way that's not good.
In a way it's good.
A parallel in science
In chemistry, atoms are elementary building blocks used to create compounds and elements. Structure of matter is based on atoms.
In physics, atoms are complex objects made of neutrons, protons, electrons, and plenty of other stuff.
Do we need both ?
Yes.
Why two approaches?
We need to describe:
Knowledge in terms of what users need to model in order to improve navigation.
Connectivity in terms of how computers understand how to get from a node to another node in a graph.
Simple proposal:
Topic Maps used for Chemistry, RDF used for Physics. Discussion in progress between authors of RDF and of Topic Maps.
And the bridge ?
Topic Maps and RDF are both XML-based standards that come from the Information Technologies side.
Convergence is being pursued.
Since they are used to actually represent knowledge, the next step is to build a series of schemas, for queries, inference rules, template construction rules, etc., all things that provide a standard answer to knowledge engineering issues.
Improving Global Knowledge Interchange
Key issue is: How to merge our information ?
How can we improve the likelihood that we get a common understanding about the knowledge we want to express?
Answer is: by using meaningful subjects, described in a reliable way, to merge information sources.