Sunday,
April 10, 2005
Global
Information Framework
The Readware Semantic Extraction component
See the progression of the demonstration of Readware-style
knowledge extraction at:
http://www.readware.com/cbp.html
The ConceptBase is a Readware glossary of 2,750 ontological structural definitions. Each definition is associated with several words of every language we have processed so far (English, German, French). Definitions are expressed in a few dozen abstract terms that are interpreted into hundreds of possible interpretations. Each interpretation is a concept structure expressed in simple language and has two cognitive functions:
a) to organize vocabulary in the human mind, and
b) to suggest query structures for knowledge extraction.
Query structures are realized using the Readware query language. Up to now we have not developed a user interface that fully exploits this language. We are also aware that the three part Adi structural ontology papers suggest that
Many computer scientists and engineers have asked us to take
some common base of information and produce a presentation that shows how
Readware is used as a Language Reference Model and Knowledge Extraction tool
for ontological projects.
We define an ontological project as a study of entities and
relations in data or information. We
have performed many indexing projects, on arcane texts particularly on Aesop's
fables, the bible and American Indian texts and Civil War texts. These projects have allowed us to refine our
internal use of the structural ontology that exists within Readware.
While discussing the requirements for semantic extraction tools
for ontologies, Dr.
Due to the need to aggregate information into a few categories,
Prueitt suggested an interrogatory framework for investigation that he has
called an event Structured Ontology Framework (e-SOF).
While Readware does not have visualization tools, it can extract
the semantic elements (entities and their relations) out of which the e-SOF
emerges. This can be seen in the
demonstration.
The demonstration also allows ad-hoc experimentation via the
construction of different tuples:
<why,people,time point>,
<how,goods-item,time point>
and derivatives, e.g.,
<why,goods-item,where>
to extract from the file contents.
In our work with Dr Prueitt, we added some additional categories
and topics to the standard commercial release of Readware.
The tuple is used as a query.
The Readware functions exposed here perform a paring process over the
10,574 letters to retrieve those administrative rulings whose contents conform
to this tuple.
Clicking the title of the document will display the contents of
the letter, highlighting relevant items.
You should be able to remember the tuple and recognize how the data fits
the tuple.
Dr. Prueitt’s e-SOF organizes the co-occurrence of responses to
the set of 18 questions. This work is
still a bit in the future, but has it’s foundation in the formalisms Prueitt
invented called categorical abstraction and event chemistry.
As this work is completed, we will post an online presentation
of Readware's capability to perform knowledge extraction from plain text,
including the preparation of the topics and classes of things it takes to
undertake such a study.
Any individual should be able to use the online interface to
study the contents of these administrative rulings in the context of who the
importers are, what they are importing and how customs officials classify goods
and deal with classifications and the disputes and controversy that arise as a
result.
One should be able to spend an hour here, and come away feeling
somehow informed of the details of commercial goods classification. In this case, our vision for a product of
Readware technology is achieved. If you
do not begin seeing results and becoming informed after ten minutes, call me at
1-352-371-5931 (speak up if a machine answers) I may still be reachable. I can walk you through some sampling
exercises.
The importance of four factors in this demonstration should not
be underestimated.
a) The raw costs to produce this study for these 10,574 pages
was roughly $0.30/page not including IT resources. This cost is expended as hand re-definition of parsing
processes. No annotation to source data
is required. No pre-processing or
modification to source data is required.
b) This collection can be extended to 100,000 or more documents
at no additional cost, except storage and IT resource overhead.
c) The schema, classes, entities, topics-- everything-- can be further refined, modified, extended
without reprogramming. Automatic compiling and incremental indexing are
features of Readware.
d) While the demonstration hardy made use of concepts in the
Readware ConceptBase, the use of the ConceptBase as a language reference model
allowed us to focus on the information extraction task without overriding
concerns about the vocabulary. The software infrastructure
(space+knowledgebase(cultures) +intelligent
structures (queries,text))
made it straight-forward to implement. It should be clear that Readware algorithms compute comparisons
between structures.
Some changes to the topic tree have been effected since it was
first posted online. An explanation to
ground the presentation has been added as information.
We still plan to sort the topic list with each result so that
each class falls together, then we will leave it in this state and publicly
available as we move to other projects that demand our attention.
Scientists and engineers should take note of the reasoning
Readware performed on the indication of a date, e.g.: what is the past, present
or future? The subsumption can be seen
in the culture files (culture 5 mainly) in the topics specifying the past,
present and future. It is not the only
example or type of reasoning in this presentation.
For AI and reasoning advocates, there are many examples of
horn-clause logic as it is extensively used in setting the context for the
inclusions and exclusions necessary to the topic specification logic. This is
similar to the DL of OWL(OIL), etc..
Those knowledgeable of the art should be particularly interested in
examining the structures of the relevant Readware culture files (the
knowledgebase about the language of customs letters).
See it, try it all at: http://www.readware.com/cbp.html