Communications on a National Project
3/15/2004 8:46 AM
Paul
I was thinking about the Readware
paper.
I find the work fascinating, but it
made me realize something while reading through it.
Each category could be constructed
using a thesaurus for first stage categorization, and human combination of
categories as a final, more granular and intelligent, categorization of
neighborhoods.
If you are to look at each category
he lists in the paper, you notice a few core words that epitomize the concept
it is getting at.
A number of core words are combined
into categories, which is what occurs in the second stage of categorization I
will discuss. In the end, however,
the "core" word remains within a grouping.
This made me think about a
thesaurus, and what it really is. If you think about it, a thesaurus has
conceptual relationship keys, where words of a similar "idea",
"gist", "direction", "focus", etc.. are given a
relationship to each other.
It is *this* attribute that we can
take advantage of. It is quite simple really.
Take a list of "center" words
for every neighborhood in the text set after the ORB process (including
stopword removal etc.) has occurred. Read in a list of related words via a
thesaurus. Group all related neighborhoods according to relationships within
the thesaurus. We now end up with a relatively low number of initial
categories. This is especially true if one is to use a very good thesaurus, and
a good stopword list. The better the thesaurus, the more specific the
categories are, and the better the stopword list, the less insignificant
connectivity among categories there is, and thus more accuracy.
A human user can then view each
category and combine or split categories as required. Basically, a relationship
table is created for each category and/or word encountered (still deciding on
the exact method in my head).
This/these relationship tables are
then saved for future data to be run across. If a user decides to combine
categories, than the thesaurus or those two categories reflects these changes
and considers those words to fall within the category, and all related and
thesaurus listed words falling within that grouping.
After this categorization is
initially created, new neighborhoods can be brought into the ORB and automatically
categorized accordingly. This is, really, a good concept based taxonomy
generation system. We can then use indexing software to retrieve taxonomy
related documents
I can code this up as a
demonstration within the next few days. I'm going to have to do some of this in
PERL though until I can figure out a couple of small algorithm type problems
out in VB, after which I will bring it across into one seamless software
package
Web text
miner<done>,
Stopword
remover<done>,
ORB
Generator<done>,
Conceptual
Relationship Creator/Manager
visualizations,
update
agent.
Yours truly,
Nathan Einwechter