[33]                               home                            [35]

 

Communications on a National Project

 Taxonomy Issues

 

3/15/2004 8:46 AM

 

Paul

 

I was thinking about the Readware paper.

 

I find the work fascinating, but it made me realize something while reading through it.

 

Each category could be constructed using a thesaurus for first stage categorization, and human combination of categories as a final, more granular and intelligent, categorization of neighborhoods.  

 

If you are to look at each category he lists in the paper, you notice a few core words that epitomize the concept it is getting at.

 

A number of core words are combined into categories, which is what occurs in the second stage of categorization I will discuss.  In the end, however, the "core" word remains within a grouping.

 

This made me think about a thesaurus, and what it really is. If you think about it, a thesaurus has conceptual relationship keys, where words of a similar "idea", "gist", "direction", "focus", etc.. are given a relationship to each other.

 

It is *this* attribute that we can take advantage of.     It is quite simple really.

 

Take a list of "center" words for every neighborhood in the text set after the ORB process (including stopword removal etc.) has occurred. Read in a list of related words via a thesaurus. Group all related neighborhoods according to relationships within the thesaurus. We now end up with a relatively low number of initial categories. This is especially true if one is to use a very good thesaurus, and a good stopword list. The better the thesaurus, the more specific the categories are, and the better the stopword list, the less insignificant connectivity among categories there is, and thus more accuracy. 

 

A human user can then view each category and combine or split categories as required. Basically, a relationship table is created for each category and/or word encountered (still deciding on the exact method in my head).

 

This/these relationship tables are then saved for future data to be run across. If a user decides to combine categories, than the thesaurus or those two categories reflects these changes and considers those words to fall within the category, and all related and thesaurus listed words falling within that grouping. 

 

After this categorization is initially created, new neighborhoods can be brought into the ORB and automatically categorized accordingly. This is, really, a good concept based taxonomy generation system. We can then use indexing software to retrieve taxonomy related documents

 

I can code this up as a demonstration within the next few days. I'm going to have to do some of this in PERL though until I can figure out a couple of small algorithm type problems out in VB, after which I will bring it across into one seamless software package

 

Web text miner<done>,

Stopword remover<done>,

ORB Generator<done>, 

Conceptual Relationship Creator/Manager

 

visualizations,

update agent.   

 

 

Yours truly, 

Nathan Einwechter