[17] home [18]

ORB Visualization

(soon)

Sunday, August 29, 2004

On conceptual mining of chat log files

PowerPoint Presentation

In edit

Method for Ambiguation/Disambiguation:

Design and Encoding Overview

Nathan Einwechter

InOrb Technologies/OntologyStream Inc.

August 21, 2004

(still under edit as of 1:25 PM Eastern time)

Introduction

Ambiguation and Disambiguation are two opposite operations which are foundational capabilities that expand the power of the OrbSuite™ further into the HIP domain as envisioned by Dr. Paul S Prueitt and outlined by many OntologyStream Inc. (OSI) papers and presentations.

This paper, however, documents a renewed vision for allowing these processes to occur and presents a simplified encoding structure that removes a most degree of complexity and reduces the memory usage.

Definitions

· Ambiguation

o Within the context of language processing in general

§ Brings multiple concept construction and word neighborhoods together to form a single common concept construction

o Within the context of Orbs

§ Brings multiple Orb entries together to form a single entry

· Disambiguation

o Within the context of language processing in general

§ Separates a single concept construction or set of word neighborhoods into multiple concept construction

o Within the context of Orbs

§ Pull a single Orb entry apart into multiple distinct Orbs

· Two (2) Types:

o Disambiguation of original Orb entries

o Disambiguation of previously ambiguated Orb entries

Detail

Ambiguation

Previously, ambiguation was to be done by directly manipulating the Orb tables themselves. This solution, though direct, creates usability problems since no management capabilities were allowed within this scheme. I our new work, a ambiguation word list management interface has been designed which is very similar to the Stopword/Goword/Available Words list system within the OrbSuite™.

The ambiguation list management system is to be linked directly with any thesaurus system and will operate as a distinct list management system. It is important to note that the Thesaurus is one type of ambiguation in that it brings together concepts, but is distinct in that it brings them together before the Orb is built.

One needs to have both a thesaurus that is used in the measurement of original text, as Orbs are produced and a thesaurus resource that can be manipulated and used to create disambiguation/ambiguation on existing Orbs. A class of thesaurus services is envisioned. The thesaurus services will allow easy to understand manipulation of category membership so that two words are treated the same, or a single word is treated in different ways depending of the real individual linguistic environment that the word is measured to be occurring in.

The ambiguation list contains linkage among words. As such, it will be a TreeView type of list with parents (roots) and children (leafs).

There are two (2) different types of encoding currently available for ambiguation, under this design;

1) Blind Encoding

Blind encoding allows concepts to be linked by the ambiguation table and brought together under the root

Example:

Ambiguation Table

Original Orb

Center	Neighborhoods
quick	fox, brown, jumped, 1.txt
jack	rabbits, run, away, again, 5.txt
fast	jets, screamed, over, 125.txt
agile	birds, duck, wires, 3.txt

Center	Neighborhood
quick	fox, brown, jumped, 1.txt; birds, duck, wires, 3.txt; jets, screamed, over, 125.txt
jack	rabbits, run, away, again, 5txt

Resulting Orb

Blind thesaurus encoding

“Blind encoding” takes up the least amount of memory and is expected to be the most commonly used for general purposes. Blind encoding forces the end user and those viewing the results to see the resultant Orb entry as a single entity, and not multiple Orb brought together. This is what ambiguation is all about, really.

In blind encoding an Orb must be completely re-generated when the ambiguation table is modified in any way. This means that the original text or data set that the Orb was created from must be available to the program to read while making these changes.

Transparent thesaurus encoding

Unlike blind encoding, transparent encoding puts the origin of any given ambiguated concept directly into the encoding so one can see precisely where each element in the subject matter indicator neighborhood came from. Transparent encoding allows for the full manipulation of an Orb’s ambiguation table (as well as other operations which occur directly on the Orb) to occur without having to completely re-generate the Orb, which means the original text set is also not required.

Using the original Orb and ambiguation table from the first example, but with a utilization of transparent encoding;

Resultant Orb

Center	Neighborhood
quick/fast/agile	fox, brown, jumped, 1.txt/jets, screamed, over, 125.txt/birds, duck, wires, 3.txt
jack	rabbits, run, away, again, 5.txt

Compare with:

Center	Neighborhood
quick	fox, brown, jumped, 1.txt; birds, duck, wires, 3.txt; jets, screamed, over, 125.txt
jack	rabbits, run, away, again, 5txt

The encoding type will be selectable from the ambiguation list area of the program. Ultimately, the selection of encoding type can occur automatically according to user input through a project wizard interface, which will co-ordinate and reconcile the various encoding mechanisms and capabilities according to user’s requirements for specific projects.

The ambiguation table can be populated by the user in a number of ways;

1) “Key” words directly inputted into the table

2) Multiple concepts are highlighted in the Orb Results window, then ambiguated together into the table by an option within a right-click menu (user prompted for root word)

3) Inherited from Thesaurus

4) Imported from another project

The ambiguation table is to be automatically saved with the project when the project itself is saved.

Disambiguation

The HIP (Human-centric Information production) process of disambiguation takes user interaction to achieve. HIP acknowledges that broad set of rules cannot always be directly applied to achieve the correct disambiguation, and as such each disambiguation operation occurs according to specific information. Obtaining this information is the key to both disambiguation and ambiguation.

The encoding of information must split neighborhoods.

Automated methods for disambiguation are rather complex and require machine-learning code, performing algorithms like hidden Markov processes or latent semantic indexing.

HIP by-passes all of these algorithms and uses a simplified encoding system, similar to the transparent encoding of ambiguation, to separate neighborhoods.

The ambiguation of the concept;

quick | brown, fox, jumped, 1.txt; jets, screamed, over, 125.txt

into 2 concepts;

quick | brown, fox, jumped, 1.txt/jets,screamed, over, 125.txt

By keeping the now separated neighborhoods in a single, hash table, encoding, we save an increasingly large amount of space as the number of disambiguations within an Orb grows. If a concept that has been disambiguated is to be ambiguated, the user would be given the option as to which elements of the concept are to be ambiguated. The encoding allows us to distinguish purely disambiguated concepts from ambiguated concepts (which use the transparent encoding) by observing the number of words within the key.

The second form of disambiguation is when an ambiguation has already been done with transparent encoding. Removing an element from the ambiguation table is one way to do this. It is here that we risk running into problems. As such, a distinction between explicit and implicit ambiguation and disambiguation needs to be made.

By removing a word from the ambiguation table, we create an implicit disambiguation. We are ambiguating an existing concept, by merely returning it to it’s original state, not ambiguating an original concept. It is only by adding a disambiguation entry to the disambiguation table that an explicit disambiguation occurs. Although they are much the same, having explicit disambiguations help us to observe what the end user did and is doing, or to leave a history of what we have done in order to recall why certain things look the way they do.

Unlike the ambiguation table, the disambiguation table must be thrown out every time an Orb is re-generated. The disambiguation table contains a listing of each disambiguation operation in the form;

Concept center Split Locations

quick 1

This allows the user to see where disambiguations have occurred, and will allow the user to double click on an entry to go directly to the entry, or double click on a split location to go directly to that neighborhood location to see the neighborhoods split off.

If a neighborhood to be split off is in the middle of other neighborhoods that are to remain together, the neighborhood, which is to be split off, is sent to the end of the neighborhood listing.

The disambiguation encoding is complimentary to the transparent encoding available for ambiguation, and is similar in that it allows ambiguation operations to occur without having to re-generate the Orb or have access to the original text set used to create the Orb. If these two encoding are used together, the Orb is fully transportable without the original text set.

Conclusion

By enabling ambiguation and disambiguation operations to occur within the Orb system, we provide the end user with a better capability to manage the conceptual content held within Orbs, and narrow/refine result sets according to their specific domain of application as part of the HIP process.

ORB Visualization

Center