[38]                               home                            [40]

ORB Visualization

(soon)

 

 

Orb Projections

Example to illustrate the point

 

 

Communicated by Paul Prueitt: 12/26/2003 2:55 PM

 


Orb Projections

 

This tutorial focuses on how a controlled vocabulary is created and then used to produce a topological cover of the “largest Orb" related to a specific document collection.  The “largest Orb" is the one that encodes all co-occurrence relationships within a parameter specified measurement window and measurement boundary.  The measurement of most of the Orbs we have studied is via a word level n-gram that is passed over sentences. 

 

In most of the examples word level n-gram windows have been where n=5.  The measurement window and measurement boundary parameters can be varied to produce slightly different results.  To illustrate, the “3-grams” of the previous sentence are:

 

{ In most of, most of the, of the examples, the examples word, examples word level,

word level n-gram, level n-gram windows,

n-gram windows have, windows have been, have been where, been where n=5 }

 

The boundary of the n-gram measure may be the sentence boundary or a boundary that is larger, say three consecutive sentences.  In the examples measurement boundaries have been the sentences, so the last word of one sentence is not deemed as having a co-occurrence relationship with the first word of the next sentence. 

 

Please review the one page overview of the Orb technology:

 

http://www.bcngroup.org/python3/ten.htm#_Example_of_the

 

In this one page we explain that the Orb has two formal representations that are in the abstract equivalent, in that one representation can always be uniquely derived from the other representation.  So we have that if Orb 2 is a sub-graph of Orb 1 when the following holds:

 

Orb 1  = {   <  a,  r,  b  > j   | j is in a set J }

-->     

Orb 2  =  {   <  a,  r,  b  > i   | i is in I and I is a subset of J   }

 

All "smaller" Orbs are in fact sub graphs of the largest Orb and this is both highly interesting to mathematicians and highly useful.

 

If a smaller Orb has certain "semantic" similarity to the larger Orb, then we have a fractal relationship in “semantic space”.  A formal definition of semantic space is not possible at this time.  We are using the expression “fractal semantic space” as a metaphor.


Example to illustrate the point

 

If we have 200 short stories with an average of 100 word occurrences, then we have 20,000 occurrences of words.  But suppose we have only 1,500 unique words.  Suppose, as is reasonable, that the top 10% of the unique words that occur most frequently will account for 35% of the word occurrences.  Suppose also that 600 of the words only occur once.

 

So we have a core of 750 words with more than one occurrence and not too many occurrences as to make visualizing the local graph neighborhoods difficult (as in the figure below). 

 

 

The local graph neighborhood for the word “”then”

in the FCC public ruling 1997-2003.

 

Gather together those that are semantically very similar in the context one has in mind and one has perhaps 400 tokens, where a token is a class of words that are treated exactly the same.  These tokens can populate a terminological reconciliation container as in the SchemaLogic Inc SchemaServer product.  One word from the class is selected to represent the class. 

 

The collapse and extension of the tokens can be done using a very simple Perl script (which we will make public domain) that acts on a text file containing the ordered triples

 

Orb 1  =  {   <  a,  r,  b  > j   | j is in a set J       }        

-->     

Orb 2  =  {   <  a,  r,  b  > h   | i is in a set H       }

 

The set J and the set H are sets of integers that are counting, or indexing, the ordered triples.  This use of the integers to create subscript is very common in the foundations of mathematics. 

 

Some important, but powerful algorithms can be defined using the notation that we are introducing.  Interesting empirical checking of what are essentially theoretical results is one benefit from the notation and the underlying set and category theory. For example, H and J not likely to have a subset relationship after the eventChemistry (eC) operators are applied.  Why is this the case? 

 

The two easiest to understand eC operators are disambiguation-merge and disambiguation –split operators.  In the merge one combines two or more word occurrences and treats the category created as a single subject-indicator.   In the split one take a single word and created two or more subject indicators.  Change the context, and change the disambiguation operators and other eventChemistry operators. 

 

http://www.bcngroup.org/area2/KSF/Notation/notation.htm#_Section_4.1:_Description

 

Due to the organization of the 400 tokens, one has a two level upper taxonomy of the subject indicators over the small collection of short stories. 

 

Remember that the "largest Orb" has 1,500 nodes and many tens of thousands of co-occurrence relationships.  A fully connected graph with n nodes has n*n links between the nodes.  In real Orbs developed from real text the connectivity is around 4%. 

 

If this largest Orb is

 

{   <  a,  r,  b  >   }

 

then the size of the set

 

{ atoms }

 

is 1,500.  A one-to-one relationship exists between these atoms, the set of Subject Matter Indicator neighborhoods (of radius 1) and the nodes of the graph of the Orb.  

 

Remove all triples that have one or more of the a or the b not in the taxonomy.  This is a simple projection of the largest Orb to the Orb that is the subject indicator for the short stories using the taxonomy.  The projection is well defined mathematically and algorithmically.

 

It is this simple.