(soon)
Wednesday, June 09, 2004
Second Tutorial
on OrbSuite™
(see also the first
tutorial -- > )
(link to Mad Wolf Software ) .
Separating concepts using Orb
constructions
Notation for the separation of events in a
generalized space
This second tutorial addresses issues related to the situational development of auxiliary resources, such as rule sets, thesaurus, dictionaries, ontologies and controlled vocabularies. The first tutorial introduced the OrbSuite™ software and demonstrated how a set of Orb triples is produced given any collection of ASCII text files.
A major part of the Orb constructions is derived from the CCM patent owned by Applied Technical Systems Inc, but is different in significant ways from the ATS applications. The basic construction is described in the Orb Notational Paper. The encoding is one significant difference between CCM and Orbs.
The Orb constructions can be viewed in two ways. The first is as subject matter indicator neighborhoods after the SLIP conjecture is developed. These neighborhoods are what the user will see first (Figure 1) .

Figure 1: One of the subject matter indicator neighborhoods
The second is as a set of hierarchical relationships between a “center” and elements of a word level “n-gram window” in the form:
( word1, word2, word3, word4, word5 )
that is passed over the text. This is the first step in developing Orbs using this particular construction.
We exclude the center from making a relationship and non-center words from having a relationship with other non-center words (this come later, as will be explained). The reasons why we start out with this particular measurement are to be explained now. But some additional background materials can be gained from the notational paper.
As discussed in the CCM patent, the Orb (Ontology referential base) constructions uses the concept of a letter level n-gram (there is a large literature on this at NSA) but most of this literature is about systems developed at the letter level. Some of the research shows that, whether at the letter level or at the word level, n-grams with window size equal to five is optimal in some statistical sense.
However, the work we do is categorical, not statistical in nature, and thus our initial parsing, or measurement, is done to capture categorical relationships that might be outside of the n-gram window. Our current measurement of co-occurrence is perhaps the simplest that one can reasonably define. But, of course we are looking for the freedom to vary the window size and to use conjectural structural relationship, as in SLIP, that fits the object of investigation best. The application of variations on methods may also apply to non-linguistic data such as image data.
In a specific case, the index will be over pairs of (i, j) s. The notation is a standard one from set theory. For example one way in which the set of n-gram relationships might be indicated is via a listing as in:
{ < word(1),
relationship, word(3) >,
< word(2), relationship,
word(3) >,
< word(4), relationship,
word(3) >,
< word(5), relationship,
word(3) >,
The paired index would then be N = ( (1,3), (2,3), (4,3), (5,3) ). Indexing is just a simple way of notating what is in and what is out of a set. In this case there are four elements, and the relationship is “non-specific”.
From the set of triples defined from the n-gram relationships we obtain a “derived” set of relationships using the SLIP conjecture. The SLIP conjecture refines the notion of relationship in a specific direction. From the standard SLIP conjecture we develop a total of five relationships with the center word being the “relationship”, as opposed to the previous set of relationships where the relationship was merely one that was structurally defined by membership in the n-gram window. The relationships are developed which indicate that two words (not the center) are in the same neighborhood. It is a link relationship. The link relationship is discussed in the SLIP foundational papers, and shown in Figure 2.

Figure 2: The “standard” SLIP conjecture
It is important to note that a great deal of variability exists that has not been explored, and the specific relationship we have encoded into the SLIP browsers is not the only relationship that can be defined.
As mentioned, there are five elements in the conjectured set of relationships from one 5 word, word level n-gram window.
{ < word(1), word(3),
word(2) >,
< word(1), word(3),
word(4) >,
< word(1), word(3),
word(5) >,
< word(2), word(3),
word(4) >,
< word(2), word(3), word(5) > }
The means by which the Orbs are encoded allow a great flexibility over a set of basic operations that we are calling Orb arithmetic. As the n-gram window is passed over the ordered set of all words in the fables (after the sentence structure and stop words are omitted) we build a large set of these ordered triples.
Once this set is developed, we have various transformations that can occur, for example we can omit some of the elements selectively in order to produce a specific subject indicator that is very exclusive to a specific subject and which can then be used to instrument the detection of that subject in new text.
A theory of convolutional operators, acting on Orb sets, is provided in the Orb Notational Paper.
We have completed just enough tools to show how to separate subject matter indicators by altering a controlled vocabulary. The relationship that controlled vocabularies have to upper taxonomies is discussed in the Orb work done for the FCC.
Creating a go-list will, as our software is more fully developed, allow for the evolution of the controlled vocabulary using OrbSuite™. The go-list is created either directly or by taking the complement of a stop-list within the words available.

Figure 1: Stop word list, after import of a stop word list
The stop-list we use in Figure 1 is from VisualText Inc. The stop-list has 524 words in it. In Figure 2 we show the list of available words as being 78. By selecting “None” in the “Type to use” selection, we find that there are 135 unique words in the two fables but that 57 of these are in the stop-list (this result is not shown in a figure). These numbers helps us understand the nature of the task we have in measuring linguistic variation and applying some type of semantic theory to patterns of co-occurrence. Using the 524 word stop word list creates a list of 78 words that are occurring in the two fables which are not in the stop list.

Figure 2: Available words after subtraction from a stop-list
These 78 words are discovered using the Orb construction process as discussed in the Orb Notational Paper. The co-occurrence is between a “center” of the five-word word-level n-gram and the four words that are also in the n-gram window.

Figure 3: 78 “concepts” in two fable stories
As discussed in Section 1, this procedure produces a specific set of ordered triples that are encoded into the referential information base (key-less hash table). The form of this set is standard and the encoding access is based on this standard. In the case of the two fables and this stop list, we have the following data rendering in the OrbSuite (Figure 3) and the SLIP browser (Figure 4).

a b
Figure 4: The 76 atoms (two seem to be missing) scattered to the circle (a) and rendered as Subject Matter indicator neighborhoods (b)
The question is about how to separate the two sets of subject indicators so that a correspondence exists between one group and one fable. So if we see a subject matter indicator we then know which fable is responsible for producing this.

Figure 5: A partial gather (20,000 iterations) of the set of fable atoms
This is an important result since one can use the ability to separate concept indicators and assist in the development of high-resolution real time analysis of social discourse
Figure 5 indicates both clustering and connections between the clusters. We use the SLIP browsers properties to take the points between the two clusters and separate these into a single Orb construction and then view that construction by itself.

Figure 6: Using the SLIP browser to take out the connectors
The event space (the set of subject indicators and their connections) within just this set of connectors is show in the next figure.

Figure 7: An Orb with 28 atoms
In Figure 7 we have 28 atoms that show strong connections between the occurrences of words from one fable to another fable.
Orb arithmetic can be done directly on an existing set of Orb constructions. However, in this case, we actually recreated a Orb using a smaller controlled vocabulary so that the new Orb would have exactly two SLIP primes. (For definition of a “SLIP prime” see theorems in the SLIP foundational paper.)

Figure 8: The two sets of atoms after removing words from the controlled vocabulary
The two sets of atoms now have no inter-set connections and when clustered form two groups.

Figure 9: The most highly connected atoms
The most highly connected atoms in each of the two separated groups are “mouse” and “lamb”. In the larger graph, defined by the set of ordered triples, these two atoms are each part of a sub-graph having no paths connecting the other sub-graph.
One of the fables is about a wolf and a lamb and the other is about a wolf and a mouse. The separation can be automated to select those co-occurrence patterns which best represent subjects. Over time, a set of specialized indicator patterns can be inventoried so that thematic analysis can easily occur over the set of themes where these indicators have been found in the past.
In the previous section we manipulated a specific controlled vocabulary in order to remove those (word) indicators that were co-occurring in two different textual description of subjects. In this case the two subjects were two of the Aesop fables, each being around 400 words in length. The word level 5-grams are moved over a list of words that are in the right order, as in the original text, but only include words that are in a controlled vocabulary.
The entire architecture is based on a tri-level architecture that does not depend on the basic elements of measurement and analysis being words and patterns of word co-occurrences. The basic elements can be any invariant that is present in many cases of some type of event structure.
We have developed generalized notation on a number of occasions. The first of these generalized notations is developed and presented in Moscow in 1997 as the voting procedure. The concept is that three levels of organization are responsible for any event structure in any observational space. The tri-level architecture was grounded in an interpretation of Pribram’s work in cognitive and quantum neuroscience, and in Walter Freeman’s observations about the (non) invariance of signal propagation between olfactory sensory receptors and the neuronal processing centers in the cortex. James Houk’s work on memory encoding helped to develop a specific biologically feasible network architecture. (see Chapter 4, Foundations).
The current opportunity is to apply the tri-level to Earth observations as part of our bid for funding from a NASA Earth Science Technology Office, Advanced Information Systems Technology (AIST) mini solicitation.
“The Earth Science Enterprise (ESE) is one of six
NASA enterprises seeking to fulfill the agency’s vision and carry out its
mission. The ESE mission is to
understand and protect our home planet by using our view from space to study
the Earth system and improve predictions of Earth system change. The ESE, . . . , provides accurate,
objective scientific data and analysis to advance our understanding of Earth
system processes and to help policy makers and citizens achieve economic growth
and effective, responsible stewardship of the Earth’s resources. The ESE research program aims to acquire
deeper scientific understanding of the components of the Earth system, their
interactions, and the consequences of changes in the Earth system for
life. These interactions occur on a
continuum of spatial and temporal scales ranging from short-term weather to
long-term climate and motions of the solid Earth, and from local and regional
to global.
… scientific focus areas are: Atmospheric
Composition, Carbon Cycle and Ecosystems, Climate Variability and Change, Earth
Surface and Interior, Water and Energy Cycle and Weather.
The key elements to our two-year proposal are:
1) the development of a unique stand alone virtual private network based on advanced Multiple User Domain (MUD) technology as seen in the Manor software from Madwolfsw Inc.

Figure 10: A visual chat discussion (June 2004)
Figure 10 is a screen shot of a discussion with Madwolf Software founders and Ontologystream regarding a plan to present to an existing coffee shop chain the notion of a Safe Net.
2) The development of an Orb repository for all scientific data on Carbon Cycles and Ecosystems
3) The development of an Orb repository for all scientific data on Earth Surface and Interior
4) The development of a theory of event type and atomic event elements based on this Orb encoded data and the use of the tri-level architecture, including both the Prueitt voting procedure, shallow link analysis (SLIP), and quasi-axiomatic theory.