Wednesday, August 11, 2004
Tutorial
On the Nature of a
new Memetic Technology
based on
categorical abstraction and event chemistry
edited : Tuesday, August 10, 2004
Two level Fixed Upper Taxonomy from subject
indicators 4
Dr. Paul S. Prueitt
Founder, BCNGroup Inc (1992) and OntologyStream Inc (2000)
Copyright 2004 OntologyStream Inc
703-981-2676
The development of a novel, simple and powerful technology has occurred over a period of years, and has several key contributors. The principle on which this technology is based has the nature of a specific extension of Hilbert type mathematics and is grounded in cognitive and behavioral sciences in very specific ways.
The technology is a fundamental advancement in science with specific consequences to measuring the human production of information as part of every day social interaction.
Central to this technology is the measurement of invariance across multiple instances. The measurement can be applied to a diverse spectrum of targets. In each application the interface to Orb (Ontology referential base) data encoding, sets of ordered triples, will be slightly different, but the core data encoding and the data transformations are generic.
The final section of this paper develops an illustration where the measurement is not a standard measurement and the semantics, that might be laid over the measured structure, is not obvious.
One thinks about this as stratified category theory. A natural separation of instances from the invariance within instances is used to develop a stratified theory of type [1][1]. For example, if the instance is a sentence, one may look for pairs of words that are co-occurring within that sentence. The measurement of co-occurrence produced a precise relationship having the form < a, r, b > where a and b are co-occurring words and r is the relationship of having co-occurred.
A complete measurement of a text corpus produces a set { < a, r, b > } and this set is equivalent to a graph with the relationships the links and the set of co-occurring words the nodes. The relationship, “co-occurrence of words in text”, is not generic; nor is it the case that the co-occurrence might be conditions on, for example noun or verb types.
We appeal to the notion that regularity in co-occurrence of words within sentence boundaries should be indicators of subject matter. One has to develop empirical evidence that this is so. However, such a measurement is precise with linear dependency on the boundary of our definition of instance, eg as sentence or multiple sentences, so that if the boundary is increased than the number of co-occurrences increase.
The set of words considered significant can be altered as can the identification criterion for words having multiple potential contexts. The definition of relationship can also be varied in a complicated fashion by introducing entity extraction rules or other types of parsing rules. The precision of these measurements allows the empirical science to be developed in an orderly and objective fashion.
Subject matter is modeled, in the abstract, as the basins of attraction within a mathematical topology on a graph (if discretize in some fashion) or within a mathematical topology as a continuum.
Model for the structural holonomy between continuum and discrete formalisms
Differential Ontology Framework (DOF), allows a structural holonomy between discrete and continuum formal models, but does not address the issue of ontology reification.
The co-occurrence relationship produces a discrete measurement of the subject matter being exchanged within the social discourse. Latent Semantic Indexing (LSI) is a well-known technology that uses linear algebra and Hilbert space mathematics to create a continuum model, a Hilbert space manifold, of the subject matter being exchanged in the social discourse. It is argued elsewhere that LSI topology can be mapped to a discrete graph having a type of structural homology to the LSI manifold.
We use the terminology of memetics, with certain distinction being made regarding the ontic status of a meme as a replication mechanism DUE to the expression of individuals within a social system. This means that we find a fundamental distinction between the replicator function of genes and the replication of indexical frame (Lissack 2004).
If, on the other hand, memes
are redefined such that the evolutionary selection process is no longer an aspect
of the ontology of memes but rather of the environmental niche of which the
memes are evidence, then the field may have other avenues of advancement and a
potential relevance to managers. Such a redefinition would entail recognition
of the relationship between a given meme and the context of the social and
ideational environment of which it is an affordance and which it demands be
attended to. Memes in this casting are a label for successful boundary object
indexicals and lose their privileged status as replicators. Instead, the
replicator status is ascribed to the environmental niches and the memes are
their representatives, symbols, or semantic indexicals.
The indexical is an abstract concept that becomes practical by actually looking for the words that are co-occurring around a central term. What is then a central term? The answer is non-precise in that a subject matter might be indicated, successfully, by more than one specific term. This variability is consistent with everyday experience. In standard taxonomy, we have the concept of a “broad term” which subsumes more than one narrow terms, allowing for an organization to terminology. This b-t/n-t concept has been found useful.
The organization of subject matter indicators into broad-term, narrow-term taxonomy can be mapped to a graph where each broad term is the center of a graphical neighborhood, and the narrow terms are all equal distant from the center.
We have exposed several
problems.
Q1: How can we fix a specific “upper taxonomy” within the constraints of a small number of nodes and two levels, so that the taxonomy covers the universe of discourse for a specific period of time?
Q2: How
can we provide additional layers, called a hidden taxonomy, which interfaces
with intelligent search and metadata extraction technology, while keeping the
upper taxonomy fixed?
Our solution
is to have human enumeration of a two-level broad-term, narrow-term taxonomy
that is to be fixed for a period of, say, nine months. This human enumeration is matched to a
bottom up adaptive elaboration of a hidden taxonomy using machine based rules
and machine learning techniques.
An addition to this two level
taxonomy we wish to provide an adaptive elaboration of the bottom elements of
the two-level taxonomy so that the taxonomy extends into and covers the subject
matter.
Figure 1:
Taxonomy candidates derived from machine or algorithm
Second
step: Organize the taxonomy candidates into a two
level taxonomy using broad term (bt) / narrow term (nt) relationships.
Figure 2:
Figure 1 organized by BT/NT relationships
Figure 1 and 2 indicate a top
down enumeration of taxonomy using polling instruments and knowledge
engineering/management methods. The
bt/nt relationships are used. For example,
c(21) is a broad term having three terms,
{ c(11), c(3), c(40)
}
with a more narrow meaning or
context.
Once the
upper taxonomy is fixed we have a finer resolution of subject matter indicators
that MUST match the bottom layer of the upper taxonomy. (This matching between the lower level of
the upper taxonomy and the top most level of the hidden taxonomy is the key to
our approach to mapping memetic expression in social discourse.)
We will use
elements of a class of adaptive elaboration instruments to enhance search and
retrieval algorithms. For example, new linguistic
variation in text categorized indicates evolution of subject indicators.
As the social discourse
changes, these patterns of linguistic variation can be empirically observed to
change and with these changes comes an evolution in nuance. These changes are
to be observed and then linked within the hidden taxonomy and the associations
between the hidden taxonomy and the upper taxonomy can be allowed to evolve as
the social discourse introduces this nuance.
Figure 3: Upper Taxonomy and Hidden Taxonomy
The upper
taxonomy continues to be fixed until or unless there are reasons to change the
upper taxonomy because of the introduction of new topics, or the forgetfulness
of topics that are no long considered relevant to the purposes of the
controlled vocabulary
Documents can
also be placed into repositories using the upper taxonomy as user defined
metadata, however search and retrieval using the Subject Matter
Indicator neighborhoods will
also use a multi-pass rule engine to provide higher resolution to the subject
indicators.
A machine
derived, bottom up, taxonomy can be generated in several ways. Most of the well known techniques take a set
of complex data sets and cluster the elements to introduce metaconcept
boundaries and to suggest relationships between these high order
constructions.
One way of thinking about this clustering process is that there is an implicit topology on the space of complex data sets. A method of differential ontology can be used to produce an explicit representation of these boundaries and these relationships.
Figure 4: A topology with two neighborhoods
Figure 4 could be used to produce a two level taxonomy with two nodes in the top level, one corresponding to each of the neighborhoods with “radius” = 2. Under the first top node (derived from the upper left neighborhood), we have ten subject indicators within a radius of 2 units. Under the second top node (derived from the lower right neighborhood), we also have ten subject indicators within a radius of 2 units. Neighborhoods can be made broader by taking only the underlying nodes with distance 0 and 1. In this case, the first top node would have 6 children and the second top node would have 4 children.
In the above topology, we have a simple notion of distance in graphical constructions (not trees – but more general graphic constructions). Topological logics can be used to measure the presence of subject matter indicators.
The logics we have studied include plausible reasoning and Mill’s logic. It is important to point out that the logic is not related to semantics, but rather to the identification of semantic indicators. OWL (Ontology Web Language) can be used to encode information and rules about how to detect a subject matter indicator; but we hold that a simple visualization interface is best if and when one begins to talk about the fidelity of correspondence to things in the world and behavioral causes of things in the world.
Our thoughts in this regard are consistent with the fundamental insights written into the Topic Maps standard 1.0.
Many companies have products that address concept extractions. Cost and ease of deployment are the limiting factors in bringing knowledge of these technologies to the client. Entrieva Inc and Applied technical Systems Inc both have concept extraction systems. Our technology matches any of these extraction/detection processes to the lower level of the upper taxonomy.
Anything that is surprising, simpler than anticipated and more powerful by several orders of magnitude when compared with previous technology is not going to be accepted easily even when understood. The initial non-acceptance is in fact understood from the very notions of autopoiesis[2][2] and the very simplest first principles of memetic science.
Memetic technology must go hand in hand with objective science based on a healthy and open interplay between theory and observation. There are social issues related to training and education. The application of this new technology to the study of memetic expression and memetic interaction is, however, immediately possible and requires only a 120-day technology development project with funding for three full time people.
A demonstration of principle has been available and has been discussed in public forums for a period of almost two years. The outline of this discussion can be viewed from the URL:
http://www.bcngroup.org/procurementModel/to-be/di.htm
The sections that follow have URLs to complete research papers and tutorials on various aspects of the new technology. In these sections we are complete and minimal in our description.
We develop a short tutorial on the core technology. Software supporting the tutorial is available at:
http://www.bcngroup.org/AIC/tutorials/3gram.zip
Community adoption and use requires a supported collaborative environment where the Orb technology is made available while also making available communication services that bring in a number of scholars who will use the memetic tools to develop a specific set of capabilities as defined in the Five Tasks listed below.
1) 1) Development of a symbolic language depicting categorical invariance in behavior related to social response under stress.
2) 2) Development of a symbolic language depicting categorical invariance in suppressive responses to aggression arising from social stress.
3) 3) Measurement technology directed at capturing the expression of causative elements of social behavior as predictors of action
4) 4) Measurement technology directed at capturing the consequences of suppressive responses to aggression.
5) 5) Development of a theoretical framework that supports predictive inference from the observations made about effects to precise knowledge about causes.
The Phase 1 of our project has identified a small group of scholars who are involved in the definition of a new memetic science. These individuals will meet several times in face-to-face conferences and will work within a distribution collaborative environment (Groove). Several memetic technologies have been identified and others may be added to the Groove toolset. As we identify this appreciative field the core team will extend a select community membership using resources judicially.
A second step will lead to the production of a red team blue team gaming environment that tests the usability of our community defined language, measurement tools and theory.
The third step will be the demonstration of the gaming environment to military leadership and the proper control of technology enabling a broad based use of the environment within the Department of Defense.
Precise notational language is develop in a paper published on the Internet in December 2003:
http://www.bcngroup.org/area2/KSF/Notation/notation.htm
However, we are going to use the SLIP software invented by Paul Prueitt in 2001 to make specific examples and to point to specific objects.
We have developed several tests of “text understanding technologies” uses a small collection of 312 short stories, the Aesop fables. As we are familiar with this collection and some of the structural characteristics we choose to redevelop a study of underlying structure invariance within the fables.
N #
1 26
2 407
3 2576
4 5866
5 6598
6 5446
7 3778
8 2250
9 1155
10 547
11 213
12 70
13 16
14 0
15 0
16 0
17 0
18 0
19 0
20 0
Table 1: size of letter level n-grams verses number of distinct categories
In order to bring this study to bear on the most abstract and thus the most general level of formalism, we decided to not look at the words in the fables but rather the letter level n-grams. In Table 1 we report the number of unique categories of occurrences of 3 letter patterns, where the three letters occur within a single word and the occurrence is immediately adjacent.
So for example in "binning" 3 grams from the fable collection
the brown fox
we would have the following 3 grams
{ the, bro, row, own, fox }
Now clearly, in this context, "row" and "rowing" are only very loosely related semantically. But the purpose here is to separate semantics from a precise and exact measurement of the invariances and their co-occurrences.
So the Orbs developed using only 3-grams have a problem, that might be sorted out using some type of ambiguation / disambiguation process, but this problem is not addressed here. The problem is an interesting one, however.
The measurement of 3-grams produced by this sort of odd transformation of text gives us a more generic case, that exposes a class of operational issues in the context of data where a set of measured “atoms” are occurring and co-occurrence of these measured atoms is used to develop a covering over the occurrences.
In our work on measuring the subject content of text using complete word occurrences, we use the direct analogy to mathematical topology where an open interval from real number a to real number b, (a, b), can cover the points in-between the two numbers. A line can be covered by small open intervals so that all points in the line are covered and the size of the intervals made small. In a similar fashion, the covering of “subject matter indicator neighborhoods” is developed and expressed as a upper taxonomy.
http://www.ontologystream.com/beads/enumeration/note1.htm
Considering the more general case exposes the issues and strengths of this methodology. How are Orbs developed and how might they be used when the language is not English and perhaps the data is not even natural language? For example, suppose that state symbols are derived from a set of behavioral atoms and then used to measure behavioral expression.
The answer is straightforward and simple. The meaning of the measurement is separated from the measurement itself and thus the measurement process is simplified. Once the measurement is done, simple techniques are used to transform the Orb structure and to thus produce information that is easy to interpret.
The table tells us exactly and precisely what we should expect from this measurement of the letter level n-grams where the grams occur completely with words. Clearly the 1-gram categories must number 26*2 or 52. If we make all letters case insensitive, then we have number of categories is 26 when n is 1. Also clearly there are no 14-letter words existing in this specific collection.
The procedure was developed, in Perl, so that each word in each sentence was expanded into a sequence of letter level 3-grams. The result of the transformation of the set of fables is a new text collection where each “word” has exactly 3 letters. Sentence boundaries are kept based on the sentence boundaries in the original collection of fables.
The collection of text was then processed by the experimental Orb system as discussed in the notational paper we have made reference to before:
http://www.bcngroup.org/area2/KSF/Notation/notation.htm
In the next section we give the URL that downloads an Orb with software for viewing the Orb neighborhoods.
In Figure 1 we have a screen shot of 2300 categorical invariances extracted from a random sample of 1/5 of the entire initial measurement of the 3-grams we produced for this tutorial. It is noticed that in Table 1 we report there are in fact 2576 categorical invariances in the complete measurement. However, the occurrences of the categories are distributed in the data in such a way that a particular measurement picks up one or more occurrence of each of 2300 out of 2576 3 letter combinations.
Figure 1: 2300 atoms are separated into a central core and outliers
Each of the 2300 occurrences scattered to the circle in Figure 1 also has the property that some co-occurrence has occurred with at least one other category. The precise issue of measurement is about the categories of occurrence, and the frequency of occurrence is left out of the visualization (for several reasons).
By “co-occurrence of two categories” we mean that at least one instance of a co-occurrence between elements of the categories has been observed. Just so that we are clear about this. In our specific example here, the categories all have the same invariance occurrence of a three-letter pattern, such as “row”. But of course we might define categories to have internal differences if there is a reason for this. We might wish to say that “row” and “oar” are to be treated as the “same thing”.
a b
Figure 2: Initial distribution of scattered atoms and after iterated gather
The software being used here is simple to use, and it may be that seeing the software work will help understand what is going on, and thus to gain some appreciation as to the power of this utility.
In Figure 2 we show the evolution of the topology on the circle where the degree of interconnectedness of the categories pull atoms together using a standard scatter-gather technique that is easy to understand, given a few minutes of technical discussion.
Figure 3: A drill down into the structure of the 3-gram Orb
The scatter-gather technique pulls more as a function of how interconnected the complete set of atoms are. The process is stochastic and the technical details interesting. But the effect is easy to understand. We remove the center of the distribution so that those categories that are highly connected are set aside. We look at the complement of the center and repeat this process to obtain a set of categories that are not so fully connected, but yet are still randomly occurring within the data. This is our candidate for a cover. A similar technique is used in the “conceptual role-up” of the NdCore text understanding system now deployed at INSCOM and Army Intelligence. However, the technique discussed here is more advanced and has a greater transparency to informed users.
In Figure 3 the 450 categorical atoms from the R box (colored green) is shown as points on the circle. These points are the first and third elements of a set of ordered triples in the form
{ < a, r, b > }
where a and b are atoms and r is the relationship of having a categorical co-occurrence. All of these 450 atoms are visually “tossed” into the display window as seen in Figure 4, when mouse clicks on atoms bring a specific picture of a cover neighborhood, as in Figure 5.
Figure 4: Categorical atoms thrown into a visualization space
The visualization can be made global, however, the methodology we have developed suggests that the local information is exactly what one wanted to start with in the development of a semantics over the space of invariance that was initially measured from the 3-letter n-grams.
Figure 5: One of the cover neighborhoods
The discussion on introducing a semantic interpretation of the elements of the cover is treated in the papers at:
http://www.ontologystream.com/beads/enumeration/note1.htm
We understand that our remarks here are likely to have raised more questions than settled questions. The most important observation is that the Orbs are a precise and exact measurement of data invariance and co-occurrence of data invariance and produces new and simpler approach then statistical data mining.