<Book Index>

Chapter 3

Semiotic Design for Document Understanding

Revision: January 18, 2000




1. Introduction

2. Quasi-axiomatic theory and plausible reasoning

3. On the question of ontologies

4. Classification of issues regarding computational document understanding

5. Using theme vectors as a basis for data mining

6. The use of formal models and semiotic models

7. Definition of a formal model from theme phrases

8. The selection process







1: Introduction

Computation is a result of specific cognitive processes, and reflects a natural order in the world.  This natural order is not transparent, perhaps it is not perfect; but it is there.  The order is in the nearly fractal expression of the growth of plants, the formation of crystals, and the categorization of stars in the sky.  The order of natural processes lead the ancient Greek philosophers and mathematicians to idealize the causes of this perceived order into concepts known as the Pythagorean Spheres.

The classical notion of computation is based on treating nearly the same things as being the same thing.  For example the difference between 1 and 2 is exactly the same as the difference between 2 and 3. Counting is periodic.  Nature is aperiodic.  No two real things are ever exactly the same in nature.  The enumeration of things in the world is made via a process of mental abstraction.

The extensibility of mathematics and classical logics has been brought into question by a failure of modern science to accommodate the many special modeling needs of the soft sciences.  The position we take on this is quite literally complex.  We feel that classical, or Hilbert, mathematics has limitations that will be overcome with the development of a more complete understanding of the process of creating and using abstractions. 

This chapter will bring critical remarks to bear on 20th century information science, where we are often reduced to quantitative analysis of linguistic and semantic issues to enumeration.  In this reduction we may be seen to have lost some essential aspects about tacit knowledge we as humans have which is not as yet abstracted into some type of formal system.  Perhaps also we may come to commonly understand that any reduction to enumerated things will loss a pragmatic dimension to what may only be regarded as “within the moment”.

Having framed the chapter in this way, we are reminded that linguistics has not been associated with a theory of semantics that has a full temporal expression.  Many academicians and many business people have acted as if a full theory of semantics has been available for some time, but perhaps we have just not understood what they are talking about. 

It seems easy to argue that an understanding of temporal expression of complex system is necessary to produce web-based mediation of knowledge management.  As easy, we argue that such an understanding of temporal expression has not been developed by the current generation of information technologists.  We argued in the last chapter that a categorical distinction must be made between the nature of a formal system and the natures of a natural system. 

Consistent with our distinction between natural and computational systems is the conjecture that computation is a product of the laws of evolution, under which biological systems have shaped themselves to reflect, albeit imperfectly, natural order.  In this view, computation comes to exist because the computational-like responses of complex systems change the world and, as a consequence, produce a survival advantage.  This evolved consequent is from the underlying physics, but is something created in a complex fashion. 

Luis Rocha states (1996) that "Cognitive agents survive in a particular environment by categorizing their perceptions, feeling, thoughts and language". The fact that categorization is generally successful in making sense of the complexity of the world is evidence that the natural order is striking.   All that is known is evidence that complexity is real, and that we have yet to understand the nature of this complexity.

Though some details about complexity is still an unknown, the fact that there is complexity in natural systems is not surprising.  As biological systems became more complex, an evolution of computational-like mechanisms occurred, and then finally the computer. These computational-like mechanisms include tools for the simplest form of measurement of the environment.  We reach into theoretical biology and ecological psychology to find empirical grounding for the new anticipatory technology.

It is natural that such mechanisms be derived from physical properties of biochemical systems and the physics of electromagnetic distribution of energy.  As the mechanisms for anticipatory technology is developed we note that the role of the human must become more central to our social notions about information technology.

It is natural that a complex measurement function be defined as part of a computational process cycle expressed as informational rendering followed by human perception expressed as decision.  Such cycles are to be driven by the periodic forcing functions that are observed at all levels of biological organization, now extended to human/computer co-processes of information. 

Instrumented measurement within an architecture of action / perception cycles is then the type of measurement that we require for the mediation of knowledge management.

Perceptual mechanisms include components of the neuronal and immune systems in mammals. These mechanisms are directed at creating, from an explicit experience, a decomposition into an implicit representation of the world.  Living systems then use this representation to maintain a form, i. e. a set of structural invariance, over a period of time.  The mammals are equipped with cognitive capabilities that strengthen the anticipatory responses seen in all living systems. 

For some living systems, the consequences of causes and the context of properties of objects can be stored.  For humans, this implicit representation is stored in multiple memory systems (Schacter & Tulving, 1994) and provides the basis for various creations of the human mind; including language, mathematics and logic.  In the tri-level architecture this implicit representation is stored as a set of ordered triples in the form < a, r, b > where a and b are graph nodes and r is a relation. 

Figure 1: Separation of implicit representation from explicit experience.

Let us examine the phenomenon of perceptional measurement a bit deeper.  Figure 1 illustrates a conjectured mechanistic separation of implicit structure from the explicit experience of the world.  These structures are precursor to the substructure and ultrastructure knowledge artifacts that we mimic in the tri-level architecture. 

Specific mechanisms are involved in the encoding of implicit structure as recourse for remembrance. The full detail remains subject to scientific investigations.  However, one might look to a general formalism that models the system properties of metabolism-repair (M,R) systems (see Casti, 1996), while noting, as John Casti does, the elegant discussions regarding the intrinsic limitations of Newtonian formalism. 

The work that is now contemplated asks questions about how the Newtonian formalism can be supplemented by Peircean logic (see Finn, 1991; Burch, 1991) and by basic notions from the American ecological psychology community (see Howard Pattee, 1996).

The fundamental insights behind a number of tools are to be integrated together within a algorithmic framework supported by a general theory of open systems. These tools include situational language and logics (Pospelov, 1986; 1996), Qualitative Structural Activity Relationship (QSAR) analysis of situations (Finn, 1996) and iconic computation (Burch, 1996; Laubenbacher et al, 1996; and Jeffery Johnson, 1996).  Between 2000 and 2003 we added a number of fundamental technologies including the Orb (Ontology referential base) innovations.

General systems properties provide a unifying foundation for understanding how systems derive implicit structure and encode, as memory, a structural representation of invariance from the experiential stream into a substrate.

The biological brain works in this way.  The substrate is our memory. Through adaptation and mutual influence, the components of the memory substrate develop, over time, a one to one correspondence to the structural invariance of systemic genotypes that are the causes of the things experienced.  Once these components have been developed, the Mill’s logic developed in Chapter 9 can be used in the layered architecture of chapters 11 and 12 to encode the relations between the presence or absence of substrate and the presence or absence of properties of the emergent wholes.  This can be done in a stratified model of cause and effect where levels interact by modifications to chains, thus giving us this temporal expression of formalism that we seek.


As discussed in Chapter 1, biological genotypes arise because predicable chains of events are supported by a complex ecosystem.  The presence of event paths in the ecosystem produce a limitation to what may occur, and information about what is to occur may be propagated in reverse into the past – thus building the basis for anticipation through knowledge use. 

The notion of a "circuit" in biochemistry and electrical engineering is defined as a tendency towards a specific chain of reactions.  Biochemical circuits have non-deterministic decision points that critically depend on environmental conditions.  We will set aside, as diversionary, the philosophical argument against non-deterministic events.  It is only important to recognize that non-deterministic events must be used to explain selective attention and individual differences within anticipatory models.  But the possible evolutionary paths in complex systems limit this non-deterministic aspect. Some specific set of constraints on the paths must occur due to the presence of ultrastructure (see Chapters 1 and 2). 

The use of structural invariance in the multi-scale models of circuits will result in distributed adaptation to repeated interaction between the active players in a complex ecosystem environment such as eBusiness.

Business ultrastructure (Deming and Long, 198?) is discovered through a process of descriptive enumeration of an AS-IS model in process flow type model of business process re-engineering.  Human intuition and knowledge is required to do this effectively.  An interview process is generally employed to build an AS-IS model.  From this AS-IS model one may be able to define a set of categories for knowledge artifact routing and discovery.  The categories can be configured as a set of category policies into which substructural artifacts can vote to place information into retrieval bins.  A second order control system makes modifications to the definition of the category policies, the substructural artifacts and to relationships that can occur in the temporal expression of information within a business process re-engineering project.

However, adaptation and recognition of invariance are not the only factors that are required to describe general properties of complex natural systems.  Biochemical systems clearly need the notion of degenerate (under determined) circuits in a virtual ecosystem (see work by G. Edelman, 1987). Complex natural systems, with degeneracy, depend on conditions not immediately preceding the time of observation, and thus the initial condition and boundary conditions does not fully determine the behavior of the system. 

How can we make sense of this?  It is in the business context that funding becomes available, even if we are not able to find proper funding for the precursor development of the anticipatory technology.  We recognized that anticipatory technology had to be figured out without a great deal of real economic support and that the first applications had to be extremely simple to use, self teaching and not encumbered with any costs.

In the late 1990s, there are some generalizations from physical systems theory that are promising. The assembly of cellular metabolites (from a universe of cellular environments W ) into physically realizable cellular metabolisms, G , as suggested by Casti’s equation:

f: W à G,          f e H(W, G)

(1996), is shaped by a class of constraints, some of which can be modeled by canalization in switching nets and other features of artificial life systems. The notion of canalization provided to us a workable model of value and production chains.  The intuition was suggestive but yet we had not as yet come to appreciate the demand that human intuitive facilities had to play in the development of our first notions about a knowledge operating system.  

The comments of the proceeding paragraphs point out that the nature of the transformation f and the class of all such transformations, H(W, G), cannot be expected to be exclusively numerical. By looking closely at the nature of enumeration and the forced equivalence between different members of an prototype; we see that it is category theory that is a missing ingredient. The central issue facing process engineering is about whether or not this class of transformations can be characterized using the category policy and the capture of human judgment about types of use chains. 

One of the leading Russian logicains in QAT, Zabezhailo (1995), demonstrates that a predictive theory of biochemistry is realizable by using special QAT logics to perform iconic computations. These computations are carried out using inference operators.  Not being able to cover the details of Zabezhailo's work here we return to some metaphor.  The paths in "iconic" space are analogous to trajectories in numerical state spaces; however, these iconic trajectories, Pospelov refers to them as syntagmatic chains, are lawfully constrained by the information in a database containing the results of specific analysis of biochemical structural activity relationships Zabezhailo ( 1996).

Part of the excitement that we found in the tri-level architecture was derived form the suggestion that the nature of value and production chains could be captured by the formalism derived from our study of QAT.   But how can we make complete our thoughts in an interpretation of the scientific literature in complex systems and neuropsychology?  We still needed to understand more.  We needed new language and some new terms and expressions to carry the new understanding that the scholars have found.

In Prueitt (1995) the author stated a hypothesis about a universal set of features that all complex natural systems have. This paper is revised as Chapter 1.

Figure 2: The Process Compartment Hypothesis is based on the observation that all natural systems form, maintain a relative stability for a finite period of time, and then dissipate.

The hypothesis is really nothing more than the observation that every natural system forms through an aggregation process, has a period of relative stability, and then de-structures through a process of dissipation. While stable, the system has a basic character, or signature, that governs the evolution of the system. This basic signature, a system image, has the ability to assemble new constructs from a finite but open set of basic components.

The basic signature is coupled to the autonomy of the system, but sits aside somehow.  An ultrastructure is felt as the total set of external affordance coming from the event paths in specific ecological circuits.  This new language fits our purposes. 

A central challenge for Prueitt’s hypothesis is to identify the nature a specific system image for any class of phenomenon under investigation. For Pospelov, this goes to the question of why an object, like a city, exists (Pospelov, 1984). "How the object exists" is a different question; one that refers more to the physical composition of the object and the set of all relationships to other objects.  Why a complex object exists is known only if one is able to examine the nature of causes and of structural composition. 

We come to appreciate that the notion of a self-image can be used to provide a metaphor for theorizing about the system image of a process compartment.

Moreover, a general notion developed that the set of lawful dynamics of a compartment is the system image for the assembly of processes that are occurring within this compartment. Ah, we seem to have something here.

However; in spite of the strength of this introspection, a systems description of the properties of a system image was not completed. For example, the theory of reflective control details some of the complexity of this problem (Lefebvre, 1996).  What are we still missing?

Let us reflect on what Figure 2 might signify.  Figure 2 represents processing in three nested levels, as delineated by the average life time for compartments at each level. At time scale 1 many subprocesses are occurring, but are not observable from time scale 2. An example is the molecular vibration on the surface of the desk in front of me. Likewise the changes at the third level are too slow to be observed from the second time scale. An example is the gradual change in average global temperature.

Clearly the nesting of compartments is an approximation to real world dynamics where the time scales are not uniform from one process to another. This reminds us about the central thesis of Chapter 2, that a formal system is useful but should not be confused to be the natural system that the formal system stands for.  In other words, what we are missing is the reminder that human cognition has something more than algorithmic natures.

Part of the statement of the Process Compartment Hypothesis is that significant regularity in temporal stratification of processes does in fact exist and can be observed in the world. 

What remains now are the empirical studies, and new formalisms that can encode the results of these studies without imposing the types of limitations that we have notices regarding the application of Hilbert mathematics to models of human awareness. 


2: Quasi Axiomatic Theory (QAT) and Plausible Reasoning

The semiotic systems demonstrated by the Pospelov - Finn group have suggested, to us, that machine readable ontologies have been created using the extensive logical theory and heuristic methods developed by the Russian community. These ontologies have the special features of a computational system that is open to perturbation at all levels of its formulation. This is a critical feature that can be exploited in the development of new systems for machine-aided investigation of natural phenomena. Knowledge management technology must stem from this work and receive extensions of this work. 

Relationships between natural concepts can be identified as linkage between nodes in formalism. In very static situations such a representational schema can be easily built to represent a situation’s ontic structure. Thus it is possible to model the evolution of a situation in well-understood contexts as the stable phase of the compartment (Figure 2). However, in non-static situations, or even static situations in unknown contexts, it is only possible to represent a situation’s evolution if a so-called "second order cybernetic" system is available.

Second order cybernetics are sets of rule modification transforms relative to a specific class of situations. These rule modification transforms are required to capture the nature of circuits available to processes as they occur in the model and are attempts to capture the computational nature of the system image.   Humans do this quite naturally and formalization is a hard thing to accomplish, but some formalization of second order rules is possible.

Consider again the Quasi Axiomatic Theory (QAT)

G = < S, S’ , R >


·         S is an set of axioms

·         Sis an set of elementary empirical statements

·         R is the set of inference rules.

We have suggested that QAT is natural because of an important distinction between plausible reasoning and reliable reasoning corresponds exactly to the difference between the formation process and the period of stability that occurs once a compartmentalized entity has come into being. In both cases, there is an aggregation that uses a substrate as a source and involves the formation of a whole that accounts for systemic needs of the level into which the emergence is occurring.  The reliance of QAT on J. S. Mill’s logic (see Finn, 1991; 1996, and Chapter 9) is natural because natural systems decompose into basic elements when the system image disappears and the system dies or dissipates.   This process is accounted for through the automated modification, deletion and addition of axioms. 

Plausible reasoning can be transformed into reliable reasoning given a well behaved set of empirical statements and a method for constructing (compressing) the set of empirical observations Sinto a set of axioms S with a specific inferential logic R (and perhaps a specific geometry) to reflect the composition of axioms into statements about evidence.  One has such a reduction of plausible reasoning to reliable reasoning in elementary geometry and in elementary arithmetic.  All of the empirical observations are made in the form of theorems that are “proven” using deductive processes and the axioms.  The reduction is not “prefect” in the sense discussed by Godel, but the elegance of these formal systems are wonderful.

In those cases where this compression is well defined and the statements are expressed in natural language, we have an ideal form of natural language compression (abstraction). If the inverse of compression exists we have text generation from a small data source. If we have both compression and its inverse we then have the ideal form of machine message understanding, and thus the new type of knowledge management technology we are looking for.  The technology that we have been proposing is not a technology that creates an artificial intelligence, but rather a smart processor that does what a computer can do in organizing and presenting data invariance to human perception. 

The primary open problem that remained was the problem of interpretation, and this problem is addressed by managing some information as a set of causes of events that occur at the level of observation.  Other information is managed as part of the analysis of compositional elements that go into the formation of the event.  In this way, the evidence that something is interpreted in a valid way can be applied during the aggregation process.  A theory of cause can be developed that places event information into situational context.  The interpretation merely biases the aggregation so that any human interpretive viewpoint is partially or wholly accommodated with some degree of under-constraint. 

In Chapter 2 we considered the possibility that a single formal system might adequately reflect some evidence, while not being able to reflect other evidence that remains significant; for example, a geometric or topological feature of S. 

Because of these features, S may partition into higher order categories, each capable of producing a separate and quite different formalism, each complete with various possible interpretive viewpoints. Each of the formalisms, what we will call "quasi formal systems", may develop the essential set of properties that a classical logical or algebraic system has, i.e. properties of completeness and consistency. We will take this issue of multiple ontologies up again after a brief discussion about translatability issues, and implications that can be draw about the source of natural language translations problems.

Ordered sequences of quasi formal systems and situations with multiple ontologies are two critical issues that must be understood to judge the Russian work, on plausible reasoning and QAT, in the light that it was intended. For the Soviet school, the privileged work in situational control was grounded in a pragmatism linking the non-stationarity of the world with observations about how transient systems arise and express coherent behavior. This work, on situational control, lead to the unique features of QAT.

3: On the question of ontologies

The Orb notation and discussions in this book supports a conjecture that natural processes compartmentalize through the constrained assembly of components.  Mental events are seen as a primary example where the aggregation process is complex and admits to tipping points, or points where the future behavior is subject to structural and functional changes.

As a result of compartmentalization, multiple formalisms appear necessary to generalize about the nature of complex natural systems.  This is because each compartment is the consequence of a different set of causes and has a different set of properties.  This difference is often seen as context, but we believe that the usual discussions about context are not complete.  A pragmatic axis, something that exists only in the present moment, is also necessary in the grounding of formalism as a model of a compartmentalized process.

The compartments are also caused by observation of one system by another.  This is a partial cause, of course.  But the observation entangles the two systems in a way that might not be part of the entanglement of a third system observing the first two.   In this way, multiple perspectives ‘cause’ multiple realities.  This was exactly what we were looking for.   Developing the notion of system entanglement answers many questions about human perspective in social situations. 

Paradox arises, and this is a necessary aspect of a ‘rational’ treatment of system entanglement.  However, science itself lays the ground for a maturity that allows paradox in specific cases.  Up becomes down when up and down are mixed together during the entanglement of one complex object with another. 

Karl Pribram has communicated to the scientific community the extensive experimental evidence that a formation of electro-chemical compartments, thought of as radiant energy distributed in a spectrum, occur within the dendritic networks of the human brain (Pribram, 1991). But this evidence, while certainly integrated into the mainstream scientific literature, has not been the basis for a comprehensive theory on brain function. The concepts exist in the literature in a fragmented form and are countered by traditional views about neural processing.  (see Chapter 4)

An extensive literature investigates parallel processing here, optical holography there, Lie algebras over there, quantum mechanics in this other place. But, up to 2004, there has been very little integration, because there is no common formal foundation that is mathematical and logic like in nature. There are isolated communities, but they have generally not been supported economically and thus have not been able to develop a framework that can be used to communicate the common framework beyond what are often either philosophical, and thus not complete as a scientific inquiry, or very narrowly focused, such as the study of molecular computing, blind sight or business processes.

The community is in an odd place of having to strongly criticize the very methods of investigation that has allowed them to see the next step.  So we stand on an edge, not being satisfied with our technology nor our deterministic science. 

As has been stated, Chapter 1,that the general theory of weakly linked oscillators has categorical invariance with quasi axiomatic theory, and that the investigation of formal relationships between systems of oscillators and QAT could serve to establish the science of open systems, in a framework that could unite various academic groups. The relevance of this theoretical work is that it shows us how to create multiple realities in the form of machine ontologies and inference engines. These realities do not have to be consistent with each other, but they do need to share a common substrate from which an experimental ground is established and enriched by the various points of view. 

The theory of coupled oscillators has been under development by mathematicians and physicists for some years. Systems of these oscillators produce an emergent energy manifold that can transform many (subfeatural) systems into a single coherent system that is phase locked. This corresponds to the formation of a compartment in physical systems, and to the specification of inference apparatus in QAT.   The question of system autonomy and entanglement can be explored in a context that has an exact modeling relationship to physical devices. 

We can explore categorical relationships between natural and formal systems (Rosen, 1975; 1985) and consider non-stationary relationships between two or more quasi-formal systems. These relationships, in the context of specific automated reasoning systems, provide examples of systemic differentiation of logical inference into multiple systems of logic. Each of these systems can be referred to as a first order cybernetic system Si , indexed by an finite set I, each sharing a common second order cybernetic system - as reflected in the inferential logic for S. Moreover, it is conceivable that each first order system is undergoing transformations C( Si) of the type that Pospelov anticipated in Equation 2. In this case, the transformation of linkages between distinct systems is part of the second order cybernetics.

A second order cybernetic system is analogous to the system image discussed in the context of the Process Compartment hypothesis. The systems of logic corresponds to middle level interaction, with autonomous compartments (wholes) having specific sets of properties (see Chapter on Mill’s logic). They can also be thought of as a collection of models, each of which are consistent within, but not consistent across multiple models. The commonalties are within a common substrate of physical, and energy processes.  The commonality is viewed from a perspective that entangles the complexity of the natural systems (i.e. the expressed entailment from unobserved levels of each system). 

The field manifold created by a system of coupled oscillators is analogous to the spectral processing referenced by Pribram.   There is a differentiation of a behavioral system, controlled by neural connectivity, and an awareness system controlled by electromagnetic phase coherence. He has called the theory "transformational realism", since a common substrate is the distribution of radiant energy.  Reality binds the behavioral and the awareness system together. 


4: Classification of issues regarding computational document understanding

This section discusses a core set of problems that are faced by document understanding and knowledge management.  We return to the aspect of the problem that is both linguistic and requiring a temporal logic.

Computational document understanding may be possible if a second order system selects the proper context for disambiguation of the text. This is the hardest problem faced by machine translation systems. Linguist Faina Citkin communicated to the author a categorization schema for treating issues of translatability. Dr. Citkin provided the primary translations for several U. S. Army funded conferences (1994 – 2000) on Applied Russian Semiotics.

For us, her categorization of the translation issues into these three classes provided essential insight about the critical communication problems encountered by the Russian logicians at these U.S. Army sponsored conferences, and by their Western counterparts (such as the Rosen school). The insights were, and still remain of a rather personal nature, since the complexity of the relationship between these scientific groups are extreme and involve a political dimension.  However, the difficulties, though really quite sever, are not a great deal different, qualitatively, from the difficulties in many of the corporation reordering process that take place, or fail to take place. Thus we will discuss this categorization here, in order to bring in some proper linguistics into the picture of next generation knowledge management that we are building. 

In Citkin’s categorization schema, there are three types of terminological relativity; referential, pragmatic and interlingual. These will be discussed only briefly.

Special texts, like product manuals, often have one to one correspondences to devices or processes. The issue of their understanding, and thus their translatability, is included in the first class. The class of interlingual type terminological relativity, is implicated when there is a clear external object for each concept expressed. Technical jargon has this distinction, at least on the surface. A poem might have less clear reference to external objects and minimalist art would have even less correspondence to a finite and specific set of things in the world.

The first class of issues can be resolved if a knowledge domain has been encoded to allow automated checking procedures between the source text and the target text. One way of visualizing this is to imagine that the world consists of sequences of ordered pairs,

{ (state, gesture)i }

and that both the state and the gesture are elements of a finite space of entities. The first class of translation issues is resolved when we know this finite state space completely, and have a means of knowing exactly what a state of a gesture is, in either language. Now the problem for translation is merely one of substitution. The temporal aspect moves us from one state / gesture pair to the next. Process modeling methodologies, such as process modeling, develop a flow diagram where state transitions are made clear.  In certain circumstances the use of process modeling methodology reduces a business activity to a simple set of flow and relational diagrams.  Three additional steps can be made once this reduction has been completed.  An AS-IS model can be develop based on the diagrams.  This model is taken to management and then to various stake holders for refinement.  Following the AS-IS validation a TO-BE model is developed and validated.  The last step is generally the most difficult and involves implementing changes via specific business process re-engineering practices and tool sets. 

The knowledge domain, in this case, can be something like an expert system or object database, but these knowledge sources are not open systems and thus will fail unpredictably if the context changes. Since telling us about the failure may also not occur, the system will, as it were, lie to us on a fairly regular basis. This is the current problem with machine intelligence systems.  Agile manufacturing methodology was developed to address this problem in the context of business re-engineering.

Two near term solutions can be brought to bear with agile manufacturing methodology.  The first involves knowledge representation using concept tokenization.  The knowledge domain is represented as a semantic net or ontology like a semiotic table, in which case the possibility for automated document understanding and thus translation of meaning is enhanced.  The second involves the development of the tri-level architecture. 

The second class, the class of pragmatic issues, is also related to a theory of interlingua where the situation addressed is dynamic. The tri-level architecture assumes the existence of a table where the system states that a process compartment can assume are all specified and related, via a composition function, to a database of subfeatures. The properties of this table is represented in the form of a database plus a specific situational language and contextual logic. In the case of Finn’s system for structural pharmacology and several other Russian systems, this work has been done and can be demonstrated. The Pospelov-Finn systems have the ability to produce an "emergent" ontology for situations where pragmatic and interlingua issues characterize the hard tasks. In this case, when the tools are available, the emergent ontology is computable in context.

Consider the problem from a linguistic point of view.  In underlying ontology, as expressed in a semantic net or table, can assume different system states and thus the sense of the terms may drift.  This is done with the voting procedure.   The voting procedure orders a set of category artifacts and effects either a routing of information or a decision to retrieval information.  Both of these effects are virtual, in that the routing or decision is not about either the categories or the semantic-net / table.  The effects are middle level events within a tri-level architecture.  The substructure and ultrastructure is “encapsulated” away form the perception of the user, and thus appears as state or gestures in the same form, e.g. finite state space elements, as when we knowing exactly all states or gestures priori to use.  However, in this case the states and gestures are newly created and given unique meaning by the user in the context of use.  The tri-level architecture is an agile architecture working with a virtual state space, exactly as we need.

The rules that govern an adaptive ontology allow a modification of the sense of the target term so that text is understood in a sense that is consistent with the source term.  Again, from a linguistic point of view, a translation process must import some of the knowledge that tracks this drift in sense.  This commonality is held by the natural language as understood by participants in the language use.  In the tri-level the commonality is held by the substructural and ultrastructural artifacts and then entangled through the composition function. 

The target representation is semantically invariant to the source representation.  The representation would not be of an early Wittgensteinian sense, where all of the tokens of language have a one to one correspondence to realities and facts in the natural world.  This case is the Citkin class of interlingua type terminological relativity (first class covered above.)  The second class, the class of pragmatic issues, is addressed in the later Wittgensteinian view that language points to reality and must have an interpretant.  In this sense Wittgenstein comes to the same position as developed by American pragmatist C. S. Peirce. 

Thus pragmatics is, as it should be, related only to a specific situation at a specific time (or state of the ontology). Interlingua type relativism is a condition of equality, i.e., this word in the source language is that word in the target language. Pragmatic type relativism is a condition of system transitions from one state into another, but under a uniform set of rules. As demonstrated by Pospelov and Finn, this set of rules can be captured in the special semiotic logics of applied semiotics.

The third class, the class of referential type, include issues arising where a term’s meaning in the source language has an ontology that does not exist in the target language. Here the process compartment that shapes the source term’s meaning, in the world of someone’s experience, does not correspond to any possible neural processing compartment, responsible for generating signs in the target language. As can be said about the appreciation of poetry, overcoming issues of referential type involves creativity and a perceptual measurement of new observables.  This class is treated extensively in the works of linguist Benjamin Whorf.

An example of a referential issue would be found in the translation of a world view created by Russian scientific deference to Marx and Pavlov’s scientific materialism in post World War II USSR. In the West there was no such deference, or at least the deferences were of a different type. A second example is the deference given to two valued logic by Western philosophers and scientists. This deference is deeply grounded in our culture. In the West, the notion that non Boolean logic would be of "ontological" value is ridiculed. A third example would be the structure and form of Hopi sand (medicine) drawings. Most people unaware of Indian "Old Way" would never imagine that a relationship could be made between colored sand designs on a dirt floor and the healing process.  In each of these examples, the problem with translatability is that there are no containers to place meaning in target languages, unless that language has a similar referential type.

The quality of any automated reasoning system is a function of its power to reveal the basic signature of a situation under investigation (see Ritz and Huber, 1996). To do this, it is often necessary to resolve paradox.  A system that resolves paradoxes will produce information complementarity.  An entanglement of the viewpoint’s substructure and ultrastructure can accommodate multiple viewpoints.  Accommodation produces the emergence of a new system for understanding both ontologies and their natural inter-relationships.

Thus the requirement for agile knowledge engineering and process re-engineering in the commercial world is similar to the problem of having a common "scientific’ methodology in physics, neuroscience and psychology.  In chapter 2 we also imagined what is involved in the fusion of two separate thought processes.  Here we used the model of weakly linked physical oscillators. The formation of a marriage or friendship between individuals is another illustration of a system where an entanglement process is occurring.

5: Using theme vectors as a basis for data mining.

The above sections communicate the general principles that shape an emerging theory and practice of knowledge processing. Automated reasoning and document understanding are viewed in this light.

In this section, we develop a notation and architecture for extracting synthetic concepts from the thematic analysis of document collections. The notation is constructed to allow further integration with QAT and category theory, and is extended in Chapters 11 and 12 to address the technology requirements of large scale knowledge management projects such as those needed by the federal government to manage the national secrets.  This work predated the Orb technology by five years. 

Let C be a collection of documents, T a set of computed theme vectors, and I the inverted index for T, (see figure 3.)

T contains a set of theme vectors,

T = { (n, t)j | dj e C }

where di is a document, n = { n1 , . . . , n16 } and t = { t1 , . . . , t16 }. The positive integer ni is the semantic weight of the theme ti .

Figure 3: Each document in a collection C is represented by a vector of weights and phrases. The full set of theme vectors T is represented as an inverted index I.

We are interested in an automated procedure that uses an index of theme vector based classifications to produce a category schemata for document retrieval and a set of distinct situationally specific ontologies. Duplicate document detection and similarity analysis in text collections must work with this index (see Chapter 11 and 12).

Figure 4: User views can take the form of a simple hierarchy (a) or as a more complex semantic net (b)

The method presented is a modification of several published methods for identifying concepts using vector representation of documents (van Rijsbergen, 1979). It borrows features from Hecht Nielson’s method based on word stemming plus vector clustering by neural nets, and D. Pospelov - V. Finn methods for situational representation.

A schematic diagram, showing the architecture for knowledge extraction, is drawn in Figure 5. C, and T have been introduced above. S is the representational space for the collection’s theme vectors. S is formally a simple Euclidean space with, for moderate size collections in one subject field, about 1500 dimensions. Each dimension is created to delineate a single theme phrase.

Figure 5: Schematic diagram for knowledge extraction and situational representation

The relationships between topic areas and the topics can be separated into manageable groups. Subject fields with greater than 1500 themes can be compartmentalized into a small number of topic areas. The themes can be linked using thesaurus and co-occurrence tables.

Suppose that we have a document collection C about a small number of narrow topics. Let T be the set of generated theme vectors. The phrase component of each vector component for every element in T can be sorted into bins. This sorting process will saturate as the probability goes to 1 that a new theme phrase already has an assigned bin. New bins are created when necessary so that each bin is representative of a single theme and every theme has been placed in a bin. This process creates an "inverted index" of the themes. The inverted index is ranked by the number of documents having that theme. This ranked index is denoted by the symbol I.

I = { ti | i e index set J }

The weight of the word phrase, ti , is dropped in the notation to simplify writing.

Now a user can fix a view of the document collection by marking as "valid" all themes having relevance to that view. This procedure is, of course, based in an intuitive judgment by a user. This procedure of marking can be done using specially constructed user profiles. The profiles are taken as the union, in some way, of text units that are grouped together for this purpose.

The valid themes for a view define a subspace Sview . This subspace can be used for trending of synthetic concepts across categories such as time or another concept. The inverted index I can be restricted to the valid themes for a specific view. The result is a new inverted index denoted here by the symbol J.

J = { ti | i e index set K which is a subset of the set J}

An assertion can be made about the completeness of Sview as a representational space with respect to a knowledge domain. If the collection of documents is comprehensive then additional computation of theme vectors for new documents will not increase the dimension of Sview . This is because the process of creating new bins for themes will saturate if the size of N is finite. We will assume that N is finite, although having the character that elements of N may change, may be deleted, or that a new element may be recognized at any time.

N denotes a class of natural kind. It is an inventory of all of the things that are constituents of events that arise in a specific arena. For example, a list of all man made pharmacological agents would be a class of natural kind. The set of all atomic elements is another class of natural kind. And under certain circumstances, the set of validated themes J is also a class of natural kind. A theory of how the elements of a class are created requires a list of subfeatures and a theory about how the parts of emergent features are aggregated to form a whole. We will return to this point later when we discuss the so called "derived model" in the last section.

N can be thought of as the situations that arise from J. These situations are often described by concepts in the form:

concept = { ti | i = 1, . . , k }

where ti is associated with an element in the set of subfeatures J.

6: The use of formal models and semiotic models:

We have introduced the general notion for a synthetic concept with n components:

concept = { ai | i = 1, n }

where ai is a theme phrase. With this very simple construction it is possible to view the occurrence of the full concept or even individual themes within a single synthetic concept. These concepts can be extracted from the index I using stochastic cluster or scatter/gather techniques (Hearst & Pedersen, 1996). The problem is, of course, that the concept has not been validated as meaningful. In what follows, we will suggest some techniques for refining the description of a meaningful concept. The rest of this chapter is technically oriented and may be skipped without loss to philosophical issues.

If used in a certain way, an advanced natural language processor engine can replace a major part of the time intensive steps that Pospelov was required to make to produce his formal models. In Situational Control (pg 32) Pospelov states; "situational control demands great expenditures for the creation of a preliminary base of data about the object of control, its functioning and methods of controlling it." This step must be automated with procedures similar to what we suggest below.

Consider the case where our document collection contains diplomatic messages regarding the internal affairs of a country, industry, company, computer network, social situation, medical case, etc. This collection of messages is collected into a document database which is then represented by theme vectors. A set T of theme vectors is produced. From this set of theme vectors we can easily merge the individual theme phrases into a table and count the number of messages having a specific theme. This produces an occurrence ranked inverted index I on the message themes. An expert on the internal affairs of the country is asked to mark those themes that are of most interest from a certain point of view.

This produces a new inverted index containing themes of interest. These themes can be quickly converted into either a hierarchical taxonomy, perhaps with "see also" hyperlinks, or a semantic net. The index, now called a theme index, can be further refined into a concept index and concept indices linked together into an analysis of a real time situation. Using the power of modern relational data bases we can keep track of pedigree, information on the sources of learning or adaptation, as well as institute wrapper procedures to normalize information.

Documents can be classified by user validated aggregations of computed themes. This is done by linking the representations of documents to semantic characteristics that are encoded at the level of the theme structure and combining this information with a study of the co-occurrence of themes and the contextual category if that is known.

We have developed an additional capability. This capacity is to provide a unique situational analysis based on the relationship of part to whole as revealed by the extension of Mill’s logic presented in chapter 9. To understand this capability, we return to the definition of Pospelov’s formal systems (Situational Control , pg 36):

Definition: The term formal system refers to a four-term expression:

M = < T, P, A, R>

where T = set of basic elements, P = syntactic rules, A = set of axioms, and R = semantic rules.

The interested reader can refer to Situational Control for a more detailed treatment of formal systems. For our purposes we need only refer to a figure from page 37.

Figure 6: Taken from Figure 1.8, Situational Control pg 37)

In figure 6, the set of base elements T are combined in various ways to produce three type of sets, axioms, semantically correct aggregates, and syntactically correct aggregates. We should remember that mathematical logic is founded on a similar construction and therefore that most of the results of mathematical logic will somehow apply later on to the theory of knowledge processing that we are constructing. For example, the set of axioms can be specified to consist of independent, non contradictory and self evident statements about the set of base elements T. The axioms are our analog to the quantum mechanical substrate that gives rise to the mechanisms that in turn produce some of the material and energetic constraints that are involved in the creation of awareness.

Rules of inference can also be formulated to maintain notions about true or false inference from assignments of a measure of truth to the axioms. The set of syntactically correct aggregations of elements of base elements can be defined either by listing or by some implicit set of rules. The semantically correct aggregations could then be interpreted as those syntactically correct aggregations that have an assignment of true as a consequence of the inference rules. However, this interpretation is only one of a number of interpretations for the formal relationships between T, P, A, and R.

We have found a simplification of theory. Pospelov himself notes that the axiom set can be the same as the set of semantically correct aggregations. In this case the rules of inference need not be known, but we certainly will lose the property that the axioms be independent and the axiom set be minimal in size. In the case where the system under study is a natural complex system, such as a national economy, there is no fully understood set of inference rules. One can only observe that national economies experiences modal changes from one condition into another. Each condition can be defined "phenomenologically" as a semantically correct aggregation of an unknown or partially unknown set of base elements. We view only the surface features. Given this caveat, we will define the following formal model.

7: Definition of a formal model from theme phrases

Let the set, T, of basic elements = J . T is now the set of theme phrases that have been chosen, from I, by an expert as representative of the expert’s view of the situations addressed by the messages. The size of T, denoted by |T|, is finite and small - perhaps less than 300 elements. Let P be the set containing a single syntactic rule stating that any subset of T will be considered a syntactically correct aggregation. Of course the size of the set of syntactically correct aggregations is 2^300, which is a very large number. At this point we have a lower and an upper envelop on the semantic rules. Any possible semantic rule must assign the possibility of being meaningful to an element of the "power set" . It is noted that one way to specify a set of semantic rules is to explicitly list the semantically correct aggregations.

Let A, the set of axioms, be defined to be equal to the set of semantically correct aggregations. T, P and A so defined leave only one remaining definition. The definition for the set of axioms is a boot strap, since at this point there is no means for identifying which of the syntactically correct aggregations of base elements are meaningful, with respect to the view under consideration. We need to create the semantic rules.

The notion of stochastic clustering can be applied to the task of clustering theme vector representations of documents. In fact the implicit notions of distance between products of stochastic clustering can be applied to produce internal measures of nearness between documents and between synthetic concepts. This fact will be exploited elsewhere. Here we are interested only in the crudest notion of an aggregation into meaningful collections.

We define a semantic rule that states the following: If a subset of T is grouped together by a clustering procedure, then the subset is meaningful. Such a rule would reduce the number of "candidate" semantically correct aggregations from 2^300 to a much smaller number, perhaps 2,000. However, such a rule is dependent on pairwise measures of similarity based theme vector distance. Selecting good pairwise measures of distance is an interesting problem that has been worked on by a number of researchers. This problem is equivalent to the construction of a good axiom set and proper rules of inference. We are interested in bypassing this problem by employing any reasonable pairwise measure and then employing "checking" procedures to validate potentially meaningful aggregations.

What results is a compound semantic rule with two parts, (1) clustering and (2) checking. We can easily see that the use of the voting procedure (Appendix) as a routing engine has exactly these two parts.

To summarize our compound semantic rule: a set of themes serves as subfeatures to be aggregated using an algorithm to cluster theme vectors. When vector clustering identifies a collection of theme vectors as being close, then the individual themes within those theme vectors are grouped together as a syntactically correct aggregation. This is now a category and this category is part of a collection of categories called a category policy. The aggregation is treated as a synthetic concept and checked to see if the synthetic concept is meaningful. Checking for meaning can be as simple as asking the expert if the synthetic concept is suggestive of the situations known to exist and referenced by the message collection. This process of checking can be used to develop a refinement of the category policy over time, a refinement that involves the perception by users of the consequences of setting the categories in a certain fashion.

At least one automated checking procedure has been identified. Synthetic concepts can be trended over feature sets, such as a time sequence, to see if temporal distributions reveal locally normal profiles. Other visualization methods are clearly possible and have the advantage that a user is able to use human intuition to organize and structure the set A.

In Pospelov’s book, Situational Control, he describes methods for "deconstructing the set A and reconstructing an new minimal size axiom set A’ and rules of inference that will generate from A’ a copy of A. In this case the formal model has an good axiom set and the inference rules are able to generate conjectures about new aggregations not originally in A, but from the same situation. This enables computational knowledge generation, the inverse of knowledge compression, as demonstrated by the Russian semiotic systems.

8: The selection process

This section contains an example of how machine readable ontologies might be constructed. It is not the only type of example, but one that demonstrated much of the tri-level framework for identifying subject indicators using Orb (Ontology referential bases). 

Document management with concepts requires the delineation of the concepts that exist in a document collection, and are judged to be significant by the user.

This delineation often starts with a document collection C   Using software, one can construct the complemented document collection C* = {(d,v)} , where d is a document and v is a theme vector.

In Figure 5, above, a set of theme vectors T is computed from a document collection C . The elements S , N , and F are described as the visual themespace, a class of natural types and subfeatures of these types that form the basis for modeling.

The Figure 7, twelve theme phrases are evaluated by the user with values v (valid or meaningful), -v (not valid), and ? (not determined). As a result, the three valid themes, t1 , t6 , and t10 , are brought forward as the leading nodes of a hierarchy. The notion of valid is grounded in the meaningfulness of the phrase to the user. The themes evaluated to be not valid are dropped, not to be considered again. Theme phrases evaluated as "not determined" are placed into a new list and redisplayed to the user.

Figure 7: Twelve theme phrases are evaluated as valid, not valid or not specified.

In Figure 8, seven theme phrases, originally evaluated as "not determined to be meaningful", are displayed to the user for evaluation. Again, the user is asked to exclude theme phrases from further evaluation, delay evaluation to the next iteration, or assign an evaluation to the phrase. At every level, other than the top one, the user is asked to place the phrase as subordinate to one of the existing nodes.

Figure 8: Second level evaluation and the resulting hierarchy.

The iterative process continues until there are no phrases to be evaluated. This results in the definition of an abstract tree structure relating the phrases to some higher level construct, which we call synthetic concepts. Of course, the relationship between these structures and real concepts expressed (intended or unintended) in the document collection must be refined.

The selection procedure allows the user (expert) to designate what is meaningful, and to organize the phrases into a hierarchy. Once the hierarchy is defined, then the interface can show the hierarchy in the form of trees.

Figure 9: Theme phrase hierarchy represented as a set of trees.

The hierarchy defines a set of meaningful aggregations of the theme phrases:

c1 = { t1 , t2 , t9 , t8 , t3 }

c2 = { t1 , t2 , t9 , t8 }

c3 = { t1 }

c4 = { t6 , t5 , t11 }

c5 = { t6 }

c6 = { t10 }

The results of formal computation is left open to the user’s input. This openness occurs at several levels, initially at the level of the basic subfeatures. At the level of concept subfeatures, phrases can be added to and subtracted through the exercise of human judgment. Later, the user is allowed to effect an analysis and refinement of the relational logic between concepts.

Each of these concepts can be placed within a graph construction, where a formal relationship, inherited from the subset relationship, can be drawn in (Figure 10). The formal relationship, defined from set theory, is at this point a mere link between two concepts.

In a strictly formal sense, a number of "artificial" relationships are implicit in the way basic elements are collected together into sets. Intuitively, there must be a formalism that provides a descriptive basis for representing these relationships. For example, consider the case where the concept is refined by a logical and between each phrase: c1 = { t1 and t2 and t9 and t8 and t3 }. An as a second example, consider the case where the concept is refined by a logical or between each phrase: c1 = { t1 or t2 or t9 or t8 or t3 }.

The discussion of minimal and maximal collections with respect to a concept shows that changes in the logical relationships between basic features are reflected in relationships between the aggregated constructs. In this case the change from logical AND to logical OR reverses the set inclusion of the collections returned through the query by concept.

Figure 10: Semantic net for Figure 8.

From the point of view of software design, the important issues are:

1) the concepts themselves are defined through a process of user selections.

2) the concepts can be represented as a graphical icon, initially the symbols ‘c2’ ,’c7’, ‘c5’ etc., and the links between icons must be represented as line objects.

Both the line objects and the icon objects need to have an object properties resource that is displayed to the user via mouse clicks.

After the formation of the initial set of synthetic concepts, the relationships between concepts is instantiated as a default using a formal relationship based on minimal and maximal collections. One possible interface allows the user to explore the minimal and maximal collections associated with the concepts that they define. More complex formal treatments of the relationships between concepts is possible.

Formal relationships may suggest, to the user, a natural relationship between the concepts. However, the user can also specify relationships between concepts. Moreover, the notion of a concept relationship has now been introduced and can be used to add, subtract and modify linkages between concepts, for example establishing a relationship of a certain type between c5 and c3 .

In the notation of Pospelov, this can be written as the syntagma <c5, r, c3> where r is the relationship. As noted in the Introduction, the trick to automated document understanding may be in identifying a virtual state space where chains of syntagma are the trajectories of automated reasoning.

Each of the seven objects in Figure 5, above, have an independent role and can be treated separately. For example, J , Sview , and N can be stored in a small computer space and called into being when the original view of a document collection is appropriate. This enables a process compartment approach towards text understanding. The compartment in this case can be called a Knowledge Processing Unit (KPU), Figure 11.

It is important to note that a compartmentalization of document views into classes of natural kind is operationally dependent on lexicon and knowledge catalog resources. An iterative feedback between a knowledge processing unit and a linguistic processor would focus linguistic analysis and produce better results.

Figure 11: Iterative feedback between a Knowledge Processing Unit (KPU) and ConText

A single KPU can be used as a classification engine. The computation involved is minimal, except for NLP computation of a theme vector. The theme vector can be placed into a visual representation using the classes of natural kind as a finite basis of a state space consisting of syntagmatic units. The means for storage and maintenance of these syntagmatic units and the composition of valid syntagmatic chains is the core of the knowledge bases built by QAT methods.

Classification methods based on simple associative neural networks are also possible. Once one or more KPUs are created, then a training set of documents can be used to encode a distributed relationship between the class of natural kind and individual documents. By altering the order of presentation and rules of assembly, a single document can be associated with multiple concepts. After training, new documents would be classified as concepts within a specific view. Almost no computation, and almost no computer memory, is required for classification using a trained classifier engine, and thus the user, with proper software, could quickly be shown the conceptual relationships that a document might have in multiple views.

The procedure outlined above uses the power of a Natural Language Processor (NLP) to bypass the time intensive first step in constructing a Pospelov type formal model. We defined the formal model:

M1 = < T, P, A, R >

where T = a set of themes, P = { power set operator P(.) on T}, A = {semantically correct elements of P(T)}, and R = { compound semantic rule }.

We can now define the so called "derived model":

Md = < Td, Pd, Ad, Rd >

This derived model can be developed by following a procedure for deconstructing examples as outlined in Pospelov’s Situational Analysis.

By referring to Figure 12 the reader can follow the creation of the formal model M1 . J is computed using a natural language processor and a brief interaction with one or more human experts. J is the base set of elements T for the formal model M1 . Through validation procedures, a class of natural kind N is identified by selecting from the power set P(T). Initially this class is simply the axiom set A.

Figure 12: Knowledge extraction and situational representation using a user defined view.

The formal model M1 can be constructed using existing software systems. However, more can be done once M1 exists. M1 contains a description, A, of meaningful aggregations of subfeature representations of situations in the world. Using this set of descriptions, it is possible to create a theory of natural kind and a new set of subfeatures, Td = F, where each of the elements of a natural class is modeled as the emergent combination of subfeatures. The natural class is initially modeled as being isomorphic to the set A.

The theory of natural kind is specified as a set, Rd , of inference rules for determining the meaningfulness of synthetic concepts, as well as the logico-transformation rules governing how referential objects are formed in an external world. A theory of natural kind is a deep result that can be appreciated by examination of the work by M. Zabezhailo and V. Finn’s work on structural pharmacology (Zabezhailo et al, 1995) or M. Mikheyenkova and V. Finns’ work on modeling social collectives (Mikheyenkoval, 1995). The logico-transformation rules is a meta formalism that can be combined with the theory of plausible reasoning as developed by Finn.

Note that the logico-transformation rules are not part of any formal model. Logico-transformation rules play an important role in moving from a single formal model into a more powerful semiotic model where transition between formal models will be allowed. Logico-transformation rules are intended to explain why a situation would arise as an example of a natural kind.

The semantic rules, R, is a surface result that provides a pragmatic way to delineate all, or most of, the natural kind in a situation. Our strategic plan is to apply R to build a classification engine for knowledge based document management.