Chapter 9
Similarity Analysis and the Mosaic Effect
Introduction
A preliminary
theory and notation for similarity analysis is outlined with application to the
reduction of a mosaic effect observed in declassified collections.
Definition of
mosaic effect
The syntactical
mosaic effect occurs when structural parts of a single image or text unit are
separated into disjoint parts, each part judged not to have a certain piece of
information but where the combination of two or more of these units is judged
to reveal this information.
The semantic mosaic
effect occurs when structural parts of a single image or text unit are
separated into perhaps overlapping parts. Each part is judged not to imply a
certain concept but the combination of two or more of these units is judged to
support the inference of this concept.
As a general rule,
an increase in similarity analysis causes a decrease in the mosaic effect.
This relationship,
between the quality of similarity analysis and the occurrences of mosaic
effects, is one of three fundamental relationships that argue for a change in
the nature of the discussion about computer mediated knowledge management and
the use of computers to provide situational analysis. It’s application to the
management of classification by the Federal Government would make tractable
tasks mandated by Executive Order 12958.
Additional
fundamental relationships open the possibility of defining notational systems
that provide a control language for the declassification process, or more
generally for a new "horizontal" information technology supporting
"conceptual checkers". These new tools are applied to text and are
based on the identification and interpretation of conceptual substructures
referenced by the text. Conceptual checking will work like a spell checker or a
grammar checker.
The relationships
between concepts, their similarities and dissimilarities, are essential to such
a technology. However, context limits the scope of notation in a concept space,
and thus it is essential to account for context in the notation. The interplay
between substructural similarities and properties of interpretation is thus to
be seen as an evolutionary process, that can be controlled via an annotation
language such as the one developed in Tonfoni (1999).
Similarity and
inference
There is a
fundamental relationship between similarity analysis in text and images and the
mosaic effect seen in declassification releases. The effect reveals information
that, when gathered together, support inferences and strong evidence that
certain things are true.
For example, an
agent may feel that a spy is working in a certain area. A mosaic effect might
provide sufficient evidence from declassified material for conjecturing the
exact identity of the spy. Whereas the declassified material does not
explicitly identify the spy, the material does lead to an identification that
might not have otherwise occurred.
At the heart of the
mosaic effect there are different types of similarity.
The type of
similarity that we address below is a relationship between causes of, or the
properties of, a class of two or more situations. This type of similarity
depends on a part to whole relationship that is exploited in various bi-level
voting procedures and is discussed in various literatures (including the work
by D. Hofstadler and his students) (Hofstadler, 1995). The voting procedure was
developed, by the author, from an interpretation of J. S. Mill’s logic and a
Russian cybernetic system (described in various Chapters of this book).
The following is
some preliminary notes on the grounding of a generalization of the Duplicate
Document Detection (D3) formalism (Prueitt, 1999, Chapter 11). It shows the
importance of both connectionist and evolutionary programming and the paradigms
that support their analysis. The development of the notation should proceed
with the peer review and contributions of several scholars in related areas of
research.
1.2: Conceptual
foundation, based on bi-level mathematics
Let S2 be the sign
system for all basins of attraction for an "integrated" system of
oscillating point sources of magnetic flux (Kowalski et al, 1988). These basins
of attraction are a structural cause for a number of system behaviors such as
when a group, of source points, comes to share the same phase of oscillation.
Phase locking is called entrainment and is a manifestation of events where two
of more point sources act together as an integrated whole. The sign system S2
is made specific with a one to one correspondence between symbols and basins.
The emergent
phenomena are basins of attraction in a manifold that develops either in
simulation or in physical reality. In one class of simulated systems, the
basins of attraction of a simple layered artificial neural network represents
emergent phenomena.
However, a model of
weakly coupled oscillators is a clearer model of bi-level computation since the
basic elements are each something like a physical pendulum. Linkage
relationships, between point sources, can be specified in either the physical
systems or in the simulations. Systems of weakly coupled oscillators give us a
means to verify hypothesis about physical systems with emergent properties.
The mosaic effect
can be studied in these simulations, since a conjecture about a fact
corresponds to the formation of a basin signed by the sign system S2. The
related theories on deduction and induction may be grounded in neurophysiology
and thus has long term validity as a motivation for basis research on text and
image understanding.
Let S1 be the sign
system for the set of point sources. These point sources provide a model of a
simple type of substructure. The bi-level model considers the elements, modeled
by the notation in S2, to be an aggregation of elements from the sign system
S1. However, the "entanglement" of the elements of cognition, memory
substructure, is not nearly so dominate as the bi-level model suggests.
One finds that
removing elements from S1 does not modify only one or a few basins – but rather
modifies many or all of them. The "entanglement" in this model is not
natural. The bi-level model is not powerful enough. A ‘top-down"
environmental level is needed to establish context.
In a bi-level
model, the sign system S2 represents the "ultrastructure" of the
emergent manifold, but this ultrastructure does not have ontological referent
outside of the happenstance of the manifold. It is merely an artifact of the
binding of the elements of substructure into one whole. We need a top down
fitness function.
The fitness
functions, in evolutionary programming, cause the configuration of basic
elements to evolve towards some implicit representation of an ecosystem. With
the bi-level model, we see that a fitness function is indeed defined, but in an
ad hoc fashion. The fitness function fits the basin – without putting pressure
on the basin to change. There is no "action-perception" cycle.
The evolutionary,
or adaptive, pressure must come from a measurement process that actually is
open to the complex nature of the environment. In the case of declassification
annotation and judgments, the required openness is to the cognitive processing
of the analyst.
In the bi-level
model, each emergent phenomenon is not co-selected by substructure and
ecosystem affordance, as is the case in natural systems having a specific
chemical distribution in an environment, or a set of behaviors in a real
behavioral space. We need a third level to supply context and which puts
pressure on basins to change.
The way to extend
the bi-level model to a tri-level model is developed in C. S. Peirce’s
"Unifying Logical Vision" (ULV), as interpreted by modern research on
situational logics, knowledge representation and neuropsychology.
Unifying Logical
Vision (ULV)
The ULV is stated
in the following way:
"Concepts are like chemical
compounds, they are composed of atoms"
This vision was
instrumental in the development, by D. Pospelov and V. Finn (1970-1995), of the
theory of situational control. The issue addressed by this Russian research
team was the role of environmental factors in determining which of many
possible basins are actually manifest in a natural system at any one time. This
role introduces a third level to the organizational stratification of the
theory.
These levels are
"real" levels separated by "gaps", not the hierarchical
levels seen in subsumption graphs and tree data structures.
For example, since
the third level of analysis is about things that are at a higher time scale,
the environmental role is seen in incomplete sets of rules of behavior rather
than all at one time.
The properties and
relationships between basins, in so far as they are understood, are represented
in the situational logic produced for the sign system. The Russian logic has
the form of five notational languages, two inner languages and three outer
languages that progressively build up the situational logic having special
properties related to an openness to change in the axioms and inference rules.
In our interpretation of the Russian work, the two inner and three outer
languages are with respect to a gap that necessarily separates structure from
function in natural systems. It is a gap that separates S1 from S2.
The Tri-level
notation
After the grounding
of similarity analysis in the previous section, we now develop a specific
notation for similarity analysis in document and image collections.
The relationship
between this notation and the 4 by 4 D3 formalism [4] is one of generalization
from four absolute levels { page segment, page, document, collection} to three
relative levels,
{ substructure, middle, contextual }.
If one considers
the four levels of the D3 formalism, we see that each of the middle two levels
are each between two levels. In this case, any analysis of similarity has three
levels. For example, the page’s substructure is non-overlapping page segments,
and the page’s context is held within the document. Page segments can be given
a n-gram substructure, but this was not done in the 4 by 4 D3 formalism.
Four types of
similarity metrics, {exact, near-exact, non-near exact but similar, different},
are also generalized to a class of N similarity metrics,
{exact, near-exact, similar through relationship r, different}.
The cardinality of
this class of metrics depends on the number of relationships, { r }, that are
active in the collection. We suggest that this number is not constant from
context to context, and that this fact represented a primary obstacle that has
not been addressed in any of the large-scale projects. Since a shift in context
requires a shift in time, the formalism is called the N(t) by 3 SA
formalism.
On the issue of
descriptive enumeration and relations
Let
A = { (a, r, b) }
be the set of
active relationships in S1 at time t0. These active relationships
are determined empirically to be some subset of all potential relationships
P = { (a, r, b) }.
Potential
relationships are also to be determined empirically. The requirement that
relationships be determined empirically is a harsh one, but one required by the
nature of phenomenon.
A method for
determination of sets of relationships can be advanced and thus the harshness
of requiring empirical determination of all possible relationships is
tractable. This method is descriptive enumeration.
The set of active
prototypes in substructure can be determined by a descriptive enumeration of
invariance. By invariance we mean those patterns that form equivalence classes
seen from the perspective of the middle level. From descriptive enumeration,
each invariance is assigned a symbol. As this set is being developed, it is
possible to consider all subsets of size 2. Each of these subsets {a, b} define
a certain number of potential relationships (a,r,b). These relationships are
identified through a process of descriptive enumeration, this time about the
class of relations.
The process of
descriptive enumeration requires strong support from human decision makers,
since any formal method relying entirely on computational process is likely to
be intractable.
Active
relationships, between substructure, are instantiated in the context of
building an ensemble at the next higher level of organization. Of the three
levels, this ensemble level is the middle level. The formation of judgement is
normally about the objects and relations in a middle level.
The middle level is
defined by a set of temporal invariance; e.g., objects having permanence in
some interval of time, interacting with each other. Substructure and context do
not interact with ensembles, by definition. In fact, a "level" is
properly defined to be the whole class of all objects that interact with each
other. From this definition, it follows that each level is separated by an
"gap’. The complex systems research community calls this separation an
"epistemic" gap. Physicists call a class of such gaps Heisenburg
gaps, and mind – body philosophers call certain type of gaps Cartesian gaps. In
each case, it is commonly thought that it is not possible to fully formalize
the gap’s nature.
Seen from the
perspective of the middle level, substructure is a statistical artifact where
individual substructure invariance is treated as a member of a prototype class.
This principle is illustrated by the regard that a chemical compound has for an
individual atom, or a factory has for an individual worker. It may be appealing
to say substructure has the relationship of "is a part of" to the
ensemble. However, this use of language is simply not correct. There is an
incompleteness of description that can not easily be overcome.
Likewise, context
has only incomplete relationships to ensembles since context is only partially
a function of the environment. Again, we have an incomplete description, but of
a different kind. Now the incompleteness has to do with the waiting time
required for patterns to complete. Sometimes the incompleteness of description
is not a problem, but at other times it leads to errors.
The definition of
level comes from an appeal to the physics of complex systems, particularly the
physics of quantum events. However, the model of three levels {substructure,
middle, contextual} is relative. Any specific substructure level might be a
middle level seen from a different perspective and the same may be said for the
level of context.
Temporal
dimension
Each
"real" object has a period of consistency where the object maintains
it’s "temporal invariance". In consideration of the properties of a
level, the temporal dimension is essential. The formation event introduces a
new object within a level that existed prior to the emergence. Once created,
the object has a stable existence before suddenly losing its cohesiveness.
Once formed, the
whole may be modified by new internal emergence. However, the level shapes the
emergence and thus the temporal state of the level is reflected in the
properties that the new object has.
There is an
assumption that the set of all objects that have an active relationship is
invariant over some period of time starting at t0 and lasting until
t1.
Let t0
be when an ensemble is first formed. We need for the sign system, S2, to have a
one to one correspondence to those objects in the level that have active
relationships to the new ensemble.
The selected set,
S2, is
{ o1, o2, o3, . . .
, on},
over the period
from t0 to t1. Each of the objects in this set have a set
of properties { p } and relationships { r } to other objects. These properties
and relationships are to be discovered.
Tri-level
notation
The tri-level
notation must account for three issues.
1) The first issue is the emergence of a level, or of a new object into
an existing level, that is seen as an interaction between substructure and
context.
2) The second issue has to do with the entanglement of substructural
invariance in the objects of a level.
3) The third issue has to do with the interpretation of signs as
referential to specific concepts.
These three issues
are treated in the following three subsections.
The emergent
manifold and its invariance
Intellectually, the
problem we face is explaining how levels come into existence in the first
place. In section 1, we grounded our discussion of similarity in a model of
weakly coupled oscillating point sources. This model has several unique
qualities. First, it is possible to perform experiments with either a physical
apparatus or within a computer simulation. Second, the bi-level nature is
illustrative of the ensemble behavior of network models of neural networks or
of evolutionary programming like genetic algorithms. Thus we have several ways
to motivate a deep and empirically grounded discussion about the emergence of a
level.
If our model was
only bi-level then there would be no level, only an emergent
"object". However, any object implicitly defines a level as the set
of all objects that it interacts with. In many formal theories, the emergent
object is considered alone – without taking into account the implicitly defined
level that forms the "full" environment of the object. We feel that
this practice is motivated by the avoidance of "necessary" logical
paradox.
Sometimes this
paradox can be stated: "there are things that do not exist". In
particular, the paradox comes up when one talks about the existence of memory
when the memory is not actually, at that moment, the contents of an awareness
state.
We use the term
‘full environment’ here to remind us that the level is not statically defined
and has a temporal dimension.
Entanglement of
substructural invariance in the objects of a level
Perhaps the best
example of entanglement is seen when we attempt to represent the referent
concepts signed by text or discourse. The boundary of a concept referentially
is just not as crisp as classical logic would have us believe. It is not always
possible to say that a concept is either present or not present in a passage of
text.
One reason for
non-crisp delineation comes from the nature of physical mechanisms that produce
mental images. These mechanisms may operate in a three-tiered modality, again
due to the underlying qualities producing physical stratification of temporal
processes into three levels with gaps.
Interpretation
of signs as referential to specific concepts
Presumably concepts
were occurring within the mental images of humans before natural language came
into it’s modern form. One imagines that specific types of behavior were
coincident with the presence of a concept. Later, these specific types of
behavior were part of the substance of sign systems and then language systems.
Interpretation of
signs then became the basis for communication of information. The tri-level
model of complex processes, including concept formation, is thus seen in the
light that Peirce and Popper used in their analysis.
For Peirce the
three levels correspond to an interpretation within context, the sign system
itself as the middle level, and the objects of the world as the substructure.
For Popper the
three levels are a subjective experience of reality in a present moment
context, the collective knowledge that has come to exist as a historical heritage,
and the objects of the world.
Conclusion
The author has
described the minimal complexity required to understand a technical solution to
the mosaic effect in large bulk declassifications. It is argued elsewhere that
this minimal complexity is required, given that the Nation acquire the
technical capability to manage the national secrets in accordance with statute
and with the Constitution.