<Book Index>

Chapter 9

Similarity Analysis and the Mosaic Effect


A preliminary theory and notation for similarity analysis is outlined with application to the reduction of a mosaic effect observed in declassified collections.


Definition of mosaic effect

The syntactical mosaic effect occurs when structural parts of a single image or text unit are separated into disjoint parts, each part judged not to have a certain piece of information but where the combination of two or more of these units is judged to reveal this information.

The semantic mosaic effect occurs when structural parts of a single image or text unit are separated into perhaps overlapping parts. Each part is judged not to imply a certain concept but the combination of two or more of these units is judged to support the inference of this concept.

As a general rule, an increase in similarity analysis causes a decrease in the mosaic effect.

This relationship, between the quality of similarity analysis and the occurrences of mosaic effects, is one of three fundamental relationships that argue for a change in the nature of the discussion about computer mediated knowledge management and the use of computers to provide situational analysis. It’s application to the management of classification by the Federal Government would make tractable tasks mandated by Executive Order 12958.

Additional fundamental relationships open the possibility of defining notational systems that provide a control language for the declassification process, or more generally for a new "horizontal" information technology supporting "conceptual checkers". These new tools are applied to text and are based on the identification and interpretation of conceptual substructures referenced by the text. Conceptual checking will work like a spell checker or a grammar checker.

The relationships between concepts, their similarities and dissimilarities, are essential to such a technology. However, context limits the scope of notation in a concept space, and thus it is essential to account for context in the notation. The interplay between substructural similarities and properties of interpretation is thus to be seen as an evolutionary process, that can be controlled via an annotation language such as the one developed in Tonfoni (1999).

Similarity and inference

There is a fundamental relationship between similarity analysis in text and images and the mosaic effect seen in declassification releases. The effect reveals information that, when gathered together, support inferences and strong evidence that certain things are true.

For example, an agent may feel that a spy is working in a certain area. A mosaic effect might provide sufficient evidence from declassified material for conjecturing the exact identity of the spy. Whereas the declassified material does not explicitly identify the spy, the material does lead to an identification that might not have otherwise occurred.

At the heart of the mosaic effect there are different types of similarity.

The type of similarity that we address below is a relationship between causes of, or the properties of, a class of two or more situations. This type of similarity depends on a part to whole relationship that is exploited in various bi-level voting procedures and is discussed in various literatures (including the work by D. Hofstadler and his students) (Hofstadler, 1995). The voting procedure was developed, by the author, from an interpretation of J. S. Mill’s logic and a Russian cybernetic system (described in various Chapters of this book).

The following is some preliminary notes on the grounding of a generalization of the Duplicate Document Detection (D3) formalism (Prueitt, 1999, Chapter 11). It shows the importance of both connectionist and evolutionary programming and the paradigms that support their analysis. The development of the notation should proceed with the peer review and contributions of several scholars in related areas of research.

1.2: Conceptual foundation, based on bi-level mathematics

Let S2 be the sign system for all basins of attraction for an "integrated" system of oscillating point sources of magnetic flux (Kowalski et al, 1988). These basins of attraction are a structural cause for a number of system behaviors such as when a group, of source points, comes to share the same phase of oscillation. Phase locking is called entrainment and is a manifestation of events where two of more point sources act together as an integrated whole. The sign system S2 is made specific with a one to one correspondence between symbols and basins.

The emergent phenomena are basins of attraction in a manifold that develops either in simulation or in physical reality. In one class of simulated systems, the basins of attraction of a simple layered artificial neural network represents emergent phenomena.

However, a model of weakly coupled oscillators is a clearer model of bi-level computation since the basic elements are each something like a physical pendulum. Linkage relationships, between point sources, can be specified in either the physical systems or in the simulations. Systems of weakly coupled oscillators give us a means to verify hypothesis about physical systems with emergent properties.

The mosaic effect can be studied in these simulations, since a conjecture about a fact corresponds to the formation of a basin signed by the sign system S2. The related theories on deduction and induction may be grounded in neurophysiology and thus has long term validity as a motivation for basis research on text and image understanding.

Let S1 be the sign system for the set of point sources. These point sources provide a model of a simple type of substructure. The bi-level model considers the elements, modeled by the notation in S2, to be an aggregation of elements from the sign system S1. However, the "entanglement" of the elements of cognition, memory substructure, is not nearly so dominate as the bi-level model suggests.

One finds that removing elements from S1 does not modify only one or a few basins – but rather modifies many or all of them. The "entanglement" in this model is not natural. The bi-level model is not powerful enough. A ‘top-down" environmental level is needed to establish context.

In a bi-level model, the sign system S2 represents the "ultrastructure" of the emergent manifold, but this ultrastructure does not have ontological referent outside of the happenstance of the manifold. It is merely an artifact of the binding of the elements of substructure into one whole. We need a top down fitness function.

The fitness functions, in evolutionary programming, cause the configuration of basic elements to evolve towards some implicit representation of an ecosystem. With the bi-level model, we see that a fitness function is indeed defined, but in an ad hoc fashion. The fitness function fits the basin – without putting pressure on the basin to change. There is no "action-perception" cycle.

The evolutionary, or adaptive, pressure must come from a measurement process that actually is open to the complex nature of the environment. In the case of declassification annotation and judgments, the required openness is to the cognitive processing of the analyst.

In the bi-level model, each emergent phenomenon is not co-selected by substructure and ecosystem affordance, as is the case in natural systems having a specific chemical distribution in an environment, or a set of behaviors in a real behavioral space. We need a third level to supply context and which puts pressure on basins to change.

The way to extend the bi-level model to a tri-level model is developed in C. S. Peirce’s "Unifying Logical Vision" (ULV), as interpreted by modern research on situational logics, knowledge representation and neuropsychology.

Unifying Logical Vision (ULV)

The ULV is stated in the following way:

"Concepts are like chemical compounds, they are composed of atoms"

This vision was instrumental in the development, by D. Pospelov and V. Finn (1970-1995), of the theory of situational control. The issue addressed by this Russian research team was the role of environmental factors in determining which of many possible basins are actually manifest in a natural system at any one time. This role introduces a third level to the organizational stratification of the theory.

These levels are "real" levels separated by "gaps", not the hierarchical levels seen in subsumption graphs and tree data structures.

For example, since the third level of analysis is about things that are at a higher time scale, the environmental role is seen in incomplete sets of rules of behavior rather than all at one time.

The properties and relationships between basins, in so far as they are understood, are represented in the situational logic produced for the sign system. The Russian logic has the form of five notational languages, two inner languages and three outer languages that progressively build up the situational logic having special properties related to an openness to change in the axioms and inference rules. In our interpretation of the Russian work, the two inner and three outer languages are with respect to a gap that necessarily separates structure from function in natural systems. It is a gap that separates S1 from S2.

The Tri-level notation

After the grounding of similarity analysis in the previous section, we now develop a specific notation for similarity analysis in document and image collections.

The relationship between this notation and the 4 by 4 D3 formalism [4] is one of generalization from four absolute levels { page segment, page, document, collection} to three relative levels,

{ substructure, middle, contextual }.

If one considers the four levels of the D3 formalism, we see that each of the middle two levels are each between two levels. In this case, any analysis of similarity has three levels. For example, the page’s substructure is non-overlapping page segments, and the page’s context is held within the document. Page segments can be given a n-gram substructure, but this was not done in the 4 by 4 D3 formalism.

Four types of similarity metrics, {exact, near-exact, non-near exact but similar, different}, are also generalized to a class of N similarity metrics,

{exact, near-exact,  similar through relationship r, different}.

The cardinality of this class of metrics depends on the number of relationships, { r }, that are active in the collection. We suggest that this number is not constant from context to context, and that this fact represented a primary obstacle that has not been addressed in any of the large-scale projects. Since a shift in context requires a shift in time, the formalism is called the N(t) by 3 SA formalism.

On the issue of descriptive enumeration and relations


A = { (a, r, b) }

be the set of active relationships in S1 at time t0. These active relationships are determined empirically to be some subset of all potential relationships

P = { (a, r, b) }.

Potential relationships are also to be determined empirically. The requirement that relationships be determined empirically is a harsh one, but one required by the nature of phenomenon.

A method for determination of sets of relationships can be advanced and thus the harshness of requiring empirical determination of all possible relationships is tractable. This method is descriptive enumeration.

The set of active prototypes in substructure can be determined by a descriptive enumeration of invariance. By invariance we mean those patterns that form equivalence classes seen from the perspective of the middle level. From descriptive enumeration, each invariance is assigned a symbol. As this set is being developed, it is possible to consider all subsets of size 2. Each of these subsets {a, b} define a certain number of potential relationships (a,r,b). These relationships are identified through a process of descriptive enumeration, this time about the class of relations.

The process of descriptive enumeration requires strong support from human decision makers, since any formal method relying entirely on computational process is likely to be intractable.

Active relationships, between substructure, are instantiated in the context of building an ensemble at the next higher level of organization. Of the three levels, this ensemble level is the middle level. The formation of judgement is normally about the objects and relations in a middle level.

The middle level is defined by a set of temporal invariance; e.g., objects having permanence in some interval of time, interacting with each other. Substructure and context do not interact with ensembles, by definition. In fact, a "level" is properly defined to be the whole class of all objects that interact with each other. From this definition, it follows that each level is separated by an "gap’. The complex systems research community calls this separation an "epistemic" gap. Physicists call a class of such gaps Heisenburg gaps, and mind – body philosophers call certain type of gaps Cartesian gaps. In each case, it is commonly thought that it is not possible to fully formalize the gap’s nature.

Seen from the perspective of the middle level, substructure is a statistical artifact where individual substructure invariance is treated as a member of a prototype class. This principle is illustrated by the regard that a chemical compound has for an individual atom, or a factory has for an individual worker. It may be appealing to say substructure has the relationship of "is a part of" to the ensemble. However, this use of language is simply not correct. There is an incompleteness of description that can not easily be overcome.

Likewise, context has only incomplete relationships to ensembles since context is only partially a function of the environment. Again, we have an incomplete description, but of a different kind. Now the incompleteness has to do with the waiting time required for patterns to complete. Sometimes the incompleteness of description is not a problem, but at other times it leads to errors.

The definition of level comes from an appeal to the physics of complex systems, particularly the physics of quantum events. However, the model of three levels {substructure, middle, contextual} is relative. Any specific substructure level might be a middle level seen from a different perspective and the same may be said for the level of context.

Temporal dimension

Each "real" object has a period of consistency where the object maintains it’s "temporal invariance". In consideration of the properties of a level, the temporal dimension is essential. The formation event introduces a new object within a level that existed prior to the emergence. Once created, the object has a stable existence before suddenly losing its cohesiveness.

Once formed, the whole may be modified by new internal emergence. However, the level shapes the emergence and thus the temporal state of the level is reflected in the properties that the new object has.

There is an assumption that the set of all objects that have an active relationship is invariant over some period of time starting at t0 and lasting until t1.

Let t0 be when an ensemble is first formed. We need for the sign system, S2, to have a one to one correspondence to those objects in the level that have active relationships to the new ensemble.

The selected set, S2, is

{ o1, o2, o3, . . . , on},

over the period from t0 to t1. Each of the objects in this set have a set of properties { p } and relationships { r } to other objects. These properties and relationships are to be discovered.

Tri-level notation

The tri-level notation must account for three issues.

1) The first issue is the emergence of a level, or of a new object into an existing level, that is seen as an interaction between substructure and context.

2) The second issue has to do with the entanglement of substructural invariance in the objects of a level.

3) The third issue has to do with the interpretation of signs as referential to specific concepts.

These three issues are treated in the following three subsections.

The emergent manifold and its invariance

Intellectually, the problem we face is explaining how levels come into existence in the first place. In section 1, we grounded our discussion of similarity in a model of weakly coupled oscillating point sources. This model has several unique qualities. First, it is possible to perform experiments with either a physical apparatus or within a computer simulation. Second, the bi-level nature is illustrative of the ensemble behavior of network models of neural networks or of evolutionary programming like genetic algorithms. Thus we have several ways to motivate a deep and empirically grounded discussion about the emergence of a level.

If our model was only bi-level then there would be no level, only an emergent "object". However, any object implicitly defines a level as the set of all objects that it interacts with. In many formal theories, the emergent object is considered alone – without taking into account the implicitly defined level that forms the "full" environment of the object. We feel that this practice is motivated by the avoidance of "necessary" logical paradox.

Sometimes this paradox can be stated: "there are things that do not exist". In particular, the paradox comes up when one talks about the existence of memory when the memory is not actually, at that moment, the contents of an awareness state.

We use the term ‘full environment’ here to remind us that the level is not statically defined and has a temporal dimension.

Entanglement of substructural invariance in the objects of a level

Perhaps the best example of entanglement is seen when we attempt to represent the referent concepts signed by text or discourse. The boundary of a concept referentially is just not as crisp as classical logic would have us believe. It is not always possible to say that a concept is either present or not present in a passage of text.

One reason for non-crisp delineation comes from the nature of physical mechanisms that produce mental images. These mechanisms may operate in a three-tiered modality, again due to the underlying qualities producing physical stratification of temporal processes into three levels with gaps.

Interpretation of signs as referential to specific concepts

Presumably concepts were occurring within the mental images of humans before natural language came into it’s modern form. One imagines that specific types of behavior were coincident with the presence of a concept. Later, these specific types of behavior were part of the substance of sign systems and then language systems.

Interpretation of signs then became the basis for communication of information. The tri-level model of complex processes, including concept formation, is thus seen in the light that Peirce and Popper used in their analysis.

For Peirce the three levels correspond to an interpretation within context, the sign system itself as the middle level, and the objects of the world as the substructure.

For Popper the three levels are a subjective experience of reality in a present moment context, the collective knowledge that has come to exist as a historical heritage, and the objects of the world.


The author has described the minimal complexity required to understand a technical solution to the mosaic effect in large bulk declassifications. It is argued elsewhere that this minimal complexity is required, given that the Nation acquire the technical capability to manage the national secrets in accordance with statute and with the Constitution.