Appendix

Description of the Minimal Voting Procedure (MVP)

First Published, in Moscow Russia:  1997

Appendix: Description of the Minimal Voting Procedure (MVP)

To instantiate a voting procedure, we need the following triple < C, O1, O2 > :

·         A set of categories C = { Cq } as defined by a training set O1.

·         A means to produce a document representational set for members of O1.

·         A means to produce a document representational set for members of a test set, O2 .

We assume that we have a training collection O1 with m document passages,

O1 = { d1 , d2 , . . . , dm }

Documents that are not single passages can be substituted here. The notion introduced above can be generalized to replace documents with a more abstract notion of an "object".

Objects

O = { O1 , O2 , . . . , Om }

can be documents, semantic passages that are discontinuously expressed in the text of documents, or other classes of objects, such as electromagnetic events, or the coefficients of spectral transforms.

Some representational procedure is used to compute an "observation" Dr about the semantics of the passages. The subscript r is used to remind us that various types of observations are possible and that each of these may result in a different representational set. For linguistic analysis, each observation produces a set of theme phrases. We use the following notion to indicate this:

Dr : --> { t1 , t2 , . . . , tn }

This notion is read "the observation Dr of the passage di produces the representational set { t1 , t2 , . . . , tn }"

We now combine these passage level representations to form a category representation.

Each "observation", Dr , of the passages in the training set O1 has a "set" of theme phrases

Dr : --> Tk = { t1 , t2 , . . . , tn}

Let A be the union of the individual passage representational sets Tk.

A = Union Tk.

This set A is the representation set for the complete training collection O1 .

The set A can be partitioned, with overlaps, to match the categories to which the passages were assigned. Let T*q be the union of all theme phrase representation sets Tk for all passages that are assigned to the category q.

T*q =  Union Tk  such that, dk, is assigned to the category q.

The category representation set, T*q, is defined for each category number q.

The overlap between category representation T*q, and T*s, is one statistical measure of the "cognitive entanglement" between category q and category s. This fact leads to a method for identifying the minimal intersections of structural features of the category representational sets.

J. S. Mill’s logics relies on the discovery of meaningful subsets of representational elements. The first principles of J S Mill’s argumentation are:

1. that negative evidence should be acquired as well as positive evidence

2. that a bi-level argumentation should involve a decomposition of passages and categories into a set of representational phrases

3. that the comparison of passage and category representation should generalize (provide the grounding for computational induction) from the training set to the test set .

It is assumed that each "observation", Dk, of the test set O2 is composed from a "set" of basic elements, in this case the theme phrases in A. Subsets of the set are composed, or aggregated, into wholes that are meaningful in a context that depends only statistically on the characteristics of basic elements.

This general framework provides for situational reasoning and computational argumentation about natural systems.

For the time being, it is assumed that the set of basic elements is the full phrase representational set

A = Union Tk.

for the training collection O1.

Given the data:

T*q for each C q , q = 1, . . , n

and the representational sets Tk , from the observations Dk, for each passage, dk, from the test set O2, we generate the hypothesis that the observation Dk should be categorized into category q.

This hypothesis will be voted on by using each phrase in the representational set for Dk by making the following inquiries for each element ti of the representational set Tk:

1. does an observation of a passage, Dk, have the property p, where p is the property that this specific representational element, ti , is also a member of the representational set T*q for category q.

2.  does an observation of a passage, Dk, have the property p, where p is the property that this specific representational element, ti , is not a member of the representational set T*q for category q.

Truth of the first inquiry produces a positive vote, from the single passage level representational element, that the passage is in the category.

Truth of the second inquiry produces a negative vote, from the single representational element, that the passage is not in the category. These votes are tallied.

Data structure for recording the votes

For each passage, dk , we define the matrix Ak as a rectangular matrix of size m x h where m is the size of a specific passage representational set Tk, and h is the number of categories. The passages are indexed by k, each passage has it’s own matrix.

Each element ti of Tk, will get to vote for or against the hypothesis that this kth passage should be in the category having the category representational set T*q. Thus Ak is defined by the rule:

ai,j = -1 if the phrase is not in T*q

or

ai,j = 1 if the phrase is in T*q

Matrix Ak is used to store the individual + - votes placed by each agent (i.e., the representational element of the phrase representation of the passage.)

This linear model produces ties for first place, and places a semi-order (having ties for places) on the categories by counting discrete votes for and against the hypothesis that the document is in that category.

A second data structure to record weighted votes

A non-linear (weighted) model uses internal and external weighting to reduce the probability of ties to near zero and to account for structural relationships between themes.

Matrix Bk is defined:

bi,j = ai,j * weight of the phrase in Tk

if the phrase is not in T*q

or

bi,j = ai,j * weight of the phrase in T*q

if the phrase is in T*q

This difference between the two multipliers is necessary and sufficient to break ties resulting from the linear model (matrix Ak).

Data structure to record the results

For each passage representation and each category, the tally is made from the matrix Bk and stored in a matrix C having the same number of records as the size of the document collection, and having h columns – one column for each category.

The information in matrix C is transformed into a matrix D having the same dimension as C. The elements of each row in C are reordered by the tally values. To illustrate, suppose we have only 4 categories and passage 1 tallies {-1214,-835,451,1242} for categories 1, 2, 3 and 4 respectively. So

cat1 --> -1214, cat2 --> -835, cat3 --> 451 and cat4 --> 1242.

By holding these assignments constant and ordering the elements by size of tally we have the permutation of the

ordering ( 1, 2, 3, 4) to the ordering (4, 2, 3, 1).

( 1, 2, 3, 4) -->  ( 4, 2, 3, 1).

This results show that for passage 1, the first place placement is category 4, the second place is category 2, etc. The matrix D would then have (4, 2, 3, 1), as its first row.