Appendix
Description
of the Minimal Voting Procedure (MVP)
First Published, in Russia: 1997
Appendix: Description of the Minimal Voting Procedure (MVP)
To instantiate a voting procedure, we need
the following triple < C, O1, O2 > :
·
A set of categories C
= { Cq } as defined by a training set O1.
·
A means to produce a
document representational set for members of O1 .
·
A means to produce a
document representational set for members of a test set, O2 .
We assume that we have a training collection O1 with m document
passages,
O1 = { d1 , d2 , . . . , dm
}
Documents that are not single passages can be
substituted here. The notion introduced above can be generalized to replace
documents with a more abstract notion of an "object".
Objects
O = { O1 , O2 , . . . , Om
}
can be documents, semantic passages that are
discontinuously expressed in the text of documents, or other classes of
objects, such as electromagnetic events, or the coefficients of spectral
transforms.
Some representational procedure is used to
compute an "observation" Dr about the semantics of the
passages. The subscript r is used to remind us that various types of
observations are possible and that each of these may result in a different
representational set. For linguistic analysis, each observation produces a set
of theme phrases. We use the following notion to indicate this:
Dr
: di à { t1 , t2 , . . . , tn
}
This notion is read "the observation Dr
of the passage di produces the representational set { t1
, t2 , . . . , tn }"
We now combine these passage level
representations to form a category representation.
·
each
"observation", Dr, of the passages in the training set O1 has a
"set" of theme phrases
Dr : di à Tk = { t1 , t2 , . . . , tn
}
·
Let A be the union
of the individual passage representational sets Tk.
A = È Tk.
This set A is the representation set
for the complete training collection O1.
·
The set A can be
partitioned, with overlaps, to match the categories to which the passages were
assigned. Let T*q be the union of all theme phrase
representation sets Tk for all passages that are assigned to
the category q.
T*q
= È Tk
…
such that, dk, is assigned to the
category q.
The category
representation set, T*q, is defined for each category
number q.
·
The overlap between
category representation T*q, and T*s, is
one statistical measure of the "cognitive entanglement" between
category q and category s. This fact will lead to a method for identifying the
minimal intersections of structural features of the category representational
sets.
J. S. Mill’s logics relies on the discovery
of meaningful subsets of representational elements. The first principles of J S
Mill’s argumentation are:
1.
that negative evidence
should be acquired as well as positive evidence
2.
that a bi-level
argumentation should involve a decomposition of passages and categories into a
set of representational phrases
3.
that the comparison of
passage and category representation should generalize (provide the grounding
for computational induction) from the training set to the test set .
It is assumed that each
"observation", Dk, of the test set O2 is composed
from a "set" of basic elements, in this case the theme phrases in A.
Subsets of the set are composed, or aggregated, into wholes that are meaningful
in a context that depends only statistically on the characteristics of basic
elements.
The general framework provides for
situational reasoning and computational argumentation about natural systems.
For the time being, it is assumed that the
set of basic elements is the full phrase representational set
A = È Tk.
for the training collection O1.
Given the data:
T*q
for each C q , q = 1, . .
, n and
the representational sets Tk , from
the observations Dk, for each passage, dk, from the test
set O2, we generate
the hypothesis that the observation Dk should be categorized into
category q.
This hypothesis will be voted on by using
each phrase in the representational set for Dk by making the
following inquiries for each element ti of the representational set Tk:
1.
does an observation of a
passage, Dk, have the property p, where p is the property that this
specific representational element, ti , is also a member of the
representational set T*q for category q.
2.
does an observation of a
passage, Dk, have the property p, where p is the property that this
specific representational element, ti , is not a member of
the representational set T*q for category q.
Truth of the first inquiry produces a
positive vote, from the single passage level representational element, that the
passage is in the category. Truth of the second inquiry produces a negative
vote, from the single representational element, that the passage is not
in the category. These votes are tallied.
Data structure for recording the votes
For each passage, dk , we define
the matrix Ak as a rectangular matrix of size m x h where m is the
size of a specific passage representational set Tk, and h is
the number of categories. The passages are indexed by k, each passage has it’s
own matrix.
Each element ti of Tk,
will get to vote for or against the hypothesis that this kth passage should be
in the category having the category representational set T*q.
Thus Ak is defined by the rule:
ai,j =
-1 if the phrase is not in T*q
or
ai,j = 1
if the phrase is in T*q
Matrix Ak is used to store the
individual + - votes placed by each agent (i.e., the representational element
of the phrase representation of the passage.)
This linear model produces ties for first
place, and places a semi-order (having ties for places) on the categories by
counting discrete votes for and against the hypothesis that the document is in
that category.
A second data structure to record weighted
votes
A non-linear (weighted) model uses internal
and external weighting to reduce the probability of ties to near zero and to
account for structural relationships between themes.
Matrix Bk is defined:
bi,j = ai,j
* weight of the phrase in Tk
if the phrase is
not in T*q
or
bi,j = ai,j
* weight of the phrase in T*q
if the phrase is in
T*q
This difference between the two multipliers
is necessary and sufficient to break ties resulting from the linear model
(matrix Ak).
Data structure to record the results
For each passage representation and each
category, the tally is made from the matrix Bk and stored in a matrix
C having the same number of records as the size of the document collection, and
having h columns – one column for each category.
The information in matrix C is transformed
into a matrix D having the same dimension as C. The elements of each row in C
are reordered by the tally values. To illustrate, suppose we have only 4
categories and passage 1 tallies {-1214,-835,451,1242} for categories 1, 2, 3
and 4 respectively. So
cat1 à -1214, cat2 à -835, cat3 à 451 and cat4 à 1242.
By holding these assignments constant and
ordering the elements by size of tally we have the permutation of the ordering
( 1, 2, 3, 4) to the ordering (4, 2, 3, 1).
( 1,
2, 3, 4) à ( 4, 2, 3, 1).
This results show that for passage 1, the
first place placement is category 4, the second place is category 2, etc. The
matrix D would then have (4, 2, 3, 1), as its first row.