Appendix
Description of the Minimal Voting Procedure (MVP)
First Published, in Moscow Russia: 1997
Appendix: Description of the Minimal Voting Procedure (MVP)
To instantiate a voting procedure, we need the following triple < C,
O1, O2 > :
·
A set of categories C = { Cq } as defined by a training set O1.
· A means to
produce a document representational set for members of O1.
· A means to
produce a document representational set for members of a test set, O2
.
We assume that we have a training collection O1 with m
document passages,
O1 = { d1 , d2
, . . . , dm }
Documents that are not single passages can be substituted here. The
notion introduced above can be generalized to replace documents with a
more abstract notion of an "object".
Objects
O = { O1 , O2 ,
. . . , Om }
can be documents, semantic passages that are discontinuously expressed
in the text of documents, or other classes of objects, such as
electromagnetic events, or the coefficients of spectral transforms.
Some representational procedure is used to compute an "observation" Dr
about the semantics of the passages. The subscript r is used to remind
us that various types of observations are possible and that each of
these may result in a different representational set. For linguistic
analysis, each observation produces a set of theme phrases. We use the
following notion to indicate this:
Dr : --> { t1
, t2 , . . . , tn }
This notion is read "the observation Dr of the passage di
produces the representational set { t1 , t2 , . .
. , tn }"
We now combine these passage level representations to form a category
representation.
Each "observation", Dr , of the passages in the training set
O1 has a "set" of theme phrases
Dr : --> Tk =
{ t1 , t2 , . . . , tn}
Let A be the union of the individual passage representational sets Tk.
A = Union Tk.
This set A is the representation set for the complete training
collection O1 .
The set A can be partitioned, with overlaps, to match the categories to
which the passages were assigned. Let T*q be the union of
all theme phrase representation sets Tk for all passages
that are assigned to the category q.
T*q = Union Tk
such that, dk, is assigned to the category q.
The category representation set, T*q, is defined for each
category number q.
The overlap between category representation T*q, and T*s,
is one statistical measure of the "cognitive entanglement" between
category q and category s. This fact leads to a method for identifying
the minimal intersections of structural features of the category
representational sets.
J. S. Mill’s logics relies on the discovery of meaningful subsets of
representational elements. The first principles of J S Mill’s
argumentation are:
1. that negative evidence should be
acquired as well as positive evidence
2. that a bi-level argumentation should involve a decomposition of
passages and categories into a set of representational phrases
3. that the comparison of passage and category representation should
generalize (provide the grounding for computational induction) from the
training set to the test set .
It is assumed that each "observation", Dk, of the test set O2
is composed from a "set" of basic elements, in this case the theme
phrases in A. Subsets of the set are composed, or aggregated, into
wholes that are meaningful in a context that depends only statistically
on the characteristics of basic elements.
This general framework provides for situational reasoning and
computational argumentation about natural systems.
For the time being, it is assumed that the set of basic elements is the
full phrase representational set
A = Union Tk.
for the training collection O1.
Given the data:
T*q for each C q
, q = 1, . . , n
and the representational sets Tk , from the observations Dk,
for each passage, dk, from the test set O2, we
generate the hypothesis that the observation Dk should be
categorized into category q.
This hypothesis will be voted on by using each phrase in the
representational set for Dk by making the following
inquiries for each element ti of the representational set Tk:
1. does an observation of a passage, Dk,
have the property p, where p is the property that this specific
representational element, ti , is also a member
of the representational set T*q for category q.
2. does an observation of a passage, Dk, have the
property p, where p is the property that this specific representational
element, ti , is
not a member of the representational set T*q for
category q.
Truth of the first inquiry produces a positive vote, from the single
passage level representational element, that the passage is in the
category.
Truth of the second inquiry produces a negative vote, from the single
representational element, that the passage is not in the category.
These votes are tallied.
Data structure for recording the votes
For each passage, dk , we define the matrix Ak as
a rectangular matrix of size m x h where m is the size of a specific
passage representational set Tk, and h is the number of
categories. The passages are indexed by k, each passage has it’s own
matrix.
Each element ti of Tk, will get to vote for or
against the hypothesis that this kth passage should be in the category
having the category representational set T*q. Thus Ak
is defined by the rule:
ai,j = -1 if the phrase is
not in T*q
or
ai,j = 1 if the phrase is
in T*q
Matrix Ak is used to store the individual + - votes placed
by each agent (i.e., the representational element of the phrase
representation of the passage.)
This linear model produces ties for first place, and places a
semi-order (having ties for places) on the categories by counting
discrete votes for and against the hypothesis that the document is in
that category.
A second data structure to record
weighted votes
A non-linear (weighted) model uses internal and external weighting to
reduce the probability of ties to near zero and to account for
structural relationships between themes.
Matrix Bk is defined:
bi,j = ai,j *
weight of the phrase in Tk
if the phrase is not in T*q
or
bi,j = ai,j *
weight of the phrase in T*q
if the phrase is in T*q
This difference between the two multipliers is necessary and sufficient
to break ties resulting from the linear model (matrix Ak).
Data structure to record the results
For each passage representation and each category, the tally is made
from the matrix Bk and stored in a matrix C having the same number of
records as the size of the document collection, and having h columns –
one column for each category.
The information in matrix C is transformed into a matrix D having the
same dimension as C. The elements of each row in C are reordered by the
tally values. To illustrate, suppose we have only 4 categories and
passage 1 tallies {-1214,-835,451,1242} for categories 1, 2, 3 and 4
respectively. So
cat1 --> -1214, cat2 --> -835,
cat3 --> 451 and cat4 --> 1242.
By holding these assignments constant and ordering the elements by size
of tally we have the permutation of the
ordering ( 1, 2, 3, 4) to the ordering (4, 2, 3, 1).
( 1, 2, 3, 4) --> ( 4, 2, 3,
1).
This results show that for passage 1, the first place placement is
category 4, the second place is category 2, etc. The matrix D would
then have (4, 2, 3, 1), as its first row.