Chapter 10
The Information Retrieval performance measures
Information
Retrieval (IR) queries are not the same as natural language queries. With the
IR query we may only specify an exact match criterion. These exact match
criterion may be quite complex and even involve stochastic elements, but no
matter what the complexity of the query the retrieval will only occur if a
criterion is matched exactly. We must be able to determine when two objects are
exactly the same.
Let O be
defined to be the set of all documents in a database. Let T be the set
of documents that are placed into a category by some unknown, but valid IR
query statement. Now let q be a, perhaps distinct, IR query that returns
the sets S. The precision of the query q with respect to the
category defined exactly by the set T is:
Pn, (q,T) = nr /n
where n is a cut
off point that throws away those elements of the hitlist beyond the first n
elements, and nr is the number of elements from the set T
that are in the first n elements of the hitlist.
We might assume
that n would be fixed at the point where the query q no longer finds
matches – but this is often not computed due to the potential for every very
large hitlists. In this case, the precision computed with a cutoff point is
assumed a good approximation to the precision computed on the full hitlist.
The recall of an IR
query q is defined as the proportion of total number of elements from
the set T that were retrieved in the top n documents:
Rn, (q,T) = nr /Nr
where Nr
is the size of the set T.
Example:
Suppose the T is the set { a1 , a3 , a3
, . . . ., a25 }. Suppose the T is a subset of O and
that the size of O is 50.
Now suppose that the IR query q returns a set R of n = 10
documents. If the size of R Ç T is 8 then the precision is
8/10 = .8. The recall 8/25 = 0.32.
If we double the size of the allowed hitlist and suppose that we obtain
the same quality of results for the additional retrieved elements then we have
n= 20 and R Ç T =
16. In this case, the precision is 16/20 = .8; but the recall is 16/25 = 0.64.
The recall will get better as the cutoff number is increased. The precision
number; however, may vary with a downward trend as the cutoff number increases.
Using a perfect
algorithm, the precision not decreased as the cutoff number is increased, but
the cutoff number will eventually exceed the number of records that the
algorithm will return. Using a perfect algorithm, the recall will stay constant
at 1.
The case where we
have a perfect algorithm is an exceptional case. In the non-perfect situation
we do not know how the correct retrievals will be placed into the hitlist.
Because of ranking algorithms the hitlist is generally ordered, but the ordering
may be less than perfect when the retrieval query is less than perfect. If the
precision is uniform as we move down an ordered hitlist, then the average
between the recall number and the precision number, called the f score, will
increase and then decrease, parabolically, as the cutoff number is increased.
Precision will not, in general, be uniform.
Precision and
recall require that we be able to count the number of items in a category and
the number of items in the category that are also in the hitlist. We claim that
there are two fundamental problems with these performance measures. The first
is related to the inexactness of mental categorization and to the entanglement
that almost always exists between natural categories. In this case, how does one
count items? The second is related to the difference between how humans
interact with environments and how a computational program is forced to
interact.
Inexactness and
entanglement are bound together. The inexactness of mental categorization is
seen immediately when one uses introspection to examine the contents of
personal mental images. The entanglement is related to how concepts and other
mental states move one into the other in an embedded and dynamic fashion. The
stratified categorization produced by the Voting Procedure provides a formalism
to address both inexactness and entanglement.
Human interaction
involves interpretation, whereas computer interaction has a weak basis for
interpretation. Humans have a subjective reality that is not shared directly
with other people or things. This subjective reality is often the source of
motivation and interpretation. In fact, from a social point of view, the
ability to judge the subjective reality of others is a highly prized human
trait. The computer has no subjective reality, and can not judge the subjective
reality of people and things. Perhaps the best that can be expected is that
information about subjective realities can be made objective and procedural.
Proper conversion, to objective form, might produce tools for machine
interpretation of context.
The interpretation
of subjective realities is a different issue than inexactness and entanglement.
Interpretation is some constraint that effectively reduces a set of possible
mental states; but the inexactness and entanglement of objective
representations may be increased or decreased during the constraint of
interpretation. Thus, while being distinct, interpretation involved in changing
mental states is informed by the boundaries between concepts and the
entanglement that concepts as seen to have.
Extension of
Precision and Recall measures to stratified categorization
The Voting
Procedure takes a crisp assignment policy for a collection of objects O
and produces an stratified assignment policy. The total set of categories used
is a category policy
C = { C1, C2, C3, . .
. , Cn }
for the set O.
A crisp
assignment policy takes an object and places it into exactly one
category. We use the notation:
Oj ® Ck ,
which is read
object Oj is placed into category Ck .
A stratified
assignment policy takes an object and ranks the preferred placement
into each of the categories in a category policy. In this case we use the
notation:
Oj ®r Ck
for a r-th place
preference for object Oj to be placed into category Ck .
A complete stratified assignment policy can be written:
Oj ®N ( Ck1, Ck2,
Ck3, . . . , Ckn )
where N = { 1, 2,
3, . . . , n } is the index that orders the category policy and the sequence
(k1, k2, k3, . . . , kn) is a permutation of the set N.
In the Voting
Procedure we assume that there exists a well defined crisp placement of each
object from O into exactly one category. Our goal is to produce a valid
stratified assignment policy. Thus we are in a typical research situation where
a training set and a test set are given. Knowledge about the categorization and
assignment policies in the training set is to be used to establish procedures
for placing elements of the test set. The recall and precision measurement is
made on the retrieval results of a query, against the test set, when measured
against the assignment policy of the test set.
A more general case
would be where the assignment policy was stratified, but this is a more general
case that has not yet come up in our experimental work. A review of the history
and research of the TIPSTER Text Extraction and Retrieval Conference (TREC),
demonstrates the problems involved in defining reasonable crisp category and
assignment policies for natural routing and ad hoc problems.
We have redefined
the nature of the retrieval problem from a crisp setting to one where there are
levels of representations and associations.
Issues that should
be expected:
·
Independent
category policy should have recall and precision characteristics that are
exactly the same as the IR recall and precision measures.
·
If
the category definitions are not independent, then recall and precision
measures should reflect categorical entanglement.
If the pairwise
intersections of category representational sets are all empty, then the
category policy can be said to have independent category definitions. In this
case, the measure of precision and recall should be applied to only first place
assignment preferences; i.e., an object is only in one category and any
placement other than in the first place should be regarded as an error.
The retrieval that
is made by an assignment policy is simply a hitlist that contains the category
names. If the cutoff length of the hitlist is n = 1, then we have a crisp
assignment policy. So in the crisp case with independent categories we have
either a 0 or a 1 for both the precision and recall of the assignment query.
Recall and
Precision in single placement actions
For single
placement actions, such as message routing into exactly one bin, the test set
has only one element so T = { O1 }. Consider the two
formulas:
P1, (q,T) = nr /1
and
R1, (q,T) = nr /Nr
where nr
is the number of elements from the set T that are in the correct
assignment policy and o(T) = Nr.
If the target
placement and the computed placement are both crisp, then Nr = 1 and
nr = 0 or 1. So the placement is either completely correct or
completely incorrect.
Suppose that an
object, O, is known to be in a specific category, say Ck . If there
is a single routing decision to let a system place the object into one
category, then
if O ® Ck then P1, (q,T) = R1,
(q,T) = 1
since nr
= 1, and
if O ® Ci for any i not k, then P1,
(q,T) = R1, (q,T) = 0
since nr
= 0; where
O ® Ci is read,
"object O is placed into category Ci
by the voting procedure".
If the target
placement is crisp, and the computed placement is stratified then the precision
and recall can be different. The parameter nr is a measure of degree
in which the element from the test set T is characterized by the target
assignment policy
If O Î Ck is known from
the training set, and
O®N ( Ck1, Ck2, Ck3, .
. . , Ckn )
is the computed
assignment then
nr
= 1 if k = k1
nr
= (n + 1 – i )/n if k = ki
Nr
= 1
Example: Let k = 3,
so we have the target placement O ® C3 and suppose the computed
placement is:
O® ( C4, C6, C3, C1, C5,
C7, C9, C2, C10, C8)
In this case, nr
= 8/10., precision and recall are both 8/10.
The IR measure Nr
is the size of the set of possible candidates that can be placed into a
crisp category. In the generalization to stratified categories, Nr
may be non-integer, when viewed from a single category. This is first seen when
we consider a stratified target placement
If the target
placement is also stratified, then nr can also vary and assume
non-integer values.
If
O®N ( Ck1, Ck2, Ck3, .
. . , Ckn )
is known from the
training set, and
O Î Ck
is the computed
assignment then
Nr = 1 if k = k1
Nr = n/(n + 1 – i ) if k = ki
nr = 1
Note the reciprocal
relationship in Nr .
Example: Suppose the target placement
O® ( C4, C6,
C3, C1, C5, C7, C9, C2,
C10, C8)
and the target placement O ® C3
In this case, Nr = 10/8. This number is interpreted to be
larger than 1 because C3 has to be moved up to get to first place
and thus C3 is out of the focus of the first place assignment.
Recall is 8/10.
There is a problem
with saying that the precision here is 1.
To consider this
task, we go to the substructure of representational features.
Let Vi,j
be the intersection of the category representational sets T*i
and T*j . Suppose that the size, o(Vi,j)
of Vi,j is m and the size of the union of all
representational elements, A, is n.
Then a measure of
the category entanglement between Ci and Cj is m/n.
In better notation:
m( Ci , Cj ) = o(Vi,j)
/ o(A)
However, we need to
define a measure for o(Vi,j) such that the sum of measures
over i for a fixed j is the number 1.
S { m( Ci , Cj ) | i Î N }= 1
Thus we can think
about possible entanglement as being spread over the set of pairwise
entanglements as a distribution function whose summation is 1. This
distribution function is centered at each category.
Let
( Cr1, Cr2, Cr3, . .
. , Crn )
be the ground truth
and
( Cq1, Cq2, Cq3, .
. . , Cqn )
be the computed
placement.
The recall and
precision measures can now be computed as a sum of n terms, t1 , t2
, . . . , tn , one for each placement position in the assignment
policy.
ti = 1/n if
Cri = Cqi
or
ti = 1/n (Vr,q) if
Cri ¹ Cqi
The
recall/precision is 1 if the ground truth and the computed placement is the
same, or if the entanglement is complete so that Vr,q = 1 for
all categories.
If the categories
are independent then Vr,q = 0, and only a first place
assignment is acceptable. Clearly, there are higher order effects to consider
that have not been considered here.
Recall and
Precision of stratified placement of a test or training set.
Let C = { C1,
C2, C3, . . . , Cn } be a categorization
policy for the set O, that has been defined by a crisp placement of the
elements from a test set T Ì O. For each element of T the
Voting Procedure will produce a stratified placement policy:
Oj ®N ( Ck1, Ck2,
Ck3, . . . , Ckn )
where N = { 1, 2,
3, . . . , n } is the index that orders the category policy and the sequence
(k1, k2, k3, . . . , kn) is a permutation of the set N.
In this case, local
information about features from the training set are used to identify the
entanglement between categories. If we counted only first place then we would
not expect to be able to achieve recall and precision scores of 1. Working with
the training set, allows us to see the effects, on the performance measures,
that result from changing crisp categorization policies to stratified ones.
If there exist a
separate test set S, S ¹ T, S Ì O; then we have the usual
train and test experimental task. The local information from the training set
can be used to place elements of the test set. In either case, we need a performance
measure that is consistent with the IR recall and precision measures.
Suppose that all O Î O have been placed
into stratified assignment policies. Clearly, if we counted only first place
assignments then we could use the standard IR measures; the precision formula:
Pn, (q,T) = nr /n
and the recall
formula:
Rn, (q,T) = nr /Nr
The parameter n is
a cut off for the hitlist, nr is the number of elements from the set
T that are in the hitlist and o(T) = Nr. The query q
addresses the assignment policy table where results from the voting procedure
has been placed.
The predictive
measure when object placement has subjective constraints
The use of
precision and recall can be questioned based on examples from experience. In
particular, the motivation for placement of documents into categories by humans
may not correspond to SQL statements that exactly recall only those documents
that have already been placed into the category.
The criterion for
placement of documents (images, objects, etc) into categories by humans may be
predictive of future placements in ways that is not measured by the IR
performance measures. The next placement of a document by a human may not be
retrieved correctly even by what was judged to be a perfect IR query. This
might or might not be traced to some reason that the user can specify in terms
that correspond to the representational data.