Chapter 10

Chapter 10

The Information Retrieval performance measures

Information Retrieval (IR) queries are not the same as natural language queries. With the IR query we may only specify an exact match criterion. These exact match criterion may be quite complex and even involve stochastic elements, but no matter what the complexity of the query the retrieval will only occur if a criterion is matched exactly. We must be able to determine when two objects are exactly the same.

Let O be defined to be the set of all documents in a database. Let T be the set of documents that are placed into a category by some unknown, but valid IR query statement. Now let q be a, perhaps distinct, IR query that returns the sets S. The precision of the query q with respect to the category defined exactly by the set T is:

P_{n, (q,T)} = n_r /n

where n is a cut off point that throws away those elements of the hitlist beyond the first n elements, and n_r is the number of elements from the set T that are in the first n elements of the hitlist.

We might assume that n would be fixed at the point where the query q no longer finds matches – but this is often not computed due to the potential for every very large hitlists. In this case, the precision computed with a cutoff point is assumed a good approximation to the precision computed on the full hitlist.

The recall of an IR query q is defined as the proportion of total number of elements from the set T that were retrieved in the top n documents:

R_{n, (q,T)} = n_r /N_r

where N_r is the size of the set T.

Example:

Suppose the T is the set { a₁ , a₃ , a₃ , . . . ., a₂₅ }. Suppose the T is a subset of O and that the size of O is 50.

Now suppose that the IR query q returns a set R of n = 10 documents. If the size of R Ç T is 8 then the precision is 8/10 = .8. The recall 8/25 = 0.32.

If we double the size of the allowed hitlist and suppose that we obtain the same quality of results for the additional retrieved elements then we have n= 20 and R Ç T = 16. In this case, the precision is 16/20 = .8; but the recall is 16/25 = 0.64. The recall will get better as the cutoff number is increased. The precision number; however, may vary with a downward trend as the cutoff number increases.

Using a perfect algorithm, the precision not decreased as the cutoff number is increased, but the cutoff number will eventually exceed the number of records that the algorithm will return. Using a perfect algorithm, the recall will stay constant at 1.

The case where we have a perfect algorithm is an exceptional case. In the non-perfect situation we do not know how the correct retrievals will be placed into the hitlist. Because of ranking algorithms the hitlist is generally ordered, but the ordering may be less than perfect when the retrieval query is less than perfect. If the precision is uniform as we move down an ordered hitlist, then the average between the recall number and the precision number, called the f score, will increase and then decrease, parabolically, as the cutoff number is increased. Precision will not, in general, be uniform.

Precision and recall require that we be able to count the number of items in a category and the number of items in the category that are also in the hitlist. We claim that there are two fundamental problems with these performance measures. The first is related to the inexactness of mental categorization and to the entanglement that almost always exists between natural categories. In this case, how does one count items? The second is related to the difference between how humans interact with environments and how a computational program is forced to interact.

Inexactness and entanglement are bound together. The inexactness of mental categorization is seen immediately when one uses introspection to examine the contents of personal mental images. The entanglement is related to how concepts and other mental states move one into the other in an embedded and dynamic fashion. The stratified categorization produced by the Voting Procedure provides a formalism to address both inexactness and entanglement.

Human interaction involves interpretation, whereas computer interaction has a weak basis for interpretation. Humans have a subjective reality that is not shared directly with other people or things. This subjective reality is often the source of motivation and interpretation. In fact, from a social point of view, the ability to judge the subjective reality of others is a highly prized human trait. The computer has no subjective reality, and can not judge the subjective reality of people and things. Perhaps the best that can be expected is that information about subjective realities can be made objective and procedural. Proper conversion, to objective form, might produce tools for machine interpretation of context.

The interpretation of subjective realities is a different issue than inexactness and entanglement. Interpretation is some constraint that effectively reduces a set of possible mental states; but the inexactness and entanglement of objective representations may be increased or decreased during the constraint of interpretation. Thus, while being distinct, interpretation involved in changing mental states is informed by the boundaries between concepts and the entanglement that concepts as seen to have.

Extension of Precision and Recall measures to stratified categorization

The Voting Procedure takes a crisp assignment policy for a collection of objects O and produces an stratified assignment policy. The total set of categories used is a category policy

C = { C₁, C₂, C₃, . . . , C_n }

for the set O.

A crisp assignment policy takes an object and places it into exactly one category. We use the notation:

O_j ® C_k ,

which is read object O_j is placed into category C_k .

A stratified assignment policy takes an object and ranks the preferred placement into each of the categories in a category policy. In this case we use the notation:

O_j ®_r C_k

for a r-th place preference for object O_j to be placed into category C_k . A complete stratified assignment policy can be written:

O_j ®_N ( C_k1, C_k2, C_k3, . . . , C_kn )

where N = { 1, 2, 3, . . . , n } is the index that orders the category policy and the sequence (k1, k2, k3, . . . , kn) is a permutation of the set N.

In the Voting Procedure we assume that there exists a well defined crisp placement of each object from O into exactly one category. Our goal is to produce a valid stratified assignment policy. Thus we are in a typical research situation where a training set and a test set are given. Knowledge about the categorization and assignment policies in the training set is to be used to establish procedures for placing elements of the test set. The recall and precision measurement is made on the retrieval results of a query, against the test set, when measured against the assignment policy of the test set.

A more general case would be where the assignment policy was stratified, but this is a more general case that has not yet come up in our experimental work. A review of the history and research of the TIPSTER Text Extraction and Retrieval Conference (TREC), demonstrates the problems involved in defining reasonable crisp category and assignment policies for natural routing and ad hoc problems.

We have redefined the nature of the retrieval problem from a crisp setting to one where there are levels of representations and associations.

Issues that should be expected:

· Independent category policy should have recall and precision characteristics that are exactly the same as the IR recall and precision measures.

· If the category definitions are not independent, then recall and precision measures should reflect categorical entanglement.

If the pairwise intersections of category representational sets are all empty, then the category policy can be said to have independent category definitions. In this case, the measure of precision and recall should be applied to only first place assignment preferences; i.e., an object is only in one category and any placement other than in the first place should be regarded as an error.

The retrieval that is made by an assignment policy is simply a hitlist that contains the category names. If the cutoff length of the hitlist is n = 1, then we have a crisp assignment policy. So in the crisp case with independent categories we have either a 0 or a 1 for both the precision and recall of the assignment query.

Recall and Precision in single placement actions

For single placement actions, such as message routing into exactly one bin, the test set has only one element so T = { O₁ }. Consider the two formulas:

P_{1, (q,T)} = n_r /1

and

R_{1, (q,T)} = n_r /N_r

where n_r is the number of elements from the set T that are in the correct assignment policy and o(T) = N_r.

If the target placement and the computed placement are both crisp, then N_r= 1 and n_r = 0 or 1. So the placement is either completely correct or completely incorrect.

Suppose that an object, O, is known to be in a specific category, say C_k . If there is a single routing decision to let a system place the object into one category, then

if O ® C_kthen P_{1, (q,T)} = R_1,
(q,T) = 1

since n_r = 1, and

if O ® C_ifor any i not k, then P_1,
(q,T) = R_{1, (q,T)} = 0

since n_r = 0; where

O ® C_i is read,

"object O is placed into category C_i by the voting procedure".

If the target placement is crisp, and the computed placement is stratified then the precision and recall can be different. The parameter n_r is a measure of degree in which the element from the test set T is characterized by the target assignment policy

If O Î C_k is known from the training set, and

O®_N ( C_k1, C_k2, C_k3, . . . , C_kn )

is the computed assignment then

n_r = 1 if k = k1

n_r = (n + 1 – i )/n if k = ki

N_r = 1

Example: Let k = 3, so we have the target placement O ® C₃and suppose the computed placement is:

O® ( C₄, C₆, C₃, C₁, C₅, C₇, C₉, C₂, C₁₀, C₈)

In this case, n_r = 8/10., precision and recall are both 8/10.

The IR measure N_ris the size of the set of possible candidates that can be placed into a crisp category. In the generalization to stratified categories, N_r may be non-integer, when viewed from a single category. This is first seen when we consider a stratified target placement

If the target placement is also stratified, then n_r can also vary and assume non-integer values.

O®_N ( C_k1, C_k2, C_k3, . . . , C_kn )

is known from the training set, and

O Î C_k

is the computed assignment then

N_r = 1 if k = k1

N_r = n/(n + 1 – i ) if k = ki

n_r = 1

Note the reciprocal relationship in N_r .

Example: Suppose the target placement

O® ( C₄, C₆, C₃, C₁, C₅, C₇, C₉, C₂, C₁₀, C₈)

and the target placement O ® C₃

In this case, N_r = 10/8. This number is interpreted to be larger than 1 because C₃ has to be moved up to get to first place and thus C₃ is out of the focus of the first place assignment. Recall is 8/10.

There is a problem with saying that the precision here is 1.

To consider this task, we go to the substructure of representational features.

Let V_i,j be the intersection of the category representational sets T^*_i and T^*_j . Suppose that the size, o(V_i,j) of V_i,j is m and the size of the union of all representational elements, A, is n_.

Then a measure of the category entanglement between C_i and C_j is m/n.

In better notation:

m( C_i , C_j ) = o(V_i,j) / o(A)

However, we need to define a measure for o(V_i,j) such that the sum of measures over i for a fixed j is the number 1.

S { m( C_i , C_j ) | i Î N }= 1

Thus we can think about possible entanglement as being spread over the set of pairwise entanglements as a distribution function whose summation is 1. This distribution function is centered at each category.

Let

( C_r1, C_r2, C_r3, . . . , C_rn )

be the ground truth and

( C_q1, C_q2, C_q3, . . . , C_qn )

be the computed placement.

The recall and precision measures can now be computed as a sum of n terms, t₁ , t₂ , . . . , t_n , one for each placement position in the assignment policy.

t_i = 1/n if C_ri = C_qi

t_i = 1/n (V_r,q) if C_ri ¹C_qi

The recall/precision is 1 if the ground truth and the computed placement is the same, or if the entanglement is complete so that V_r,q = 1 for all categories.

If the categories are independent then V_r,q = 0, and only a first place assignment is acceptable. Clearly, there are higher order effects to consider that have not been considered here.

Recall and Precision of stratified placement of a test or training set.

Let C = { C₁, C₂, C₃, . . . , C_n } be a categorization policy for the set O, that has been defined by a crisp placement of the elements from a test set T Ì O. For each element of T the Voting Procedure will produce a stratified placement policy:

O_j ®_N ( C_k1, C_k2, C_k3, . . . , C_kn )

where N = { 1, 2, 3, . . . , n } is the index that orders the category policy and the sequence (k1, k2, k3, . . . , kn) is a permutation of the set N.

In this case, local information about features from the training set are used to identify the entanglement between categories. If we counted only first place then we would not expect to be able to achieve recall and precision scores of 1. Working with the training set, allows us to see the effects, on the performance measures, that result from changing crisp categorization policies to stratified ones.

If there exist a separate test set S, S ¹ T, S Ì O; then we have the usual train and test experimental task. The local information from the training set can be used to place elements of the test set. In either case, we need a performance measure that is consistent with the IR recall and precision measures.

Suppose that all O Î O have been placed into stratified assignment policies. Clearly, if we counted only first place assignments then we could use the standard IR measures; the precision formula:

P_{n, (q,T)} = n_r /n

and the recall formula:

R_{n, (q,T)} = n_r /N_r

The parameter n is a cut off for the hitlist, n_r is the number of elements from the set T that are in the hitlist and o(T) = N_r. The query q addresses the assignment policy table where results from the voting procedure has been placed.

The predictive measure when object placement has subjective constraints

The use of precision and recall can be questioned based on examples from experience. In particular, the motivation for placement of documents into categories by humans may not correspond to SQL statements that exactly recall only those documents that have already been placed into the category.

The criterion for placement of documents (images, objects, etc) into categories by humans may be predictive of future placements in ways that is not measured by the IR performance measures. The next placement of a document by a human may not be retrieved correctly even by what was judged to be a perfect IR query. This might or might not be traced to some reason that the user can specify in terms that correspond to the representational data.