• Ei tuloksia

Ideal answer sets

7.3 Evaluation

7.3.4 Ideal answer sets

An ideal answer set, as intended in this thesis, can be created for any granularity by sorting the collection of disjoint fragments by relevance, after which the fragments are in the ideal order. The ranking is not ideal in the absolute sense, as better evaluation scores may be achieved by allowing for fragments of various levels of gran-ularity to appear in the same ranking. Nevertheless, the ideally ranked list of fragments represents the best result set that can be returned without aggregating the indexed fragments into bigger an-swers. Consequently, the best result sets are ideal when comparing the IR performance of fragment collections at a single granular-ity level because the effects of indexing techniques and similargranular-ity computation are eliminated. We can thus study whether we gain anything by discarding data-oriented content in the index, or if we lose some relevant content by requiring full-text content of the in-dexed fragments. Another subject of interest is to see how close to perfect we can actually get in a realistic setting.

4Source code available at http://sourceforge.net/projects/evalj/

For both the baseline and full-text collection of each granularity, an answer set simulating an ideal run is built from the assessments.

The two-dimensional relevance assessments are first quantised into a one-dimensional scale using the generalised quantisation function fgen(e,s). Given the fixed set of fragments, those that have a non-irrelevant assessment are then sorted with the quantised assessment value as the primary sort key and the size as the secondary sort key, so that the most relevant and the biggest fragments come first. Last, the top 1,500 fragments for each topic are listed in the ideal answer set, although, for most of the 32 topics, there are fewer than 1,500 disjoint answers assessed as non-irrelevant. After all the relevant answers have been used, the rest of the list is filled with irrelevant answers sorted into a descending order by size.

The resulting ideal answer sets are only ideal with regards to the strict quantisation of inex eval ngand nearly ideal regarding the generalised quantisation, GR, and PRUM. For example, the ideal ranking for the generalised quantisation of inex eval ng requires the sort key to be defined as the product

size×fgen(e, s),

whereas the ideal ranking for GR with P(Re) only requires a small change: when answers where (e,s) = (2,3) are given a smaller quantised value than 0.75, the order is ideal. For the PRUM met-ric, however, the creation of an ideal answer set would require the computation of quantised values for the near misses, too, which are dependent on the elements within a short navigational distance.

For other metrics than inex eval ng with strict quantisation, the ideal answer sets represent systems of a superior quality instead of being ones that yield the maximum scores. If we built different ideal runs for each metric, we could not properly compare the results with each other as they would describe different systems. Hence, we settle with only one method for creating a perfect ranking for the indexed fragments.

The need to develop a new method for computing the ideal an-swer set is a point that can be criticised. Other methods have already been proposed in order to evaluate the behaviour of differ-ent evaluation metrics in the works of de Vries5 and Kazai et al.

5http://homepages.cwi.nl/˜arjen/INEX/metrics2004.html

7.3 Evaluation 127 [KL05], however, none has become a standard. New metrics pro-posed each year bring up new assumptions about the user models and user behaviour, which in turn imply new requirements for the ideal system for XML retrieval. Since we are not trying to evalu-ate the evaluation metrics themselves, but we only evaluevalu-ate differ-ent configurations for an XML retrieval system, we define an ideal system as one that achieves either the best or a reasonably high score with any suitable metric. Moreover, the fragment collections that are ranked in the upcoming tests consist of disjoint fragments, which makes the computation rather simple compared with the ear-lier methods that also have to sort overlapping fragments into an ideal order. How different metrics deal with overlap is an additional factor that, fortunately, can be ignored in this thesis.

The completeness of the ideal answer sets is directly proportional to that of the assessments, which has been tempting to criticise in the past years [HKW+04]. It is more than possible that there are relevant answers in the document collection that did not, however, make it to the top 100 in the result lists of any participant. These answers may not have assessed relevance values at all, in which case they are considered irrelevant in the evaluation. Nevertheless, if some answer is not included in the top 100 of any of the official 56 submitted answer sets, it is highly likely that the answer actually is irrelevant. Moreover, even if a few relevant answers were missing, the ideal answer sets still give us a perspective into what could theoretically be achieved.

The ideal parameters for selecting the indexed fragments can be estimated by comparing the evaluation results of different ideal runs. The ideal parameters vary from one query to another, how-ever, so that whichever fragment collection is valued best for one query might not be best for another. The evaluations are especially different when, for instance, small answers (10–500 characters) are assessed as highly relevant for one query but completely irrelevant for another. Bigger answers are usually not systematically discrim-inated by the assessors. The average scores are somewhat more stable, and any general conclusions should be drawn from them in order to avoid tuning the parameters towards specific queries. An-other thing to consider is that when working with ideal runs, the conclusions drawn by comparing results in an ideal setting might not generalise into any realistic setting.