• Ei tuloksia

Each test run can be seen as an independent system that first com-putes similarities between the test queries and the fragments of a given collection, and then returns the answers it considers the most relevant. In this section, we present the run specifications which define how the indices are created from the fragment collection and how the queries are processed. At the outcome of a run, each query is answered with a ranked list of XML fragments sorted by rele-vance. The result lists will function as the basis for the evaluation of the test run.

7.2.1 Queries

The set of test queries is based on the 32 “Content-Only” (CO) topics of the INEX 2003 initiative, for which the corresponding rel-evance assessments are available. The Content-Only type topics are traditional in Information Retrieval. They contain a combination of keywords and keyphrases specifying the topic of the user’s in-formation need. Keyphrases consist of at least two keywords. The CO topics contain no structural hints or constraints for the XML structure of the answers, which makes the actual queries inherently independent of any document type. By default, any fragment is a potential answer regardless of the tag names or fragment context.

7.2 Test runs 119 Only the content of the answers matters when their relevance to the topic is judged. An example of a CO topic is presented in Sec-tion 8.6 where also the corresponding results are studied in detail.

Converting the topics into queries is rather straightforward. Af-ter the explicitly marked phrases are recognised, the topic is pre-sented as a vector. Only the title and keywords parts of the topic are used.

7.2.2 Baseline process

The baseline runs should be simple in that they introduce as few new factors affecting the results as possible. The number of vari-ables should be minimal in order to make the results easy to inter-pret and compare. Moreover, reduced optionality helps eliminate possible side-effects.

As a first step, two separate inverted indices are built from the fragment collection to be tested. A word index is created after punctuation and stopwords are removed and the remaining words are stemmed with the Porter algorithm [Por80]. The phrase index is based on Maximal Frequent Sequences (MFS) [AM99]. Maximal phrases of two or more words are stored in the phrase index if they occur in seven or more fragments. The threshold of seven comes from the computational complexity of the algorithm. Although lower values for the threshold produce more MFSs, the computation itself would take too long to be practical. More details concerning the configuration of the phrase index are included in the PhD thesis of Antoine Doucet [Dou05].

Besides the indexed fragments, also the queries are presented as normalised vectors. As we have two indices, we initially compute two relevance scores based on the cosine product for each fragment.

One score indicates the similarity of the query terms to the terms in the fragment, whereas the other indicates phrase similarity. The Retrieval Status Value (RSV) is a combination of these scores so that both scores have an equal weight. After all the fragments are ranked by the RSV, the top 1,500 fragments for each query are included in the result lists which consist of disjoint fragments of a fixed granularity.

7.2.3 Additional options

The baseline process is a heavily compromised version of a full XML retrieval system. One of the biggest compromises is the fixed granularity of the answers. Dynamic granularity of the answers does not require anything more but the combination of the RSV scores of the fragment siblings, after which the fragments can be combined into bigger fragments. For example, instead of returning three relevant subsections, it might be more appropriate to return the entire section. This would also enable us to set the granularity of indexed fragments to represent the absolute minimal size instead of a settling with the compromise of indexing answer-sized fragments.

There is no standard score combination method that could be used in a baseline process. As any such method would have a strong impact on the final results returned to the user, the test runs are expected to be more useful without any score combination. It might not be until the fragment selection algorithm and the fragment expansion techniques have been thoroughly tested, when it is the right time to move on and add score combination methods to the test run specification.

Query expansion with words from the answers — even when using only the most highly ranked answers — improves the over-all quality given that the answers at the top ranks are relevant [Voo94]. The sooner the precision of the results drops, the greater is the potential gain that query expansion could offer. At this point, however, we first want to find out how to attain the best possible results without query expansion so that the maximum benefit could be made later.

Phrase detection could also be optimised by tuning the param-eters for creating MFSs. Although it would lead us to better eval-uation scores, the main purpose of the tests is to find out how the indexed fragments should be selected and whether we can take advantage of the document markup with fragment expansion tech-niques. Optimisation of the process should thus not change the observations that can be made about the tested subjects. There-fore, we are well off with a less than optimal phrase index.