Ideal fragment selection - Evaluation of fragment selection

4.5 Evaluation of fragment selection

4.5.3 Ideal fragment selection

So far, we have discovered what can be measured from a collection of indexed fragments, and we have also seen some numbers con-cerning the evaluation of the example articles. We will next focus on the grading of the measured values. The starting point to our problem is that if we want to index all full-text without too much data-oriented content, a 100% coverage does not describe an ideal fragment collection. Then we ask, what is the coverage of an ideal fragment collection or do we even have to know it? As expected, single right answers cannot be given to these questions.

The ultimate goal of the evaluation is to find out which values of the absolute measures yield the best IR performance with an arbitrary document collection and what is required of the related IR techniques. However, a sufficient goal for now is to find parameter values or settings that, first of all, do not prevent the system from achieving the ideal IR performance and second, that aid the system in reaching towards the ideal performance. Even this goal may turn out to be utopia according to two simple claims: 1) the fewer fragments we search, the easier finding the relevant ones is, and 2)

4.5 Evaluation of fragment selection 77 the fewer fragments we index, the fewer queries will have access to the ideal set of answers. Anyway, the best performance is expected from fragments that comply with the qualities of an XML Full-Text fragment as defined in Section 3.2.1. These qualities cannot be measured with any simple metric because their evaluation is based on human perception.

Assuming that the ideal fragment collection can be selected from one document collection, the absolute values can be measured and either generalised or regularised in order to be applicable to other collections. However, not all values generalise in all cases. For example, the number of characters is dependent on the character encoding, and the number of words is specific to the language of the documents. Moreover, document collections are different from each other in that the proportion of full-text and data varies from one document and one collection to another. If the properties of a col-lection can be regularised, we can define a set of rules by which the values of the same properties can be inferred for other collections.

Measuring performance and the overall effects on the quality of Information Retrieval is more complicated because other techniques are always involved, e.g. clustering, query processing, scoring, score combination, relevance feedback, and query expansion. Because division into fragments is one of the first steps of indexing XML, comparing different fragment collections is difficult: the consequent steps cannot always be repeated identically.

Performance at different levels of granularity is especially chal-lenging to compare. Some precision and recall metrics are favourable to exhaustive answers whereas others prioritise specificity of the an-swers. When the size of a relevant answer plays a role in the eval-uation metric, the choice of granularity sets intrinsic boundaries on the achievable scores. If a score combination method is applied so that the granularity of the returned fragments may be decided at the time of query processing, a fixed granularity will not be the problem. However, after score combination, the fragment collection is no longer evaluated on its own merits.

CHAPTER 5

Fragment expansion

When the indexed fragments have been selected but not yet isolated from the document collection, we can apply the techniques of frag-ment expansionto the selected fragments. The purpose of fragment expansion is to help an XML search engine sort the fragments by relevance with respect to any given query. A good technique aids in the task by making the fragments more self-descriptive. Two different approaches are studied: 1) How to find metadata in the contextual information which is available in the document but not in the fragment itself, and 2) how to locate and weight the de-scriptive content inside the fragment. The XML structure of the fragments plays an essential role in both approaches, which makes fragment expansion the last chance to utilise the fragment context information and the XML markup before the fragments are sepa-rated into independent entities and converted into another format, e.g. plain text.

This chapter is organised as follows. The methodology that XML markup supplies to the content provider for adding style and struc-ture to the written text are studied in Section 5.1. Section 5.2 is devoted to analysing the XML structure. Whether unsupervised algorithms for the analysis are obtainable is also investigated. In Section 5.3, we consider the possibilities regarding different weight-ing schemes in the vector space model.

5.1 Markup semantics

The inspiration for finding meanings in the markup of structured documents comes from similar research on natural language where various interpretations of the semantics of style have been proposed [Lan70]. For example, by analysing the semantic components of the sentential structures we can classify texts into distinct stylistic cat-egories [LS81]. In a similar fashion, we strive to understand how semantics can be encoded in XML documents and how to decode the information written “between the tags”. Instead of concentrat-ing on the lexical and grammatical categories of a natural language, we now collect statistics of different types of nodes and node struc-tures of markup languages.

In document Indexing Heterogeneous XML for Full-Text Search (sivua 86-90)