• Ei tuloksia

Titles as fragment descriptors

The nearest preceding title was appended to 81,554 out of 86,386 fragments (94.41%) in the fragment collection Base1Ti, and to 235,049 out of 236,630 fragments (99.33%) in the collection Base4Ti.

The only fragments with no additional title added were those that comprise a whole article element. The evaluation scores of the frag-ment collections where titles were added are presented in Table 8.9 with the comparative scores of their counterparts without the ad-ditional titles.

By comparing the results at the two granularity levels, we ob-serve quite contradictory scores for the collections where the frag-ment expansion techniques other than appending the titles are applied (BaseXFtEmLi). According to the strict quantisation of inex eval ng, this particular configuration is best for the Base1 granularity, whereas for the smaller fragments of Base4 granularity, it is hardly better than the baseline. The generalised quantisa-tion leads to fewer surprises in the relative system rankings: the baseline collections have clearly the lowest scores, whereas the ‘All’

collections have clearly the highest scores for each granularity.

The relative effect of appending title words to the fragments is shown in Table 8.10. As the only fragment expansion technique,

8.5 Titles as fragment descriptors 151 Division strict -o generalised -o GR PRUM

Base1 0.0815 0.0591 21.0692 1.0231

Base1Ti 0.0956 0.0631 21.2052 1.2752

Base1FtEmLi 0.0982 0.0665 21.1706 1.3478

Base1All 0.0954 0.0686 21.1940 1.2952

Base4 0.1026 0.0729 27.1790 1.7181

Base4Ti 0.1081 0.0751 27.3364 1.8464

Base4FtEmLi 0.1033 0.0769 27.0489 2.2274

Base4All 0.1170 0.0831 27.6774 2.5398

Table 8.9: The descriptive value of titles measured in absolute eval-uation scores.

Division strict -o generalised -o GR PRUM

Base1Ti +17.3 +6.8 +0.6 +24.6

Base1All -2.9 +3.2 +0.1 –3.9

Base4Ti +5.4 +3.0 +0.6 +7.5

Base4All +13.3 +8.1 +2.3 +14.0

Table 8.10: Relative improvement of associating titles with the in-dexed fragments.

including the contents of the title elements in the fragments of the Base1 collection seems to have a strong positive effect on the average precision whereas the positive effect is more modest on the Base4 collection. Together with other fragment expansion techniques, the effect of the granularity is reversed: the other techniques seem to make the titles unnecessary when the fragments are big (Base1All).

There is a strong agreement on this behaviour among the metrics inex eval ng and PRUM with the strict quantisation, which is reasoned as follows.

First, as shown in Figure 6.4, more than half of the fragments in the Base1 granularity are section and front matter elements where the nearest preceding title is the title of the article. Second, the most relevant section elements at the Base1 granularity are more

likely to contain the title words than the smaller fragments at Base4 granularity. Given that, appending the titles does not add any new terms to the most relevant sections but it merely increases the cor-responding term frequencies. Third, as sibling elements are consid-ered equal, the words in the article title are also added to the less relevant sections, though, as new terms this time, which causes the df values of the title words to increase and the corresponding term weights throughout the collection to decrease. Fortunately, the less relevant sections are nonetheless relevant according to the gener-alised quantisation, which explains the 3.2% improvement origi-nating in the titles in the Base1All collection. Fourth, creating associations to the referred content also adds article titles to the fragments, in particular when the references point to bibliograph-ical entries. The proliferation of the title words makes them look more common than they actually are, which in turn causes their descriptive value to deteriorate.

Why these problems do not have a negative effect on the re-sults at the Base4 granularity can be explained by looking into the composition of the fragment index. The fragments of the Base4 collection (Figure 6.3) are considerably smaller than the maximum size of 20,000 characters of the Base1 granularity, the most common fragment root element being p (paragraph). The associated titles are now found in the beginning of the sections and subsections, and as they describe smaller portions of text, they are often more spe-cific than the article titles which have to be general enough to label the whole article. Moreover, linked content is associated with less than 40% of the fragments at the Base4 granularity, whereas the corresponding figure was as high as 65% for the Base1 granularity.

We can thus come to the conclusion that when the fragments are smaller, e.g. 8,000 characters or less in size, we are less likely to overdo the fragment expansion by associating most useful search terms with too many fragments and inflating their value.

The connection between titles and the linked content is obvious when we compare the percentages in Table 8.10 to those of links presented in Table 8.5 in Section 8.3. While it is clear that the two fragment expansion techniques interfere with each other, we cannot pick out either one of them as the scapegoat. The evaluation scores only show that the best configuration for fragment expansion is highly dependent on the fragment size and the different aspects of

8.5 Titles as fragment descriptors 153

Figure 8.7: Curves demonstrating the effect of associating frag-ments with related titles.

the retrieval task which are reflected in the evaluation metrics.

The precision at low recall levels is shown in Figure 8.7. By com-paring the curves to the corresponding values of average precision in the legend, we observe that they do not seem like a good match at all. What is not shown in the zoomed figures is that the ap-pended titles affect the curves all the way to the 1,500th result for each query, thus improving the recall more than the other fragment expansion techniques. Why the improvement on precision is more modest at low recall levels may be because the importance of the

<inex_topic topic_id="124" query_type="CO" ct_no="125">

<title>application, algorithm, +clustering, +k-means, +c-means,

"vector quantization", "speech compression",

"image compression", "video compression" </title>

<description>find elements about clustering procedures, particularly k-means, aka c-means, aka Generalized Lloyd algorithm, aka LBG, and their application to image speech and video compression. </description>

<narrative>interested only in application, and or the algorithm or procedure applied to speech video or image compression. </narrative>

<keywords>vector, quantization, VQ, LVQ, "Generalized Lloyd", GLA, LBG, "cluster analysis", clustering, "image compression",

"video compression", "speech compression"</keywords>

</inex_topic>

Figure 8.8: The official Content Only topic #124 of the INEX 2003 initiative.

title words is diluted by artificially increasing their document fre-quencies. Statistical document frequencies should be a good remedy for this as it also makes the methods scalable to the incremental indexing of an infinite number of documents.

Associating related titles with the indexed fragments is the most influential fragment expansion technique presented in this thesis as it affects the term frequencies of nearly all the indexed fragments and the document frequencies of the most descriptive index terms.

As a technique complementing the others, it is especially impor-tant to the relatively small fragments. The presented results imply high prospects for future research on heuristics for finding titles in heterogeneous XML documents without assuming the names of the title elements.