• Ei tuloksia

Emphasis on the emphasised

8.4 Emphasis on the emphasised

In order to picture the magnitude of the potential influence of the structure-based phrase detection and phrase weighting, we present the number of qualified inline elements in Table 8.6. The param-eters were set to a minimum of three (3) characters in at least one Text node child and a minimum of five (5) characters in at least one Text node sibling. EntityReference nodes may not appear in the Simple inline elements by definition, but the frequencies show that this additional condition only disqualifies 1.2–1.3% of all the qualified inline elements.

Division Simple (5,3) Others (5,3) All (5,3)

Base1 532,475 6,750 539,225

Base1Ft 523,853 6,743 530,596

Base4 520,965 6,643 527,608

Base4Ft 511,866 6,636 518,502

Whole collection 544,495 7,226 551,721 Table 8.6: Qualified inline elements in the tested collections.

By comparing the numbers of the baseline collections to those of the full-text oriented collections, we can make a somewhat inter-esting observation: by discarding data-oriented content we also lose many qualified inline elements, 8,622 in the Base1Ft collection and 9,099 in the Base4Ft collection. The amounts are not alarming, though, and we can still consider mixed content a good indicator of full-text content. For one thing, some of the lost inline elements are left out because their source fragment is too small, not because it looks too data-oriented. For another thing, compared to the 5–6%

reduction in the total size of the collection, the 1.6–1.8% reduc-tion in the number of qualified inline elements is within reasonable limits.

In order to see real phrases that are emphasised in the INEX document collection, we study Figure 8.4 which shows a selection of qualified inline elements, each preceded by their frequency. The seemingly best phrases are found in those inline elements that have a relatively low frequency in the whole collection. The phrases that benefit most from heavier weighting are such that the words of the

phrase occur in quite many fragments, but they are emphasised in

Figure 8.4: Qualified inline elements preceded by the frequency in the document collection.

In order to show when phrase weighting is useful, we consider the phrase “critical path” and the related word frequencies in the 236,630 fragments of the Base4 collection. The word ‘critical’ oc-curs in 8,413 fragments, ‘path’ in 12,927 fragments, both words in 950 fragments, and the phrase “critical path” in 631 fragments. All of these frequencies are directly proportional to the corresponding document frequencies in the tf*idf weighting scheme. As the con-tent of a qualified inline element, the phrase occurs 16 times in 16 different fragments. It is intuitively easy to understand that one phrase is emphasised only once in a fragment although it might occur there several times. The term frequency of the words ‘crit-ical’ and ‘path’ thus increases in 16 fragments when the qualified inline elements are given extra weight. As both words have a rather high document frequency, triplication of the inline element might be more effective than duplication. Nevertheless, those fragments

8.4 Emphasis on the emphasised 145 where the phrase is emphasised are expected to be more relevant to corresponding queries than those where the words occur unem-phasised.

The actual effect on XML retrieval was studied by creating frag-ment indices with eight different configurations for both of the tested granularities. Besides the tests where the duplication of in-line elements is either on or off, we also tested whether triplication is better than duplication, and whether there is any difference be-tween the simple inline elements and all qualified inline elements.

The results of the baseline runs on each index are reported in Ta-ble 8.7.

Division strict -o generalised -o GR PRUM

Base1 0.0815 0.0591 21.0692 1.0231

Base1Em 0.0894 0.0589 21.3262 1.3529

Base1Em3 0.0958 0.0635 21.6545 1.3932 Base1sEm 0.0924 0.0616 21.5971 1.6038 Base1sEm3 0.0886 0.0613 21.6547 1.3999 Base1FtLiTi 0.0969 0.0669 20.9796 1.4314 Base1All 0.0954 0.0686 21.1940 1.2952 Base1sEmAll 0.0918 0.0679 21.0951 1.2591

Base4 0.1026 0.0729 27.1790 1.7181

Base4Em 0.1040 0.0733 27.2178 2.4100

Base4Em3 0.1025 0.0727 27.2384 2.1141 Base4sEm 0.1055 0.0761 26.5913 1.8040 Base4sEm3 0.1029 0.0717 26.3486 2.4839 Base4FtLiTi 0.1133 0.0806 27.6353 2.8075 Base4All 0.1170 0.0831 27.6774 2.5398 Base4sEmAll 0.1138 0.0821 27.1297 2.5799

Table 8.7: Added weight on the emphasised content.

Trying to find some agreement in the results among different metrics is somewhat challenging, so the interpretation of the re-sults may, at times, seem vague. When duplicating the content is the only fragment expansion technique applied, in most cases the definition for the simple inline elements (BaseXsEm) seems to work better than the one that allows for entity references to appear in the

text content (BaseXEm). In the case where triplication is applied, the metrics give a slight favour on all the qualified inline elements (BaseXEm3). With other fragment expansion techniques, the met-rics almost unanimously suggest that duplicating only simple in-line elements (BaseXsEmAll) is not sufficient: a better precision is achieved by also duplicating inline elements with entity references (BaseXAll).

Whether duplication is more effective than triplication is some-what dependent on the granularity. In most cases, triplication of the simple inline elements (BaseXsEm3) leads to a lower search quality than the duplication thereof (BaseXsEm). When all qual-ified inline elements are concerned, however, triplication improves the results more than duplication at Base1 granularity (Base1Em3), but at the granularity of the Base4 collections (Base4Em3), the re-sults are better when the qualified inline elements are only given double weight (Base4Em).

The actual effect of giving heavier weights on emphasised con-tent is shown in Table 8.8 where each collection is compared to their counterpart without additional weighting. The biggest im-provement is displayed in the PRUM scores, according to which, however, duplication of inline elements does not improve the re-sults at all when other fragment expansion techniques are applied.

Both the improvement and the decline in the results seem to be more pronounced in the Base1 collections where the fragments are fewer and bigger.

In order to find out which weighting scheme works best for the detected phrases, we may try to analyse which configuration leads to the biggest improvement in the evaluation scores. Table 8.8 tells us that, of the two weighting schemes, duplication works better when simple inline elements are weighted at Base4 granularity and when all qualified inline elements are weighted at Base1 granularity, whereas triplication is preferrable when all qualified inline elements of the Base1 collection are weighted. Which weighting method is best for the simple inline elements at Base4 granularity seems to depend on the evaluation metric. We can also draw the conclusion that as the only fragment expansion technique, giving additional weight to the qualified inline elements does not automatically result in a significant improvement in average precision, although, in most of the tested cases, it does.

8.4 Emphasis on the emphasised 147 Division strict -o generalised -o GR PRUM

Base1Em +9.7 –0.3 +1.2 +32.2

Base1Em3 +17.5 +7.4 +2.8 +36.2

Base1sEm +13.4 +4.2 +2.5 +56.8

Base1sEm3 +8.7 +3.7 +2.8 +36.8

Base1All –1.5 +2.5 +1.0 –9.5

Base1sEmAll –5.3 +1.5 +0.6 –12.0

Base4Em +1.4 +0.5 +0.1 +40.3

Base4Em3 –0.1 –0.3 +0.2 +23.0

Base4sEm +2.8 +4.4 –2.2 +5.0

Base4sEm3 +0.3 –1.6 –3.1 +44.6

Base4All +3.3 +3.1 +0.2 –9.5

Base4sEmAll +0.4 +1.9 –1.8 –8.1

Table 8.8: Percentual change of increasing the absolute frequency of qualified inline elements.

The effect of phrase weighting at the low recall levels is pictured in Figure 8.5 which shows the curves associated with all qualified inline elements. From the curves zoomed into the first 50 answers per query, similar observations can be made about the effects of this technique to those that were made of the overall evaluation scores.

Although the effect seems more or less positive for the granularity [200, 20K], it seems close to random for the smaller fragments in the Base4 collection. However, all the figures mostly agree that the curves are the furthest apart from their comparative counterpart at the lowest levels of recall, after which they start to converge.

Figure 8.6 shows the average precision at low recall levels with a focus on the simple inline elements. In all the test cases, giving triple weight to simple inline elements leads to a higher average precision than a double weight — but only when the very first answers are considered. Although the improvement is limited to the top two or three answers per query, the observation may be valuable to applications where high precision is preferred. Nonetheless, the triplication of the simple inline elements causes the precision to

0.2

Figure 8.5: Emphasising qualified inline elements has the greatest effect on retrieval quality at the low recall levels of 1–50.

8.4 Emphasis on the emphasised 149

[200, 20K]: Simple Emphasis (strict) All with Simple Em 0.0918

All but emphasis 0.0969

[200, 20K]: Simple Emphasis (generalised) All with Simple Em 0.0679

All but emphasis 0.0669

[150, 8K]: Simple Emphasis (strict) All with Simple Em 0.1138

All but Emphasis 0.1133

[150, 8K]: Simple Emphasis (generalised) All with Simple Em 0.0821

All but emphasis 0.0806 3x Simple Em 0.0717 Simple Emphasis 0.0761 Baseline 0.0729

Figure 8.6: The effect of emphasising simple inline elements shown at recall levels 1–50.

sink after the first few answers, and as it sinks even lower than the baseline, it is everything but recommendable for tasks where high recall is appreciated.

Giving extra weight to qualified inline elements has a strong ef-fect on the term frequencies as over 500,000 inline elements are involved. However, the overall effect on an individual fragment depends on the term frequency before the additional weighting.

Although duplication of the inline element increases the term fre-quencies by 1, the actual term frequency rarely doubles because the

same terms tend to occur in the same fragments unemphasised, as well. Further tests with collections of different granularities would produce more data to be analysed, but the most certain conclusion that can be drawn from the results presented is that neither du-plication nor tridu-plication alone is a reliable method for improving the retrieval quality. Instead, the variance in the results suggests that we need more sophisticated weighting methods where differ-ent inline elemdiffer-ents can be weighted individually. For example, we could consider the context of the whole fragment when weighting a detected phrase, so that the corresponding term weights in the whole fragment are doubled instead of duplicating a single occur-rence of the terms. Future work on this fragment expansion tech-nique should thus be directed at weighting methods because the quality of the detected phrases can hardly be improved if indepen-dence of document types is required of the methods.