Changes in the typeface - Analysis of full-text fragments

5.2 Analysis of full-text fragments

5.2.2 Changes in the typeface

Systems that index full-text often neglect the information that is implicitly available in the layout and presentation of full-text. They may perform sophisticated linguistic analysis including full gram-matical parsing etc. but, at the same time, they are only capable of processing the plain text presentation of the XML documents where all markup and other information about the text formatting has been removed. In order to fix the shortfall, we want to find methods that analyse formatting properties, typefaces in particu-lar, by taking document structures into account.

Motivation for the analysis comes from the rather simple writing technique of adding emphasis to selected parts of text as described in Section 5.1.1. From what we know about labelling concepts with differing typefaces, we infer the following hypothesis:

A temporary change in the typeface implies a phrase boundary.

Assuming that we can detect the temporary changes in the type-face, the hypothesis should turn out to be helpful inphrase detection which in turn aids in building separate phrase indices and enables schemes for phrase weighting.

Thanks to the XML markup, we can locate the portions of text content that are supposed to be formatted with a typeface differ-ent from the surrounding contdiffer-ent. In order to be independdiffer-ent of document types, we intend to ignore what the specific typefaces are called, what they look like, or even how they differ from other type-faces. Therefore, any kind of change in the typeface is potentially interesting, which makes this theory applicable to all documents where typefaces are used for presentational purposes.

It is necessary to read document type definitions if, for example, we want to know what the intended content models of elements are.

Without knowing the content model, we cannot confidently deter-mine whether or not text nodes containing only whitespace can be ignored. Zwol et al. presented an approach where they first analyse the DTD in order to find those elements that can have the mixed content model [vZWD05]. The descendant elements in the mixed content — inline-level elements — are then given a multiplicated weight. Although their results showed slight improvement when inline-level elements were given heavier weights, the overall perfor-mance of their system was too low to demonstrate the significance of the idea.

DTDs and schemas also tell us which elements can be used for adding emphasis or changing the typeface. However, by only know-ing the element name, we cannot know if the change is temporary in the parenting element or if that typeface is dominant in the whole of the parent element. For example, all the full-text of one document can be presented in italics aslong as the DTD only is concerned. Be-cause of the distinction between temporary and permanent changes in the typeface, we cannot rely on any interpretations of element names but we have to look at the relations of element and text nodes instead.

We know that marking a phrase with a typeface that differs from the surrounding text implies that the phrase be content of an inline element such as theit element in Example (5.4).

5.2 Analysis of full-text fragments 91 (5.4) ...difference in firing modes. Timed (stochastic)

PN’s use the <it>strong firing mode</it> in which a transition is forced to fire immediately after it is enabled...

This is a typical example of the intention to change the typeface temporarily in order to emphasise the important content. There are numerous other cases where elements appear at the inline level without having anything to do with typefaces. We find typical examples in the XHTML specification [W3C02] which allows tables and lists to appear at the inline level. They should be simple to distinguish, though, as their content differs greatly from that of the inline elements that change typefaces. Table and list elements have structured content, i.e. they contain several child elements, whereas inline elements marking phrases typically have only one text node child. Nevertheless, it is intuitively clear that also inline elements with structured content imply phrase boundaries.

Even when an inline element is used for changing the typeface, it might not contain a useful phrase. Examples include elements for subscript, superscript, embedded program code, preformatted text, mathematical formulae, etc. The content of these elements is often treated differently from full-text content at a later point of indexing and not much harm is done if treated as full-text during fragment expansion. For example, program code is mostly ignored merely due to the large amount of special characters and stopwords such asifand then.

As a contrast to Example (5.4), the inline elements in Example (5.5) and (5.6) do not contain phrases although they mark tempo-rary changes in the typeface. In this kind of cases, the markup is placed in the middle of a word, and the inline elements are not sepa-rated by word delimiters from the surrounding content. More signs of deceptive inline elements are seen in the sizes of text nodes. The text length at the inline level is extremely short (1–2 characters) in Example (5.5), whereas in Example (5.6), the inline elements themselves only contain 1–2 characters.

(5.5) <ti>C<scp>OMPARISON OF</scp>

A<scp>RCHITECTURAL</scp> F<scp>EATURES</scp></ti>

(5.6) ...through SilRi (<it>si</it>mple

<it>l</it>ogic-based <it>R</it>DF

<it>i</it>nterpreter, ...

In the light of these examples, the original hypothesis should be amended with a remark that a phrase boundary can only occur between words. However tempting it might be, the assumption that words are separated with whitespace characters is specific to languages and writing systems. In languages like English, leading and trailing non-whitespace characters provide negative evidence when deciding the usefulness of an inline element, but languages with other kind of writing systems need different criteria for finding word boundaries.

Discovery of useful phrases is the subsequent goal of detecting phrase boundaries. In addition to the hypothesis, we also claim that the typeface does not change within a useful phrase, as shown in Example (5.4). In order to recognise the significant phrase bound-aries, we define a qualified inline element which contains a phrase that can be considered more important to the fragment than the surrounding content. The phrase is often emphasised with a differ-ent typeface when the text contdiffer-ent is displayed, but this is more of a consequence than a requirement. An XML element is considered a qualified inline element when it meets the following conditions:

(C1) The text node siblings contain at least n characters after whitespace has been normalised.

(C2) The text node descendants contain at leastm characters after normalisation.

(C3) The element has no element node descendants.

(C4) The element content is separated from the text node siblings by word delimiters.

Defining the lower bounds of n and m improves the quality of detected phrases in the qualified inline elements. If the text node siblings contain fewer thann non-whitespace characters, the inline element is not likely to denote a temporarychange in the typeface.

It might not even change the typeface at all. If the normalised string value of the inline element is shorter thanm characters, it is likely that the string is too short to be indexed. Setting the minimum

5.2 Analysis of full-text fragments 93 length of the trimmed phrase m reduces the number of qualified inline elements, which in turn reduces processing overhead. If the values of n and m are set properly, the fourth condition may be redundant.

In the third condition, the definition for qualified inline elements differs most from the HTML definition for inline elements in that qualified inline elements may only contain string-valued content, not other inline elements. This definition is also independent of any tag names and document types. Furthermore, a descriptive definition is more useful than a prescriptive one because it reflects the practice of how tag names are actually used instead of simply assuming the intended usage.

Setting a threshold for the maximum size of an inline element is excluded in the definition, which allows for phrases of arbitrary length to qualify. Fortunately, the amount of excessively big inline elements among the qualified ones is minimal — less than 0.01% — in the document collection described in Section 6.1. The ones that occur in the collection contain several sentences of text. The num-ber of lengthy phrases can be reduced by modifying the condition (C3) as follows:

(C3b) The element has only text node descendants.

With the modified condition (C3b), we have the definition of a simple inline element where entity references are not allowed.

Because the probability of a phrase containing entity references is proportional to the phrase length, the condition reduces most the number of lengthy phrases. Consequently, some useful phrases similar to the contents of the it element in Example (5.7) do not qualify as simple inline elements.

(5.7) ... The <it>Audio & Video Recorder</it>

consists of audio-visual equipment such as ...

According to the definition, an inline element does not qual-ify if more than one element is needed for changing the typeface temporarily. Example (5.8) shows two different phrases where two inline-level elements define the changes in the typeface. The type-face of both phrases is actually subject to three different changes:

boldface, italics, and capitalisation.

(5.8) ...similar in several respects to Bertrand Meyer’s

<it>BILINEAR</it> [<ref>12</ref>],

pp. 141-146 and <it>TWO_WAY_LIST</it> ...

From our experience, the inline level content where two or more changes have been applied to the typeface consists of mostly of single characters, e.g. italicised superscript, and pieces extracted from program code, e.g. names of variables, functions, or even files.

Example (5.8) represents the best phrases in the test collection where two inline level elements are used, and even their quality hardly compares with that of the most common case presented in Example (5.4).

Although detecting cases similar to Example (5.8) would not do much harm, we cannot consider it worthwhile because of the ques-tionable quality of the phrases combined with the added complexity of computation. Moreover, the presupposition behind the heuris-tics (two or more elements changing the typeface) does not apply to all document types although the heuristics themselves are indepen-dent of document types. While some DTDs only define elements for marking temporary typefaces, others may have corresponding definitions for attributes, which implies only one element occurring at the inline level.

Hoi et al. rely on the HTML document type [HLX03], but they are unable to account for cases such as Example (5.9) where the style is defined by inline CSS.

(5.9)

A new background and font color with inline CSS

Because of CSS and other stylesheet languages, knowing the tag names is not sufficient when analysing HTML documents, and this conclusion can be generalised to other document types as well that allow such structural heterogeneity.

In document Indexing Heterogeneous XML for Full-Text Search (sivua 99-104)