• Ei tuloksia

4.4 Algorithm for fragment selection

4.4.3 Tree traversal

As each of the divided XML documents is considered independent of the other documents, we process the collection of documents serially, one XML document at a time. Because the definitions

for different kinds of full-text fragments require that we operate on document trees, the algorithm basically defines how the tree is traversed and what to do with the traversed nodes.

The na¨ıve approach is to traverse the whole tree from first to last node and test all the element nodes on the way. Besides being slow, this approach may also lead to other problems concerning nested elements because it is possible that both the parent and the child elements qualify. However, being thorough may work if overlapping and nested fragments are acceptable in the index, but otherwise, additional testing of nodes is necessary for each node that is traversed. If we optimise the tree traversal by adding conditions on branch selection we do not only make the traversal more efficient but we also have the option of restricting the output to disjoint fragments without an extra cost.

In the optimised approach, the beginning state of the division consists of an empty fragment collection F = ∅ and the current node c which is set to be the document root. The current node is first tested for size. If the size is bigger than the maximum size of the granularity, c is assigned each of the child element nodes one by one. If the size of the current node falls in the given range, the node is tested for full-text likelihood. In the case that c has too structured content so that T/E<1,c is again assigned each of the child element nodes. Otherwise, when the size of c fits the given range and when the content passes as full-text, we addcto F and move on to the next branch (subree sibling) in the document tree. The acceptance as a full-text fragment and a size below the given range are both stopping conditions for the recursive testing of child nodes. In other words, the child nodes are tested until they contain enough full-text or until they are too small. The tree traversal is then repeated for each document in collection C. When the algorithm stops, the fragment bodies that have been added to F represent the disjoint full-text fragments that can be selected from the XML documents in the colletion.

The pseudo code for the algorithm is presented in Figure 4.1.

The accepted size range [Gmin, Gmax] is considered a global vari-able more than an actual parameter by its nature, and only the current node c is passed on at each recursive function call. The algorithm can be optimised by reordering the tests. However, the optimal order is dependent on the input documents. For example,

4.4 Algorithm for fragment selection 67 Algorithm SelectFullText(node) {

if size(node) > Gmax

// Discard node (too big)

for each child: SelectFullText(node.child);

else if size(node) < Gmin // Discard node (too small) break; // Skip children

else if ftl(node) = data //Discard node as data for each child: SelectFullText(node.child);

else

accept(node);

break;

}

Figure 4.1: Pseudo code of the algorithm that returns a set of disjoint qualified full-text fragments from a given XML document.

if the test for full-text likelihood ftl(node) fails more often than the test for the fragment size, it should come before the size test assuming that they can be computed at an equal cost.

If a relatively small maximum size is given to the algorithm, the conditions for testing the child elements should be reconsidered.

Small elements are likely to have text node children, from which follows that the child elements are at the inline level. Although some inline-level elements are seemingly good candidate fragments, e.g. lists, tables, and their inclusion in the division is easily justified, we are bound to discard their text node siblings, given that disjoint fragments are required. Whether this is generally acceptable de-pends on the nature of the documents and what would actually be lost in the orphaned text nodes. As a solution for these cases, we can define a soft limit for the maximum size of a fragment. Elements that have text node children are then tested against the soft limit instead of the actual maximum size. This solution compromises the definition of a qualified full-text fragment, though.

The problem of determining the set of full-text fragments is actu-ally treated as a classification problem in the algorithm. It classifies

all the nodes of a document into three pre-defined classes: 1) Qual-ified full-text fragments, 2) “Too small” fragments, and 3) Other fragments (which are either too big or too data-oriented). Although the classes are fixed, the classification rules must be learned if no supervision is available. The unsupervised classifier is trained by collecting statistics from the document collection itself before it can be used in the algorithm. How to learn the minimum and maxi-mum size for each granularity level was studied in Section 3.3, and some experiments with an actual document collection are reported in Chapter 6.