Structural issues - Indexing Heterogeneous XML for Full-Text Search

Dividing a document collection into fragments with unsupervised methods hardly results in an ideal fragment collection, but it is reasonable to expect the result to be a good compromise. There are different ways to achieve this goal depending on the choices that are made about how the fragments are related to whole XML documents.

4.2 Structural issues 59 Overlapping fragment context information For example, all parts of a section are under the same section title which belongs to the fragment context. Several fragments linking to the same element is another typical example of context information that is shared by multiple fragments. If the link relations are preserved, the common element should be duplicated several times. Otherwise, the fragments are not independent. Terms appearing in the copied elements may seem more common than they really are, which in some weighting schemes leads to lower term weights. This should be taken into account when duplicating content.

Partly overlapping fragment bodies When a section with three subsections is divided into two fragments, we may choose to include the second subsection in both fragments. In addition to the shared subsection, both fragments also contain a non-overlapping subsec-tion. Even partial overlap complicates the process of creating in-dices, and anyway, overlap is an undesired property per se in the result list for a query.

Nested fragments The special case of overlapping fragment bodies is such that one fragment is completely included in another as a descendant fragment that comprises at least one descendant el-ement. Nested fragments increase redundance in the index, unless the original document structure is preserved and the overlapping parts of each fragment are only indexed once. The attempt to avoid redundance would, however, compromise the goal of keeping the fragments independent. Another factor against nested frag-ments is that although the granularity of returned answers should be determined dynamically, e.g. based on their topical relevance, it does not imply that the size of the indexed units could not be fixed.

As a conclusion, we can safely choose not to have nested fragments in the index.

Discontinuous fragment bodies Full-text fragments may con-tain sections of data that are not needed in the index, e.g. data charts of listings of program code. Removing the data is rather straightforward, but it is still unclear whether we have good reason to deny access to the data sections when the stripped fragment is given to the user. Returning a node or a location in the original document is challenging, should part of the corresponding subtree be excluded.

Fragment bodies with several root elements When a big el-ement has a flat structure, it typically contains many small child elements or many text node children. The most natural way to di-vide such an element into fragments is to define split points where one fragment ends and another one begins. All the root elements of the resulting fragments have a common parent element which is not, however, part of the fragment. Fragment segmentation is an additional challenge that is avoided when the element structures are not flat.

Fragment bodies with orphaned text nodes In the rare case that a block-level element is too big to be indexed as a single unit, we may have to divide the element into fragments that start and end with either text or inline-level elements. However, block-level elements are rarely so big that we would lose significant amounts of full-text by requiring that all text nodes have a parenting element node in the same fragment.

The decisions concerning the structural issues do not have to be black-and-white yes or no answers, but we can also define a degree in the grey area, e.g. describing how much overlap and nesting is acceptable. Independent fragments can overlap and they can be nested, but the combination at a later point of time may be more efficient with non-overlapping fragments.

Any overlap also causes interference with the concept of docu-ment frequency (df ). For example, if a term occurs exactly once in two different documents and zero times elsewhere, then the term occurs exactly twice in the document collection and the document frequency of the term equals 2. However, if the documents overlap and if the term occurs in the common part of the overlapping docu-ments, it only occurs once in the collection, thus contradicting with thedf value. This issue is discussed in more detail in Section 5.3.

The solutions to these issues often come from the features and limitations specific to the implementations that assume certain user models, e.g. ones where standalone answers are required, or others where pointers to relevant nodes in the document tree are suffi-cient. Overlap of the answers is acceptable when they are treated as starting points for user navigation, whereas overlap in standalone answers causes redundance. Moreover, if all the indexed fragments are potential answers to users’ queries, the format and presentation of the answers may set requirements for the structure of the

in-4.3 Discarding data fragments 61

In document Indexing Heterogeneous XML for Full-Text Search (sivua 68-71)