Independence of document types - Measuring the probability of full-text

3.4 Measuring the probability of full-text

3.4.3 Independence of document types

Whether the full-text indicators proposed in Sections 3.4.1 and 3.4.2 actually classify XML fragments correctly with appropriately de-fined pivot points should, without a question, be tested on authen-tic XML documents. However, a question can be raised about the quality of the test documents: How many document types need to be represented sufficiently in the document collection where the distinguishing traits — the full-text indicators and the correspond-ing pivot points — are verified? If we can show that the measures are independent of document types, one document type suffices in the verification as the validity in others can be induced, which is ideal. Otherwise, a wider variety of documents are needed in the test collection.

One argument against the independence of the document types says:

“...some DTD might induce a high proportion of elements and some [might] not...”⁴

According to our first counter-argument, a DTD cannot dictate that an element for full-text content parents a high proportion of elements. We reckon that the elements parenting full-text must

4From an anonymous reviewer of the ACM Fourteenth Conference on Knowl-edge and Information Management 2005 (CIKM’05).

<xsd:element name="fulltext">

<xsd:complexType mixed="true">

<xsd:sequence>

<xsd:element name="a" type="xsd:string"

minOccurs="1"/>

<xsd:element name="b" type="xsd:string"/>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

Figure 3.1: Schema definition for a full-text element.

have either the “text only” or mixed content model. If elements are not allowed in the content model, even a single text node in element content makes the proportion equal to 1/1. When both elements and text are allowed, we need to consider the appropriate definitions. With the DTD, the mixed content model is defined as shown in Example (3.5).

(3.5) <!ELEMENT fulltext (#PCDATA|a|b|c|d)*>

There is no other way to define the mixed content model in a DTD. The fulltext elements as defined in the example may or may not contain text and the listed elements (a,b,c, and d) in an arbitrary order. The definition does not enforce the occurrence of any possible child element. The XML Schema definitions allow for more complex definitions for the mixed content model, though. An example is shown in Figure 3.1.

With a schema definition, the appearance and order of the child elements can be enforced in the mixed content model. The element ain Figure 3.1 provides such an example. However, only two con-tent models in the descendant elements of the fragment may steer the T/E value towards the range of data values: the empty and

“element only” content models. If the content model suggests the occurrence of several such elements, we may well question whether the defined element is truly designated to full-text content.

3.4 Measuring the probability of full-text 51 Another argument concerns the previously defined pivot point of the T/E measure:

“...proposal sets an arbitrary value of 1.0 for the ratio of ele-ments. ...different values may be appropriate for other collections (depending on the characteristics of the DTD) and no single value may be appropriate in a heterogeneous collection.”⁵

In response to the arguments, we show that the pivot point value of the T/E measure comes from the content models of full-text ele-ments, not from properties specific to any DTD, and thus, the value of 1.0 is not chosen arbitrarily nor is it dependent on the charac-teristics of a DTD. If only text content is allowed, the T/E value is at least 1.00 depending on the number of entity references. If mixed content is allowed, the occurrence of any elements cannot be enforced with a DTD. Moreover, requirements for elements with the “element only” content model is not meaningful — it even con-tradicts with the synonymous concept offree text — although it is possible with a schema definition.

With the pivot point of 1.00, single elements can be misjudged by only looking at the T/E value but all full-text content cannot.

It is always possible to have a full-text element with a T/E value of at least 1.00. One of the few cases where full-text content could be misjudged as data is when some structured content, such as a table, occurs at the inline level. Another such case occurs if all the words of the full-text are wrapped in individual elements, e.g. if the text is tagged with elements marking the part-of-speech of the words.

Judging these borderline cases as data might be a very reasonable choice of action. Data content can be misclassified, too, if mixed content is allowed in data elements but again, instead of questioning the pivot point value, we may as well question the content being data.

There are also cases which may at first sight seem like full-text documents but, as expected, the appropriate full-text indicators reveal the real quality of the content. A typical example of such a document is shown in Figure 3.2 where the content cannot be considered data in the sense it has in the context of databases.

However, not all the criteria for an XML Full-Text fragment are met, either, as the interpretation of the content is heavily dependent

5From another anonymous reviewer at CIKM’05 conference.

...

<SCNDESCR>SCENE England; afterwards France.</SCNDESCR>

<PLAYSUBT>KING HENRY V</PLAYSUBT>

<PROLOGUE><TITLE>PROLOGUE</TITLE>

<STAGEDIR>Enter Chorus</STAGEDIR>

<SPEAKER>Chorus</SPEAKER>

<LINE>O for a Muse of fire, that would ascend</LINE>

<LINE>The brightest heaven of invention,</LINE>

<LINE>A kingdom for a stage, princes to act</LINE>

<LINE>And monarchs to behold the swelling scene!</LINE>

...

Figure 3.2: Excerpt from the Shakespeare playHenry Vmarked up in XML, courtesy of Jon Bosak.

on the tag names, and the fragment has to be seen as data in the binary classification.

In document Indexing Heterogeneous XML for Full-Text Search (sivua 59-62)