• Ei tuloksia

Recent Work

In document Active Faceted Search (sivua 9-13)

Semfacet was developed in 2015 as a system for semantic faceted search. It was designed to query Resource Description Framework (RDF) datasets represented with Web Ontology Language (OWL) ontologies. Facets were represented as pairs of predicates and values. The system was able to utilize faceted structure to limit the

scope of queries in order to reduce the execution time of searches. The creators were able to show that evaluation of faceted queries in their system was feasible in polynomial time. Facet suggestions are updated with each iteration. [6]

FacetSearch was introduced in 2018 as a system to explore and analyze bibliographic data [18]. FacetSearch was extended from Paperlens, a system for visualizing trends in scientific literature [19].

Research was also done in order to find ways to improve speed of faceted search sys-tems. Klungre introduced a technique to reduce computing time of calculating facet suggestions by indexing parts of queries. This increases speed, but with the tradeoff of a slight reduction in accuracy [20]. Mukhopadhyay et al devised a technique for Apache Spark to filter on cached results of previous queries when facets are added in order to reduce query execution time [21].

In 2019, Niu et al utilized various machine learning models to predict facet interac-tions and performed a user study to determine user behavior and preferences when interacting with a faceted search system [22]. Users tended to use facets throughout the entire search process. Most users in the experiment claimed that they liked the idea of faceted search but had problems with the specific implementation used in the study. They also mentioned that the facets added an element of choice overload.

The authors concluded that future faceted search implementations should be more responsive to user input.

3 Active Faceted Search

Unlike many other approaches which attempt to group facets into an ontological hierarchy, our approach simply leaves them as a flat set of boolean facets where each facet has two possible values in the user interface: selected or unselected.

Users interact with the active faceting system in the following way: at each iteration, the system takes the list of facets seen and the list of facets selected and uses them to predict a ranking of facets not yet seen. The top ranking facets are presented to the user in groups of five. The user responds by selecting a facet of interest. If the facets presented are not sufficient, they may request the next group of facets indefinitely until one is found.

Search results are then ranked according to how well they fit the combination of all selected facets. A result can be ranked highly even if it does not match a specific facet, so long as it matches well with others. That is to say that in our system, successive iterations never reduce the result space, they merely change the rankings of results. This contrasts with conventional faceted search where facet selection acts as an orthogonal filter. This allows our system to learn from feedback that is suboptimal or even conflicting.

3.1 Computational Model

We model the facet recommendation process as a contextual multi-armed bandit problem [8]. Each facet in the data is an arm of the bandit model and we use Term Frequency-Inverse Document Frequency (TF-IDF) [23] statistics to estimate the expectation of each arm. Rather than always offering the most optimal facet, this system balances the exploration/exploitation tradeoff by offering facets that are near optimal, but have a high uncertainty. The user is able to reduce this uncertainty by rewarding arms that they have found useful for the given task.

3.1.1 Weighing Facets

We begin with a set of documents D, a set of facets F, and a set of tuples, each of which associates a document with a facet. We assign weight facet-document pairs by calculate the term frequency inverse document frequency (TF-IDF) score of each pair. A standard way to measure term frequency is to simply count the occurrences of each pair. This method gives undue weight to documents with larger amounts of terms without regards to specificity of each term. We calculate the augmented term frequency by counting the frequency of the document-facet pair and dividing that by the count of the most frequently occurring term in that document.

tf(f, d) = count(f, d)

max(count(f, d) :f ∈F)

The inverse document frequency measures the rarity of a term. We calculate the

inverse document frequency of a term by dividing the total number of documents by the number of documents associated with that term. We then take the log of that number.

idf(f) = log( |D|

|d :t|)

Multiplying the augmented term frequency with the document frequency gives us the TFIDF score.

T F IDF(f, d) =tf(f, d)·idf(f)

We create a weight matrix W which associates the TF-IDF score of each facet to each document.

3.1.2 Ranking Facets

The system needs to balance exploration and exploitation while taking into account facets that were selected on previous steps. For these reasons, we have decided to choose a contextual bandit approach. The system ranks facets according to the regularized LinRel algorithm [24].

The user selects one or more facets. At each step t, we define the matrix Xt to be a subset of W which only contains selected facets. Then we obtain estimates for ordinary least squares regression for each facet f:

af =xf ·(XtTXt+µI)−1·XtT

Where µis a regulation value and I is an identity matrix of size |F|.

We do not want to rank facets simply on relevance. Instead, we would like to rank them based on upper confidence bound.

ucb=af ·yt+ c 2||af||

Where cis a variable used to adjust the confidence estimate. In future sections, we demonstrate the effect of using different values ofc.

We apply a penalty pto the UCB of facets that have already been presented to the user that is multiplied based on the number of times presented, s.

ucb=af ·yt+ c

2||af|| −p·s

3.1.3 Ranking Search Results

In contrast to facet ranking, we no not apply any sort of exploration to search results. Thus, we determine the ranking of search results by how relevant they are to selected facets. We can achieve this using standard linear regression.

aI = (XtTXt+µI)−1·XtT

In document Active Faceted Search (sivua 9-13)