• Ei tuloksia

Introduction to content-based image retrieval

4. Content-based Image Retrieval with Graph Techniques

4.1 Introduction to content-based image retrieval

Retrieving useful information from a large dataset is an important task in this era of data explosion. Using web search engines, users have got accustomed to retrieving useful web pages and documents by text-based queries. However, retrieving multimedia content, such as image, audio, and video, is a more challenging task. To retrieve in-formation from a large multimedia database using a text-based search engine, users have to enter queries in text format (keywords, sentences of a natural language, or statements of a certain query language) to describe the content he/she is interested in. This approach heavily relies on the text-based retrieval techniques. However, it is difficult to precisely and completely describe the content of multimedia items using any language. Another difficulty is that this approach also re-quires detailed descriptions of the items in the multimedia database.

However, for a large multimedia database, there is always inadequate text descriptions. Earlier attempts of annotating large multimedia databases have shown that the annotations are ambiguous, erroneous and deficient [154, 155, 156].

Because of the difficulties faced by text-based search technologies, content-based image retrieval (CBIR) has become a critical topic in computer vision and multimedia content retrieval. With a CBIR

sys-tem, a user provides one or multiple query images and the system returns similar images according to the content of the query image [52, 56]. The retrieval results are normally presented to the user in the order of their relevance to the query image(s).

CBIR systems can be categorized into two types. One is designed to serve for a specific use case, for example, a CBIR system that retrieves pictures of furniture that matches the style of the furniture in a given query image, or a system that retrieves pictures of animals of the same breed as the one in a query image. Another type of CBIR system serves for general use cases, similar to search engines on the Internet. For better user experience, a general CBIR system needs to understand the purpose of the query—the intention of the user and the expectation of the retrieved result. Unlike text-based system, where the intention can be elaborated by providing more description to the query, for example,

“running Labrador dog“ and “layered birthday cake”, a CBIR system has to determine the intention using clues from the query images and knowledge gained from previous records.

4.1.1 Purpose of a query

The purposes of a query can be summarized into two broad categories depending on whether the user knows what they are looking for.

Finding “similar” images or a specific group of images from the dataset

In this category, the user has a clear definition of “similarity”

when he/she triggers the search. The results returned from the CBIR system must match the definition of “similarity” and the CBIR system has to be able to rank images according to this definition. The majority of researches on CBIR systems fall into

4.1. Introduction to content-based image retrieval 65 this category [52, 157, 158, 159]. The target of the search can be a concrete concept or an abstract concept.

– The target can be a concrete concept, such as “the specific dog breed in this picture”, “this type of vehicle” or “more pictures of this person”. When the target is clearly defined and understood, the retrieval problem becomes a classifica-tion problem and the CBIR system will return the pictures of the same class as the query images. With the fast devel-opment of pattern recognition, especially recent advances in deep learning technologies, systems that understand 1000 classes can perform in a level close to or even better than human beings [160, 161]. Automatic face recognition was also reported to outperform human accuracy [162, 163]. A deep learning based CBIR system normally performs well in this case, as long as the concept is precisely known.

– It is also possible that the target concept is abstract and cannot be matched to a concrete class, for example, tures of the same artistic style of the query image”, “pic-tures as peaceful as this one”, or “pic“pic-tures arousing similar sentiment”. Since the definition of “similarity” is abstract, it is difficult to apply the aforementioned classification ap-proach. One has to clearly define a measurement of “style”

or “sentiment” and find corresponding features to rank im-ages. There has been limited related research on this topic [164, 165] and the performance of CBIR systems are often unsatisfactory in this case.

Seeking the answers to a question related to the query image In this category, the user is looking for images that may help him/her to answer a question related to the query image. Al-though this type of use case is common, there has been little

research in this area from CBIR perspective [166, 167]. One of the difficulties of this type of use case is that there are numerous questions that can be initiated from an image. Fig. 4.1 shows a query image and some questions related to the image.

– What is this event?

– Why a military vehicle is on the street?

– Where is this place?

– Who are the reporters in the picture?

– What is the model of the vehicle in this picture?

Figure 4.1 A query image and a number of questions related to the image

With the query image itself, it would be impossible for a system to figure out which question to answer. The user may elaborate the query with additional text information, or the system can present the results in groups and each group answers a specific question. An ideal system may answer these questions directly in text form. However, presenting the relevant images is often more convenient and brings extra information to the user.

The purpose of a query also affects how the results should be presented to the user. Sometimes, a user may expect to retrieve either a certain group of images or one specific image, for example, when querying

“this breed of dog” or “this specific dog”. In another situation, the user may want to extend the scope of the query image and expect to see more diverse results, for example, “show me different people with

4.1. Introduction to content-based image retrieval 67 this type of hairstyle”. In the latter case, the diversity of the retrieved images must be kept so that the results are not “too similar” to the query image.

4.1.2 Gaps in CBIR systems

Researchers have realized certain gaps that a CBIR system suffers from to fulfill the requirement of users [168]. The most important one is the semantic gap that describes the disparity between a user and a CBIR system when interpreting an image [159, 169]. With different purposes of a query, the interpretations of the images may be totally different and the definition of similarity is also different. For example, given the query image in Fig. 4.1, a user may interpret the image as a serious public security event, whereas the CBIR system interprets the image as a vehicle. As discussed in the previous section, it is difficult for a CBIR system to reduce this gap without understanding the purpose of the query.

Feature gap (as known as sensor gap [170]) refers to the situation that a CBIR system does not have effective features to evaluate similari-ties between images, even when the concept of “similarity” is clearly defined. This may be due to the limitation of the technology in image understanding, such as inferior performance in image classification un-der certain conditions [171]. More often, the functionality of a CBIR system cannot cover all possible use cases. A system that is designed to recognize different breed of dogs may not be able to distinguish different types of vehicles.

In [168], the authors also defined the performance gap and the usability gap. These two gaps address the same question of how a user can easily and quickly find and locate images that he/she is interested in from the results provided by a CBIR system.

4.1.3 Architecture of a CBIR system

A typical CBIR system consists of the following key components [159].

Feature extraction

This component is responsible for extracting relevant features for the images in the dataset and the query images. The fea-tures can be low-level feafea-tures (such as color feafea-tures, texture features, and shape features[159]), middle-level features (such as SIFT and HOG features [172]) or high-level features (such as class-specific representations of an image given by a deep neural network [57]). Feature extraction is normally executed offline and the features are quantized and stored in a database for fast access [173]. Feature extraction for query images is executed online.

Similarity measurement

To measure the similarity between a query image and images in the dataset, a metric needs to be defined. The most common measurement is the Minkowski metric, that is defined as

d(x, y) = d

i=1

|xi−yi|r 1/r

, (4.1)

wherex and y are features of two images, dis the dimension of the feature,r is a constant, andr 1. In particular, Eq. 4.1 is the Euclidean distance whenr= 2 and the Manhattan distance when r = 1. When multiple features are used, the similarity measurements can be assembled using a statistical method to generate an overall similarity score. Retrieved images are or-dered according to their similarities to the query image. In a large-scale CBIR system, involving datasets with millions or bil-lions of images. The CBIR system may just return theknearest

4.1. Introduction to content-based image retrieval 69 images for fast retrieval instead of ranking all images. To fur-ther speed up the process, an approximation of the k-nearest neighbor search is often used [174].

Presentation of retrieval results

After the k nearest images are retrieved from the dataset, a CBIR system normally displays the results as a list of images ordered by their similarities. To reduce the semantic gap and improve the user experience, some CBIR systems incorporate relevance feedback—a technique that refines the results with the help of users’ feedback [175]. Some other systems allow a user to either select features that are used for similarity comparison or refine retrieval results from the keywords that the user give to a query image [176, 177]. In [178], the CBIR system organizes images in the dataset into a tree structure using a clustering algorithm and allows users to browse the retrieval results in a hierarchical view. This approach combines the visual content and the semantics, thus making it easier for users to locate the target images.

A good CBIR system shall use not only the visual content of the images but also all available information, such as tags, annotation, date, location and surrounding texts. A key requirement for a general CBIR system is the ability to handle large and continuously evolving dataset. All operations, including feature extraction and similarity measurement, must be fast and efficient to provide real-time response.

The retrieved results shall be presented with good user experience and a method shall be available for users to refine the query and locate the target images quickly. The next subsection will show how graph techniques can help improve the user experience of CBIR systems.