Computational image analysis - Efficient and Robust Methods for Audio and Video Signal Analysis

Variability of presentation within an image or a video frame is huge. That is, multiple objects may be located anywhere in the image, in any size and with any pose or ori-entation. Objects are often partially occluded by other objects. Depending on lighting conditions, the color hues captured by the camera vary tremendously. Human brain is capable of dealing with the complexity of this visual information without a problem, but for a computer, image analysis is a highly demanding task. In the following I explain some computational procedures for automatically analyzing the contents of a digital image.

2.4.1 Filters for recognizing simple shapes

For automatic image analysis, the standard approach is to first search for, i.e. detect, simple details and patterns from an image, e.g. lines, edges, corners, blobs etc. . Each sub-image at every possible position, size and orientation from the whole sub-image is evaluated separately for the quested shapes. Simple basis functions, i.e. filters, are designed based on the visual appearance of the quested shapes. They are specifically suited for the preliminary edge detection, corner detection, blob detection, etc. from image patches.

They are in the core of nearly every higher level feature extraction scheme for computer vision. A linear filter response within imageIat pixel position(h, v)is computed as

x(h, v) =

N_h

i=1 N_v

j=1

I(h−^N^h₂⁺¹+i, v−^N^v₂⁺¹+j)·F(i, j), (2.7) whereFdenotes the 2-dimensional shape detection filter of sizeN_h×N_v. The filter dimensionsN_handN_vare assumed to be odd, e.g. 3x3, 5x5 etc. .

For detecting edges from an image, simple linear filters approximating the image deriva-tive are Prewitt, Sobel, and Roberts filters. However, these filters are usually not enough for reliable interest point detection and some post processing is necessary for noise robustness of detection. For example, an old, nevertheless the state-of-the-art Canny edge detector by Canny (1986) is a dominant approach. It uses first two linear filters to extract the horizontal and a vertical components of gradient at each position of the image. These gradient images are combined to provide a likelihood of an edge existing at each pixel position as well the direction of the potential edge. The edge likelihood values are post processed first by non maximum suppression, which damps majority of them to be zeros. Then thresholding with hysteresis is used to provide crisp edges form the remaining line likelihood values.

For blob detection, a linear Laplacian of Gaussian (LoG) filter is a standard choice. The difference of Gaussians (DoG), computed as a difference between responses to Gaussian filters with different scales, is also used as an approximation of LoG. To detect the blobs, maxima and minima of scale normalized LoG or DoG responses are searched.

2.4. Computational image analysis 11 Haar-like filters developed by Viola and Jones (2001a) are very popular due to their computationally efficient applicability and versatility to detect different patterns. The filters consist only values -1 and 1, which are arranged as rectangular areas. An efficient implementation of Haar-like feature computation by Viola and Jones (2001b) utilizes a sum-up table calledintegral image.

An example of a non-linear simple pattern detecting filter is a local binary pattern (LBP) filter proposed by Ojala et al. (2002). When computing an LBP response for a pixel, values of pixels within the neighbourhood of the central pixel are compared to each other to obtain directional information about local intensity differences. The pixel value differences are converted to binary values based on the sign of the difference – thus the name binary pattern – giving one pattern for each pixel which is used as a neighbourhood centrum.

2.4.2 Using histograms to compact information

When analyzing the image with many different filters, the amount of numbers charac-terizing the image becomes even bigger than the number of pixels. This data explosion must obviously be suppressed and only relevant information should be retained. One simple and popular solution for reducing the amount of data is to collect ahistogramof attributes to describe the whole image or some part of it. The idea is extensively utilized also in text processing, and thus a name Bag-of-Words (BoW) (Harris (1954)) is also used.

A histogram or BoW is a collection of counts of selected attributes from an image. Each number in the histogram refers to occurrences of a certain attribute within the image.

One attribute may account for e.g. strong enough responses from a certain type of filter, where an occurrence is determined using a threshold value.

The simplest of the histogram-based image feature vectors, also used in Publications IV and VI, is an RGB -color histogram. A color histogram feature (Novak and Shafer (1992)) is popular in analyzing video frames particularly because of its fractional computational load. Histogram of oriented gradients (HOG) (McConnell (1986)) is another very popular histogram based feature. To collect HOG, multiple linear filters which detect gradual changes of luminance values in many scales and orientations within the image are used, and notable responses of each filter type of gradient are accumulated into the HOG vector. The LBP presentation (Ojala et al. (1994)) is eventually a histogram based feature.

After the initial nonlinear filtering stage, a histogram of binary pattern representations is collected from an image patch to form an LBP feature vector.

In the work of Publications IV and VI on video analysis I have utilized histograms of scale invarian feature transform (SIFT) (Lowe (2004)) features, which are explained in the next section. A large set of SIFT feature vectors, i.e. SIFT descriptors, are first computed from training material. Then the K-means algorithm is applied to this set of SIFT descriptors and theKmean SIFT-descriptors are collected into the codebook. When analyzing a new video, the SIFT descriptors are extracted from each video frame and they are compared to those of the codebook. Counts of descriptors that match the codebook items closely enough are returned as a SIFT BoW feature vector for each video frame.

2.4.3 Crafted features for image analysis

There are many hand-crafted image feature extraction algorithms, which are designed to tackle some specific problems of computer vision. The scale-invariant feature transform -descriptors, which are utilized in Publications IV and VI, are an example of a crafted

12 Chapter 2. Audio and video signals analysis method which detects simple shapes and patterns invariantly to changes of scale, orientation and illumination. Another example of a crafted image feature computation scheme discussed here is facial point extraction. A facial point extraction algorithm by Zhu and Ramanan (2012) is utilized in Publication VI.

The scale invariant feature transform is an image analysis scheme used successfully ever since its invention by Lowe (2004). SIFT analysis gives localized information for the image by returning feature vectors, calleddescriptors, for the image to be analyzed.

Each descriptor is associated with a specific location, akey pointof interest, in the image.

Detection of key points within an image starts with computing different smoothed versions of the image using Gaussian filters of different sizes, i.e. scales. Between each pair of adjacent scales, a difference of these smoothed images are used to produce DoG responses. Local minima and maxima of scale normalized DoG responses in the three dimensions; width, height and scale, are detected to be potential key points. The final set of key points is selected from those by excluding the poor ones in the areas of low contrast and on edge lines. For each key point, a descriptor is constructed. It consists of histograms of 8 orientations from 16 small image patches within the neighborhood of the key point, making a descriptor length to be 128.

Facial feature point detection (FFPD) is an example of crafted very high level feature extraction scheme. There are many different FFPD algorithms specifically designed for tracking and analyzing faces in images. FFPD algorithm finds image coordinates corresponding to a set of points of a human face in an image using some lower level features, e.g. SIFT-descriptors. This kind of high level feature extraction schemes are highly complex and utilize many filters, transforms, and heuristically derived selection schemes to achieve their output. The face point detector of Zhu and Ramanan (2012), which is used in Publication VI, utilizes HOG descriptors to find potential interest points and models different facial poses with trees.

2.4.4 Feature selection

The huge pool of different low level analysis methods, i.e. features, available for image analysis calls for methods to select the best features for an application at hand. Using all the available features is not feasible due to thecurse of dimensionality, discussed first by Bellman (1957). The term is used to describe the difficulty of dealing with very high dimensional data. Thus to select the best performing features or filters from the huge pool of them available,feature selectionmust be performed. The feature selection methods are generally divided into three categories of methods, namely, filter, wrapper and embedded methods (Kumar and Minz (2014)).

The filter methods are the simplest feature selection methods, mainly used only as a preprocessing step to exclude the distinctly useless features (Kumar and Minz (2014)).

The filter methods evaluate each feature individually, not taking correlations between different features into account and thus failing in reducing redundancy in the remaining feature set. The "filter" in this case is some particular way of computing a statistical score for the usefulness of each potential feature to describe the data at hand. Examples are Chi squared test, information gain and correlation coefficient score. The features with highest scores are then selected for use.

The idea in wrapper type feature selection methods is to try out different feature sets by training the system with each subset of features, and then selecting the best performing set for use. The drawback of these methods is high computational cost. Often greedy

2.5. Features from learned linear transformations 13

In document Efficient and Robust Methods for Audio and Video Signal Analysis (sivua 23-26)