• Ei tuloksia

Independent Component Analysis

Sparseness is somewhat related to the concept of statistical independence, which states that two variables are independent if and only if p(x, y) = p(x)p(y) i.e. the joint probability of the variables can be factorized and is equal to the product of the marginal densities. The relation may not be immediately obvious, so let us consider a simple example. In Fig. 3.5 we show a scatter plot of the joint density of two random variables. In the left hand plot the two variables are independent, whereas in the right hand plot they have been rotated (mixed), so the probability density function (pdf) can no longer be written as the product of the marginals. In accor-dance with the Central Limit Theorem, this has an interesting implication for the marginals: under certain regularity conditions, mixtures of non-Gaussian variables are always more non-Gaussian than the original variables.

Therefore, if the original variables have a sparse, or supergaussian distribu-tion, maximizing independence is equivalent to maximizing sparseness. In both cases we are looking for directions, or basis functions, that maximize nongaussianity.

Independent Component Analysis (ICA) is a method that attempts to recover these directions by maximizing some measure of nongaussianity.

After ICA was first described [15, 60], it took only a few years until it was applied to natural images [9, 116, 55]. While ICA is very similar to sparse coding in some respects, there are a few important differences. In ICA we usually consider a complete model, which has as many basis functions as pixels, so the matrix of basis functions is invertible. Furthermore we restrict the analysis to the noiseless case, i.e. there is no gaussian reconstruction

a) sources aligned with the axes b) mixtures of the supergaussian sources

Figure 3.5: Illustration of the connection between sparseness and indepen-dence. On the left hand side, two independent variables are shown with the histograms of the marginal distributions. On the right hand side, the two variables have been mixed together, so they now have dependencies.

In this simple case, the dependency can directly be read off the scatterplot:

if one variable takes on a high value, the other has a high chance of also having a strong negative or positive activation. The key point here is that the marginal distributions have changed and have become less sparse, or more Gaussian.

error and the first term in Eq. 3.3 vanishes [75].

Before we discuss the application of ICA to natural images, let us quickly review the basic ICA model and one possible way to estimate it. We will focus on the likelihood-based approach for estimating the ICA model, since it forms the basis for much of the work described later in this thesis.

To define ICA as a probabilistic model, we write the data as a mixture of sources x= As and define the distribution of the data in terms of the densities of the independent sources

px(x) =|detW|ps(Wx) =|detW|Y

i

pi(wTi x) (3.4) where we assume that the mixing is invertible andW =A−1 is the inverse of the mixing matrix, so thewTi are the rows of the inverse mixing, orfilter matrixW. We denote the pdf of the mixtures bypx and that of the sources by ps. The pi denote the marginal distributions of the individual sources.

For the first equality, we have used a well-known result for the density of a linear transform, and for the second equality the independence of the the

sources. GivenT samples of the data vector, denoted byx(t), we can now write the log-likelihood of the wi as

logpx(x|W) = In principle, estimating the model would require estimating not only the matrixW but also the densities of the sources,pi. This is a nonparametric estimation problem, and estimating the true densities, which tend to be strongly peaked at zero, can lead to problems with gradient optimization methods. Thus it is preferable to use smooth proxy distributions for the estimation of the filters. It turns out that if we know that all of the sources are supergaussian, we can plug in any supergaussian pdf for the sources and still get a consistent estimate of the filters wi [56]. In cases where we are interested in the densities of the sources in addition to the filters, or in nonlinear models where the above does not hold, we can use a simple family of densities, such as the generalized normal distribution, to infer the shape of the marginals.

A supergaussian pdf for the estimation that naturally comes to mind for its simplicity is the Laplacian distribution, which we have already seen as the sparseness prior in the sparse coding model. Normalized for zero mean and unit variance, it is given by

logpi(si) =−√

2|si| −1

2log 2. (3.6)

However, the derivative of this density has a discontinuity at zero, so it is convenient to replace it by a smooth version, thelogistic distributiongiven by

For convenience the various normalization factors are usually omitted, and the density that is used is just logpi(si) =−2 log coshsi.

The ICA model is estimated by taking the gradient of the log-likelihood w.r.t. the filters w. Substituting the derivative of the marginal distribu-tions, ∂u log cosh(u) = tanh(u) we obtain the gradient as

∂ maxi-mum likelihood estimation proceeds by taking gradient steps like

W ←W +µ ∂

∂W logpx(x|W) (3.9)

a) ICA basis functions b) ICA filters

Figure 3.6: Basis functions and filters estimated using the FastICA algo-rithm on 16×16 pixel natural image patches. The dimensionality of the data has been reduced to 120 dimensions by PCA. For white data the filters are equal to the basis functions, i.e. A=WT, but projecting back to the original, non-white space, the whitening matrix is absorbed in the filters, so the high frequencies are emphasized, whereas the basis functions have the same power spectrum as the image patches. Note that the appearance of the sparse coding basis is in between ICA filters and basis functions since there the whitening is not part of the model.

where the step size is given by µ, a small constant. This has come to be known as the Bell-Sejnowski algorithm [8]. There are various ways to make the estimation of this model computationally more efficient, for example using a modified gradient update rule [13] or with a fixed-point algorithm like FastICA [48]. We will not go into the details here, since most of the work of this thesis is based on the gradient algorithm in the simple form presented above.

Applied to natural image data, ICA produces basis functions (i.e. the columns of the matrix A) that are very similar to the sparse coding basis functions or simple cells of primarily visual cortex. The similarity should not be surprising however, because ICA is a special case of sparse coding [89]. The advantage of ICA here is that estimating the model is much easier and faster, since the sources or independent components can be computed in closed form and do not have to be estimated by gradient descent. The filters that give the independent components can be plotted in the same

way as basis functions as shown in Fig. 3.6.

While it is not possible to draw a clear distinction between indepen-dence and sparseness in the linear models we have considered so far, this is not the case in general. In the neural processing hierarchy, there is no reason why higher levels of abstraction should have increasingly sparse en-codings, and in fact there is no evidence that this is the case [5]. On the other hand, it is conceivable that a continued maximization of indepen-dence to encode for different objects, persons, etc. may be useful even in very high-level representation, so it is important not to confuse sparseness with independence [80, 110].