Local Approaches - Non-linear Manifold Learning

3.5 Non-linear Manifold Learning

3.5.3 Local Approaches

Local nonlinear manifold learning algorithms attempt to resolve the embedding to the low-dimensional manifold through a graph representation in which certain local neigh-borhood information of the input data is preserved. Local approaches are advantageous over the embedding problems where the data lies over a space whose local geometry is close to Euclidean, but its global geometry is not. The local approaches usually involve sparse matrix computation and to a certain extent are computationally ecient. Several local non-linear manifold learning algorithms have been proposed, but the most popular ones are Kernel Principal Component Analysis (KPCA) [144], Locally Linear Embedding (LLE) [141] and Laplacian Eigenmaps (LE) [14].

Kernel Principal Component Analysis: Kernel Principal Component Analysis (KPCA) [144] expands the classical linear PCA beyond the linear manifold assump-tion. Similar to PCA, KPCA is basically an eigenvalue-decomposition based approach, but rather than explicitly evaluating data in intrinsic space, it exploits the pairwise similarities, known as kernel functions.

In essence, the theoretical grounds of KPCA are built on the family of kernel methods [25] that allow us to handle more complex data manifolds involved with intrinsic non-linearities. The main idea in KPCA is to transform data containing non-linearities to

3.5 Non-linear Manifold Learning 45

a higher dimensional space, through a non-linear mapping function, where data depen-dencies become linear. Once the data is mapped to a linear space, then it is followed the classical PCA to perform dimensionality reduction. In other words, KPCA can be seen as an attempt to adapt non-linear problems such that they meet the requirements needed by the classical PCA.

To this end, KPCA exploits the so-called kernel trick [153] to compute data covariances by which it also eliminates computing the embeddings. The input to KPCA is the data vectors and a semi-positive denite kernel function to compute data dissimilarities. The output to KPCA, similar to PCA, are the coordinates of the intrinsic space Y ⊂ R^d. Indeed, the type of kernel is a signicant determinant factor of the quality of intrinsic coordinates. KPCA utilized with linear kernels is nothing but the classical PCA. Popular kernels include Gaussian, Polynomial, and hyperbolic tangent.

Letκ(., .)be a positive semi-denite kernel on X × X, that is for every xi,xi ∈ X, the functions whenx_iandx_j are xed, respectively.

Then, κ(., .) can be viewed as a real valued function, denoted by Ψ, that implies a transformation mapping the input space X to a high dimensional dot-product kernel (Hilbert) spaceH [146]:

Ψ :X 7→ H

x7→κ(.,x) (3.18)

Without loss of generality, assume H is a zero-mean space, PN

i=1Ψ(x_i) = 0. Thus, KPCA obtains the intrinsic manifoldY ∈ R^d in the kernel space by applying PCA to the covariance matrix inH:

Σ¯ = 1

that results in the eigen-decomposition problem:

Σ¯ =VΛV^T. (3.20)

The nal solution is obtained by the kernel trick and formulating Equation 3.20 in forms of dot products [153]:

NK= ¯VΛ¯V¯^T, (3.21)

46 3. Manifold learning

whereΛ¯ and V¯ are respectively the diagonal matrix of the ordered eigenvalues and the eigenvectors matrix of the kernel matrixK. In this way, the projection of input data on the intrinsic coordinates can be obtained by:

Y=KV (3.22)

KPCA theoretically sounds simple and it can be easily applied to non-linear problems.

However, it lacks an intuitive geometric interpretation, and as a result, the choice of the kernel function is not straightforward [97]. Moreover, as one can easily see, the standard KPCA requires us to build a kernel metrics between pairs of data, and therefore its usage may be limited in the case of large real world datasets.

Locally Linear Embedding: Locally Linear Embedding (LLE) [141] is an unsuper-vised nonlinear manifold learning algorithm that aims to restore intrinsic manifolds by preserving the local data structures. The local structure in LLE is constructed through a linear combination of thek-nearest neighbors and the embedding problem is resolved through an eigenvector-based optimization approach provided that the local structure of the every data point is preserved. This follows the geometrical intuition that a topologi-cal manifold can be represented by a collection of overlapping coordinate charts and any smooth embedding preserves the local relationships.

Unlike the Local-Principal Component Analysis (L-PCA) and ISOMAP, which only re-constructs the data manifold locally or globally, LLE takes a hybrid approach combining a local and a global data prole. It is local as it attempts to preserve the local data struc-tures, and it is global as it nds a global coordinate system through a single eigenvalue decomposition.

The input to LLE is the data vectors{xi}^N_i=1sampled from a non-linear spaceX ⊂R^D. LLE assumes that the neighborhoods of the data manifold can be captured in high dimensional space and the geometric structure of the data manifold can be represented by an undirected graph. The output from LLE is the coordinates of points{yi}^N_i=1 in the intrinsic spaceY ⊂R^d.

LLE utilizes the data k-nearest model Nk(xi) to perform data reconstruction in the embedded intrinsic space. It assumes that each point xi can be reconstructed by a weighted linear combination of the points in the neighborhood:

xi≈ X

j∈N(i)

wijxj. (3.23)

where wijk

j=1 is the reconstruction weight coecient. LLE obtains the optimal recon-struction weights by minimizing the reconrecon-struction cost:

minwi

3.5 Non-linear Manifold Learning 47

and the translation invariance constraint:

j∈N(i)

wij= 1. (3.26)

Applying the obtained optimal weighting to the intrinsic spaceY, LLE nds the optimal embedded coordinates by minimizing the reconstruction cost:

minyi

subject to the centering constraint to eliminate the degree of freedom thatyiis translated by a constant amount:

i=1

yi=0_n×1 (3.28)

and the unit covariance constraint constraint to avoid degenerate solutions:

1 the matrix of the reconstruction weights w_ij = [W]_ij. We can rewrite the quadratic optimization in Equation 3.27 in matrix form as

k(III−W)Yk²_F=Y^TMY, (3.30)

whereM= (III−W)^T(III−W).

LLE obtainsYby the eigenvalue decomposition problem:

MY^T = ΛY^T, (3.31)

whereΛis a diagonal Lagrange multiplier matrix. To this end, the optimalYis obtained by the (d+1) eigenvectors ofMcorresponding to thed+ 1smallest ordered eigenvalues (λ0≤λ1≤. . . λd) where the rst eigenvector is constant and is excluded.

LLE is a non-iterative global optimizer algorithm that does not get stuck in local optima [166]. However, LLE has a number of limitations. The eigenvectors computed in Equation 3.31 are not guaranteed to be independent. LLE cannot deal with unevenly or sparsely sampled data. Moreover, as the weighting scheme in original LLE is relatively sensitive to small changes in the neighborhood, it can easily be aected by the presence of noisy data and the neighborhood graph meta parameters.

Nevertheless, LLE has gained great popularity and become the backbone of many man-ifold learning algorithms [14, 187, 47]. In particular, the famous Stochastic Neighbor Embedding (SNE) [47] algorithm and its other variants such as t-Distributed Stochas-tic Neighbor Embedding (t-SNE) [111] and trust-region StochasStochas-tic Neighbor Embedding (tr-SNE) [126] can be seen as extensions to LLE that exploit a stochastic framework in nding the intrinsic space of lower dimension.

48 3. Manifold learning

Laplacian Eigenmaps: Laplacian Eigenmaps (LE) [14] is a non-linear manifold learn-ing algorithm that relies on spectral graph theory. LE is similar to LLE, but rather than characterizing the local structure of the data manifold by geometric intuition, it models the local structure of the data manifold by spectral graph theory. In LE, data manifold is approximated as a neighborhood graph and, analogous to the Laplace-Beltrami operator [35] on manifolds, the embedding is provided by the Laplacian of the adjacency graph.

The input to LE is the data vectors{xi}^N_i=1sampled from a non-linear manifoldX ⊂R^D, and the output is the coordinates of points{yi}^N_i=1 in the intrinsic spaceY ⊂ R^d. LE makes similar assumptions as in LLE.

LE follows three main computational steps. It starts by building a nearest neighborhood graph representation of data. In particular, the graph edge weights are determined by the heat kernel (Gaussian kernel): W, (wij= [W]i,j), LE computes the embedded coordinates by minimizing the following

cost function: X

i,j

(yi−yj)²wij. (3.32)

By this means, the cost function penalizes any instance where a pair of close points in ambient high dimensional space are mapped into far distant points in an intrinsic low dimensional space. SettingD=P

jwij andL=D−W, the minimization cost function

The general embedding problem of a graph onto ad-dimensional space is given by minY

As in the cased= 1, the solution to the minimization problem is given by the generalized eigenvalue problem:

LY=λDY. (3.36)

3.5 Non-linear Manifold Learning 49

Similar to LLE, the rst eigenvector ofLis excluded, and the optimal solution ofY is given by its (d+1) eigenvectors corresponding to thed+ 1smallest ordered eigenvalues.

Sharing the same assumptions, LE has advantages and disadvantages similar to LLE.

Several extensions to the original LE have been proposed [46, 154, 84, 114]. Hessian Eigenmaps (HE) [46] extends LE replacing the Laplacian operator by the Hessian oper-ator. The Hessian operator is utilized to eliminate the deciency of the Laplacian, as it is guaranteed of asymptotic convergence over a broad range of assumptions.

The key importance of LE revolves around its underlying solid theory in spectral graph theory. LE shares similar theoretical grounds with many other machine learning algo-rithms such as spectral clustering [73, 42] and graph partitioning [21]. Indeed, LE is one of the most popular and well-studied non-linear manifold learning algorithms.

Local Tangent Space Alignment: Local Tangent Space Alignment (LTSA) [187] is a non-linear manifold learning algorithm that relies on manifold tangent bundle alignment to recover the intrinsic manifold. LTSA is dierent from the previous non-linear algo-rithms in two main directions. First, it characterizes the embedded manifold along the overlapping local tangent spaces, rather than directly manipulating the manifold surface.

Second, it adopts a dierent approach to perform the eigenvalue decomposition.

LTSA has inputs and outputs, and makes assumptions similar to LE and LLE. There are three main computational steps involved in performing LTSA. The rst step is to compute local tangent coordinates of the embedded intrinsic manifold, followed by computing a local coordinate transformation matrix. The second step is to build and to align a global transformation matrix. The third step is to obtain coordinate points in the intrinsic manifold.

Given a pointxi ∈ X and the associated nearestk-neighbors Nk(xi), LTSA computes the corresponding orthogonal projections{˜pij}_j∈N_k(xi)on the tangent space of X at¯xi

denotedT¯xiX by:

pij ={xj−¯xi}_j∈N_k(xi) (3.37) wherex¯i is the center of the set includingxi and its neighboring points. Denote P˜i ∈ R^D×k as a matrix containing the projections p˜_i_j and rewrite Equation 3.37 in matrix form:

P˜i=XiH, (3.38)

where Xi is the matrix containing neighbors to xi and H = I_k − ¹_k1_k1^T_k ∈ R^k×k is the centralizing matrix. LTSA applies PCA toP˜iby the Singular Value Decomposition (SVD):

X_iH=USV^T (3.39)

that provides the coordinates of the local tangent space ofY aty¯i,Ty¯iY ⊂R^d, given by thedleft singular vectorsU_d corresponding to thed-reduced SVD.

XiH=UdSdV^T_d. (3.40)

50 3. Manifold learning

Let f be a parametrization (embedding) from Ty¯iY to X such that xj = f(yj). By applying Taylor expansions about the neighborhood ofy¯i, one can see:

xj−x¯i=∇f(¯yi)^T(yj−y¯i) (3.41) that for all{xi}ⁿ_i=1 turn outs:

X_iH=∇f(¯x_i)Y_iH. (3.42)

Using Equation 3.40, rewrite forYiH:

YiH= (∇f(¯xi))⁻¹XiH

= (∇f(¯x_i))⁻¹U_dΣ_dV^T_d (3.43) DenoteL= (∇f(¯xi))⁻¹Udand SdV_d^T =Θi. Then, this follows:

YiH=LΘi, (3.44)

that turns out thatL=YiHVdΣ⁻¹_d gives the local coordinate relations inTy¯iY. From this, LTSA obtains optimal embedded coordinates by a cost function governing the global alignment:

LTSA converts the local relations for allY_iinstances to a single global model. It achieves this by extending thek×kmatrixWi to anN×N matrixK:

that denes a single global alignment kernel matrix. Note thatKis dened as a sym-metric positive-denite matrix. Then this results in LTSA having a non-iterative cost function:

min

Y YY^T=I

kYKk²_F, (3.48)

that can be solved through the eigenvalue decomposition ofK.

Compared to the previous methods LLE and LE, LTSA has signicantly lower compu-tational complexity as it employs eigenvalue decomposition on local data patches rather than on the entire data. However, the performance of LTSA depends on the distribution pattern and the intrinsic linearity of the data clusters.

In document Advanced Techniques for Unsupervised Classification of Remote Sensing Hyperspectral Images (sivua 45-52)