• Ei tuloksia

2.1 Machine learning

2.1.2 Unsupervised learning

1

2(xµy)TΣ−1y (xµy)

. (2.20) Denoting prior probability of the k-th class asp(y = k) =πk, the posterior proba-bility is found using Bayes theorem

p(y=k|x) = p(x|y=k)p(y=k)

p(x) = πkN(x|µk,Σk)

l=1πlN(x|µl,Σl). (2.21) A slightly more sophisticated version of this classifier is used in Publications III andIVto detect audio segments containing speech. More precisely, in that case the class-dependent densitiesp(x|y)are defined not as single Gaussian densities but as weighted sums of Gaussian densities.

One can see that, in contrast to empirical risk minimization, a loss function is not used in this approach to find a decision rule. Advantages and disadvantages of density estimation approach are summarized in Table 2.2.

Table 2.2: Advantages and disadvantages of density estimation.

Benefits Drawbacks

• If the estimated density is the

“true” model of the data, then this approach is optimal.

• Training is separated from pdiction. There is no need to re-train the model if the loss func-tion changes.

• If the model is poorly fitted, the prediction can be highly inaccu-rate.

2.1.2 Unsupervised learning

In contrast to supervised learning,unsupervised learningaims at discovering hidden structure in the data based on somea prioriassumptions but without access to any labels or response variables. The two most commonly known unsupervised learning tasks include dimensionality reduction and clustering. The former aims at finding a compact representation of data points while retaining the most relevant information.

Therefore, it can be viewed as a form of lossy compression. The clustering task, in turn, aims at discovering groups of similar objects in a dataset.

Dimensionality reduction

Given an input object represented by ad-dimensional vectorx= (x(1),x(2), ...,x(d))T, dimensionality reduction aims at finding a vector of lower dimensionality

s= (s(1),s(2), ...,s(m))T, wherem<dso that as much information as possible about the original data is preserved. This aim can be formalized within anencoder-decoder framework [24]. An encoder is a mapping from the original feature space to some lower-dimensional space that assigns a compact representation s = encoder(x) to any given feature vector x. The decoder is used to obtain a reconstruction xrec = decoder(s)from the encodingsof the original feature vectorx. These mappings are found by minimizing some loss function`(·,·)that measures dissimilarity between a feature vectorxand its reconstructionxrec:

N i=1

`(xi,decoderθ(encoderφ(xi)))→min

φ,θ, (2.22)

where φ and θ are parameters of the encoder and decoder, respectively. In other words, the mappingdecoder(encoder(·)), usually referred to as anautoencoder[25]

in the context of artificial neural network models, approximates the identity map-ping.

Dimensionality reduction is used to combat the curse of dimensionality [26], a problem that occurs when the number of features,d, is large relative to the training set size. As the dimensionality increases, one needs larger datasets to achieve the satisfactory generalization ability of a learning algorithm as some feature combina-tions may never occur in the training set (similar argument holds for real-valued data).

Perhaps the most well known dimensionality reduction technique is principal component analysis (PCA) [27]. It seeks an orthogonal matrix URd×m to define encoding and decoding mappings ass=encoder(x) =UTxandx=decoder(s) = Us, respectively. This is achieved by minimizing themean squared error,`mse(x, ˆx) = kxxˆk2:

N i=1

kxiUUTxik2→min

U (2.23)

s.t.UTU=I. (2.24)

Assuming that the data is centered, that is, the empirical mean of the training set equals zero, the solution to this problem is the matrix having eigenvectors corre-sponding tomlargest eigenvalues of the empirical covariance matrixC=i=1N xixTi as columns. PCA falls to the category of lineardimensionality reduction methods because the encoding mapping is linear [28].

Dimensionality reduction can also be seen as part ofrepresentation learning[29], a class of techniques that aim at finding representations of the data that makes it easier to extract useful information for a learning algorithm. Typically, this is achieved by learning a mapping (encoder) from raw measurements or low-level features to some fixed-dimensional vector space. Elements of this space, known as embeddings, are then used either for visualization or as an input to an algorithm to solve a separate learning task. More often, however, the term embeddings is used specifically when similar input objects are expected to have similar representations in the embedding space. One example is word embeddings [30], where a vector of

real numbers represents each word from a fixed vocabulary. Other examples include face [31] and speaker [32] embeddings, where face images and speech utterances respectively are embedded into a vector space such that any two representations of inputs corresponding to the same person are characterized by a small distance and representations of different individuals are characterized by a large distance.

One can point out that learning such mapping requires labeled data, so learn-ing embeddlearn-ing could be categorized as supervised learnlearn-ing technique. But since learning representations is rarely the final goal, but rather an intermediate step to solve another applied problem, learning embeddings is usually categorized as an unsupervised learning task.

One broadly used technique to learn embeddings is linear discriminant analysis (LDA) [28]. It aims at projecting the data into lower-dimensional space in such a way that separation between classes is maximized. The first step in LDA computes between-class,Σb, and within-class,Σw, covariance matrices defined as follows:

Σb =

K k=1

(µkµ)(µkµ)T (2.25) Σw =

N i=1

(xiµk)(xiµk)T, (2.26) where µ is the global mean, µk is the class mean and K is the number of classes.

Then, LDA seeks the projectionUthat maximizesbetween-classvariability, tr(UTΣbU), while minimizing within-class variability, tr(UTΣwU), leading to the optimization problem

tr(UTΣbU)

tr(UTΣwU) →max

U (2.27)

s.t.UTU=I. (2.28)

This optimization problem is non-convex, so it is hard to solve it directly. Conven-tionally, the objective function in the original problem is replaced by an alternative objective, tr((UTΣwU)−1UTΣbU), leading to a closed-form solution obtained by se-lecting top eigenvectors of the matrixΣ−1w Σb[33].

LDA has a probabilistic extension known asprobabilistic linear discriminant analy-sis(PLDA) that is extensively used for face and speaker recognition [34]. In contrast with classical LDA, typically used as a tool for dimensionality reduction, PLDA can be reformulated as a probabilistic classifier to discriminate between a finite set of hypotheses describing the relationship among a set of biometric templates. Further, its discriminative formulation leads to a classifier known aspairwise support vector machine(PSVM) in the speaker recognition community [35, 36], which in turn forms a basis for an alternative learning algorithm proposed in PublicationII.

Clustering

Another major category of unsupervised learning is clustering, which aims to iden-tify groups of similar objects (clusters) in a dataset. The underlying assumption is that the dataset consists of a few groups of objects so that objects in the same group are more similar to each other than to those in different groups. For instance, one of

the most well known clustering algorithms,k-means[37], can be seen as a method to approximately solve the following optimization problem:

N i=1

K k=1

zikkxiµkk2→min

zikk, (2.29)

s.t.

K k=1

zik=1, (2.30)

where K is the number of clusters assumed to be known or fixed in advance and zik ∈ {0, 1} is an indicator variable that equals 1 if thei-th data point belongs to thek-th cluster and 0 otherwise. Finallyµk is the centroid of thek-th cluster. The k-means algorithm is block-coordinate descent (alternating minimization) to solve this problem, which alternates between updating zik while keeping µk fixed and vice versa until convergence.

Interestingly,k-means can be viewed as an instance of the encoder-decoder frame-work. In this case, the encoder maps the input vector to the index of the cluster with the closest centroid and the decoder outputs the centroid. That is, each vector in the input space is approximated by one element of a fixed dictionary (or codebook) of vectors. Therefore, encoding-decoding can be seen as vector quantization (VQ), a generalization of scalar quantization to multi-dimensional data. Vector quantization based modeling has a long history of applications to automatic speaker recognition and related tasks, beginning in the 1980s [38]. In [39], so-calledcentroid modelsbased on VQ were used to design a lightweight speaker verification system for mobile de-vises with limited hardware resources like CPU speed and memory. In Publication IIIand [40], codebook models are used for speech activity detection.

Many unsupervised learning tasks, including those reviewed above, can be for-mulated as estimation of probability density function p(x) given a training set of i.i.d. samplesxi ∼ p(x). For instance, in aprobabilisticformulation of PCA [41] one seeks a set of parameters {µ,U,σ}to fit a Gaussian density function of the form p(x) = N(x|µ,UUT+σ2I)to a given data set. It was shown in [41] that, similar to the classical PCA, columns of matrixUobtained via maximum likelihood estima-tion (discussed in Secestima-tion 2.2.3) are scaled eigenvectors of the empirical covariance matrix. In turn, some clustering methods can be seen as estimating parameters of so-calledmixture models [6], which are discussed in the following sections. When estimated, parameters of a mixture model can be used to find assignments of the data points to groups discovered in the dataset. Speaker diarization is an example of a task for which mixture models are adopted (e.g. [42], [43]) to cluster speech segments according to speaker identity.