• Ei tuloksia

Latent variable models and the Gaussian mixture model

In alatent variable model(LVM), the observed data is assumed to be affected by one more more unobservable or latent variables or factors [20, 15]. Normally, in LVMs and the related topic offactor analysis, the goal is to explain the variance of the ob-served data with an arbitrary number of unobob-served variables, which are found by the means of factor analysis. The ideas behind LVMs were originally developed in the social sciences for behavioural modelling [20,21], but have also found widespread use within machine learning. LVMs and factor analysis will be discussed in more detail in subsection 2.4.

The Gaussian mixture model (GMM), or mixture of Gaussians [15, 16], consists of multiple multivariate Gaussian distributions, each with a mean µk and a covariance matrixΣk. A GMM is formalised as a weighted sum of its mixture components:

p(xi|θ) =

K

X

k=1

πkN(xikk), (2)

whereN is theprobability density function(pdf) of the multivariate Gaussian (Normal) distribution (MVN): In the MVN, the mean vector, µ, defines the centre point of the distribution within theD-dimensional space that it lies, and the covariance matrix, Σ ∈ RD×D, defines how the probability density behaves around the mean, i.e., how the probability mass is distributed. |Σ| is the determinant of Σ. The covariance of a multivariate random variable is given by

cov[x] =E

(x−E)(x−E)T

, (4)

where for the Gaussian distribution we haveE[x] = µ. The covariance can be either general, diagonal or isotropic (or spherical) [16], of which the first one is the least restricted in terms of number of parameters. The general covariance matrix is as its definition in (4), however, we may restrict it to only be the diagonal matrix diag(σ2i) so that we have a matrix with only the individual variances of each dimension of x.

Lastly, in the most restrictive option, we haveΣ=σ2I, so that the covariance is simply the identity matrix scaled by a variance parameter. The differences between these are

illustrated in Figure5. The restricted covariances are useful due to the requirement of computing the inverse of the covariance inside the exponent of (3), which is also known as theprecision matrix[15]. In (2),Kdenotes the number of mixture components and

Figure 5: Three multivariate normal distributions in 3D and 2D plots. The leftmost distribution is one with the general case covariance matrix, while the central and right-most distributions have spherical and diagonal covariances, respectively. We see that the general case covariance may have an arbitrary angle for direction of the highest variance, while the diagonal one is restricted to follow one of the axes for its shape of the variance. Lastly, the spherical covariance has clearly a stright circle shape. Plots generated with GNU Octave (https://www.gnu.org/software/octave/) mesh and contour plots.

Dthe dimensionality of the vectors (xis). Further,πks are known as themixing weights orcomponent priors. Two probabilistic constraints, 0 ≤ πk ≤ 1 andPK

k=1πk = 1, must be satisfied. Theπk represent the Bayesianprior knowledge we have regarding the training data. The model encapsulates a 1-of-K coded latent variablezk, also known as ahot vector, where one of the vector values is one and the rest zeroes. This one-hot vector can be thought of as tying each observation in to a mixture component, with the index of the one in the vector pointing to thekthmixture component. The constraint

P

kzk= 1must hold. For the mixing weighs,πk, we have

p(zk = 1) =πk. (5)

The parameters of a GMM can be trained with an iterative algorithm known as the expectation-maximisation(EM) algorithm [22, 15,16]. The algorithm consists of two steps, computed at each iteration. TheE-step(expectation) step comes first in the iter-ation. Here, data inference is done based on the current model parameter values. What inference means with a generative model such as the GMM, is that we use the current parameters of the distribution to sample arbitrary number of new samples. The second, maximisation, orM-step, is an optimisation of the parameters given the previously up-dated data. The purpose of the EM algorithm is to maximise thelog-likelihood of the

Figure 6: GMM training process via the EM-algorithm: here, starting from a random initial position, the mixture component parameters are shifted towards the clusters. In this toy example, we conveniently have three data clusters and three mixture compo-nents.

observed data,xi, themissingorhiddendatazi, given the parametersθ. The likelihood is a function that indicates how reasonable our model is given the data we observe (our training data) where a higher likelihood is in favour of our model, and vice versa, and it is

As noted in [15], due to the logarithm in front of the sum, the complete data

Thecomplete datais the set of random variables{X, Z}, whereXis the observed data andZthe unobserved data (xi ∈Xandzi ∈Z). WhileZ is not directly observable, it is assumed that there is exactly one Gaussian component in the mixture that generated a particular observed sample inX. This data-generating process is encoded inZ. As the zi are latent and therefore unknown, this log-likelihood cannot be directly evaluated.

The EM algorithm solves this problem such that, instead of evaluating (7), we evaluate itsexpectation[15]:

Note the introduction of the variablezi for the assumed unobserved data. Instead of evaluating a likelihood, in the EM algorithm we evaluate the expectation of the joint distribution of the observed data (thexis) and the unobserved data, represented by the zis. We see that the sufficient statistics, means and covariances, appear only in the right-hand sum. This is utilised in the steps of the algorithm. Here, t refers to the current iteration of the algorithm and θt−1 refers to the parameters of the previous iteration. Qis a new function known as the auxiliary function. The parameters θtin the M step are optimised via the following:

θt = argmax

θ

Q(θ,θt−1), (9)

that is, we want to find the model parameters that maximise the expected log-likelihood in (8). In the EM-algorithm,maximum likelihood estimation(MLE) [23] is used to find the parameters. The basic principle of MLE is that we take the log-likelihood function, the data, and obtain an estimate (maximum likelihood estimate) for each parameter of the model. Theestimatorfor a given parameter is obtained by taking the derivative of the likelihood function with respect to that parameter at zero. This estimator can then used to compute the estimate for the parameter of interest.

Now, the E-step in the EM iteration, based on (8), is done via [15]:

rik = πkN xit−1k PK

k0=1πk0N xit−1k0

. (10) That is, for each observed data pointi, theresponsibilitythat thekthcomponent takes for this data point is obtained with the Bayes rule. The responsibility computation can be seen as a soft clustering of the data points, in contrast to hard clustering in the K-means algorithm [16]. Thus, if we wanted to use the EM-GMM framework for clustering purposes, we could select the data point that has the largest responsibility value given each cluster to obtain cluster memberships. We will discuss this similarity briefly later. The M-step is done in multiple phases. First, the mixing weight for the kthcomponent is set to be the proportion of the observed data that has been assigned to this component [15]:

The new mean vector for thekth component is obtained by taking the mean over all data points weighted by the responsibility of the kth component for each data point [15]:

µk= P

irikxi

rk (12)

The updated covariance matrix is obtained by following the MLE-reasoning for the covariance parameter, in which we set the derivative of the log-likelihood, l(xikkk), with respect to the parameter of interest (hereΣk) to zero and solve for that parameter. This results in the form [15]:

Σk =

Assuming that the convergence threshold is set appropriately, the algorithm is guar-anteed to end up in a local maximum in terms of the log-likelihood [15, 16]. For the initialisation of the model in terms of the means and covariances that describe the mixture components, it is possible to use the K-means algorithm [24, 15, 16]. Other approaches include random initialisation andfarthest point clustering[15].

Algorithm 1The naive EM algorithm for GMMs

1: Initialise eachµkkk, convergence thresholdt, update variableuand evaluate the initial log-likelihood.

2: whileu≥tdo

3: Evaluate theriks .E-step

4: Using the newriks, updateπk(11),µk(12) andΣk(13). .M-step

5: Re-evaluate the log-likelihood for the updated model and set the increase in log-likelihood to beu

6: end while

2.4.1 Discussion

The EM-GMM scheme described earlier can be seen as a more robust way of doing clustering that is conceptually close to the K-means algorithm (in fact, [16] notes that K-means is a special case of GMM-EM). In the case of GMMs, we get a probabilistic alignment (soft alignment) of each data point belonging to each mixture component, whereas in the case of K-means, each data point either belongs to a cluster or does not (hard alignment). These are often called soft clustering and hard clustering, respec-tively.

In K-means clustering [25, 26], we first select kpoints at random and then iterate two steps, in the first of which we assign data points to their nearestcluster centreaccording to some distance metric (usually the euclidean metric, or L2-metric), and then in the second step we set each cluster centre to be the mean all the points assigned to it. The first step can be formalised as follows [16]:

rik =

where we get a one-hot encoding for the cluster assignments, anagolously to the re-sponsibilities in EM-GMM. The updated cluster means are obtained such that [16]

µk = P

irikxi P

irik . (15)

Both (14) and (15) result from the objective or loss function [16]:

J = arg min

with (15) being the MLE for µk. Now, suppose that in the EM-GMM computation we set each Gaussian component’s covariance matrix to beσID, use a constant πk = 1/K, and use an indicator function to set the responsibilities in E-step such that for a given data point, the component responsibility ends up either being one for the most likely component, or zero otherwise [15]. The result is the same clustering that is obtained from K-means as only the means of the components are relevant in the E-step due to the way we defined the covariances. EM-GMM can be thus seen as a more robust way of modelling in comparison to K-means due to the possibility of modelling the individual component variances. Another important difference arising from the probabilistic responsibility assignments in EM-GMM is that it captures the uncertainty of each component responsibility, obtained via1−maxk(rik) [15], i.e., we can look at the relative magnitude of the highest assigned component responsibility to obtain a measure of the uncertainty related to the component (or cluster) assignment.