Deep generative models for facial keypoints detection

(1)

Degree Program in Computer Science

Bachelor’s Thesis

Mikko Haavisto

DEEP GENERATIVE MODELS FOR FACIAL KEYPOINTS DETECTION

Examiners: Lasse Lensu D.Sc. (Tech.) Supervisor: Lasse Lensu D.Sc. (Tech.)

(2)

Lappeenranta University of Technology Faculty of Technology Management Degree Program in Computer Science Mikko Haavisto

Deep Generative Models for Facial Keypoints Detection

Bachelor’s Thesis 2013

33 pages, 10 figures, 2 tables, and 1 appendix.

Examiners: Lasse Lensu D.Sc. (Tech.)

Keywords: deep learning, generative models, facial keypoints

A new area of machine learning research called deep learning, has moved machine learning closer to one of its original goals: artificial intelligence and general learning algorithm. The key idea is to pretrain models in completely unsupervised way and finally they can be fine-tuned for the task at hand using supervised learning. In this thesis, a general introduction to deep learning models and algorithms are given and these methods are applied to facial keypoints detection. The task is to predict the positions of 15 keypoints on grayscale face images. Each predicted keypoint is specified by an (x,y) real-valued pair in the space of pixel indices. In experiments, we pretrained deep belief networks (DBN) and finally performed a discriminative fine-tuning. We varied the depth and size of an architecture. We tested both deterministic and sampled hidden activations and the effect of additional unlabeled data on pretraining. The experimental results show that our model provides better results than publicly available benchmarks for the dataset.

(3)

Lappeenrannan teknillinen yliopisto Teknistaloudellinen tiedekunta Tietotekniikan koulutusohjelma Mikko Haavisto

Syvien generatiivisten mallien käyttö kasvonpiirteiden tunnistukseen

Kandidaatintyö 2013

33 sivua, 10 kuvaa, 2 taulukkoa ja 1 liite.

Tarkastajat: TkT Lasse Lensu

Hakusanat: syvä oppiminen, generatiiviset mallit, kasvonpiirteet Keywords: deep learning, generative models, facial keypoints

Syväoppiminen (deep learning) on uusi koneoppimisen tutkimussuuntaus, joka on tuonut koneoppimista lähemmäs sen alkuperäisiä tavoitteita: tekoälyä ja yleistä oppimisalgorit- mia. Syväoppimisen keskeisenä ajatuksena on, että mallit aluksi koulutetaan ohjaamat- tomasti ja tämän jälkeen ne voidaan hienosäätää ohjatusti tiettyä tehtävää varten. Tässä työssä esitellään syväoppimismalleja ja algoritmeja ja näitä menetelmiä sovelletaan kasvonpiirteiden tunnistamiseen. Tehtävänä on tunnistaa 15 kasvojen avainpistettä mustaval- kokuvista. Jokainen avainpiste on määritetty (x,y) reaaliarvoparilla kuvan pikseliavaruu- dessa. Kokeissa esiopetettiin deep belief verkkoja (DBN), joille lopuksi tehtiin erotteleva hienosäätö. Arkkitehtuurin syvyyttä ja kokoa vaihdeltiin. Esiopetuksessa tutkittiin deter- ministisiä ja otannalla saatujen piilokerrosten aktivointeja sekä merkitsemättömän lisäda- tan vaikutusta. Kokeellisten tulosten perusteella malli suoriutuu paremmin kuin datajou- kolle julkisesti saatavilla olevat vertailukohdat.

(4)

ABBREVIATIONS AND SYMBOLS

AI Artificial Intelligence CD Contrastive Divergence CPU Central Processing Unit DAE Denoising Auto Encoder DBN Deep Belief Net

DSM Denoising Score Matching GMM Gaussian Mixture Model GPU Graphics Processing Unit LSA Latent Semantics Analysis MAE Mean Absolute Error MCMC Markov Chain Monte Carlo MSE Mean Squared Error

PCA Principal Component Analysis RMSE Root Mean Squared Error SM Score Matching

RBM Restricted Boltzmann Machine

(6)

a Hidden units bias vector b Visible units bias vector E Energy function

E_P(x)[·] Expectation with respect to distributionP. F Free energy function

g Logistic function

h Hidden (latent) units vector

ˆh All possible configurations of hidden units N Normal distribution

P Probability distribution function Pˆ Empirical distribution of training set

q_σ Corruptor model: isotropic Gaussian of varianceσ² v Visible (observed) units vector

˜

v Corrupted visible units vector ˆ

v All possible configurations of visible units

W Weight matrix

x Discrete valued vector Z Partition function ψ Score function σ Standard deviation θ Model parameters

(7)

1 INTRODUCTION

The goal in machine learning is essentially to learn from data. This learning procedure is usually done by using supervised learning algorithm where the learning consists of finding the mapping between inputs and outputs, which are more commonly referred as labels.

A general learning algorithm could be described as a way of automatically learn features at multiple levels of abstraction. In the case of images, the first abstraction level could contain simple abstractions like straight lines with different orientations. The level above forms more complex shapes, such as polygons, based on abstractions from the layer below. These characteristics allow a system to learn a complex function mapping directly from data, without depending on human-crafted features. This becomes important for higher-level abstractions, which are often hard for humans to specify in terms of raw data.

Most existing feature extraction methods are domain specific. For example, applications in signal and image processing are, in general, using different feature extraction methods.

Brain plasticity research exploring sensory substitution [1] also indicates the existence of some sort of general purpose learning algorithm in the human brain.

An architecture describes the structure of particular machine learning model. Depth of architecture refers to the number of levels of composition of non-linear operations in the learned function. In other words, number of levels of abstraction. Most current supervised learning algorithms correspond to shallow architectures (1, 2 or 3 levels), such as neural networks with one hidden layer, support vector machines and random forests. In addi- tion to the limitations caused by shallow architectures [2], these algorithms are trained using only labeled data and they are unable to benefit from unlabeled data. Even though researchers have put in huge effort on labeling data sets, the underlying problem still remains in every machine learning application using supervised learning.

By contrast, for example, object recognition in the visual cortex, which is a common example of artificial intelligence (AI) related task, uses many layers of nonlinear processing and requires very little labeled input [3]. Object recognition is still a difficult task in computer vision research. This is because the image of an object vary in different view points, scale and sizes or when an object is translated or rotated into a different pose. Objects can also be partially obstructed from view, which is extremely common in video sequences and real world images.

In this thesis main focus is on exploring methods that can overcome limitations described

(8)

above. These methods moves machine learning closer to one of its original goals: artificial intelligence and the general learning algorithm.

1.1 Background

Increased computing power and massive data sets have made the implementation of larger and more complex machine learning systems feasible. These factors have also had a huge impact on the development of a new area of machine learning research, commonly referred as deep learning.

1.1.1 Deep Learning

In 2006, Hinton et al. [4] introduced unsupervised learning algorithm for deep generative models called Deep Belief Net (DBN). The building block for DBN is an energy-based graphical model called restricted Boltzmann machine (RBM) that can learn a probability distribution over its set of inputs. DBNs have the following attractive features:

• The greedy layer-by-layer learning algorithm can find a good set of model parameters even for models that contain millions of parameters.

• Pretraining can be done in completely unsupervised way. The very limited labeled data can then be used to fine-tune the model for a specific task at hand using standard gradient-based optimization.

• There is an efficient and accurate way of performing inference, which make the values of the hidden variables easy to infer.

Recently in 2012, Hinton et al. [5] developed an interesting biology-inspired method to prevent overfitting by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. This method, called as “dropout”, can be seen as an efficient way of performing model averaging.

(9)

1.1.2 Applications of Deep Generative Models

Experiments in speech recognition show that pretraining the model with DBNs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM) [6][7].

In the field of object recognition, large, deep convolutional neural network was consider- ably better than previous state-of-the-art methods in classification and localization tasks in ImageNet Large Scale Visual Recognition Challenge 2012 [8][9]. The network contained 60 million parameters and 650,000 neurons and was trained on raw RGB pixel values.

High-dimensional data can be converted to low-dimensional codes by training a multi- layer neural network with a small central layer to reconstruct high-dimensional input vec- tors. This deep autoencoder network significantly outperforms common methods, such as Principal Component Analysis (PCA) and Latent Semantics Analysis (LSA), and they can be used for data visualization purposes [10]. Also a technique called “semantic hashing”, provides very fast method for document and image retrieval [11].

Important notion is that deep learning methods do not necessarily require domain specific knowledge of the particular task at hand. Deep generative models have been applied successfully to a wide variety of machine learning competitions in Kaggle [12], for example, molecular activity prediction [13] and job salary prediction [14].

1.2 Objectives and Restrictions

The objective of this thesis is to construct a deep generative model that can learn high- level feature representations in an unsupervised way. The main goal is to find out how well this model can extract features from small grayscale face images, to detect facial keypoints.

1.3 Structure of the Thesis

Section 1 is an introduction to deep learning models and explains the motivation behind these methods. Section 2 gives a detailed description of methods used in this thesis.

The experiments that were done are explained in Section 3. The results are discussed in Section 4 and a short conclusion of this thesis is given in Section 5.

(10)

2 METHODS

2.1 Energy Based Models

Energy-based models associate a scalar energy to each configuration of the variables of interest [15]. Learning corresponds to modifying an energy function so that desirable configurations have low energy. A model may define a probability distribution through an energy functionE(x), as follows:

P(x) = exp(−E(x))

Z , (1)

where thexnormalizing factorZis called the partition function by analogy with physical systems,

Z =X

x

exp(−E(x)), (2)

with a sum over the input space whenx is discrete, e.g., binary valued vector. If a real valuedxis used, the sum is replaced by an appropriate integral.

Deep learning primitive models consist of visible (v) and hidden (h) units. Thus, we introduce the following change to the probability distribution function:

P(v,h) = exp(−E(v,h))

Z . (3)

Because only the visible partvis observed in a generative model, one is interested in the marginal

P(v) = X

h

exp(−E(v,h))

Z , (4)

which is simply achieved by summing overhof the joint distribution. By defining a free energy as:

F(v) =−logX

h

exp(−E(v,h)), (5)

the marginal distribution ofvis formed

P(v) = exp(−F(v))

Z , (6)

with the partition functionZ =P

vexp(−F(v)). The free energy is just a marginaliza- tion of energies in the log-domain.

(11)

We introduceθto represent the parameters of the model. Then the data log-likelihood is derived:

logP(v) = log

exp(−F(v)) Z

=−F(v)−log(Z)

=−F(v)−log X

ˆ v

exp(−F(ˆv))

! ,

(7)

where all possible configurations of inputs are denoted byvˆand the log-likelihood gradient

∂logP(v)

∂θ =−∂F(v)

∂θ − 1

P

ˆ

vexp(−F(ˆv)) X

ˆ v

exp(−F(ˆv))−∂F(ˆv)

∂θ

=−∂F(v)

∂θ + 1 Z

X

ˆ v

exp(−F(ˆv))∂F(ˆv)

∂θ

=−∂F(v)

∂θ +X

ˆ v

P(ˆv)∂F(ˆv)

∂θ .

(8)

The log-likelihood gradient contains two terms, which are referred to as the positive and negative phase. The terminology reflect their effect on the probability density defined by the model. If the free energy can be computed tractably, the positive phase term can be solved analytically. However, the negative phase term is often intractable because of the sum over all possible input configurations. This leads to the approximation that computes the expected value, i.e., the average log-likelihood gradient over the training set

EPˆ

∂logP(v)

∂θ

=−EPˆ

∂F(v)

∂θ

+E_P

∂F(v)

∂θ

, (9)

where expectations are over v, with Pˆ the training set empirical distribution and P the model’s distribution.

Maximum likelihood learning is often unfeasible because the exact computation of the expectation with respect to the model’s distribution takes time that is exponential in min{V, H}, i.e., the number of visible or hidden units. However, sampling from P and computing free energy tractably, one can obtrain a stochastic estimator of the log- likelihood gradient using a Monte-Carlo method. Such method is introduced in Sec- tion 2.2.3.

(12)

2.2 Restricted Boltzmann Machine

2.2.1 Binary RBM

h

W

v

Figure 1. Restricted Boltzmann Machine. The top layer represents a vector of stochastic binary unitshand the bottom layer represents a vector of stochastic binary visible unitsv. The symmetric weightsW connects these layers.

A restricted Boltzmann machine is a particular type of Markov random field that has a two-layer architecture [16], in which the visible, binary stochastic unitsv ∈ {0,1}^V are connected to hidden binary stochastic units h ∈ {0,1}^H, as shown in Figure 1. The energy of the model withV visible units andH hidden units is defined as follows:

E(v,h;θ) =−

V

X

i=1 H

X

j=1

v_ih_jW_ij −

V

X

i=1

v_ib_i−

H

X

j=1

h_ja_j, (10)

where θ = {W,b,a} are the model parameters: W^V^×H represents the symmetric weights, and b^V anda^H are bias terms. The conditional distribution over hidden units hcan be derived as follows:

P(h|v) = exp(−E(v,h)) P

ˆhexp(−E(v,h))ˆ

= exp(v^TW h+b^Tv+a^Th) P

ˆhexp(v^TWˆh+b^Tv+a^Th)ˆ

= exp(b^Tv)QH

j=1exp(PV

i=1v_iW_ijh_j+a_jh_j) exp(b^Tv)QH

j=1

P

ˆhexp(PV

i=1viWijˆhj +ajhˆj)

=

H

Y

j=1

exp(h_j(PV

i=1v_iW_ij +a_j)) P

ˆhexp(ˆh_j(PV

i=1v_iW_ij +a_j))

=

H

Y

j=1

P(h_j|v).

(11)

(13)

Whenhˆ_j ∈ {0,1}, we get the usual stochastic binary unit:

P(h_j = 1|v) = exp(PV

i=1v_iW_ij +a_j) 1 + exp(PV

i=1v_iW_ij +a_j) =g

V

X

i=1

v_iW_ij +a_j

!

, (12) whereg(x) = (1 + exp(−x))is the logistic function. Since v and hplay a symmetric role in the energy function,

P(v|h) =

V

Y

i=1

P(v_i|h) (13)

and in the binary case

P(v_i = 1|h) =g

H

X

j=1

h_jW_ij+b_i

!

. (14)

Using the same factorization, the free energy of the model can be derived from Eq. 5:

F(v) =−

V

X

i=1

v_ib_i−

H

X

j=1

log(1 + exp(

V

X

i=1

v_iW_ij+a_j)) (15)

and the derivatives with respect to the model parametersθ:

∂F(v)

∂W =−

H

X

j=1

g

V

X

i=1

v_iW_ij +a_j

!

v=−vh^T (16)

∂F(v)

∂a =−

H

X

j=1

g

V

X

i=1

v_iW_ij +a_j

!

=−h (17)

∂F(v)

∂b =−v (18)

2.2.2 Gaussian-Bernoulli RBM

We describe the traditional parametrization of Gaussian RBM here. However, as we later in Section 2.2.4 introduce, we use a modified version of the energy function in the experiments.

To model real-valued continuous data, the hidden units of the first-level RBM remain binary, but the visible units are replaced by linear units with Gaussian noise [10]. The

(14)

binary RBM’s energy function is replaced with:

E(v,h;θ) = −

V

X

i=1 H

X

j=1

v_i σi

h_jW_ij.+

V

X

i=1

(v_i−b_i)² 2σ_i² −

H

X

j=1

h_ja_j. (19)

Here,v ∈ R^V denotes the real-valued activities of visible units. Each visible unit adds a quadratic offset to the energy function, whereσ^V controls the width of the parabola. We obtain the conditional distribution over hidden unitsh:

P(v|h) = exp(−E(v,h)) R

ˆ

vexp(−E(ˆv,h))dˆv

=

V

Y

i=1

1 σ_i√

2π exp(− 1

2σ_i²(v_i−b_i−σ_i

H

X

j=1

h_jW_ij)

2

)

=

V

Y

i=1

N b_i+σ_i

H

X

j=1

h_jW_ij, σ_i²

! .

(20)

Similarly to the binary-binary case, this gives the stochastic binary hidden unit where the real-valued visible activity is scaled by the standard deviation

P(h_j = 1|v) = g

V

X

i=1

v_i

σ_iW_ij +a_j

!

, (21)

and we derive the free energy of the model from Eq. 5:

F(v) =−

V

X

i=1

(v_i−b_i)² 2σ²_i −

H

X

j=1

log(1 + exp(

V

X

i=1

v_i σi

W_ij +a_j)). (22)

The derivatives with respect to the model parametersθare:

∂F(v)

∂W =−

H

X

j=1

g

V

X

i=1

v_i

σ_iW_ij +a_j

!

v =−v^Th (23)

∂F(v)

∂a =−

H

X

j=1

g

V

X

i=1

v_i

σ_iWij +aj

!

=−h (24)

∂F(v)

∂b =− 1

σ²(v−b) (25)

∂F(v)

∂σ = (v−b)²

σ³ − h^TW v

σ² (26)

(15)

2.2.3 Contrastive Divergence

Because the maximum likelihood learning in Eq. 9 is unfeasible, an approximation is needed. In practice, learning is done by following an approximation to the gradient of a different objective function, called the “Contrastive Divergence” (CD) [17]. In binary- binary RBM this equals to the following learning rule:

∆W =α E_P_data v^Th

−E_P_T v^Th

(27)

∆a=α(E_P_data[h]−E_P_T [h]) (28)

∆b=α(E_P_data[v]−E_P_T [v]) (29) whereαis the learning rate andP_T represents a distribution defined by running a Gibbs chain, initialized at the data, forT full steps.

Due to the special bipartite structure of RBM, quite an efficient Gibbs sampler exists.

Alternating Gibbs sampling, as shown in Figure 2, updates parallel all of the units in one layer given the current states of the units in the other layer, and vice versa (see Eq. 11 and Eq. 13). In CD-T algorithm, the correlations in the activities of two layers are measured after the first update of the hidden units and again afterT steps. The difference of these two correlations provides the learning rule (Eqs. 27, 28, 29) for updating the parameters of the model. SettingT = ∞recovers maximum likelihood learning and it corresponds to sampling from model’s equilibrium distribution, however, the CD learning withT = 1 has been show to work quite well [17].

Figure 2.CD-T learning procedure that uses alternating Gibbs sampling. The top layers represent hidden units and the bottom layers visible units. Data is used to initialize the Markov chain.

(16)

2.2.4 Denoising Score Matching

It is quite difficult for Gaussian RBM to learn variances from natural images using Con- trastive Divergence and the parameterization of the energy function we presented in Eq. 19.

Therefore, those variances are often fixed to unity [18] [19]. To overcome this limita- tion, we introduce a modified energy function for Gaussian RBM and a Denoising Score Matching (DSM) algorithm.

Score matching (SM) is an alternative to Contrastive Divergence where probability density models, whose partition function is intractable, can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data [20].

Denoising Autoencoder (DAE) is a deep learning model that forces the hidden layer to discover more robust features, by reconstructing the input from a corrupted version of it.

They have proven to be an empirically successful alternative to RBMs for pretraining deep networks. Pascal Vincent showed that there is equivalence between DAE and Gaussian RBM trained using denoising score matching [21] which leads to the following objective function:

J_{DSM q}_σ(θ) =E_q_σ_(˜_v,v)

"

1 2

ψ(˜v;θ)−∂logq_σ(˜v|v)

∂v˜

2#

, (30)

where E_q_σ_(˜_v,v)[·] is the expectation with respect to Gaussian noise model. The visible units are corrupted using additive isotropic Gaussian noise: v˜ = v+, ∼ N(0,σ²I).

Thus, the latter term, the gradient of the Parzen windows density estimator with respect to the corrupted visible units, corresponds to

∂logqσ(˜v|v)

∂v˜ = 1

σ²(v−v).˜ (31)

The gradient of the log density with respect to the corrupted visible unitsv, is called score:˜ ψ(˜v;θ) = ^∂^log_∂˜^p(˜_v^v;θ). Now we add the following modification to the energy function of

(17)

Gaussian RBM to achieve the equivalence with Denoising Autoencoder.¹ E(v,h;θ) = − 1

σ²(v^Tb+v^TW h+h^Ta− 1

2v^Tv) (32)

P(h|v) = g

W v+a σ²

(33) P(v|h) = N W h+b,σ²

(34) F(v) =

1

2v^Tv−b^Tv

σ²−log(1 + exp((W v+a)/σ²)) (35) ψ(v) = − 1

σ²(v−b−g

W v+a σ²

W^T) (36)

Substituting Eq. 31 and Eq. 36 in the Eq. 30, leads to the final objective function, which can then be minimized by using gradient descent algorithm:

J_{DSM q}_σ(θ) =E_q_σ_(v,˜_v)

"

1 2

− 1

σ²(˜v−b−g

Wv˜+a σ²

W^T)− 1

σ²(v−v)˜

2# . (37)

2.3 Deep Belief Nets

An efficient way to learn a complicated model is to combine a set of simpler models that are learned sequentially. In 2006 Hinton et al. introduced Deep Belief Nets [4], with a learning algorithm that greedily trains one layer at a time, exploiting an unsupervised learning for each layer. Each layer in DBN captures high-order correlations between the activities of hidden features in the layer below. A key feature of this algorithm is its greedy layer-by-layer training that can be repeated several times. Variational lower-bound justifies greedy layerwise training of RBMs [4].

The main building block of a DBN is restricted Boltzmann machine. The idea of greedy learning algorithm is quite simple. Train the first RBM normally using data in the visible layer. Then freeze the learned parameter vector and use the hidden layer activations as data when training the second RBM. This process is outlined in Figure 3.

Finally, a normal feed-forward neural network can be initialized using the weights learned by pretrained RBMs. For output, a softmax regression layer is usually chosen for classi-

1This is not the same energy function as presented in Eq. 4.3. in [21] because that energy function has no hidden variables and therefore is not an RBM. This parameterization is proposed by Ian Goodfellow in Pylearn2 [22] implementation of Gaussian RBM.

(18)

Figure 3. Deep Belief Nets. The first RBM is trained to model the raw input as its visible layer. The second RBM is trained using the transformed data from the previous layer as training examples for its visible layer. This process can be repeated to increase the depth of the DBN.

Finally a discriminative fine-tuning can be performed by adding a final layer that represent the desired outputs and backpropagating error derivatives.

fication task. For continuous values, a linear Gaussian layer can be used to model con- ditionally Gaussian data. Neural network can be then trained using standard supervised backpropagation algorithm. This training phase is commonly referred as fine-tuning.

(19)

3 EXPERIMENTS

3.1 Facial Keypoints Dataset

The facial keypoints dataset available through Kaggle competition [23], is used in experiments. The task is to predict keypoint positions on 96×96 pixel grayscale face images.

The problem is challenging because facial features vary from one individual to another.

Also environmental conditions, such as illumination, viewing angle and pose causes large amount of variation. Features learned to predict keypoints can also be suitable in several other applications, such as: tracking faces, analysing facial expressions, detecting dysmorphic facial signs for medical diagnosis and face recognition.

Dataset is divided into a training set and a public test set. Each predicted keypoint is specified by an (x,y) real-valued pair in the space of pixel indices. There are 15 keypoints² , which represents locations from eyes, eyebrows, nose and mouth. 1783 images public test set do not have keypoint values and it is used to test performance against other competitors through the platform. There are also two benchmarks in public leaderboard:

averages benchmark and patch search benchmark. Even though the training set size is 7049 images, only 2140 samples have all the 15 keypoints and the rest have only 4 keypoints labeled. To make use of all training data, algorithms need to somehow handle this missing label information.

3.2 Setup for Experiments

We used Pylearn2 machine learning research library [22] to implement the models. It is build on top of Theano [24] which can compile code for both CPU and GPU backends.

Many deep learning algorithms, like Contrastive Divergence for RBM, can be parallelized for GPU which significantly speeds up the learning.

In data preprocessing, we first downsampled the images by factor 3 which decresed the input dimensions to 32×32 = 1024. This common practise in image processing is usually applied to achieve reasonable memory usage and faster training times. Then we applied

2Keypoint ids: left_eye_center, right_eye_center, left_eye_inner_corner, left_eye_outer_corner, right_eye_inner_corner, right_eye_outer_corner, left_eyebrow_inner_end, left_eyebrow_outer_end, right_eyebrow_inner_end, right_eyebrow_outer_end, nose_tip, mouth_left_corner, mouth_right_corner, mouth_center_top_lip and mouth_center_bottom_lip.

(20)

global contrast normalization so that the images became to have zero mean and unit variance.

The original dataset was splitted into training, validation and test sets. Test and validation sets consist of 300 fully labeled examples. The public test set gave convenient way to test the effect of additional unlabeled data. This additional data are used in unsupervised pretraining. Training is stopped based on a results on the validation set.

3.3 Architectures

The surrounding feed-forward neural network with linear Gaussian output layer is the same for each pretrained DBN. To test the effect of increasing the depth, we trained a new RBM above the previous DBN. We fine-tuned each DBN seperately, which led to total three DBNs per test case. We transformed data from the lower layer by calculating

Figure 4.The architecture of medium size DBN used in experiments.

the expectations of the hidden units. These deterministic activations can be considered as features that the layer can extract. Because the hidden units are binary stochastic units, it is natural to interpret the activations to include a sampling from the Bernoulli distribution.

(21)

We also run tests by using sampled activations from deterministic expectations each time a training example is shown to the model.

We varied the size of the network, more spesifically the number of hidden units in each layer. These three hidden layer configurations were tested: 1000-1000-4000, 1500-1500- 6000 and 2000-2000-8000. Medium size architecture used in experiments is outlined in Figure 4. By initializing the first two layer hidden biases with -2, we encouraged sparse hidden activities which helped to learn deeper models.

We trained the first layer Gaussian RBM with 1024 visible (32×32 pixels) using De- noising Score Matching with σ = 0.4Gaussian corruptor. The second and third layers are normal binary RBMs which we trained using CD-1 algorithm. The linear Gaussian output layer has 30 units which corresponds to (x,y) pixel coordinates of 15 keypoints.

We used the objective function as stopping criteria with Denoising Score Matching and reconstruction error with Contrastive Divergence. In supervised fine-tuning, early stopping is based on Mean Squared Error. In these experiments, early stopping is the only method we used to prevent overfitting.

3.4 Results

We used the following metrics to evaluate the performance of predictions: Mean Squared Error (MSE), Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). We also evaluated the best model in each size category in Kaggle platform, which used RMSE metric.

Table 1.Comparison to Kaggle benchmarks. “(e)” denotes the usage of extended training set.

Model RMSE

Averages Benchmark 3.96244 Patch Search Benchmark 3.80685 DBN-3 (Small) (e) 3.49923 DBN-3 (Medium) (e) 3.45029 DBN-3 (Large) (e) 3.45612

Experiments showed that discriminative performance improved when deterministic activations were used on pretraining. Increasing the size of the network from small to medium

(22)

size improved the performance. Increasing the size further decreased the performance but it was still better than with only one layer. Our best model was the medium size DBN-3, trained using deterministic activations between layers. The extended dataset for pretraining improved the total performance. The best DBN-3 in each size category overcame the reference benchmarks in Kaggle competition (Table 1). Test results are summarized in Table 2.

3.4.1 Visualizations

Figure 5. Some of the weights learned by Gaussian RBM in the first layer.

Figure 5 shows some of the filters learned by the first-level Gaussian RBM. Example

(23)

Figure 6.Keypoint predictions. Green color indicates points from the left side and red color from the right side of the face relative to subject.

(24)

keypoint predictions by the best model are shown in Figure 6. Keypoints from faces, which match the dominant scale in dataset, were predicted well. Mouth keypoints were causing some error on these images when the face was not aligned. When an image was in notably smaller scale, predictions were clearly wrong and somewhat corresponded to an average guess. The model tolerated quite well occlusions like glasses, hair and beard.

3.4.2 Learning Progress

The convergence of MSE on the training set in supervised fine-tuning (Figure 7) shows that the complexity of the best model was sufficient for the task. During the training, the differences in monitored channels (objective function, MSE) between training and validation set started to increase. Training was stopped when the MSE in validation set did not improve in 20 consecutive epochs.

In unsupervised pretraining, overfitting did not occur. For RBMs we monitored the reconstruction error. In Gaussian RBM case, also the objective function was monitored. Early stopping were used when progress during 10 consecutive epochs was smaller than 1%.

Learning progresses for RBMs are shown in Appendix 1.

Figure 7. Learning progress of DBN-3’s supervised fine-tuning.

(25)

Table 2. Prediction metrics for DBNs. “(e)” denotes the usage of extended training set and “(s)”

denotes sampled hidden activations between layers on pretraining. Small, Medium and Large network sizes corresponds to 1000–1000–4000, 1500–1500–6000 and 2000–2000–8000 number of hidden units in each layer respectively. The metrics of the best model are boldfaced.

Small Medium Large

Model MSE MAE RMSE MSE MAE RMSE MSE MAE RMSE

DBN-1 5.182930 1.499364 2.276605 4.840961 1.459513 2.200218 5.132676 1.490685 2.265541 DBN-2 4.263798 1.429895 2.064897 4.415898 1.443715 2.101404 4.608888 1.481311 2.146832 DBN-3 4.262035 1.443279 2.064470 4.143560 1.405758 2.035574 4.163945 1.423680 2.040575 DBN-1 (e) 4.844518 1.409683 2.201027 4.581566 1.371640 2.140459 4.827625 1.380519 2.197186 DBN-2 (e) 3.973304 1.356106 1.993315 3.897413 1.357655 1.974187 3.717659 1.325147 1.928123 DBN-3 (e) 3.653829 1.330983 1.911499 3.384984 1.275147 1.839833 3.431095 1.290669 1.852322 DBN-2 (e) (s) 3.969170 1.372654 1.992278 4.032181 1.372653 2.008029 3.738019 1.321421 1.933396 DBN-3 (e) (s) 3.723732 1.344598 1.929697 3.462854 1.289099 1.860875 3.547598 1.305768 1.883507

(26)

4 DISCUSSION

We found that our best model is suprisingly expressive and it can learn well but it will easily overfit if early stopping is not used. Figure 7 is a great example of learning progress with a variance problem. Monitoring curves on the validation set are not following the ones on training set and the model’s ability to generalize would have started to get worse without early stopping.

The filters in Figure 5 show that modeling the real-valued input data using Gaussian RBM and Denoising Score Matching gave meaninful results. The first layer plays an important role in the whole network because the upper layers are trained based on its hidden activations. Filters clearly corresponds to recognizable face images. This is due the fact that images in the dataset have roughly the same scale and alignment. These weights are clearly better than random values to initialize a feed-forward network. We want to empha- size that these useful filters were learned in completely unsupervised way. We showed that using the extended training set improved results and we are intrigued by the possibility to exploit large databases such as google image search. We also found that deterministic activations were better than sampled activations, when transforming data from the lower layer.

Analysis of keypoint predictions tells us that our model can predict meaningful keypoints on facial images which have same scale. Faces with small rotations got some error typ- ically on one side but the prediction is probably still good enough for example to use in tracking. Biggest errors were on images that are in completely different scale. Predictions for those cases are useless and can be interpreted that the model just made an educated guess. We found that the model tolerated well occlusions that were caused by glasses, hair and beard. That is really nice feature to have because real-world use cases most certainly contains noisy data.

There is quite a big bias between the test results and kaggle benchmarks. Some of that is probably caused by the fact that for the 4-keypoints images, the nose tip is usually labeled below the nose. This imbalance did not cause error on our tests because we only used fully labeled examples in supervised learning and tests. The public test set also contained more examples with smaller scale.

(27)

4.1 Future Work

Few simple ways to generate more data for this task would be vertical mirroring, small rotations and translations. This could help the model to learn some rotational and trans- lational invariance and with additional data, larger network could be trained. Also better method to handle overfitting such as dropout and weight decay could be beneficial. Cur- rent early stopping method is a very crude way to deal this problem especially on larger networks. The scale invariance is harder to achieve and it would need changes in the network architecture.

An interesting way to learn hierarchical feature representations is to use small image patches instead of full images. The idea is to use deep models to produce feature representations for spatial pyramids. Even though the concept is quite different from the method used in this thesis, it is still exploiting the same deep learning models and algorithms.

(28)

5 CONCLUSIONS

The goal of this thesis was to find out how well deep generative models can extract features from small grayscale images to detect facial keypoints. The facial keypoints dataset available through Kaggle competition, was used in experiments.

In this thesis a general introduction to deep learning models and algorithms were given.

We introduced the main building block of deep learning, an energy-based generative model called Restricted Boltzmann Machine. We derived Contrastive Divergence, an unsupervised learning algorithm for energy-based models. A novel approach, Denois- ing Score Matching, was introduced to train Gaussian RBM to model real-valued data.

We described how Deep Belief Nets can be constructed by using greedy layer-by-layer training.

Analysis of predicted keypoints showed that the proposed deep model was able to predict meaningful keypoints on facial images which have same scale. It also overcame the reference benchmarks in Kaggle competition. The model tolerated occlusions that were caused by glasses, hair and beard, well. On pretraining, it was able to benefit from the extended unlabeled training set. Transforming data from the lower layer by using deterministic activations were better than sampled activations. The complexity of the model was sufficient for the task and we had to use early stopping on training to prevent overfitting.

The results of this work imply that Gaussian RBM trained with Denoising Score Matching can successfully model real-valued pixel values and increasing the depth improves the discriminative performance of the deep model. A better method to handle overfitting than early stopping is needed. For example the usage of dropout and weight decay could improve the performance.

(29)

REFERENCES

[1] Paul Bach-y Rita and Stephen W. Kercel. Sensory substitution and the human- machine interface. Trends in Cognitive Sciences, 7(12):541–546, 2003.

[2] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009.

[3] T. S Lee, D. Mumford, R. Romero, and V. A.F Lamme. The role of the primary visual cortex in higher level vision. Vision Research, 38(15/16):2429–2454, 1998.

[4] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.

[5] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

[6] G.E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30–42, 2012.

[7] N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke. An application of pretrained deep neural networks to large vocabulary conversational speech recognition. Technical report, Tech. Rep. 001, Department of Computer Science, University of Toronto, 2012.

[8] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors,Advances in Neural Information Process- ing Systems 25, pages 1106–1114. 2012.

[9] Large scale visual recognition challenge 2012 (ilsvrc2012). http://www.

image-net.org/challenges/LSVRC/2012/. Accessed August 12, 2013.

[10] Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 – 507, 2006.

[11] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. Int. J. Approx. Rea- soning, 50(7):969–978, July 2009.

[12] Machine learning competitions. http://www.kaggle.com. Accessed August 12, 2013.

(30)

[13] George Dahl. Deep learning how I did it: Merck 1st place

interview. http://blog.kaggle.com/2012/11/01/

deep-learning-how-i-did-it-merck-1st-place-interview/.

Accessed August 12, 2013.

[14] Vlad Mnih. Q&A with job salary prediction first prize win-

ner Vlad Mnih. http://blog.kaggle.com/2013/05/06/

qa-with-job-salary-prediction-first-prize-winner-vlad-mnih/.

Accessed August 12, 2013.

[15] Yann Lecun, Fu Jie, and Jhuangfu. Loss functions for discriminative training of energy-based models. InIn Proc. of the 10-th International Workshop on Artificial Intelligence and Statistics (AIStats’05), 2005.

[16] P. Smolensky. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. chapter Information processing in dynamical systems: foundations of harmony theory, pages 194–281. MIT Press, Cambridge, MA, USA, 1986.

[17] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput., 14(8):1771–1800, August 2002.

[18] KyungHyun Cho, Alexander Ilin, and Tapani Raiko. Improved learning of gaussian- bernoulli restricted boltzmann machines. InProceedings of the 21th international conference on Artificial neural networks - Volume Part I, ICANN’11, pages 10–17, Berlin, Heidelberg, 2011. Springer-Verlag.

[19] Nan Wang, Jan Melchior, and Laurenz Wiskott. An analysis of gaussian-binary restricted boltzmann machines for natural images. InProceedings of the 20th Eu- ropean Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pages 287–292, 2012.

[20] Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695–709, December 2005.

[21] Pascal Vincent. A connection between score matching and denoising autoencoders.

Neural Comput., 23(7):1661–1674, July 2011.

[22] A machine learning library based on theano. http://deeplearning.net/

software/pylearn2/. Accessed August 12, 2013.

[23] Facial keypoints detection. http://www.kaggle.com/c/

facial-keypoints-detection. Accessed August 12, 2013.

(31)

[24] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pas- canu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Ben- gio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.

(32)

Figure A1.1.Learning progress of 1st layer Gaussian RBM’s pretraining.

(continues)

(33)

Figure A1.2.Learning progress of 2nd layer RBM’s pretraining.

Figure A1.3.Learning progress of 3rd layer RBM’s pretraining.

Deep generative models for facial keypoints detection

DEEP GENERATIVE MODELS FOR FACIAL KEYPOINTS DETECTION

CONTENTS

ABBREVIATIONS AND SYMBOLS

1 INTRODUCTION

1.1 Background

1.2 Objectives and Restrictions

1.3 Structure of the Thesis

2 METHODS

2.1 Energy Based Models

2.2 Restricted Boltzmann Machine

2.3 Deep Belief Nets

3 EXPERIMENTS

3.1 Facial Keypoints Dataset

3.2 Setup for Experiments

3.3 Architectures

3.4 Results

4 DISCUSSION

4.1 Future Work

5 CONCLUSIONS

REFERENCES