Facial Emotional Recognition Experiment by applying R-CNN

(1)

Facial Emotional Recognition Experiment by applying R-CNN

Nguyen Gia Hong

Master's thesis

School of Computing Computer Science

Autumn 2020

(2)

2

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu.

School of Computing Computer Science

Author: Nguyen Gia Hong

Title of the Master’s thesis: Facial Emotional Recognition Experiment by applying R-CNN Supervisors of the Master’s thesis: Xiao-Zhi Gao

Term and year of completion: Autumn 2020

ABSTRACT

Analyzing and detecting emotions from human facial muscle movements is a problem that has been defined and developed for several years because of the benefits it provides. During development, datasets and methods become more and more complex, and the accuracy and difficulty increase. Besides the normal Convolutional Neural Network (CNN), this thesis will implement a Residual CNN to detect human facial emotion based on the input images or camera video. The results of the experiment show that this method is slightly better than other modern methods, which is related to emotion recognition problems for complex images and some reported scientific studies. Because of the scope of the study and the limitation of the hardware, the ensemble methods cannot be implemented which will give a better result. The best result of this experiment on a Residual Convolutional Neural Network is 66.8% on the FER2013 dataset.

Keywords: CNN, Residual CNN, Facial emotional recognition, FER2013

(3)

3

PREFACE

I developed this recognition experiment in Python language by using the Anaconda framework, Spyder environment, and other computer vision related libraries. I have earned more experience in using many modern architectures and how these networks work through this project about facial emotional detection. For example, VGG16 (also call OxfordNet) and YOLO network. After this interesting project, I will try to focus more on this computer vision field.

I would like to thanks my supervisor professor Xiao-Zhi Gao. As a brother and a Ph.D.

candidate at Oulu University in the Computer Vision field, Mr. Tran Thuong Khanh also guided me and helped me a lot in explaining detailed theories about facial emotion recognition through this thesis.

Joensuu, 05/10/2020 Nguyen Gia Hong

(4)

4

ABBREVIATIONS

CNN Convolutional Neural Network

Residual CNN Residual Convolutional Neural Network

VGG16 Visual Geometry Group

YOLO You Only Look Once

FER2013 Facial Expression Recognition 2013

SVM Support Vector Machine

FACS Facial Action Coding System ResNet-4 4 Residual blocks Network

(5)

5

CONTENT

ABSTRACT ... 2

PREFACE ... 3

ABBREVIATIONS ... 4

CONTENT ... 5

LIST OF FIGURES ... 8

1 INTRODUCTION ... 11

1.1 Problems ... 12

1.2 Scope of the thesis and target... 13

2 RELATED STUDIES ... 15

2.1 Deep Learning and Linear Support Vector Machine ... 15

2.1.1 Brief introduction ... 15

2.1.2 Results ... 17

2.1.2.1 Winning the Face-to-Face Recognition Contest in 2013 ... 17

2.1.2.2 Comparison experiment between Softmax and Linear SVM ... 17

2.1.3 Advantages, disadvantages, and challenges ... 18

2.2 Facial expression recognition using a multi-level convolutional neural network ... 19

2.2.1 Brief introduction ... 19

2.2.2 Result ... 21

2.2.3 Advantages, disadvantages, and challenges ... 22

(6)

6

3 Basic Theory ... 23

3.1 Human emotions through facial expressions. ... 23

3.1.1 Emotions on human faces. ... 23

3.1.2 The facial action coding system ... 24

3.2 Convolutional Neural Network in image recognition. ... 25

3.2.1 Convolutional Neural Network introduction ... 26

3.2.2 Basic layers in Convolutional Neural Network ... 27

3.2.3 Convolutional layer (CONV) ... 28

3.2.3.1 Stride, depth, and zero-padding ... 29

3.2.4 Pooling layer (POOL) ... 30

4 RESIDUAL CONVOLUTIONAL NEURAL NETWORK ... 32

4.1 Introduction ... 32

4.2 Expressive inference from facial muscles... 33

4.2.1 Analysis of typical examples based on FACS table ... 33

4.3 Residual CNN architecture ... 39

5 EXPERIMENTS ... 42

5.1 FER dataset introduction... 42

5.1.1 FER2013 dataset exploration ... 43

5.2 Experiment environments ... 45

5.3 Evaluation metrics ... 45

5.4 Experiment framework ... 46

5.4.1 Python language ... 46

5.4.2 Spyder IDE... 46

(7)

7

5.4.3 Project code structure ... 46

5.5 Data processing ... 49

5.6 Training parameters configuration ... 50

5.7 Evaluation results on FER2013 dataset ... 50

5.7.1 Comparison with other very deep modern architectures ... 51

5.7.2 Confusion matrix ... 52

5.8 Results in reality ... 53

6 CONCLUSION ... 57

7 REFERENCES ... 58

(8)

8

LIST OF FIGURES

Figure 1.1: Series of human basic emotions on still images.

Figure 1.2: Examples of three facial expression datasets.

Figure 2.1: Graph of the test results of two models on averages on 8-fold validation.

Figure 2.2: Multi-level architecture facial expression recognition.

Figure 2.3: An example of some output of Grad-CAM filters from multi-level architecture.

Figure 2.4: An example of the input and Grad-CAM images from multi-level architecture.

Figure 3.1: An example of a ConvNet created by basic layers. It converts a 3-dimensional input block (36x36x3) into a 3-dimensional output block (1x1x2) – it only has 2 classes.

Figure 3.2: An example of a bigger ConvNet called VGG stacked by many basic layers.

Figure 3.3: Example of spatial arrangement.

Figure 3.4: Max-pooling layer downsamples the volume spatially.

Figure 4.1: Emotional recognition system.

Figure 4.2: Angry emotions in FER2013.

Figure 4.3: Disgust emotions in FER2013.

Figure 4.4: Fear emotions in FER2013.

Figure 4.5: Happy emotions in FER2013.

Figure 4.6: Sad emotions in FER2013.

Figure 4.7: Suprise emotions in FER2013.

Figure 4.8: Neutral emotions in FER2013.

(9)

9

Figure 4.9: 2D Menpo facial landmark configuration. Left: Frontal 2D landmarks (68 points), Middle: Left 2D landmarks (39 points), Right: Right 2D landmarks(39 points).

Figure 4.10: 3D Menpo facial landmark configuration. Left: Frontal 3D landmarks, Middle:

Left 3D landmarks, Right: Right 3D landmarks.

Figure 4.11: Vanishing gradient problem in very deep normal CNN network.

Figure 4.12: One Residual block.

Figure 4.13: Full ResNet architecture in the thesis.

Figure 5.1: FER2013 dataset emotions distribution.

Figure 5.2: Some examples in the dataset. There are many variants between images such as ages, face angles, expression strength, and some blank images.

Figure 5.3: Distribution of each class in three separated datasets: training set, validation set, and test set.

Figure 5.4: Project code structure.

Figure 5.5: Small Residual CNN definition code.

Figure 5.6: Training process on Small Residual CNN code.

Figure 5.7: Generator applied data augmentation methods to generate more image data while training.

Figure 5.8: Evaluation results after 100 epochs on our Residual CNN.

Figure 5.9: Confusion matrix.

Figure 5.10: The combination of two models emotion and gender recognition.

Figure 5.11: Another good recognition example but four heads are not in the right position so the model cannot detect.

Figure 5.12: Recognition result in a scene of End Game movie.

(10)

10

Figure 5.13: Recognition result in a scene of Three Idiots movie.

(11)

11

1 INTRODUCTION

Human facial expressions play a significantly important role in modern social communication.

A casual conversation involves both linguistic and non-verbal linguistic. Non-verbal linguistic include eye contact, various facial expressions, body language, gestures, and so on. A smile shows happiness, a sad expression shows a loss, a fear shows the scared of something, and a surprise expression indicating something unpredictable has happened. According to Darwin and Phillip [1], facial expressions are one of the universal, natural, and powerful signals for people to convey their intentions and emotional states. In the fields of computer vision and machine learning, many scientists have researched automated facial analysis systems because of its important practical applications in robot interaction systems. Through this, many facial recognition systems have tried to encode expressions from facial expressions.

Since the early 20^thcentury, Paul and Wallace [2] have defined six basic facial expressions based on their studies. They declared that human emotion is universal. Human perceives expressions as the same no matter which culture they come from. There are seven basic expressions which are anger, disgust, fear, happiness, sadness, surprise, and neutral. The disgust expression was added later as one of the basic facial expressions. However, recent studies in neuroscience and psychology have argued that the pattern of the six basic facial features is culturally specific and non-universal [3]. Because of the historical value, the FER2013 dataset is built on a universal hypothesis of facial expressions. Currently, many Chinese researchers have collected and developed a dataset for Chinese such as Ma Jialin, and his colleagues [4], who contribute to promoting the application of China industry [5].

At present, facial recognition systems can currently be classified into two main categories depending on their input or design which are still image-based system, and a system is based on a series of frames. A still image-based system is only analyzing the emotions of each image based on the available information through that image, mainly extracting the characteristics based on the relative positions of the elements in the image. Figure 1.1 shows the series of basic emotions still images.

(12)

12

1.1 Problems

Human emotional recognition plays an important role in robot interaction systems. Many methods can be used to recognize a person’s emotions from voice, expression, and gestures, or even Electroencephalography (EEG). Under the scientific eye, analyzing human emotions or reactions help to understand more about human. Therefore, it has a great impact on daily life as well as in the research of different fields.

Figure 1.1: Series of human basic emotions on still images.

Some recent studies have involved expression and perception such as in life applications such as driver emotional monitoring to ensure safety while driving [6], or in the educational field such as cognitive state analysis [7], a study of children with autism spectrum disorders [8][9][10], a study of people with schizophrenia [11]. Besides that, systems that automatically detect and recognize human facial expressions have immense potential in areas related to commerce and advertising. Customer will more or less show their expressions on their faces when observing and evaluating items or gliding through a billboard. The customer’s interest and their feeling at that time can be shown through their facial expressions [12]. Also, this system has many applications in crowd management and analysis for authorities or managers in crowded places such as supermarkets and airports.

However, the current development of facial recognition methods is divided into two branches with two different types of data: lab-controlled data and complex real data. Recent studies easily achieve an accuracy of more than 90% for controlled datasets such as CK+ [13] or more than 80% in the MMI dataset [14]. But for complex real datasets in practice, the accuracy is

(13)

13

reduced to smaller than 60% as FER2013 [15]. Therefore, this study focus on experimentation, comparison, and evaluation of a complex real dataset.

Figure 1.2: Examples of three facial expression datasets.

1.2 Scope of the thesis and target

As mentioned above, emotional computing is a broad branch of research because human emotional characteristics are expressed in many aspects. Since there is no common convention, the datasets and studies have different conventions and they have different settings in experiments. It makes this very difficult to compare methods and research on this topic in a general and objective way. So the thesis’s scope and objectives as follows:

In terms of approach, along with the strong development and great achievements of deep learning neural network architectures and related models in recent times [16][17][18][19][20], the study will focus on surveying, evaluating, experimenting and improving on deep learning models and modern methods.

In terms of data, public human facial expressions databases different in many ways including collection environment, the format of input (one or more contiguous images - image series), distribution of expression, quality of images, and labels for each training sample (Table 5.1), and quantity. In the scope of this thesis topic, the input data of the model is complex real data.

The dataset is used for this experiment and comparative evaluation are described in chapter 5.

(14)

14

In terms of measurement, different studies are typically deployed very differently from preprocessing, building the network architecture, as well as the way to implement training and testing on different datasets. All of these factors influence the results and performance. Hence, a throughout comparison of the impact assessment of the network architecture and other factors is not possible if only based on the reported results. This thesis will implement and retrain some modern deep learning network models to compare the results with some proposed models.

Regarding the goal of the topic, with many challenges in complex actual data, this thesis only focuses on improving the accuracy of the dataset. Because of the background in computer science who does not have much experience in biomedical and psychological medicine, this thesis accepts the hypothesis that there are six discrete expressions, and then accept this problem as the classification problem.

(15)

15

2 RELATED STUDIES

2.1 Deep Learning and Linear Support Vector Machine

For the classification problem using deep learning models, it is often to see them using the Softmax [21] function in the last layer to perform the classification and minimize the Cross- Entropy-Loss steps. However, in this study, Yichuan [22] used the Linear Support Vector Machine to replace the Softmax activation function and aimed to let the learning model minimize the error function based on margin distance in the theory of Support Vector Machine (SVM).

2.1.1 Brief introduction

Yichuan focuses on comparing deep learning models using different methods to make activation functions. He mainly compares the performance of the Softmax function and Support Vector Machine theory. This experiment has earned many certain achievements, such as winning the Face-to-Face Recognition Contest organized by the International Conference on Machine Learning (ICML) in 2013. That contest was held on Kaggle with more than 120 participating teams during that time. In addition, the author also performed comparison tests on some classic datasets such as MNIST [23] or CIFAR-10 [24].

Linear Support Vector Machine was originally built for the binary classification problem which consists of two classes. Given the training set and the corresponding labels xn, yn, n = 1,.., N, xn

∈ R^D, yn ∈ {−1,1}. The learning of the SVM algorithm it the optimization that has the following constraints:

min (2.1)

w,ξ

s.t. 𝑤^𝑇𝑥_𝑛𝑦_𝑛 ≥ 1 − 𝜉_𝑛, ∀n (2.2)

𝜉_𝑛> 0, ∀n (2.3)

(16)

16

In which, 𝜉_𝑛 is a slack variable, this variable exists to punish data points that violate the rules of the algorithm. Combining the two conditions of the problem and the objective function, we get a new objective function unconditionally as follows:

(2.4)

The problem now means choosing a 𝒘-matrix to minimize the target function 2.5, which is the primitive form of the L1-SVM problem, with the error function – hinge loss. Since this function feature is not differentiable, another more common variant called L2-SVM is formed as follows:

(2.5)

The L2-SVM function is differentiable. Also, since L2-SVM is a quadratic function, L2-SVM punishes the error more heavily than L1-SVM for datapoints that violate the rules of function.

The study also mentioned the Kernel SVM method. Although the Kernel SVM also works quite well, calculating the kernel matrix can take a lot of time and memory. Furthermore, it is not efficient to extend the problem to the multi-class classification problem compare to the Multi- class SVM method. Besides, this method also has problems with a huge amount of data.

Therefore, in Yichuan’s study, only linear SVM with normal deep learning models were used.

The model tested by Yichuan is noted to have the same basic architecture as proposed in the studies of Zhong & Ghosh [25], Nagi et al. [26]. Also, Yichuan has replaced the old hinge loss function with L2-SVM for the reason that it punishes the error more heavily. The weights of the first class of the model have been computed through the backpropagation process by deriving the output values of the Support Vector Machine layer. To do this, the Support Vector Machine's target function needs to be differentiable to the activation function and the previous layer. Let Equation 2.5 be ℒ(𝑤) , and input x is replaced with the activate value h of the preceding function, we have the derivative:

(2.6)

Where 1{} is an indicator function. The indicator function has a value of 1 if its parameter is met and 0 otherwise. Meanwhile, for L2-SVM, we have:

(17)

17

(2.7)

After this, the implementation of the backpropagation process is similar to the models using the softmax function. Yichuan found that L2-SVM is better than L1-SVM in most cases and used L2-SVM in the experimental parts.

2.1.2 Results

2.1.2.1 Winning the Face-to-Face Recognition Contest in 2013

This is a Face Recognition contest held in 2013 at an ICML seminar, hosted by LISA of the University of Montreal. The training data of the contest includes 28,709 with 48 x 48 resolution images on 7 different expressive shades of human emotions. The assessment and test set together contain 3,589 pictures and this is a multi-class classification problem.

Yichuan submitted results using his model with a 69.1% accuracy on the public test set and 71.2% accuracy on the private test set. This result won the competition. Also, due to noise, miss labels, and other factors, human outcomes as measured on this data set a range between 60-65%.

The model consists of a simple convolution neural network with a final one-for-all linear SVM layer. The Stochastic gradient descent algorithm together with momentum was also used during the training.

2.1.2.2 Comparison experiment between Softmax and Linear SVM

Yichuan compared the capabilities of these two algorithms to the deep learning model. Both models are tested by an 8-fold validation algorithm, with a reflection layer.

(18)

18

Figure 2.1: Graph of the test results of two models on averages on 8-fold validation. [22]

Figure 2.1 showed that comparing using Softmax and L2-SVM as a weight update function, the error score of the DLSVM model decreases more in the second half of the training process.

The difference is not too much but it is obviously better.

Softmax DLSVM L2-SVM

Training set 67.5% 68.9%

Validation set 69.3% 69.4%

Test set 70.1% 71.2%

Table 2.1: The comparison of two models based on accuracy. The test set decides the winner.

2.1.3 Advantages, disadvantages, and challenges

• Advantages

o This study by Yichuan is a fundamental contribution to deep learning modeling training.

o The biggest reason for this research because it won the facial expression recognition competition contest. And that dataset is also widely used today.

(19)

19

• Disadvantages, and challenges

o Lack of analysis of the features of human facial expressions. It does not mention the theory.

o Research has mostly focused on the comparison of using kinds of parameters without focusing on the expression recognition problem.

o The presentation of the model used in the research was very primitive, lack of specific details about the parameter settings.

2.2 Facial expression recognition using a multi-level convolutional neural network

2.2.1 Brief introduction

Hai-Duong Nguyen et al. [27] performed a study using multi-level neural networks, experiment with dataset FER2013. Research results are evaluated on the dataset with accuracy and significant improvements compared to previous architectures.

The plain network architecture proposed by Hai-Duong Nguyen is designed by convolutional layers of neurons that are very common in image classification problems, which are inspired by VGG networks [28], with 18 layers organized in 5 blocks (Figure 2.2). Each block contains layers of convolution and subsequent layers – pooling layers and fully connected layers. The model is fed a grayscale image with 48x48 resolution and at the end of the network is a softmax output layer that classifies seven expressions.

Figure 2.2: Multi-level architecture facial expression recognition [27].

Hai-Duong Nguyen et al. used the Gradient-weighted Class Activation Mapping (Grad-CAM) [29] technique to simulate containers of useful information for classification. On the other hand, this method also explains how a convolutional layer of neurons places attention on each

(20)

20

particular position of the object in specific layers throughout the network. Figure 2.3 shows a Grad-CAM result generated by the 24^th filter in the 2^nd block of the network. At this level, the network focuses only on the eyes, nose, the parts that play an important role in the emotional recognition problem. In contrast, the network only pays attention to the background and other areas of useless information in the 1^st block. Besides, figure 2.4 shows a Grad-CAM result generated from the 3^rd block of the network. Those filters are worth creating a higher level of features, it can be seen that not all of them contribute to the recognition process. Only a few filters in the middle level of the network contribute to the recognition and categorization. From this point of view, the author has proposed multiple convolutional neural networks for this dataset.

2^nd filter 8^th filter 22^nd filter

Figure 2.3: An example of some output of Grad-CAM filters from multi-level architecture [27].

(21)

21

Grayscale input image After 1^st block Grad-CAM After 3^rd block Grad-CAM

Figure 2.4: An example of the input and Grad-CAM images from multi-level architecture [27].

2.2.2 Result

The model proposed by the author is evaluated on the FER2013 dataset. Details about this dataset are presented in the dataset introduction. The study experiment on a normal conventional model, three MLCNNs, and three ensemble models based on three MLCNNs.

The result of Hai-Duong Nguyen et al. study is shown in the following table:

(22)

22

Model Architecture Accuracy (%)

Conventional model 18 layers 69.21

MLCNN 1 Connection type 1 73.03

Ensemble model 1 3 MLCNNs + fc7 73.17

Ensemble model 2 3 MLCNNs + 2xfc512 + fc7 74.09

Ensemble model 3 3 MLCNNs + 3xfc512 + fc7 73.73

Table 2.2: Hai-Duong Nguyen et al. performance on FER2013 on the test set [27].

2.2.3 Advantages, disadvantages, and challenges

• Advantages

o The advantage of this approach is that the author took advantage of the existing VGG network architecture to get inspiration for the network model.

Recognition aims to improve the accuracy of the recognition model, using the Grad-CAM method for visualization which is good to notice the training network process.

• Disadvantages

o The downside of this study is that it is difficult when the author uses a basic network to compare with multi-level neural networks but does not re-implement modern models under the same configuration which can compare the strength of multilevel neural networks and modern neural networks more objectively.

(23)

23

3 BASIC THEORY

3.1 Human emotions through facial expressions.

3.1.1 Emotions on human faces.

The human face is a complex entity that plays a major role in this study. The face is not only a place to reflect emotions but also shows human mental factors, social interactions, and physiological signals [10]. This study only mentions the most important content that the face shows which is human emotions. The structure and the combination of facial muscles make the face is the most emotional part of a human. It’s impossible to tell someone's state without looking at their face. Professor Albert Mehrabian from California university concluded that there are three basic elements in communications: Words used in communication, communication style and tone of voice, and non-verbal behaviors when communicating.

In which, the non-verbal behaviors accounted for 55% of the effectiveness of the conversation [30]. Non-verbal behaviors include gestures, facial expressions, posture, and various body movements. Among these components, the facial expression is an essential component contributing to the effectiveness of the interaction. Our face is a complex component and has major differences from the rest of our body. Concretely, it's one of our complex signal systems available, including over 40 structural and functional autonomous muscles, each of which can be activated independently of the other [30]. Facial muscles work as emotional expression and allow us to share and exchange social information with others through verbal and non-verbal communication. All the muscles in our body are controlled by nerves, which help determine the path to the brain and spinal cord. The nerves have a two-way connection, which means they can activate muscles based on the brain waves, and transmit information about the current muscle state back to the brain. The facial nerves emerge from deep inside the brain located in the skull below the ear and branch out to all muscles like a tree. Moreover, the facial nerves are also connected to the younger motor regions in our neocortex, areas that are primarily responsible for the facial movements needed to speak [30]. The brain stem and cortex are extremely sensitive, depending on whether a facial expression is intentional or unintentional.

While the brain stem controls expressions that are unintentional and unconscious that occur naturally, the cortex consists of consciously controlled and intentional facial expressions. That

(24)

24

is the reason why a face with a fake smile does not appear naturally, while a natural smile is easy to arise. In other words, making a fake smile doesn't even feel the way it should be because it obviously doesn't activate the same nerves as when we have a real smile.

The facial nerves connect most of the facial muscles to the brain. The same regions in the brain stem control emotional processing and express emotions. Magnetic Resonance Imaging (MRI) imaging studies have identified a specific area in the brain stem that is very sensitive in the face of potential visual threats and hearing - left and right amygdala. The amygdala is associated with the processing of terrifying events that directly threaten us, or the pleasurable stimuli of the body. Besides dealing with fear and pleasure, the amygdala has also been found to be responsible for the autonomic functions involved in emotional stimulation. The amygdala controls the release of cortisol and other stress hormones into the bloodstream which controls heart rate, and observable behaviors such as changes in posture and facial expressions [30].

According to the universal hypothesis, the perception and emotional expression on the face are identical regardless of human background and culture. In Charles Darwin's first work on facial expressions and their importance in 1872, he declared that facial expressions are innate, it does not need to learn and have evolutionary significance in order to survive [1]. Paul Ekman researched an isolated tribe in New Guinea and observing their facial expressions, he realized that those expressions were similar to the civilized world around them. Also, he showed them emotional expressions in photographs of people in the civilized world, the people of this tribe were also able to recognize those emotions and react [31]. Paul Ekman then studied further and concluded that there is a set of emotions that are always expressed with the same facial expressions, regardless of gender, age, culture, or social history [32].

3.1.2 The facial action coding system

The Facial Action Coding System (FACS) developed by Ekman [33] is a system that fully describes human facial behaviors using a set of Action Units (AUs). The system works by detecting one or more facial movements then referring to a table of contents to indicate what kind of emotion is being expressed. There are 46 different operating units described in [33].

From there, Ekman describes facial expressions from the combination of those units.

(25)

25

Emotions Action Units Description

Happy 6 + 12 Cheek raiser, lip corner

puller.

Sad 1 + 4 + 15 Inner brow raiser, brow

lower, lip corner despressor.

Suprise 1 + 2 + 5 + 26 Inner brow raiser, outer

brow raiser, lid raiser, jaw drop.

Fear 1 + 2 + 4 + 5 + 7 + 20 + 26 Inner brow raiser, outer brow raiser, brow lower, upper lid raiser, lid tight, lip stretcher, jaw drop.

Angry 4 + 5 + 7 + 23 Brow lower, upper lid raiser,

lid tight, lip tight.

Disgust 9 + 15 + 16 Nose wringkler, lip corner

depressor, lower lip depressor.

Table 3.1: Combination table of AUs by emotions [33].

This part explains where emotions come from, which nerves are controlled, and a more holistic view of human facial expressions. The human face is one of the most important parts of analyzing human expressions.

3.2 Convolutional Neural Network in image recognition.

Convolutional Neural Networks (CNNs) are similar to normal neurons in the human brain.

They are created by various parameters and hyperparameters which can update in learning through time. Each neuron receives several inputs, performs a dot product, and optionally follows it with a non-linear layer. The entire CNN network represents a single scalable

(26)

26

function: from raw image pixels to the desired output. They have the loss function, and they have also applied a Support Vector Machine (SVM) or Softmax in the last layer – Fully Connected layer (FC layer). CNN architecture makes an input as an image, allowing to encode certain properties from the input into the architecture. This action makes the model implemented more efficiently and at the same time greatly reduces a huge number of redundant parameters in the network. [34]

3.2.1 Convolutional Neural Network introduction

Unlike a normal neural network, the layers of a Convolution Network (ConvNet) have nerve cells arranged in order of three dimensions: width, height, and depth. The depth refers to the third dimension of the feature block. The input image example in the CIFAR-10 data set is an input block of 32×32×3 size (width, height, and depth respectively). In CNN, each neuron in the next layer will only be connected to a small neuron area of the previous layer instead of all the neurons as the fully connected layer. Moreover, the final output layer will be 1×1×10, because by the end of the architecture the output should be a classified class vector in depth. A ConvNet showed in figure 3.1 below:

Figure 3.1: An example of a ConvNet created by basic layers. It converts a 3-dimensional input block (36×36×3) into a 3-dimensional output block (1×1×2) – it only has 2 classes.

(27)

27 3.2.2 Basic layers in Convolutional Neural Network

ConvNet is a series of layers, and each layer converts one feature block to another through a differential function [34]. Three main types of layers make up a ConvNets architecture:

Convolutional Layer, Pooling Layer, and a Fully-Connected Layer [34]. By arranging these layers, a ConvNet architecture is created.

A complete simple ConvNet architecture is arranged as ordered: Input layer – CONV – RELU – POOL – FC [34]. More detail of each type of layer as follows:

• INPUT layer [32×32×3] will hold the raw pixel values of the image. In this case an input image with width 32, height 32, and with three color channels R, G, B. [34]

• The CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32×32×12] if we decided to use 12 filters. [34]

• RELU layer will apply an element-wise activation function, such as the max (0, x) thresholding at zero. This leaves the size of the volume unchanged ([32×32×12]).

• POOL layer will perform a downsampling operation along the spatial dimensions (width, height), which results in volume such as [16×16×12]. [34]

• The FC layer will compute the class scores, resulting in a volume of size [1×1×10], where each of the 10 numbers corresponds to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.

[34]

In this way, ConvNets transforms the input image layer from the original pixel values to a series of 10 classes scores of the final layer. Concretely, the CONV and FC layers perform transformations by not only changing and updating the input feature block but also the participation of parameters (weights and bias of neurons) in the operations. On the other hand, the RELU and POOL layers will perform a fixed function as a downsampling output from the previous layer. The parameters in the CONV and FC layers will be trained with the Gradient Descent algorithm to achieve the goal of reducing the errors calculated by the loss function

(28)

28

through time, to serve an ultimate goal of increasing the accuracy of the predicted outputs which are compared with actual labeled.

Figure 3.2: An example of a bigger ConvNet called VGG stacked by many basic layers. [34]

In short, a ConvNet network is a stack of layers that converts an input image into the desired output. This network has to contain basic layers: CONV, RELU, POOL, and FC. Each of these layers takes a three-dimensional input and returns the three-dimensional output with a different size or same size based on the configure.

3.2.3 Convolutional layer (CONV)

The convolution layer is extremely important and does almost all of the computation in this architecture. It is too famous that although the ConvNets architecture contains many other essential layers, this architecture is named after it - Convolution Neural Network.

The CONV layer consists of a set of learnable filters. Every filter is spatially small in width and height but it slides through the full depth of the input block. For example, a typical filter on the first layer of ConvNet could be 5×5×3 (5 pixels width, 5 pixels height, and 3 channels of color). During the processing, it convolves each filter according to the width and height of the input block and computes the dot product between the filter and the input block at all positions. When the filter slides across the width and height of the input volume, and output is created called a feature map.

(29)

29

For example, assume that the input block of an RGB image has a size of 32×32×3, and the filter size is 5×5. Then each neuron in the CONV layer will have a number of weights for one region of 5×5×3 in the input block. For one region, there are total 76 connections/weights to the output from the input (5 * 5 * 3 + 1 = 76 weights, + 1 parameter for the bias). In another example, assume that the input block has a size of 100×100×50 and the filter size is 3×3. Then each neuron in the CONV layer will have a number of connections/weights for one region of 3×3×50 (3 * 3 * 50 + 1 = 451 weights, +1 parameter for the bias).

3.2.3.1 Stride, depth, and zero-padding

Stride, depth, and zero-padding are three hyperparameters that are used to control the size of the output block.

• Stride is a hyperparameter that indicates the sliding degree of the filter. For example, when the stride is 2, the filter move 2 pixels at a time.

• Depth is a hyperparameter that corresponds to the number of filters which can be used.

• Zero-padding is also a hyperparameter that pads the input block with zeros around the border. It’s used to control the spatial size of the output volumes.

Accordingly, the spatial size of the output block can be calculated through a function that takes the input block size (W), the filter size of layer Conv (F), stride (S), and zero-padding (P). The formula is calculated (W - F + 2P) / S + 1. For example, for an input size of 9×9 and filter 3×3 with stride 1 and zero-padding 0, we will get the 7×7 output. With stride 2 we get 4×4 output.

Figure 3.3: Example of spatial arrangement. Left: The filter slide across the input size W=5, F=3, S=1, and P=1, giving output size (5-3+2)/1+1=5. Right: The filter slide across with S=2,

giving output size (5-3+2)/2+1=3. Notice that stride S=3 could not be used since (5-3+2)=4 cannot divisible by 3. [34]

(30)

30

Use of zero-padding: in the left example, note that the input size is 5 and the output size is equal and equal to 5. This is achieved because the filter size is 3, and zero-padding is used to pad 2 zeros in the two ends of the input. If zero-padding is not used, accordingly the output block only has a size of 3, result in lost information. In general, the zero-padding hyperparameter is calculated using the formula while stride S=1 to ensure that the input block and output block will be the same size. Using a zero-padding in this way is very common in many ConvNet architectures. [34]

The constraint of stride: hyperparameters that affect the spatial size are bound to each other.

For example, with the input size W=10, P=0, and the F=3, stride S=2 cannot be used because (10-3+0)/2+1=4.5 which is an odd number. Hence, the hyperparameters configuration of this case is invalid, and a ConvNet library could throw an exception. [34]

3.2.4 Pooling layer (POOL)

In ConvNets architectures, it is common to see a pooling layer inserted between convolutional layers originate from the LeNet network [16]. The function of this layer is to reduce the spatial size of the feature block in order to reduce the number of parameters and computation in the network. The pooling layer operates independently on each slice of the feature block and resizes spatially. Using a function (such as max or avg) to compute each spatial position in the output feature block. The most common form is a pooling size of 2×2 filter with a stride step S=2 on each slice in the input block. After performing the computation, the pooling layer removes around 75% of the information in the input layer. Each pooling computation takes 2×2 spatial numbers and returns the largest number while maintaining the depth of the feature block. More generally, the max-pooling layer is specified as follows:

(31)

31

Figure 3.4: Max-pooling layer downsamples the volume spatially. Left: the input volume size of 224×224×64 is pooled with a filter size of 2, stride 2 into an output volume size of 112×112×64. Notice that the depth of volume is preserved. Right: The most common downsampling operation is max, giving rise to max-pooling, here shown with a stride of 2.

Each max-pooling function is taken over 4 numbers (little 2×2 square). [34]

(32)

32

4 RESIDUAL CONVOLUTIONAL NEURAL NETWORK

4.1 Introduction

A facial expression recognition system receives an I input image and performs the following steps:

• Detect F faces in an input image I.

• Classify the expression E for each F face in input image I.

• Perform data analysis based on E prediction results.

The scope of this thesis is developing and contributing to the recognition module that stands between and plays a core role in the system (Figure 4.1). After that, the results are combined or compared with other available facial recognition methods to get ready for further analysis.

In short, the core process of developing an emotional recognition module is to process the input image I, determine F face and return E emotion as one of seven basic labels for each F face with a float value between zero to one in each label. Note that the recognition module cannot determine if the input image I don’t contain any F faces.

(33)

33

Figure 4.1: Emotional recognition system.

4.2 Expressive inference from facial muscles

Scientific research in the field of emotional detection has discovered that emotions are formed from the facial muscle configurations. Human facial muscle configurations can reveal more information than just an emotion, but in some certain configurations, they belong to a certain emotional class according to the FACS table. For example, people often laugh when happy, frown when they are angry, grimace when they are sad.

4.2.1 Analysis of typical examples based on FACS table

An angry expression is a strong emotional response and it is a dangerous emotion because it can trigger violence. Anger feelings have many different causes. When people are frustrated, they may feel angry at the obstacles in their path to their success. They can also get angry when someone wants to hurt them. In addition, physical violence, threatening words are also the causes of anger feeling. Anger has a significant effect on the entire body. It raises the blood pressure, makes the body heat up inside, and almost all of the muscles on the body tense. There are some signals to realize an angry person.

Image I contains

F faces

Pre- Preprocessing

Deep learning model

Post- Preprocessing Recognition module

Predicted results (emotions

E in F faces)

Further data analyse steps

(34)

34

• Eyebrows are lowered and bunched together, and there are vertical wrinkles between the eyebrows.

• Eyelids are straight and strained.

• Eyes tight and focused on the source of anger. The pupils are narrowed and focused on the source of anger.

• Lips tightly closed or slightly opened and prepared to yell.

Figure 4.2: Angry emotions in FER2013.

Disgust expression is a strong emotion that expresses disagreement, aversion, and disapproval.

Disgust is closely related to sensory perception, especially taste and sight. The disgust expresses when people are fed lemons or spoiled food to induce this emotion. This emotion is mainly expressed through the mouth and the nose area, especially the wrinkled nasal area, the facial configurations are described as follows:

• Upper lip lifted.

• There are wrinkles on the nose.

• Cheek also lifted.

• Eyelids are lifted but not tight. There are wrinkles under the eyes.

• The eyebrows are pulled down.

(35)

35

Figure 4.3: Disgust emotions in FER2013.

An expression of fear occurs in stressful or dangerous situations. A person feels scared of a bad event that might happen in the future or sees a violent situation. When a person feels fear, the human body auto-reaction prepares to escape or resist the possible threats. At that time, heart rate and blood pressure increase, eyes open, and pupils widen, so the eyes can absorb the maximum amount of light. When the fear feeling extremely large that can affect some muscle groups and cause body paralysis. The facial markings for an expression of fear are shown as follows:

• Eyebrows are lifted and pulled inward.

• There are wrinkles on the forehead.

• The upper eyelid is lifted.

• The mouth is open and lips stretch according to the intensity of fear emotion.

(36)

36

Figure 4.4: Fear emotions in FER2013.

A happy expression is a positive emotional state of the human being. It often occurs with a smile on the face. Happiness is expressed when people feel satisfied or achieved something.

The typical features of a happy facial expression:

• The lip corners are wide and raised.

• The mouth can be opened and teeth can be seen.

• Cheeks lifted.

• Wrinkles under the lower eyelid may appear.

• Wrinkles appear out of the corners of the eyes.

Figure 4.5: Happy emotions in FER2013.

(37)

37

The sad expression is an emotion related to pain or caused by loss, hopelessness, grief, helplessness, disappointment. Sadness is an emotional state that occurs when people feel pain.

The source of sadness is often the loss of something. With this emotion, the facial muscles lose tension and may have the following typical features:

• The inner part of the eyebrows is pulled down.

• The corners of the lips pulled down, and lips trembled.

• Staring and unfocused eyes.

Figure 4.6: Sad emotions in FER2013.

An expression of surprise is an expression that suddenly appears. It came without thinking and lasted only for a short time. Emotions of surprise have both positive and negative elements and cannot be expected. A sudden emotion often leads to happy or sad emotions. The typical feature of surprise feeling is the eyebrow, and eyes widen.

• Eyebrows are lifted and pulled inward.

• Horizontal wrinkles appear on the forehead.

• Eyes wide opened.

• The mouth can open wide with the lower jaw pulled down.

(38)

38

Figure 4.7: Suprise emotions in FER2013.

The neutral is the state when a person does not show any emotions. Actually, the feelings can be inside but they are controlled without showing facial muscles. Facial muscles in the neutral state are relaxed and not significantly changed.

Figure 4.8: Neutral emotions in FER2013.

In recent studies, it is easy to see that emotions are expressed through the facial muscles only in a few areas, especially the mouth and eye area. The mouth and eye area information was extracted as facial landmarks. These landmarks were researched by Vahid Kazemi and Josephine Sullivan and published [42]. Based on these landmarks, researchers can extract the coordinates of the mouth and eye area to conduct machine learning or deep learning methods to observe the change of muscle configuration. Furthermore, the facial landmarks method was developed in multi-pose 2D and 3D in Menpo benchmark research [43]. Therefore, it is considered as the basis for facial expression recognition.

(39)

39

Figure 4.9: 2D Menpo facial landmark configuration. Left: Frontal 2D landmarks (68 points), Middle: Left 2D landmarks (39 points), Right: Right 2D landmarks(39 points). [43]

Figure 4.10: 3D Menpo facial landmark configuration. Left: Frontal 3D landmarks, Middle:

Left 3D landmarks, Right: Right 3D landmarks. [43]

4.3 Residual CNN architecture

Residual Network (ResNet) was introduced in 2015 and won 1st place in the ILSVRC 2015 competition with a top 5 error rate. It also took first place in the competition ILSVRC and COCO 2015 with ImageNet localization, ImageNet Detection, Coco segmentation, and Coco detection. A ResNet is a CNN network designed to work with hundreds or thousands of convolution layers.

(40)

40

A problem that occurs when building a normal CNN network with many convolutional layers will happen the vanishing gradient phenomenon leading to poor learning. If the hyperparameter epoch too small, the network cannot give a good result. Otherwise, the training time will be very long. However, in practice gradient weights will usually get smaller and smaller in deeper layers. As a result, the gradient descent process does not update the weights of deeper layers much and made the network unable to converge. Therefore, the network cannot produce a good result. ResNet was born to solve the vanishing gradient problem.

Figure 4.11: Vanishing gradient problem in very deep normal CNN network.

ResNet included all normal types of layers: CONV, POOL, RELU, and FC. The figures below show a residual block and the full ResNet used in the thesis.

(41)

41 Output

t

Relu

Input

Figure 4.12: One Residual block.

Figure 4.13: Full ResNet architecture in the thesis.

Conv 3x3

Conv 3x3 Relu

(42)

42

5 EXPERIMENTS

5.1 FER dataset introduction

With the growth of emotional computing in recent decades and the rapid growth in hardware power, datasets for facial expression classification have also grown strong. A facial expression recognition database is a collection of images or video clips containing human faces labeled with ground truth for each emotional state. Datasets are labeled for training and testing new algorithms for developing facial expression recognition systems. Most datasets are based on the basic emotional theory of Paul Ekman and Armindo Freitas-Magalhaes assuming the existence of six discrete fundamental emotions of humans (anger, fear, disgust, surprise, happiness, sad).

Dataset name Number images Emotions

JAFFE [35] 213 images 6 basic emotions + neutral

CK+ [13] 593 series of image 6 basic emotions + neutral MMI [14] 740 images, 2,900 videos 6 basic emotions + neutral FER2013 [15] 35,887 images 6 basic emotions + neutral

AFEW 7.0 [36] 1,809 videos 6 basic emotions + neutral

SFEW 2.0 [37] 1,766 images 6 basic emotions + neutral

Oulu-Casia [38] 2,880 series of image 6 basic emotions

Emotio-Net [39] 1,000,000 images 23 emotions

AffectNet [40] 450,000 images 6 basic emotions + neutral Table 5.1: Some famous datasets information. [41]

(43)

43 5.1.1 FER2013 dataset exploration

The main dataset used to train, evaluate, and compare models of the Challenges in Representation Learning: Facial Expression Recognition Challenge held in 2013 on Kaggle [15].

Figure 5.1: FER2013 dataset emotions distribution.

FER2013 dataset contains 35,887 images with a resolution of 48×48 on grayscale. Each image contain a grayscale face in one of seven classes (0 = Angry, 1 = Disgust, 2 = Fear, 3 = Happy, 4 = Sad, 5 = Surprise, 6 = Neutral). One of the biggest challenges of this dataset is imbalanced data because of the lack of disgust examples. In addition, this dataset contains some non-face examples, wrong label examples, and not full-face examples.

(44)

44

Figure 5.2: Some examples in the dataset. There are many variants between images such as ages, face angles, expression strength, and some blank images.

Figure 5.3: Distribution of each class in three separated datasets: training set (blue), validation set (green), and test set (red). Imbalanced data in Disgust class.

(45)

45

5.2 Experiment environments

The scope of the thesis is for research and academic purpose. The experiment environment was performed by personal computers because the ResNet architecture built for this project is not a huge network. The general configuration for the computer used to train the model is as follow:

• Operating System: Windows 10

• Processor: Xeon CPU E3-1246 v3, 3.5GHz

• RAM: 16GB

• GPU: GTX 760

• Disk size: 500GB SSD

5.3 Evaluation metrics

There are two common evaluation metrics for classification problems are the accuracy score and confusion matrix.

Accuracy score is a simple measurement commonly used in most classification problems. This method only shows a float number for the prediction so it’s easy to evaluate the results. The calculation of the accuracy is shown in the equation below:

(6.1) With:

• True Positive (TP): data belongs to class A and is predicted to belong to class A.

• True Negative (TN): data does not belong to class A and is predicted not belong to class A.

• False Positive (FP): data does not belong to class A and is predicted to belong to class A.

• False Negative (FN): data belongs to class A and is predicted not belong to class A.

The downside of the accuracy method is that it does not comprehensively reflect the classifiers against unbalanced. There is a better evaluation method called the confusion matrix.

(46)

46

In the field of machine learning and statistical problem, the confusion matrix is a tabular measure that allows the visualization of model performance in each class. Each row of the matrix represents the predicted results of the model while each column represents the ground- truth value of the data point. The name confusion matrix comes from the fact that it makes it easy to see if the system confuses between two classes. The result of this method is a square matrix with the size of each dimension equal to the number of the data class. The value in the i-th row and the j-th column is the number of points that should belong to class i but it is predicted to belong to class j.

5.4 Experiment framework

5.4.1 Python language

Python was created by Guido Rossum and first launched in 1991. Python is a high-level programming language for versatile programming purposes. It is designed with the strength of being easy to read, learn, and remember. Python is a very clean language with a clear structure that is convenient for beginners to code. Nowadays, it is widely used in the field of data science, machine learning, and deep learning.

5.4.2 Spyder IDE

Spyder is a scientific python development environment that is a free integrated development environment (IDE) in the Anaconda framework. It is included editing, interactive testing, debugging, and introspection features.

5.4.3 Project code structure

(47)

47

Figure 5.4: Project code structure.

The file trainers.py is the place where to call all the required functions and train_dataset for the training process. All the results of the training process are saved in the trained_models folder.

The file main_demo.py will apply the trained models which gives the best results to predict the test_dataset.

Model definition files are mainly implemented by the Keras modules and functions. They are all designed in a python file. This makes the models are easy to call and further develop. An example of the residual CNN model definition is shown below.

Test_dataset Models

Trainers.py

Utils Train_dataset

Trained_models Main_demo.py

Predictions

(48)

48

Figure 5.5: Small Residual CNN definition code.

The training procedures are also implemented by Keras modules and data processing modules.

Each trainer has different training parameters set and datasets. Each trainer also has some callback functions to manage the training process.

• Model Check Point function helps the trainer save the model weights when the training performance has a better loss and accuracy score compare to the previous epoch.

• Early Stopping function helps to stop the training process when loss and accuracy score does not change through epochs.

(49)

49

• Reduce the learning rate function helps to decay the learning rate through epochs.

Training parameters are also set separated in the training file. The structure of a training procedure is as follow:

Figure 5.6: Training process on Small Residual CNN code.

Besides, the Saved directory contains the training results such as error scores, accuracy scores, visualization images, and weight sets. The utils directory contains modules that define training- related utilities such as data augmentation, data processing files, preprocessing files, and related visualization recognition functions.

5.5 Data processing

For image data that has not yet performed face detection, the faces are detected by face detection method Haar-Cascade frontal face provided by the OpenCV library.

(50)

50

Before the training process performs, raw image data are processed by the ImageDataGenerator function provided by the Keras library. The generator applies the data augmentation method to enrich the image dataset as the figure below:

Figure 5.7: Generator applied data augmentation methods to generate more image data while training.

5.6 Training parameters configuration

The training parameters configuration of the training process is hard-coded for basic installations. The contents of the basic configurations are shared by all models as follows:

• Input image size: 64 × 64 × 1.

• Image color channels : 3.

• Total classes: 7.

• Learning rate: 0.001.

• Total epochs: 100.

• Batch size: 32.

• Patience: 50.

• Verbose: 1.

• For every 50 consecutive epochs without decreasing validation accuracy, stop training.

• For every 12 consecutive epochs with no decrease in validation accuracy, reduce the learning rate with factor 0.1.

• The chosen model has the best validation accuracy.

5.7 Evaluation results on FER2013 dataset

The figure below shows that the model accuracy and model loss value while training our Residual CNN on FER2013.

(51)

51

Figure 5.8: Evaluation results after 100 epochs on our residual CNN.

5.7.1 Comparison with other very deep modern architectures

Due to the limitations of hardware, this personal thesis cannot retrain other big networks. The table below shows the comparisons of our network and other big networks with origin scientific report results.

(52)

52

Networks name Params × 10⁶ Accuracy score (%)

VGG19 [44] 138 70.8

GoogleNet [18] 6.79 71.97

Inception_v3 [17] 24.1 72.72

Cbam_resnet50 [45] 28.5 73.39

Bam_resnet50 [46] 23.8 73.14

(Ours) Resnet_4 3.4 66.5

Table 5.2: Comparison with other very deep modern models.

5.7.2 Confusion matrix

The confusion matrix of our best Residual CNN result training on FER2013 as follows:

(53)

53

Figure 5.9: Confusion matrix.

The confusion matrix table shows that the prediction is not so good in the Disgust class since this dataset is unbalanced. Disgust and Angry class are also very similar in this FER2013 dataset.

5.8 Results in reality

The trained model was also tested on some real emotion grid images found on the Internet and real-time camera to verify the effectiveness of the model, they are shown in figures 5.10, 5.11, 5.12, and 5.13.

(54)

54

Figures 5.10: The combination of two models emotion and gender recognition.

(55)

55

Figure 5.11: Another good recognition example but four heads are not in the right position so the model cannot detect.

(56)

56

Figure 5.12: Recognition result in a scene of End Game movie.

Figure 5.13: Recognition result in a scene of Three Idiots movie.

(57)

57

6 CONCLUSION

During the implementation of the thesis, I have researched, tested, and improved a deep learning Residual Network for the emotional recognition problem. I figured out that many networks give much better results than the network in this thesis but they are very large and require a large number of parameters to train, especially some ensemble networks [47]. Based on the numbers, confusion matrix, and visualize image results throughout the experiments. We can conclude that:

• Resnet_4 gives good results for emotional recognition problems compared to other bigger modern networks.

• Resnet_4 predicted well on all seven classes although the FER2013 dataset is an unbalanced dataset which is very common in practice.

• Resnet_4 also consecutively performs well on the FER2013 and IMDB dataset.

• Resnet_4 requires a small number of parameters to train compared to other big networks.

However, due to the limitation in hardware and time, the experimental process also has shortcomings such as not experimenting on expanding more residual blocks of the Resnet_4 which only has four residual blocks. In addition, the experimental process cannot retrain other big modern networks for a better objective comparison. Since it is a classification network, Resnet_4 will probably work well on some other classification problems that are not part of the emotional classification.

(58)

58

7 REFERENCES

[1] Charles D. and Phillip P., The expression of the emotions in man and animals, Oxford University Press, USA, 1998.

[2] Paul E. and Wallace V. F., Constants across cultures in the face and emotion, Journal of personality and social psychology, page 124, 1971.

[3] Rachael E. J. et al., Facial expressions of emotion are not culturally universal, Proceedings of the national academy of sciences, 2012.

[4] Jialin M. et al., Development of a facial-expression database of Chinese Han, Hui, and Tibetan people, International Journal of Psychology, 2019.

[5] Internet, China is testing emotion recognition systems in Xinjiang, https://www.techspot.com/news/82611-china-testing-emotion-recognition-systems- xinjiang.html, 2019.

[6] Mira J. and Byoung C. K., Driver’s facial expression recognition in real-time for safe driving, 2018.

[7] Ruyi X. et al., Towards emotion-sensitive learning cognitive state analysis of big data in education: deep learning-based expression analysis using ordinal information, 2019.

[8] Michela B. and Simona A. and Chiara F., Emotional decoding in facial expression, scripts and videos: A comparison between normal, autistic and Asperger children, 2012.

[9] Giorgio C. and Marco W. B. and Letizia A., The understanding of the emotional meaning of facial expressions in people with autism, Journal of autism and developmental disorders, 1999.

[10] Madeline B. H. and Alex M. and Gregory L. W., Facial emotion recognition in autism spectrum disorders: a review of behavioral and neuroimaging studies, Neuropsychology review, 2010.

[11] Christian G. K. et al., Facial emotion recognition in schizophrenia: intensity effects and error pattern, American journal of psychiatry, 2003.

[12] Gozde Y. et al., Deep learning-based face analysis system for monitoring customer interest, Journal of ambient intelligence and humanized computing, 2019

[13] Patrick L. et al., The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression, IEEE computer society conference on computer vision and pattern recognition-workshops, 2010.

(59)

59

[14] Maja P. et al., Web-based database for facial expression analysis, IEEE International conference on multimedia and expo, 2005.

[15] Ian J. Goodfellow et al., Challenge in representation learning: a report of three machine learning contests, International conference on neural information processing springer, 2013.

[16] Yann L. et al., Gradient-based learning applied to document recognition, Proceedings of the IEEE, 1998.

[17] Kaiming H. et al., Deep residual learning for image recognition, IEEE conference on computer vision and pattern recognition (CVPR), 2016.

[18] Christian S. et al., Going deeper with convolutions, IEEE conference on computer vision and pattern recognition (CVPR), 2015.

[19] Karen S. and Andrew Z., Very deep convolutional networks for large-scale image recognition, 2015.

[20] Alex K. and Ilya S. and Geoffrey E. H., ImageNet classification with deep convolutional neural networks, 2012.

[21] John S. Bridle, Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition, 1990.

[22] Yichuan T., Deep learning using linear Support Vector Machine, 2013.

[23] Yann L. et al., Gradient-based learning applied to document recognition, 1998.

[24] Alex K., Learning multiple layers of features from tiny images, 2009.

[25] Zhong S. and Ghosh J., Decision boundary focused neural network classifier, 2000.

[26] Nagi J. et al., Convolutional neural support vector machines: hybrid visual pattern classifiers for multi-robot systems, 11^th International Conference on Machine Learning and Applications, 2012.

[27] Hai D. N. et al., Facial expression recognition using a multi-level convolutional neural network, Proceedings from International Conference on Pattern Recognition and Artificial Intelligence, 2018.

[28] Karen S. and Andrew Z., Very deep convolutional networks for large-scale image recognition, 2015.

[29] Omkar M. P and Andrea V. and Andrew Z. et al., Deep face recognition, 2015.

[30] Facial expression analysis: The complete pocket guide, https://imotions.com/blog/facial-expression-analysis/, 2019.