Deep face detection for interactive avatar

(1)

DEEP FACE DETECTION FOR INTERACTIVE AVATAR

Bachelor’s thesis Faculty of Information Technology and Communication Sciences Examiner: Prof. Joni Kämäräinen May 2021

(2)

Viljami Romppanen: Deep face detection for interactive avatar Bachelor’s thesis

Tampere University Information technology May 2021

Face detection is one of the most studied problems of the computer vision field. It is usually the first step for other face related technologies such as face recognition and verification. In this thesis, a face detection system is implemented for an interactive avatar that aims to interact naturally with humans. The implemented system consists convolutional neural networks (CNN) and machine learning which are often used to solve image classification tasks. In general, those have enabled the development of face detection systems to be able to detect faces from real world scenes where the distances and image conditions might vary greatly. The used networks in this thesis are Single Shot Multibox Detector (SSD) and MobileNet.

The structure of this thesis can be divided into two broad categories which are theory and implementation. Theory part provides the basic background theories and explanations related to machine learning, neural networks and object detection networks. The goal of the theory part is to introduce all of the main elements that are finally constructing the object detection networks. The implementation part describes how the networks are implemented for a functional system that can detect faces and provide coordinates from those detections. Evaluation part is also related to the implementation since the evaluation aims to measure the performance of the implementation.

The purpose of this thesis is to use pre-trained networks and models which are then adapted to meet the requirements of this implementation. The research of different kinds of face detection systems shows that there are multiple different approaches and systems, that are able to detect faces from images. The used networks and structures, as described above, were chosen because of their accuracy, efficiency and modern technologies. The results and performance of the proposed implementation satisfied well the requirements and goals of the system.

Keywords: machine learning, neural network, face detection

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Viljami Romppanen: Syvä kasvojentunnistus interaktiiviselle avatarille Kandidaatin työ

Tampereen yliopisto Tietotekniikka Toukokuu 2021

Kasvojentunnistus on yksi konenäkösovellusten tärkeimmistä ongelmista. Se on yleensä myös ensimmäisiä vaiheita muissa kasvoihin liittyvissä sovelluksissa kuten kasvojen identifioinnissa ja -todentamisessa. Tässä työssä kasvojentunnistusjärjestelmä on tehty interaktiiviselle avatarille, jonka tavoitteena on kommunikoida luontaisesti ihmisten kanssa. Toteutettu kasvojentunnistusjär- jestelmä sisältää konvoluutioneuroverkkoja (engl. convolutional neural network, CNN) sekä ko- neoppimista (engl. machine learning), joita usein käytetään kuvien luokittelussa. Nämä ovat ylei- sesti mahdollistaneet kasvojentunnistuksen kehitystä normaaleissa todellisen sovelluksen tilan- teissa, joissa etäisyydet ja kuvaamisen olosuhteet saattavat vaihdella paljon. Neuroverkot (engl.

neural network), joita käytetään tässä työssä ovat Single Shot Multibox Detector (SSD) sekä Mo- bileNet.

Työ voidaan jakaa kahteen suureen osaan, jotka ovat teoria ja toteutus. Teoriaosuudessa tarkastellaan teoreettisia perusteita koneoppimisesta, neuroverkoista ja objektien tunnistukses- ta (engl. object detection). Päätavoite teoriaosuudessa on esitellä pääasialliset elementit, joita lo- pulta hyödynnetään objektien tunnistuksen neuroverkoissa. Toteutuksessa kuvaillaan, miten neuroverkkoja on käytetty toiminnalliseen järjestelmään, joka pystyy tunnistamaan kasvoja ja tuotta- maan niistä koordinaatteja. Toteutuksen arviointi on myös osana toteutusta, sillä arvioinnin tavoitteena on tuottaa toteutuksesta mittaustuloksia, joita voidaan vertailla muihin toteutuksiin.

Työn tarkoituksena on hyödyntää valmiiksi opetettuja neuroverkkoja ja malleja, jotka on mu- kautettu vastamaan tämän totetuksen vaatimuksia. Tutkimalla monia erilaisia kasvojentunnistus- järjestelmiä huomataan, että on olemassa vapaasti saatavilla useita erilaisia järjestelmiä ja tapoja kasvojentunnistukseen. Yläpuolella kuvaillut neuroverkot, joita käytetään tässä työssä, valittiin nii- den tarkkuuden, tehokkuuden ja ajantasaisen lähestymistavan vuoksi. Käytettyjen menetelmien tulokset ja tehokkuus täyttävät hyvin työlle asetetut tavoitteet ja vaatimukset.

Avainsanat: koneoppiminen, neuroverkko, kasvojentunnistus

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

1. Introduction . . . 1

2. Related work . . . 3

3. Theory. . . 5

3.1 Machine learning . . . 5

3.2 Neural networks . . . 6

3.2.1 Activation functions . . . 7

3.2.2 Cost functions and optimization . . . 9

3.2.3 Convolutional neural network . . . 10

3.3 Neural networks for object detection . . . 12

3.3.1 MobileNet . . . 12

3.3.2 Single Shot MultiBox Detector . . . 14

4. Implementation . . . 16

5. Evaluation . . . 17

5.1 Metrics . . . 17

5.2 Results. . . 18

6. Conclusions . . . 22

References . . . 23

(5)

LIST OF SYMBOLS AND ABBREVIATIONS

AP Average Precision

CNN Convolutional Neural Network COCO Common Objects in Context CS Continuous Score

DS Discrete Score

FDDB Face Detection Data Set and Benchmark IoU Intersection over union

MSE Mean Squared Error ReLU Rectified Linear Unit

ROC Receiver Operating Characteristic SGD Stochastic Gradient Descent SSD Single Shot MultiBox Detector

(6)

1. INTRODUCTION

Detecting faces from an image is an easy task for humans, but for computers it has always been a challenging task because of the dynamic nature of the faces. Faces must be detected although light levels, orientation, facial hair and so on might vary greatly.

That’s why face detection is one of the most studied problems in the computer vision field [1]. Furthermore, it is usually the first step for other face-related technologies such as face recognition and verification. [2] Maybe the most well-known application for face detection is in digital cameras that can adjust the focus to the locations of the faces in a scene in order to take more precise pictures.

For computer applications that aim to interact naturally with humans, such as interactive avatar which is presented in Figure 1.1, face detection is in an essential role. The purpose of this thesis is to implement a face detection system that provides a direction, from the interacting human’s face, in which the avatar can look at. Another purpose is to implement a method that decides which face to look at when multiple humans are interacting with the avatar at the same time.

Figure 1.1.The demo system of the avatar

The proposed implementation consists machine learning and CNN, since they are the state of the art methods in the era of deep learning. The live video feed from the interacting human’s face is provided by a basic webcam on top of the screen where the avatar is located, working as the "eyes" of the avatar. The implementation is designed to be

(7)

light and efficient, so the requirements of the environment for the system can be easily implemented in a number of different set-ups.

Related work and different kinds of face detection approaches are introduced in Chapter 2. Chapter 3 provides the basic theoretical background of object detection and machine learning without going into fine details. The proposed implementation is introduced in Chapter 4 and the results are discussed in Chapter 5. Finally the conclusions and improvements for future work are introduced in Chapter 6.

(8)

2. RELATED WORK

The early solutions of face detection date back to the early 1970s when simple heuristic and anthropometric techniques were used. These techniques make a lot of assumptions, such as a plain background so any changes to the image’s conditions would mean fine- tuning or completely redesigning the system. After the practical face recognition and video coding systems started to become a reality in the 1990s, research interest in face detection grew. Over the past decade there has been a lot of research interest in several aspects of face detection. Furthermore, the use of statistics and neural networks has enabled the ability to detect faces from cluttered scenes at various distances from the camera. [2]

Face detection techniques can be organized into two broad categories based on the different approaches to utilizing face knowledge. These categories are called the feature- based approach and the image-based approach, as presented in Figure 2.1. The feature- based approach started in the early days of face detection research. This approach is accomplished by manipulating distances, angles and area measurements of the visual features of the scene. Examples of two of the pioneering works in the feature-based approach are given in [3, 4]. This thesis is focused on the image-based approach, which is generally considered to be the more modern technique for face detection. The image- based approach utilises pattern recognition theory and addresses face detection as a general recognition problem. In this approach the representations of the faces are usually classified into a face group with the use of learning algorithms, such as neural networks.

[2] One of the earliest neural network-based face detection study is [5], while another study that was very popular on its time is [6]. The latter is also known as the Viola–Jones face detector, which is based on Haar-like features, an adaboost classifier and cascade inference [1].

This thesis is based on the live-age-estimator [7] work’s face detection part, which uses machine learning, CNN and a specific object detector network structure. Another work with a similar kind of structure that detects face masks, inspired by the spread of the Covid-19 pandemic, is a deep neural network-based face mask detection system [8]. The structure of these detection systems presents one of the latest approaches for the face detection with high accuracy and efficiency.

(9)

Figure 2.1.Approaches for face detection [2]

(10)

3. THEORY

This chapter presents the academic background theory of the thesis. Section 3.1 introduces the fundamentals of machine learning, including supervised learning. Section 3.2 introduces the essential elements of neural networks, including CNN, which is one of the most important neural network structure of this thesis. Section 3.3 presents the structure of the neural networks that are used specifically for object detection.

3.1 Machine learning

Machine learning can be defined in a number of ways. One of the most universally ac- cepted and relevant definitions for machine learning is by Tom M. Mitchell, who stated that, "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." In the context of face detection, E represents the past image data with faces, T is the task to detect faces from new image data and P is the performance measure of correct detections. [9]

Machine learning is implemented with machine learning algorithms which consist of different kinds of algorithm components, such as an optimization algorithm, a cost function, a model, and a data set [10]. A model is a summarized knowledge representation of the data and, for example, the form of a model can be a set of mathematical equations [9].

In modern solutions the model is often build with neural networks, which then implement the backbone of the machine learning algorithm [11]. A data set is the input for a model and it is usually divided into training data and test data. Training data is used as past data for training the model and test data is used as unseen new data for the final performance measures and tests. [9] In addition, one part of the data set is often further divided into validation data, which is used for fine tuning and testing the model’s behaviour before a larger test data set is used. The other components mentioned above, are introduced more specifically in Section 3.2.

One subcategory of machine learning whose major motivation is to learn from past data is supervised learning [9]. The basic idea of supervised learning is to show for the machine learning system what to do with the training data. A supervised learning algorithms’

experience is the training data set that is associated with a label or target. For example,

(11)

the training data could be a set of images with different kinds of dogs. Each image is also annotated to a specific breed. When that is applied to the supervised learning algorithm, it is then able to learn to classify dogs into different breeds, based on the measurements it learned by studying the training data set. [10]

3.2 Neural networks

Neural networks are adaptive statistical models inspired by the structure of the brain.

Neural networks are organized into layers and each layer is built from basic units, usually called neurons or nodes. The basic structure of a neuron is shown in Figure 3.1. One neuron receives input from other neurons, or from an external source, and then computes an output. These connections between each other are made by a set of weighted connections. Learning is usually achieved by modification of those connection weights. The computed

Figure 3.1.Basic structure of a neuron with three inputs

outputy(x)of a neuron is the weighted sum of the inputs applied to an activation function, as in the following equation

y(x) = a(∑︂

i

w_ix_i) = a(W^TX), (3.1)

whereais an activation function whose details will be introduced in Subsection 3.2.1,

W is the weights asW =

⎡

⎢

⎣ w₁ w₂ ...

w_i

⎤

⎥

⎦

andXis the inputs asX =

⎡

⎢

⎣ x₁ x₂ ...

x_i

⎤

⎥

⎦ . [12]

As mentioned above, neural networks are organized into layers which are built from neu-

(12)

the layers, as introduced above, is known as a feed-forward neural network. Basically, it means that information is flowing in only one direction (forward) and never backward. [14]

Figure 3.2.A simple three layer structure of neural network

3.2.1 Activation functions

An activation function is the last part of the neuron to decide on the output based on the given inputs. It also plays a key role in determining whether the learning model is linear or non-linear. [13] There are many different kinds of activation functions with different purposes and functionality. Two of the most common ones are Sigmoid

Sigmoid(x) = 1

1 +e^−x, (3.2)

and rectified linear unit (ReLU)

ReLU(x) =max(0, x). (3.3)

As seen from Figure 3.3 and Figure 3.4, the Sigmoid function maps the values into a

(13)

range of [0,1] and ReLU maps the values into a range of [0,max(x >0)]. [15]

Figure 3.3.Sigmoid activation function

Figure 3.4.ReLU activation function

For the current trend of deep learning and deep neural networks, the ReLU activation function and its variants are usually more relevant than Sigmoid. In deep neural networks, the main advantages of ReLU over Sigmoid are its faster computation and the wider range it uses for mapping the values. The computation is faster in ReLU because it doesn’t need to compute expensive exponential operations, as can be seen from Equations 3.3 and 3.2, unlike in Sigmoid. The wider range of mapping values, as mentioned above, doesn’t make the gradients disappear as in Sigmoid. [15]

(14)

3.2.2 Cost functions and optimization

A cost function, also called a loss function or error function, is used to measure the difference between the correct answer and the answer that is predicted by the learning algorithm. Based on that measurement, and with the help of other algorithms such as optimization, the learning algorithm is able to make changes and reduce the loss. The changes could be, for example, updates to the weights, which lead to better predictions and performance. [16]

Two of the most common cost functions are mean squared error (MSE) and log loss, also known as binary cross-entropy [16]. MSE can be defined as

M SE = 1 n

n

∑︂

i=1

(fi−yi)², (3.4)

wherenis the number of data points,f_iis the predicted output value andy_iis the correct output value forith data point [10]. Log loss can be defined as

Logloss=−1 n

n

∑︂

i=1

y_ilog(p_i) + (1−y_i) log(1−p_i), (3.5)

where n is the number of data points, yi is the is the correct output value for ith data point andp_i is the prediction probability of theith output of being desired class [17]. The main properties of these cost functions are that MSE highlights the error because of the square, while the log loss uses probabilities based on correctness, and transforms them into a numeric measure of the error [16]. The optimal type of cost function is often related to the activation function in the output layer. The type of problem or task that needs to be solved is also related to the type of cost function. [10]

As mentioned above, learning is usually achieved by modifications of the weights that connect the neurons. Optimization algorithms are used to modify and find the optimal values for the model parameters or weights by minimizing the cost functions [10]. One of the most used optimization algorithm for machine learning is Stochastic gradient descent (SGD) and its variants [10].

(15)

In general, the equation of updating the model’s parameterwwith gradient descent is

w=w−α∂E(w)

∂w , (3.6)

where α is the learning rate and E is the cost function. The learning rate is used as a scale factor that decides the step size of updating the parameters. The property of SGD is that it computes iteratively only one or a small batch of weight update at a time.

This can make the computation of SGD faster than other gradient descent variants. Two other often-used optimization algorithms with different kinds of properties are Adam and RMSProp. The choice of the type of optimization algorithm is mainly dependent on the user’s familiarity with the algorithm, allowing ease in tuning the parameters. [10]

3.2.3 Convolutional neural network

CNN is a neural network that is often used for image classification tasks, such as object detection. Usually the layer structure of CNN consists convolutional layers, pooling layers and fully-connected layers, as presented in Figure 3.5. [18] In image classification tasks, the main advantage of CNN compared to the traditional neural network is the improvements in efficiency. CNN is able to reduce the number of required parameters without losing the accuracy of the model, which basically means faster and more efficient computation. [10]

Figure 3.5.Basic layer structure of CNN

The core of the CNN is the convolutional layer, which is a structure of fixed-size filters that applies complex functions to the input image. Each filter produces a feature map and multiple filters produce multiple feature maps. That process is implemented by a sliding window of the locally trained filters over the image. During this process each filter has the same weight values which is also known as weight sharing. That provides the ability to represent the same feature for the whole image in one feature map. [19] The output of feature mapO at coordinate(i, j)with cross-correlation is

O(i, j) =E((K∗I)(i, j)) =E (︄

∑︂

m

∑︂

n

I(i+m, j+n)K(m, n) )︄

, (3.7)

(16)

neural network libraries and named as convolution [10].

Figure 3.6. Basic example of two dimensional cross-correlation, where the kernel is sliding completely inside the image. [10]

After the convolutional layer, the pooling process is applied to the feature maps. Pooling is performed with a same kind of process, as described above, where a sliding window is applied to the image for the operations. Some of the most popular pooling operations are maximum, L2 and average pooling. Maximum pooling passes the maximum values, where as average pooling takes the average of the inputs and L2 pooling calculates the L2 norm. An example of maximum pooling is presented in Figure 3.7 with stride (or filter’s step size) of 2. The main responsibility of pooling is to increase the robustness of feature extraction [18] and to reduce the size of the feature maps to reduce the number of parameters. [19]

Figure 3.7. Max pooling

(17)

After the convolutional and pooling layers, the data is converted into a one dimensional vector for the fully connected layer [19], which acts as a classifier of the CNN. The neurons in the fully connected layer are connected with all of the neurons at the previous layers, like in the traditional neural network, as presented in Figure 3.2. Finally the last fully- connected layer is followed by an output-layer where usually softmax activation function is used for the classification purposes. [18]

3.3 Neural networks for object detection

Object detection is a field of computer vision that is used for locating instances of objects in images or videos. The use of machine learning and CNNs have led to enormous break- troughs in the field of object detection. [20] The two main networks for object detection that are used in this thesis are MobileNet and SSD.

3.3.1 MobileNet

MobileNet is a network architecture that is designed to be efficient and low in latency to meet the requirements of mobile and embedded vision solutions. To achieve low latency and high efficiency MobileNet uses depthwise separable convolutions. That means that the convolutions are separated into depthwise convolution and pointwise convolution. The depthwise convolution applies a single filter to each input channel. Then the pointwise convolution applies a1×1convolution to combine the outputs of the depthwise convolution. The normal convolution makes filtering and combining in one step. That separation between the layers has the effect of reducing the computation and model size. [21] The comparison between these layer structures is shown in Figure 3.8.

Mathematical proof of the reduced computation in depthwise separable convolution can be calculated from the computational cost of normal convolution

D_K ∗D_K∗M ∗N ∗D_F ∗D_F, (3.8)

and with the computational cost of depthwise separable convolution

D_K∗D_K∗M ∗D_F ∗D_F +M ∗N∗D_F ∗D_F, (3.9)

whereDK is the width and height of the kernel,M is the number of input channels,N is the number of output channels andD_F is the width and height of the feature map. Based

(18)

Because the MobileNet uses 3×3 depthwise separable convolutions, the reduction in computation is 8 to 9 times less than in normal convolution without significant loss in accuracy. [21]

Figure 3.8. Comparison of the layer structure between normal convolutional layer (left) and depthwise separable convolutional layer (right), where BN is batch normalization.

[21]

The whole structure of MobileNet is built with one normal convolutinal layer as the first layer. That is followed by 13 depthwise separable convolution blocks stacked on top of each other with batch normalization and ReLU, as presented in Figure 3.8. The last layers consists average pooling, fully connected layer and classifier with softmax function. That structure of MobileNet is presented in Figure 3.9. For customizing the base architecture of MobileNet, the width multiplierα and resolution multiplierρcan be changed from the default values of 1. The role ofαis to thin the network at each layer andρ is applied to the input image to subsequently reduce the internal representation of every layer. [21]

Figure 3.9.Architecture of MobileNet. [21]

(19)

3.3.2 Single Shot MultiBox Detector

SSD is a network for object detection which purpose is to do localization and classification at once. The SSD is based on a feed-forward CNN that produces a fixed-size set of bounding boxes. The presence of object class instances are then scored in those boxes and a non-maximum suppression step is applied to produce the final detections. The name "Single Shot" comes from that fact of doing localization and detection in one pass by looking once at single image. The name "MultiBox" comes from the multibox algorithm that is modified a bit in SSD. It basically means that the SSD can recognize objects of different classes, even if their bounding boxes are overlapped. [22] The SSD framework is shown in Figure 3.10.

Figure 3.10. SSD framework. (a) Presents the input image with ground truth boxes for training phase. The set of default boxes with different aspect ratios at each cell location is evaluated in multiple feature maps with different scales as (b) and (c) presents8×8and 4×4feature maps. The shape offsets and confidences(c₁, c₂, ..., c_p)are predicted for all object categories for each default box. During the training phase, these default boxes are matched to the ground truth boxes, and treated as positives or negatives. The model loss is calculated with Smooth L1 and Softmax. [22]

The SSD uses a base network, that is designed for image classification, as an early network layer. The purpose of the base network is to produce high quality feature extraction from the image. The classification and detection process is then done in the added feature layers of the SSD, so the possible classification layers of the base network are removed.

[22] For example, MobileNet is used as the base network in this thesis. The architecture of SSD is presented in Figure 3.11.

As described above, the SSD generates a large set of bounding boxes of the detections to the image, from multiple feature maps. That means that one object might have multiple different bounding boxes. The non-maximum suppression step, that produces the final detections, is in important role to filter most of those boxes. That step basically means that the best detections are kept and others are removed, based on confidence loss and jaccard overlap thresholds. [22]

(20)

Figure 3.11. SSD architecture with VGG-16 being the base network. The added convolutional feature layers, that allows predictions of detections at multiple scales, can be seen from the end of the base network. [22]

(21)

4. IMPLEMENTATION

The face detection system is implemented with SSD where MobileNet is used as the base network. The depth parameter of MobileNet is set to, α = 0.75, for fast and efficient performance and the input size is set to240×180. The network is initialized with Common Objects in Context (COCO)-pretrained weights. Training process is done with Tensorflow Object Detection API and Face Detection Data Set and Benchmark (FDDB) [23] face database. The system is implemented in Python and used through OpenCV. As previously mentioned, this implementation is based on [7] face detection part.

The demo system for the whole implementation of interactive avatar, presented in Figure 1.1, is located at the Tampere University Hervanta campus facilities. It consists a screen, a webcam and a computer with Ubuntu operating system. The implementation is done in Windows 10 environment and tested also with a laptop without powerful processors, since the network structure is designed to be fast and efficient. The requirements for the environment are quite adaptive and almost any kind of modern platform with a webcam can run the implementation. Also, manually fine tuning the code almost any kind of video or image feed can be passed to the implementation.

The basic workflow of the face detection system consists reading image from webcam, preprocessing image for the network, passing the image to network, extracting detections from the network’s output, selecting satisfied detections based on thresholds and returning that detection’s bounding box coordinates which area is the biggest. Since the avatar can target only face at a time and the network can detect multiple faces from the scene, the method for providing only one target is handled by returning that detection which area is the biggest, as described above. Also, if there isn’t any faces in the scene, a default target’s coordinate is returned. That workflow process is repeated until stopping command is given. The code of the face detection implementation and used models are available at Github¹.

1https://github.com/viljamirom/face_detection

(22)

5. EVALUATION

The evaluation is done with FDDB [23] and WIDER FACE [24] benchmarks. WIDER FACE benchmark is working as a reference evaluation, since the used validation data set in WIDER FACE is quite difficult and irrelevant for this kind of implementation. Also, the system is designed to be efficient and the actual faces in implemented environment are larger and easier to detect. The variability and difficulty of the WIDER FACE data set is presented in Figure 5.1. Another used benchmark is the FDDB benchmark which data set is also used for training the implemented system and is more similar with the faces in the implemented scene. The evaluation metrics are introduced in Section 5.1 and results are shown in Section 5.2.

Figure 5.1.WIDER FACE data set variability example [24]

5.1 Metrics

In general, metrics are used to measure how well different implementations can solve the presented problem. In this implementation the problem is to detect faces at certain locations from images. For evaluations the detections from the system are classified as true positives, false positives or false negatives. Those classifications are made by matching the ground truth coordinates to the predicted coordinates with the intersection over union (IoU) threshold. The IoU between prediction coordinates and ground truth coordinates is calculated as the ratio of intersected areas to joined areas

IoU(B_p, B_gt) = area(B_p)∩area(B_gt)

area(B_p)∪area(B_gt), (5.1)

(23)

whereBp is the bounding box coordinates of the prediction andBgt is the bounding box coordinates of the ground truth [23].

If the IoU between those coordinates are over the threshold value, the prediction is classified as true positive. If the prediction coordinates can’t be matched with any ground truth coordinates, the prediction is classified as false positive. One ground truth can be matched with one prediction, which basically means that some predictions might be classified as false positives even if they could satisfy the IoU threshold value. If a ground truth can’t be matched with any prediction, the prediction is classified as false negative. [23]

The used metrics are precision and recall (also known as true positive rate), which are constructed with the measurements of the detection classifications as introduced above.

The equation for precision is

precision= T P

T P +F P, (5.2)

whereT P is true positives andF P is false positives [25]. The equation for recall is

recall= T P

T P +F N, (5.3)

whereF N is false negatives [25]. As seen above from the Equations 5.2 and 5.3, these metrics presents the trade-off between each other in the case of maximizing the result in either one metric. For example, higher recall value would often mean more sensitive model with higher number of true positives and less false negatives. At the same time false positives are more likely increasing which then might lower the precision metric.

Those above mentioned metrics are used in this thesis for generating precision-recall and Receiver Operating Characteristic (ROC) curves to analyzing the performance.

For summarizing those curves, Average Precision (AP) metric is used in WIDER FACE precision-recall curves. AP is defined as the mean precision at a set of eleven equally spaced recall levels [26]. Discrete Score (DS) and Continuous Score (CS) metrics are used in FDDB ROC curves. DS assigns a score of1for detection if IoU is over0.5and0 otherwise. CS assigns a score for the detection based on the IoU value. [23] DS is also used for scoring the detections in WIDER FACE benchmark [24].

5.2 Results

Precision-recall curves for WIDER FACE benchmark are presented in Figure 5.3 and ROC curves for FDDB benchmark are shown Figure 5.2. The procedure and tools for generating the curves are provided by each benchmark FDDB [23] and WIDER FACE [24]. The summarized results based on those curves are presented in Table 5.1. With

(24)

Benchmark Score metric Result Result with thresholds

FDDB DS 93.4% 77.1%

FDDB CS 69.7% 57.0%

WIDER FACE Validation easy AP 39.9% 36.6%

WIDER FACE Validation medium AP 26.0% 21.8%

WIDER FACE Validation hard AP 10.9% 9.1%

Table 5.1.Evaluation results

Figure 5.2.ROC curves with FDDB benchmark

The DS and CS results in FDDB benchmark ROC curves are showing that the true positive rate increases quite quickly to a relatively high level, and then almost stabilizes in certain level. Due the threshold values in Figure 5.2(b), the total amount of false positives is decreased under 200. Basically some uncertain detections are filtered without significant loss in the true positive rate. From those results it can be concluded that the implementation is quite accurate and confident on detecting faces in FDDB data set. That’s because even with the thresholds it can detect almost same amount of faces correctly.

Comparing to other implementations in the FDDB results page, these results are showing good performance especially for a light and efficient implementation. However, the analysis of ROC curves should be mainly done for determining the strengths of different implementations [23], and not straightly for comparing performances.

(25)

The reference results in WIDER FACE benchmark are quite poor compared to other implementations that are reported in WIDER FACE results page. However, the results are still showing the capability of performing in a lot harder data than the implementation is designed to perform. The results are poorer because the implementation is designed to be light and efficient, it wasn’t trained with WIDER FACE data set and only validation data set was used for benchmarks. The results with thresholds in Figure 5.3 are showing the impact of filtering uncertain detections and the trade off between metrics as mentioned above. When accepting only most confident detections, based on thresholds, the number of false positives is decreasing and precision isn’t dropping under certain level. However, at the same time recall doesn’t reach as high levels as it reaches without thresholds. The lower sensitivity of making detections leads to higher number of false negatives which low- ers the recall. Basically the trade off is between making accurate detections and making a lot of detections. The AP is almost the same even without thresholds, so those thresholds are quite optimal for making the implementation more accurate based on WIDER FACE data set.

(26)

Figure 5.3.Precision-recall curves for implemented SSD-MobileNet network with WIDER FACE benchmark

(27)

6. CONCLUSIONS

The goal of this thesis was to implement a face detection system that is able to provide coordinates of faces from webcam’s video feed and provide only one face target at a time.

The proposed implementation used SSD and MobileNet networks to provide those coordinates. The used networks were designed to be light and efficient for fast computation and adaptive requirements on the environment. The code and models used in this thesis are available at Github¹.

Benchmark results and experiments in the demo system, with moving avatar head, show that the proposed implementation works well and fast. The avatar is able to turn the head in real time on the direction of human face in different distances and positions.

Also, the accuracy is at satisfying level because any false detections weren’t noticed in test environment. The solution for targeting only one face at a time was implemented by targeting that face which area is the biggest, which is supposed to be the closest face in the scene. That method works reasonably well, however it could be further improved, which will be discussed below.

Possible future improvements could be in the targeting method. If multiple faces are moving close to each other, the biggest area method would almost constantly change the target back and forth between faces. Also, the face which is the closest isn’t always the face that is interacting with the avatar. That could be improved by detecting if face’s lips are moving and then targeting that face even if some other face is closer. Implementing a face recognition system could be future improvement to increase the interaction experience with the avatar. The face recognition system could memorize faces and associate face to a name that user gives when interacting with the avatar.

1https://github.com/viljamirom/face_detection

(28)

[1] Junjie, Y. et al.Face Detection by Structural Models. Elsevier B.V, 2014, pp. 790–

799.

[2] Erik, H. and Low, B. K.Face Detection: A Survey.Elsevier Inc, 2001, pp. 236–274.

[3] Kanade, T.Picture Processing System by Computer Complex and Recognition of Human Faces. Available: https : / / repository . kulib . kyoto - u . ac . jp / dspace / bitstream / 2433 / 162079 / 2 / D _ Kanade _ Takeo . pdf. Department of Science, Kyoto University, 1973.

[4] Kotropoulos, C. and Pitas, I. Rule-Based Face Detection in Frontal Views. Wash- ington DC: IEEE, 1997.

[5] Rowley, H., Baluja, S. and Kanade, T.Neural Network-Based Face Detection. IEEE, 1998.

[6] Viola, P. and Jones., M. J.Robust Real-Time Face Detection. Boston: Kluwer Aca- demic Publishers, 2004.

[7] Tommola, J., Ghazi, P., Adhikari, B. and Huttunen, H.Real Time System for Facial Analysis. Available: https : / / arxiv . org / ftp / arxiv / papers / 1809 / 1809 . 05474.pdf. Submitted to EUVIP2018, 2018.

[8] Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P. and Hemanth, J.SSDMNV2:

A real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Available: https : / / www . sciencedirect . com / science/article/pii/S2210670720309070. 2021.

[9] Chandramouli, Subramanian, Dutt, S. and Das, A.Machine Learning. Pearson Ed- ucation India, 2018.

[10] Goodfellow, I., Bengio, Y. and Courville, A.Deep Learning. Available:http://www.

deeplearningbook.org. MIT Press, 2016.

[11] Kavlakoglu, E. AI vs. Machine Learning vs. Deep Learning vs. Neural Networks:

What’s the Difference? Available:https://www.ibm.com/cloud/blog/ai-vs- machine-learning-vs-deep-learning-vs-neural-networks. IBM, 2020.

[12] Herve, A., Valentin, D. and Edelman, B.Neural Networks. Thousand Oaks, Calif. ; London : SAGE, 1999.

[13] Cole, M. R.Hands-on Neural Network Programming with C♯: Add Powerful Neural Network Capabilities to Your C♯Enterprise Applications. Birmingham, UK : Packt Publishing Ltd., 2018.

[14] Wu, C. H. and McLarty, J. W.Neural networks and genome informatics. New York : Elsevier Science B.V, 2000, pp. 20–21.

(29)

[15] Alcantara, G. Empirical analysis of non-linear activation functions for deep neural networks in classification tasks. Available: https : / / arxiv . org / pdf / 1710 . 11272.pdf. 2017.

[16] Mueller, J. P. and Massaron, L.Deep Learning. Hoboken, New Jersey : For Dum- mies, 2019.

[17] Malhotra, R., Shakya, A., Ranjan, R. and Banshi, R. Software defect prediction using Binary Particle Swarm Optimization with Binary Cross Entropy as the fitness function. Available: https : / / iopscience - iop - org . libproxy . tuni . fi / article/10.1088/1742-6596/1767/1/012003/pdf. IOP Publishing, 2021.

[18] Guo, T., Dong, J., Li, H. and Gao, Y.Simple convolutional neural network on image classification. IEEE, 2017, pp. 721–722.

[19] Sarıgül, M., Ozyildirim, B. and Avci, M. Differential convolutional neural network.

United States: Elsevier Ltd, 2019.

[20] Liu, L. et al. Deep Learning for Generic Object Detection: A Survey. International journal of computer vision 128.2, 2020.

[21] Howard, A. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Available: https : / / arxiv . org / abs / 1704 . 04861. arXiv preprint arXiv:1704.04861, 2017.

[22] Liu, W. et al.SSD: Single shot multibox detector. Available:https://arxiv.org/

abs/1512.02325. European conference on computer vision, 2016.

[23] Jain, V. and Learned-Miller, E.FDDB: A Benchmark for Face Detection in Uncon- strained Settings. University of Massachusetts, Amherst, 2010.

[24] Yang, S., Luo, P., Loy, C. C. and Tang, X. WIDER FACE: A Face Detection Bench- mark. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

2016.

[25] Davis, J. and Goadrich, M. The Relationship between Precision-Recall and ROC Curves. Association for Computing Machinery, 2006.

[26] Everingham, M. et al.The Pascal Visual Object Classes (VOC) Challenge. Boston:

Springer US, 2010.