Efficient Deep Learning for Person Detection

(1)

EFFICIENT DEEP LEARNING FOR PERSON DETECTION

Faculty of Information Technology and Communication Sciences Master of Science Thesis April 2020

(2)

ABSTRACT

Olli Eloranta: Efficient deep learning for person detection Master of Science Thesis

Tampere University Computing Sciences April 2020

This work researches how an efficient object detection neural network can be implemented.

The object detectors use image classification networks in their pipeline as so-called feature ex- tractors, and their efficiencies are researched as well. The best practices for speeding up both classification and detection networks are presented. The main concepts of machine learning, neural networks and object detection are also presented to form an understanding in the subject.

Based on the neural network research, an SSD-MobileNetV2 object detector model is trained with multiple different parameters to evaluate how the parameters affect the detection speed and accuracy. Changing the input image size and number of channels in the network greatly affect the performance of the model, but other effects on performance are discussed as well. The results can be used to more quickly select the appropriate parameters for training a object detector, depending on the application.

Keywords: Deep learning, Computer vision, Object detection

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Olli Eloranta: Tehokas syväoppiminen ihmisen tunnistukseen Diplomityö

Tampereen yliopisto Tietotekniikka Huhtikuu 2020

Tässä työssä tutkitaan miten tehokkaita neuroverkkoja toteutetaan hahmontunnistukseen. Te- hokkuudella tarkoitetaan hyvää hyötysuhdetta nopeuden ja tarkkuuden välillä. Hahmontunnistus- verkot käyttävät yhtenä osanaan myös kuvien luokitteluverkkoa niin sanottuna piirteenirroittajana, joten näitä molempia tutkitaan verkkoa nopeuttavien menetelmien löytämiseksi. Yleiset käsitteet koneoppimisesta, neuroverkoista ja hahmontunnistuksesta on esitelty ymmärryksen syventämi- seksi.

Tutkimuksen perusteella työhön valittiin yksi neuroverkkoarkkitehtuuri, SSD-MobileNetV2, jon- ka koulutusparametreja muutettiin niiden vaikutusten selvittämiseksi. Vaikutuksia arvioitiin keske- nään, ja merkittävimmiksi tekijöiksi havaittiin syötetyn kuvan koko ja neuroverkon kanavien määrä.

Tehdyn tutkimuksen avulla voidaan nopeammin löytää eri sovelluskohteisiin sopiva arkkitehtuuri.

Avainsanat: Syväoppiminen, Konenäkö, Hahmontunnistus

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

This thesis was written for the Faculty of Information Technology and Communication Sci- ences in Tampere University during the academic year 2019-2020. The actual research was done at Wapice Oy during the summer and august of 2019.

I would like express my gratitude to Wapice, and especially to my team leader Ilari Kamp- man for giving me the opportunity to research this interesting subject at work. The data science team helped me in my research and provided me a great environment for con- ducting it. I have learned a lot on this subject with the help of my colleagues.

Professor Heikki Huttunen gave me instructions for my research and good ideas for writ- ing the thesis when supervising my work at the university. I would like to thank him for his help.

Additionally, I would like to thank my family and friends for helping me work on my thesis and supporting me during my studies. I could not have succeeded in my studies as well without the support of the people close to me.

Tampere, 28th April 2020 Olli Eloranta

(5)

LIST OF SYMBOLS AND ABBREVIATIONS

AP Average precision AR Average recall AUC Area under curve

CNN Convolutional neural network

COCO Common objects in context, an object detection dataset CPU Central processing unit

FLOPS Floating point operations per second GPU Graphical processing unit

HOG Histogram of oriented gradients IoU Intersection over union

MAC Memory access cost mAP Mean average precision MLP Multi-layer perceptron MSE Mean squared error

NMS Non-maximum suppression

R-FCN Region-based fully convolutional network RCNN Region-based convolutional neural network ReLU Rectified linear unit

RoI Region of interest

RPN Region proposal network SSD Single shot detector YOLO You only look once

(7)

1 INTRODUCTION

Modern machine learning methods have revolutionized several areas of technology and science. Computer vision has been used in simple applications for a long time, but due to the increasing power of computers and availability of data, it is more widely used than ever before. Deep neural networks have emerged to be the state-of-the-art method for classifying and detecting objects from images in the past several years. The success of using a deep convolutional neural network in the 2012 ImageNet image classification contest has lead to a huge amount of research on improving neural network models.

These days, the best models can classify and detect objects as well, or in some cases better than humans.

In the development of state-of-the-art object detectors, researchers attempt to find methods to increase the accuracy in any way possible. However, this usually means making the architecture of the model more complicated. Training and running these models might require several of the highest-end GPUs available. For example, in a few state-of-the-art papers released recently [1][2], the models are trained with four or eight of one of the most expensive and powerful GPUs available. The deployed model needs at least one of these powerful GPUs to run detections.

It is important to find and research extremely powerful models for machine learning tasks.

On the other hand, it might not be worth the cost in most applications to use powerful and expensive computers to run these tasks. The lower the cost of deploying a model is in different applications, the lower the threshold is to actually develop a solution. Optimizing a model’s architecture, the operations used in it and the computing hardware itself allows the model to be run even with inexpensive computers.

Some critical applications, such as self-driving cars, require very fast and powerful models for detecting traffic and pedestrians at a high velocity to prevent accidents. Lower speed and accuracy is enough for predicting the number of people in a surveillance camera feed, where people are walking slowly. Efficient neural network designs are required for these non-critical applications to reduce costs in hardware.

The goal of this work is to research and train an efficient neural network architecture for detecting people in images. The architecture is supposed to have minimal decrease in accuracy while focusing on making it as fast as possible to detect objects. We selected person detection as our target class, since it has broad applications, such as detecting pedestrians in traffic or surveilling social distancing in public spaces.

(8)

Object detector models and their methods for improving speed are studied for the work.

We pick the most relevant detector for further inspection. The model is trained with different parameters to find their effects on the model’s performance. An object detection dataset and methods for evaluating the model are studied to train the model. The perfor- mances are evaluated and the model with the most efficient speed and accuracy trade-off is determined based on the evaluations.

The design should be able to run with the Nvidia’s Jetson Nano hardware unit. The designed object detection model can then detect people directly from a camera feed connected to the device. This low-cost setup could be used in a non-critical application to reduce costs. The Jetson Nano does not have as powerful specifications as multi-GPU systems that state-of-the-art object detection models are usually implemented on. This is why size and efficiency of the detector network is crucial.

The basics of machine learning and pattern recognition are presented in Section 2 to form an understanding on how machine learning methods work in general. In Section 3, the theory behind neural networks and how they are trained is presented. The following Section 4 analyzes relevant deep neural network architectures for image classification.

Then, object detection is defined in Section 5. The most relevant object detector models are presented as well. Section 6 describes the training setup of the picked object detector model, and the results of the evaluations are presented in Section 7. Finally, Section 8 draws conclusions of the research and presents further thoughts in the subject of efficient object detection.

(9)

2 MACHINE LEARNING

2.1 Basic concepts

Machine learning algorithms build a mathematical model that learns to solve tasks based on the data of examples given to it [3, 4]. The examples are sets of features, such as a vector of temperature values or a matrix of pixel values, that can be processed by the learning algorithm. The algorithm takes the examples as training data to learn its features and also predict similar data based on the task at hand. The tasks that the machine learning models solve are often named as how the algorithm handles an example. Two common tasks are classification and regression.

In classification tasks, the machine learning algorithm attempts to assign an example to the correct class of c categories. The algorithm is given an input vector of values, and it assigns it to one of the discrete classes. The output can also be represented as a probability distribution of the classes. Machine learning classifiers can be taught to separate the given examples to their respective classes in many applications. An example classification task is image classification, where the machine learning algorithm attempts to distinguish what class an image belongs to. The examples contain the image as a matrix of pixel values and the respective class, such as a person, animal or vehicle.

Regression tasks do not have a set of discrete classes as target outputs. The given data and ways of learning are similar than in classification tasks, but the output is a continuous numerical value. Regression tasks are used in applications where the data and desired outputs are quantifiable. For example, a regression algorithm’s task could be to predict next year’s market share of a company based on the data from previous years.

Both classification and regression are supervised learning algorithms. During training of the algorithm, it is given the correct ground truth value, or target of the prediction along with the input. When the algorithm produces an output from the input vector, it is shown the correct answer. The algorithm then refines its parameters for its future answers to be more accurate.

Unsupervised learning algorithms do not have target values given to them. They aim to learn properties from the data without any hints given about the examples. An example unsupervised machine learning task is clustering, where the data is attempted to be separated into clusters based on the similarity of the examples in the data.

(10)

y

x

ŷ(x)

Figure 2.1. A linear regression model’s estimated line based on training points.

2.2 Methods and training

To start training a machine learning model, a dataset is needed [4]. Datasets are collec- tions of examples, each example containing its features. The examples in a dataset are usually contained in a matrix-like structure. Each row contains the features of one example, and the matrix contains rows for each of the examples. With supervised learning models, the rows contain also the target value of the example. Classification datasets transform the class names into corresponding numbers so the data can be more easily processed.

Machine learning models that are trained on simple examples with only a few variables can possibly be taught with a dataset of hundreds of examples. Training on complicated and multidimensional examples, such as images, might require tens of thousands of images in a dataset to train a good model. The type, size and complexity of a dataset are major factors on what machine learning method should be used.

A simple machine learning task is picked as an example to understand the training of a machine learning algorithm. Linear regression is a regression task where a line is fitted on the given data. The fitted line can then be used to predict outputs that were not given as examples during training. An example fit is presented in Figure 2.1.

In the simplest example of only one input value x, linear regression aims to predict the correct valueywith the learned weightswand biasbas

ˆ

y=wx+b, (2.1)

whereyˆis the predicted output of the algorithm. The inputxis a real value, often between 0 and 1. The bias is a constant value regardless of the number of weight values. In this example there is only one input value and one associated weight value.

(11)

The aim of the model is to change the weight valueswto minimize the error between the predictionsyˆand target valuesyby observing the examples. The error can also be called the loss of the machine learning model. There are multiple different ways to compute the loss of the model. One example loss function is the mean squared error (MSE). With the train data ofN examplesx₁, x₂, ...x_N and target valuesy₁, y₂, ...y_N, MSE would have the form

M SE = 1 N

N

∑

k=1

(yk−yˆ(xk))². (2.2) The squared differences between the model’s output predictionyˆand the target valuey that was approximated is calculated for all inputs.

Only knowing the loss of the model does not give instructions on how to minimize it.

Some direction needs to be found on how the weight valuewshould be changed. To get a minimal value for a linear regression model, the weights are calculated by solving the weight values where the gradient of the MSE loss function is 0.

Using the loss function to optimize the parameters will teach the model to fit well on the given examples. However, the aim of machine learning algorithms is not to fit on the given data, but to generalize to any future data as well. A good model performs well on data it has not seen before. Therefore, the model’s performance is evaluated on a separate test set. This is acquired by splitting the dataset into two parts.

The correct values in the test set are not given to the model during training. It only sees how well it performs based on the loss function. Thus, for the model to learn, it must balance making the training loss as small as possible and minimizing the difference between the train and test loss.

If the model has not yet learned enough from the train set, it is seen as underfitting. On the other hand if the difference between the train and test losses are too high, it is seen as overfitting. An overfitted model might learn to perfectly minimize the loss on the train set, but it does not perform well on the test set.

Underfitting and overfitting of a model can be seen from its bias and variance. The model bias indicates how large the difference is between a given train value and the model’s estimate. Variance indicates the difference in the model’s output if it is trained with sam- ples of the whole data. The best model can be found by minimizing the bias and variance trade-off.

The example linear regression model also presented in Figure 2.1 had only one weight value that draws a straight line. This is easy to train, but usually a good fit is more complex than a straight line. The model’s complexity, or capacity, can be increased to give it the ability to model more complex information.

(12)

Figure 2.2. A linear, quadratic and 6^th degree function produced by linear regression models.

For a linear regression model, increasing the model’s capacity means giving it additional degrees to form a polynomial. The simple form from equation 2.3 is further developed into the form

ˆ

y =b+w1x¹+w2x²+...+w_dx^d (2.3) wheredrepresents the degree, or capacity of the linear regression model. Linear regression models with too little capacity, meaning it has a too small degree polynomial, likely underfit on the given data. A model with large capacity can solve complex problems, but it has a chance of overfitting. The correct capacity must be found.

A visualization on finding the correct capacity of the model is presented in Figure 2.2.

The model in the first graph (d= 1) has underfitted and it does not learn to fit well on the given points. It has a high bias and low variance. The last graph (d= 6) has overfitted. It performs well on the given points, but the curve turns upwards before the first and after the last point, which probably will not happen with additional examples. It has a low bias and high variance. The middle graph of a quadratic polynomial (d= 2) is the best fit from these three examples.

2.3 Evaluation metrics

A loss function provides a way for the model to determine its performance during training.

It can give information to the user as well, but the overall performance of the model is usually determined with other metrics. These evaluation metrics can be used to validate if the model is good enough for the required application. If the model is not working well enough, it might not be a good approach for critical applications, such as in a medical context.

The predictions of machine learning models can be placed in a confusion matrix [5]. The columns represent the correct response and the rows represent the responses that the network outputs. Each prediction lands in one of the four corners of the confusion matrix.

The confusion matrix is presented in Table 2.1. For example, if a classification model predicts that the input belongs to a certain class and it does not actually belong to it, it would lead to a false positive result.

(13)

Positive Negative Positive True Positive False Negative Negative False Positive True Negative

Table 2.1. A confusion matrix. The target responses are on the left and the model’s predictions on the top.

Score Correct Precision Recall

0.98 True 1.0 0.25

0.90 True 1.0 0.50

0.85 False 0.67 0.50

0.68 True 0.75 0.75

0.55 False 0.6 0.75

0.43 True 0.67 1.0

Table 2.2. Example set of predictions to calculate precision and recall

To evaluate the model, it is presented all the values in the test set. Each of the predictions the model performs falls into one of the corners in the confusion matrix. From the number of predictions in each corner, the precision and recall values can be calculated. Precision measures the percentage of correct predictions from all predictions and recall measures the percentage of found positive prediction. They have the following formulas:

P recision= T P

T P +F P (2.4)

Recall= T P

T P +F N, (2.5)

where TP, FP and FN are the number of true positive, false positive and true negative predictions in the confusion matrix.

In regard to a current issue, an example machine learning model is classifying whether a person is infected with the COVID-19 disease or not. If a person is ill, and the model predicts it correctly, it is denoted as a correct prediction. The precision and recall values are calculated in the order of the detector’s confidence score of the predictions. The example predictions and the calculated precision and recall values are presented in Table 2.2.

The precision and recall values are then plotted into a precision-recall curve [6]. The predictions from Table 2.2 are added to the curve one at a time in the descending order of scores. The recall gets higher each time a correct prediction is done. Every time a false prediction is done, the precision value drops. The example would result in the precision-recall curve presented in Figure 2.3.

(14)

0.5 0.75 1 0.2

0.4 0.6 0.8 1

00.25

Recall

Precision

Figure 2.3.The precision-recall curve produced from the predictions in Table 2.2.

The higher the precision-recall curve is, the better the model is performing. From the curve, the average precision (AP) can be calculated. It is a numeric value on how well the model is performing on producing correct predictions. The average precision AP is calculated as the area under the precision-recall curveP(R)of the plot:

AP =

∫ 1 0

P(R)dr, (2.6)

which results in an average precision value between 0 and 1. With multiclass data, the mean average precision (mAP) is defined as the mean of the AP between all the classes in the dataset. AP is sometimes used with the same definition of mAP. Datasets with only one class naturally have the same mAP and AP.

In machine learning competitions the precision-recall curve is interpolated to have only rectangular areas under the curve. The interpolation can be done by picking values between set steps or dynamically every time the precision drops and calculating the AUC from the rectangular areas between the interpolation steps.

(15)

3 NEURAL NETWORKS

A neural network is a machine learning algorithm that connects a large number of simple computational units to form a powerful model [3]. It can be difficult to train due to the required computational power and amount of data. However, as the computational power of computers and number of available machine learning datasets in the internet has been increasing, neural networks have lately become relevant [7]. Especially tasks with images, such as image classification, have been solved with neural networks.

3.1 Basic architecture

A simple neural network consists of artificial neurons that are connected to each other [7].

The neurons form layers, where each layer’s neurons have the previous layer’s neurons’

outputs as their inputs. This simple architecture with multiple layers of neurons, also presented in Figure 3.1, is called a multi-layer perceptron, or an MLP.

The MLP network has an input layer for providing the input values. The input values are given to the hidden layers, which contain the network’s neurons that calculate their outputs based on learned weight values. The output layer contains the calculated outputs of the neural network. A network where the layers have their inputs from the previous layers is called a feedforward neural network.

Input layer Hidden layers Output layer

Figure 3.1. The simple multi-layer perceptron architecture.

(16)

Each neuron in the neural network calculates its output based on an activation function and the weighted input values, along with an added overall bias. The weighted input value with the weight vectorwconnected to the input vectorxfrom the previous layer is calculated as

z=w·x+b, (3.1)

where b is the added bias. The activation function calculates the neuron’s output with the weighted input. The simplest neuron type is the perceptron, which has an activation function of

a(z) =

⎧

⎨

⎩

0,ifz≤0 1,ifz >0.

(3.2)

The calculated binary output is then given to the all the perceptrons in the next layer as one of the input valuesx.

The binary output of a perceptron means that a very small change in the weight might determine whether the output is calculated as 0 or 1. On the other hand, a large change in the weights might still keep the output as the same as before. A better approach would be to use an activation function that is smoothed out instead of using a step function.

One example of a neuron with an activation function that provides any value between 0 and 1 is the sigmoid neuron. The sigmoid function is defined as

a(z) = 1

1 + exp (−z). (3.3)

If there is a small change in the sigmoid weights, it also has a small effect on the output.

The network can also be trained more precisely as the neurons in it can have a larger range of values than just 0 or 1.

Another activation function, called ReLU is defined as

a(z) = max(0, z). (3.4)

ReLU is extremely simple and only suppresses the negative values to zero. It has been proven to be a well performing activation function in convolutional neural network architectures [8]. With ReLU activations, the network is faster to train and results with a better accuracy than with the sigmoid function. The training of the convolutional kernels’ values is done similarly than with MLPs with gradient descent and backpropagation [7]. The differences between the perceptron, ReLU and sigmoid activation functions can be seen from the visualization in Figure 3.2.

(17)

−1 −0.5 0 0.5 1 0

0.5 1

Perceptron

−1 −0.5 0 0.5 1

0 0.5 1

ReLU

−6 −4 −2 0 2 4 6 0

0.5 1

Sigmoid

Figure 3.2. The perceptron, ReLU and sigmoid activation functions.

The fully connected layers are practical in the hidden layers of the neural network, but it is difficult to interpret as an output layer. A typical way of getting a better output is to use a softmax layer as the output layer. The softmax layer transforms the hidden layer values into a probability distribution. Thus, each of the output neurons in the softmax layer can be interpreted as a specific class probability.

The weighted input valueszare used to calculate the distribution for each of the output values. The activation of the jth softmax neuron in the softmax layer of k neurons is calculated as

aj = exp(zj)

∑

kexp(z_k). (3.5)

The softmax activation uses an exponential function similarly to the sigmoid activation function. The weighted inputs in the softmax neuron is divided by the sum of all the other inputs in the softmax layer to get the output distribution. The sum of the outputs of the softmax layer is always 1. Especially in classification tasks the predicted class can be easily found by searching the softmax neuron with the highest value.

The presented building blocks are enough to build a simple neural network architecture.

How the architecture is formed is based on the task and data that are used. When the architecture is designed, it needs to be trained on some data to acquire the neural network model.

(18)

−1 −0.5 0

0.5 1−1 0

1 0

1 2

v₁

v₂

L

Figure 3.3. Gradient descent where two variables affect the loss function.

3.2 Training process

The concepts and building blocks of a neural network are quite simple. However, the large number of neurons and the layer-like structure makes it difficult to train. The same concepts introduced in section 2.2 for training machine learning models apply to neural networks as well. However, some additional methods are needed when training neural networks.

Initially, the weights and biases of the neurons in the network are randomly initialized, and the aim of the training process is to find the optimal values for them. The process is similar than in other machine learning algorithms, but the number of tuneable parameters is very large. With the optimal values, the network can correctly do the required task.

The loss function of a neural network can be the same as the loss functions in basic machine learning models. For example, MSE can be used to determine the accuracy of the neural network model. Due to the layered architecture of neural networks, the optimization of the neuron weights is done with a slightly more complicated fashion than with simpler machine learning models.

The minimization of the loss function is done with gradient descent. When changing variable values, the loss function moves in some way, either to the better or to the worse.

The "movement" can be seen from the gradient of the loss function. In a simple situation with only two variables, v₁ and v₂ affecting the loss function L, we can visualize the minimization of the loss function as in Figure 3.3.

The two variables of the model were updated, and the change in the gradient is seen as the arrow in the figure. As we can see from the direction of the arrow, the updates were in the correct direction and the loss function became smaller. Continuing approximately in the same direction could help in finding the optimal variable values.

(19)

The gradient of a loss functionLconcerningnvariables is calculated as

▽L= (∂L

∂v₁, ..., ∂L

∂v_n )T

, (3.6)

where T means the transpose operation. The updates v^′_i for each variable vi can be calculated by using the gradient with

v_i^′ =vi−µ▽L, (3.7)

whereµis the learning rate. If the variables are changed as drastically as in Figure 3.3, the minimal point might not be found. The large changes would cause the result to go over the minimal point. Therefore, a learning rate is chosen to make the changes smaller.

Typical learning rate values are small, between10⁻¹to10⁻⁵depending on the complexity and architecture of the network. The learning rate is typically adaptive and might start with a higher number and get smaller during training to fine-tune to a minimal loss.

Stochastic gradient descent is used to make learning by gradient descent faster. In stochastic gradient descent, the gradient ▽L is calculated with a small batch of training inputs instead of only one training input. Using the mean of the gradients within the batch of training data makes the training process quicker and more accurate. A proper batch size is dependent on the complexity and size of the training dataset and can be adjusted to improve the fitting of the network.

The backpropagation algorithm is used to calculate the gradient of the loss function. If a neuron is given a small change at the first hidden layer of an MLP network, its output will affect all the neurons in the next layer. Furthermore, the change will continue to slightly change the inputs to all the following layers as well because of its effect on the next layer. Backpropagation aims to backtrack the small cumulative changes that the updated weights cause. Backpropagation calculates the errors of the neurons starting from the last layer and going backwards in the network’s architecture.

After the error of the output layer is calculated, the hidden layers’ errors can be backpropagated. For each of the hidden layers l, the backpropagated error δ_l is calculated as

δ^l= ((

w^l−1)T

δ^l−1 )

⊙f( x^l)

, (3.8)

wherew^l−1 contains the weights of the previous layer,f( x^l)

contains the outputs of the layer based on the inputs and activation function. The⊙operator denotes the Hadamard product of the two vectors, which means multiplying the values in the vectors element- wise.

(20)

After the hidden layer errors are backpropagated, the gradient of the loss function can be calculated. The gradient of a weight variable from thek^th neuron in the(l−1)^th layer to thej^thneuron in thel^thlayer is calculated as

∂L

∂w_jk^l =y_k^l+1δ_j^l (3.9)

where y_k^l+1 is the output of the neuron in question. The gradient of the neuron’s bias variable is calculated as

∂L

∂b^l_j =δ_j^l. (3.10)

The Equation 3.7 can then be used to update the weight and bias values in the network based on the backpropagated gradients. This whole process is done for one batch of data at a time. When all the inputs in the training dataset have been fed through the network, one epoch is said to be completed. The network is trained for a suitable number of epochs to prevent either underfitting or overfitting.

The introduced quite small and shallow network architecture is enough for simple tasks.

To increase the performance, more than only a few hidden layers should be added to form a deep neural network. However, just adding layers does not improve the performance too much, as making the network deeper introduces problems with the gradient descent algorithm [7].

The gradients in the first hidden layers of a network tend to have a different scale of values than the last hidden layers. The deeper the network is, the larger the difference.

This makes updating the weights with gradient descent difficult. Sometimes the gradient is very low in the first layers of the network and large in the end of the network. On the other hand, the gradient might sometimes be very high in the first layers and very low in the last layers. These two problems are called the vanishing gradient and exploding gradient problem, respectively.

Both vanishing and exploding gradients are the results of having generally unstable gradients the deeper a neural network is. Both are the results of the fact that the gradient in the first layers are computed as the product of the latter layers. With more layers the product gets more unstable.

Changes in the activation function or gradient descent algorithm can help in avoiding unstable gradients. However, neural networks with fully connected neurons should not be used in the more complex, deep networks. There are other approaches that can be used to form a neural network. The most common deep neural network, especially for working with image data, is a deep convolutional neural network.

(21)

0 1 2 3 4

y[1] = 1*0+2*0+1*1 = 1 y[2] = 1*0+2*1+1*2 = 4

Figure 3.4. The calculation of the first two values of a 1D convolution.

3.3 Convolutional neural network

Convolutional neural networks (CNNs) are used especially in image recognition tasks [7]. They utilize convolutional layers to learn the characteristics of images. Convolutional networks generally attempt to find different features in images based on small neighborhoods of pixels. The neighborhoods are processed with 2D convolutions.

The convolution operation is a mathematical operation that combines two signals to get a third signal, similarly than addition combines two numbers to get a third number [9].

The convolution operation has an input signal and an impulse signal, or a convolutional kernel, that are combined to get an output signal.

The convolution operation is represented as the asterisk symbol. The convolution of the input signalx[n]with the convolutional kernelk[n]to get an output signaly[n]is denoted as

y[n] =x[n]∗k[n]. (3.11)

The operation can be thought as sliding the convolutional kernel through the input signal.

At each point the kernel’s values are multiplied with the respective input signal’s values and the multiplications are added together. It is better understood with an example of discrete signals. With an input signal of x[n] = [1,2,3,2] and kernelk[n] = [1,2,1]the output would be y[n] = [1,4,7,8,4,0]. Figure 3.4 shows the first few calculations of the convolution. One output value is calculated at each point of the signal.

A convolution with 2D information, such as images, works with the same idea as the previous 1D version. Instead of sliding a 1D kernel through the 1D input, the 2D kernel moves through the 2D image one row of pixels at a time [7]. At each pixel location, the resulting pixel value in the output image is calculated in the same way by multiplying the respective kernel values with the small neighborhood of values in the input image. Figure 3.5 presents in practice how a convolution works with an image array.

(22)

0 1 1 2 2 3 1 1

2 3 2 3 5 4 2 2

1 2 4 9 8 7 3 2

1 2 5 6 6 7 3 1

2 1 3 1 7 2 3 1

3 2 3 2 0 2 0 0

4 1 2 0 0 1 1 0

4 1 0 0 0 0 0 0

0 1 0

1 2 1

0 1 0

y[1,1] = 13 0*0 + 1*1 + 0*1 + 1*2 + ...

Figure 3.5.2D convolution. The kernel slides through each part of the input.

The kernel weight values of a convolutional neural network also consider the activation function of the convolutional layer and the bias value. Thus, the output value of a convolutional operation with the input neighborhoodXand kernelKwould be

a(K∗X+b), (3.12)

wherea is the activation function and b the activation bias. Similar activation functions could be used for calculating the values inside a convolutional kernel and an MLP network’s neuron. However, the ReLU (Rectified Linear Unit) activation function have been widely used as the activation functions for convolutional neural networks.

The output of a network’s layer’s convolution operation is a so-called feature map. Feature maps can also be called activation maps because they contain the activations of the layer.

A convolutional layer in a neural network consists of multiple feature maps, each with a different learned kernel.

The number of feature maps can be expressed as the number of channels in the layer.

The feature maps learn to distinguish different features from the image, such as edges or shapes. However, the kernels are quite small regarding the input image, and they can learn only to distinguish very small features the size of a few pixels. For the kernel to learn more general features of an image, the input needs to be made smaller.

Pooling layers are used to simplify or diminish the outputs of the convolutional layers.

The output feature map of the convolutional layer is given to the pooling layer and it summarizes a small region of the feature map. With 2x2 pooling, the pooling layer takes areas of 2x2 pixels and summarizes them to one pixel value in the pooling layer’s output feature map. After a pooling layer, the feature map has halved dimension and contains a quarter of the input’s values. The pooled information can then be given to the next convolutional layer.

One common pooling method is max-pooling. It outputs the maximum values of the input areas. Max-pooling, like other pooling methods, aims to find the searched features in regions of the image. The searched feature means features such as edges or basic shapes, which the kernel’s values has been updated to recognize. If the region contains

(23)

Convolution Input image

32x32x3 32x32x8

16x16x8 16x16x16

8x8x16 8x8x32

1x1024

1x10

Max-pooling Fully-

connected Softmax

Figure 3.6. An example deep convolutional neural network architecture for classifying images between 10 classes. The numbers above the layers represent the layer dimensions.

the searched feature, it has a large activation. The large activation is noticed by the max-pooling layer as it picks the highest activation values from the feature map.

A functional convolutional neural network is formed by alternating between convolutional layers and pooling layers. The first layers that produce large feature maps can be taught to learn very rough features in the image, such as edges, corners and small shapes. The layers at the end of the network have small feature maps that cover large parts of the image, and they can learn to distinguish complex shapes.

After the input dimensions have been cut with a pooling layer, convolutional layers often increase the number of channels. The size of the inputs keeps getting smaller, but the number of the learned feature maps gets larger. Fewer feature maps are needed to for the basic low-level features. The high-level features are more varying and benefit from a larger number of channels.

The output of the last convolutional layer is difficult to interpret as the image class. Be- cause of this, one or several fully connected hidden layers can be added to the end of the convolutional network. The fully connected layers are connected to all the values in the last convolutional layer. The final classification probabilities can then be given as an output with a softmax layer at the end of the network.

In Figure 3.6, a simple example of a convolutional neural network architecture is presented. The network can be trained to classify small 32x32 sized input images to 10 classes. Changing the kernel size, the number of layers or other parts of the architecture will affect the performance of the network.

As presented earlier, the kernels in different parts of a convolutional neural network learn to find different features. Figure 3.7 presents an example set of features from the second and third layers of a 5-layer convolutional neural network. In the first image, one of the kernels in the second layer seems to recognize corner-like shapes. In the second image, a kernel in the third layer can recognize roughly the shape of a person’s upper body.

There is a max pooling layer between the two convolutional layers which results in the

(24)

Figure 3.7.Example features of feature maps in different layers of a convolutional neural network and their respective location in the input image. Adapted from [10].

different input sizes. Layer 2 has an input size of 26x26 pixels and layer 3 has an input size of 13x13 pixels.

The presented types of layers and the ReLU activation are just examples to be used for a deep convolutional neural network. The best approach and architecture are dependent on the data and application. There are many different architectures with different approaches to get powerful classification models. In the next chapter some of these network architectures are presented.

(25)

4 DEEP CONVOLUTIONAL NETWORK ARCHITECTURES

There is a huge amount of different convolutional neural network architectures. Some are powerful and require a large amount of computing power while others have a worse accuracy but are much faster. Four common deep convolutional neural network architectures that present efficiency in their architecture are researched. Table 4.1 presents the main features that are developed in each of the architectures.

4.1 ResNet

ResNet was developed to tackle the problems of training deep neural networks [11].

Stacking a large number of layers makes the model more complicated, but it does not increase the accuracy. Deep networks have a problem of accuracy degradation, and ResNet proposes residual networks that can be made extremely deep.

In residual learning, the network does not aim to fit to an input vectorxto solve a mapping H(x). Instead, it attempts to approximate the residual function F(x) = H(x)−x. This transforms the original mapping to be in the form of F(x) +x The network consists of residual building blocks with the function

y =F_w(x) +x (4.1)

where x and y are the input and output of the block, and F(x,{W_i}) is the residual Architecture Presented features

ResNet Residual learning to create extremely deep CNNs.

MobileNet Depthwise separable convolutions to reduce computational cost of convolutional layers

SqueezeNet Sparser representation of the network to reduce model size ShuffleNet Shuffled group convolutions to improve performance on

multiple GPUs

Table 4.1. Main features that affect efficiency in the researched CNNs.

(26)

3x3 DWConv 1x1 Conv BN

ReLU

BN ReLU

Figure 4.1. A depthwise separable convolution as used in MobileNet.

mapping. The mapping can be done flexibly, but ResNet uses residual blocks of two or three convolutional layers. With neural networks, the residual mapping is achieved with skip connections, where the input features are given both to the following residual block and the layer after the block.

The base ResNet architectures consist of either 34 or 18 convolutional layers, followed by average pooling and a fully connected layer for predictions. All the convolutional layers are in residual blocks of two layers. Without the residual connections the 34-layer ResNet has a higher error than the 18 layer ResNet on the ImageNet [12] classification dataset.

With the residual connections it is the other way around. The ResNet network can be made extremely deep to improve accuracy. The 152-layer ResNet was the top state-of- the-art classification neural network of 2015 due to the residual connections.

The residual connections do not increase the number of parameters or computational complexity in the network. This means that using residual connections has little down- sides in neural networks. The residual can make deep neural networks more accurate, but it has a smaller effect on simple networks. However, the ResNet network can be made smaller and more lightweight by removing layers in the network. Further halving the layers to 9 could allow the model to work in lightweight applications.

4.2 MobileNet

MobileNet is a convolutional neural network architecture designed to be used in mobile and embedded computer vision applications [13]. The architecture aims to be computationally light while it tries to maintain comparable performance to larger architectures.

This efficiency is achieved with convolutional layers that use depthwise separable filters and hyperparameters that shrink the model.

Depthwise separable convolutions combine a standard convolutional layer into a depthwise convolution and a pointwise convolution, the latter meaning a convolution with a 1x1 filter. The depthwise separable convolution layer is presented in Figure 4.1. As a standard convolution would both filter and combine results to its output, the depthwise separable convolution does this in two separate layers. The depthwise convolutions apply a filter per each input channel of the layer, and the pointwise convolution creates linear combinations of the depthwise layer’s outputs. This reduces computation cost and model size.

(27)

The computational cost of operations performed on a computer can be depicted as FLOPS (floating point operations per second). A standard convolution with a square convolutional kernel has a computational cost of:

F LOP S=DKDKM N DFDF (4.2) whereD_K is the kernel size,M andN the number of input and output channels andD_F the feature map size. Here the input and output feature maps have the same dimensions apart from the channels. The depthwise convolution has a computational cost of

F LOP S=D_KD_KM D_FD_F. (4.3) The depthwise convolution only filters input channels and does not create new features.

It requires a 1x1 pointwise convolution to produce new features from the filtered results.

The combination of a depthwise convolution and a pointwise convolution is called a depthwise separable convolution. Its computational cost is:

F LOP S=D_KD_KM D_FD_F +M N D_FD_F. (4.4) To get the reduction of computational cost, the comparison between depthwise separable convolution and normal convolution can be calculated. The total reduction in cost is:

∆F LOP S= DKDKM DFDF +M N DFDF

DKDKM N DFDF

= 1 N + 1

D²_K (4.5)

Using 3x3 convolutions requires 8 to 9 times more computation than depthwise separable convolutions.

The standard MobileNet architecture consists of 28 layers where each layer is followed by batch normalisation and ReLU. However, the size of these layers can be adjusted with the width and resolution multiplier hyperparameters. Adjusting the width multiplier makes all the convolutional layers in the network thinner, reducing the number of input and output channels. The width multiplier α can be added to the computational cost to get the multiplied cost:

F LOP S=DKDKαM DFDF +αM αN DFDF (4.6) where α = 1 means the baseline MobileNet and lower values are thinned ones. The resolution multiplier affects the size of the network’s input image and reduces the number of channels of all following layers based on the new resolution.

(28)

M=32, N=64 M=32, N=64 M=16, N=32 DF=224 DF=112 DF=112

Basic convolution 920 230 58

Depthwise separable conv 120 29 8

Bottleneck conv (t=1) 170 42 11

Bottleneck conv (t=2) 340 84 23

Table 4.2. Examples of computational costs in GFLOPS. The convolutions use 3x3 kernels and feature maps of sizeD_F ×D_F withM input andN output channels.

The resolution multiplierρcan be added to the computational cost to get the final cost of a modifiable depthwise separable convolution in MobileNet:

F LOP S=D_KD_KαM ρD_FρD_F +αM αN ρD_FρD_F (4.7) whereρ= 1is the baseline model and lower values are models with lower resolutions.

The following version of MobileNet, called MobileNetV2, uses the same principles as the first one [14]. New residual bottleneck layers are implemented as a basic building block to further increase the efficiency of the network. They include the depthwise separable convolutional layers with linear bottlenecks and inverted residuals.

Linear bottlenecks are used to prevent non-linearities such as ReLU from losing too much information. The non-zero volume of the result of a ReLU corresponds with a linear trans- formation. In addition, ReLU loses information on the channels it collapses except if the input can be represented as a lower-dimensional subspace. This low-dimensional data that is assumed to contain the necessary information can be captured with linear bottleneck layers within the convolution blocks, optimizing the neural network’s performance.

The bottleneck layers are followed by expansions that again increase the number of channels. Inverted residuals are then used to work as shortcuts between the bottlenecks in the network. The computational cost of a bottleneck convolution is:

F LOP S=D_FD_FM t(M+D_K² +N) (4.8) where t is the expansion factor. It is computationally slightly more costly than depthwise separable convolutions. However, bottleneck layers can be used with smaller input and output dimensions with the same accuracy, resulting in a smaller cost.

To compare the differences in the computational costs of the presented different types of convolutions, examples should be given. Table 4.2 presents the computational costs in GFLOPS (gigaFLOPS) of three combinations of channel and dimension values. All the convolutions use 3x3 convolutions and have the same image dimensions in the input and output, but a different number of channels.

(29)

Figure 4.2.The SqueezeNet architecture. Adapted from [15].

From these numbers the computational differences can be better realized. The depthwise separable convolution has the smallest computational cost in FLOPS. However, as said, the bottleneck convolution can work as well with smaller dimensions. Depending on the situation, a lighter form than the basic convolutions should be used in lightweight applications.

4.3 SqueezeNet

SqueezeNet is a CNN architecture that focuses on minimizing the number of parameters and the size of a network [15]. Three main strategies are applied to squeeze the network.

Some of the 3x3 convolutions are changed to 1x1 convolutions to decrease convolutional parameters. So-called squeeze layers are used to decrease the number of input channels to the 3x3 filters in the network, also decreasing the number of parameters. Finally, pooling layers that perform downsampling are concentrated in the end of the network.

This is done to increase the performance in a limited number of parameters.

The SqueezeNet architecture consists of Fire modules and a convolutional layer at the beginning and end of the network. The Fire module consists of a squeeze layer with 1x1 convolutions which is connected to an expand layer with multiple 1x1 and 3x3 convolutions. The SqueezeNet architecture is presented in Figure 4.2. The numbers above the architecture represent the number of filters forwarded to the next module. Additionally, ReLU is applied to the squeeze and expand layers and a dropout of 0.5 is used after the last fire module.

In addition to the minimization of parameters by the architecture, quantization and deep compression was used to further make the network smaller. Quantization changes the parameters to be represented with a smaller number of bits. Deep compression combines network pruning and huffman coding to compress neural networks [16]. Network pruning replaces parameters that are below a threshold to zero. This creates a sparser representation of the network. Huffman coding uses codewords to represent the weights in the network. Common weights are represented with fewer bits than uncommon ones to reduce the needed space for the network.

(30)

Convolution Features Channel

Shufﬂe Convolution g₁

g₂

g₃ Channels

Figure 4.3. The channel shuffle operation on grouped convolutions. Adapted from [17].

The SqueezeNet architecture can have a similar accuracy on the classification datasets than larger and well performing networks even when extremely reducing the size of the network [15]. Using deep compression with 6 bits and removing one third of the network parameters with pruning leads to a very small model. It still functions with the same accuracy, but other architectures can have 50 times the parameters and 500 times the size than SqueezeNet.

4.4 ShuffleNet

ShuffleNet introduces ShuffleNet units that make a network more efficient [17]. It utilizes pointwise group convolution combined with a channel shuffle operation. These reduce the computational cost of the convolutions in the network.

Training can be separated to multiple GPUs with group convolutions. The grouped convolution layers only communicate in certain layers [18]. The parallelization can give a slightly increased accuracy and shortens training time. The groups of the group convolutions can be combined with channel shuffle operations [17]. The channel shuffle operation is illustrated in Figure 4.3. It shuffles the outputs of the groups’ convolutional layers. Less information is lost when a group provides input data to another group.

ShuffleNet units are residual blocks, meaning that it has a residual branch with convolutional layers and a skip connection that combines the input of the unit with the output of the residual branch. The residual branch consists of a pointwise group convolution with a channel shuffle operation after it.

The shuffled output is given to a 3x3 depthwise separable convolution which is followed by another pointwise group convolution. The output is combined with the input of the unit.

The computational cost of a ShuffleNet unit is

F LOP S=hw(2cm

g + 9m) (4.9)

whereh×w×cis the input feature map size,mthe output bottleneck channels andgthe number of groups. This allows more channels with the same cost than ResNet, where

(31)

the residual block’s cost is

F LOP S=hw(2cm+ 9m²). (4.10) All convolutions of the ShuffleNet unit have batch normalisation. Additionally, the first pointwise group convolution and the combined output both have ReLU activations.

The ShuffleNet architecture consists of an initial convolutional layer and three stages of ShuffleNet units, each with multiple stacked units. The input size is halved between the stages. The number of groups used by the units can be modified with a network scaling factor to fit the computational needs.

ShuffleNet can use wide feature maps compared to other networks with the same computational budget due to its ShuffleNet units. However, the network is not as efficient on small computers with low power. Mobile computers do not have multiple GPUs and cannot apply the presented operations as well as more powerful computers.

ShuffleNetV2 has been developed to make the network more efficient for lightweight en- vironments [19]. It is developed with minimal computational cost in mind. Equal channel width in a convolution’s input and output minimizes the cost. The computational costBin FLOPS of a 1x1 convolution is

B =hwc₁c₂ (4.11)

wherehandware the feature map dimensions andc1 andc2 the input and output channels. The memory access cost (MAC) of a convolution is

M AC≥2

√

hwB+ B

hw (4.12)

which reaches its lower bound when the input and output channels are the same size.

Using many group convolutions increases the cost of the network. The MAC of a 1x1 group convolution is

M AC ≥hwc₁+Bg c1

+ B

hw, (4.13)

wheregrepresents the number of groups. From the equation we can notice that the MAC increases whengincreases.

Network fragmentation between GPUs also decreases the efficiency of the network.

GPUs have good parallelization within them, so minimizing the need to move information to other GPUs can improve efficiency. Additionally, the use of element-wise operations such as ReLU or additions occupy much of GPU processing time. Minimizing the use of these operations speeds up the network.

(32)

1x1 GConv Channel

Shufﬂe 3x3 DWConv 1x1 GConv

Add BN

ReLU BN BN

ReLU

1x1 GConv 3x3 DWConv 1x1 GConv

Concatenation BN

ReLU BN BN

ReLU

Channel Shufﬂe Channel Split

(a)

(b)

Figure 4.4. (a) The ShuffleNet unit, (b) the ShuffleNetV2 unit. Adapted from [19].

To tackle the mentioned issues that affect the computational cost, a new operation called channel split is introduced in ShuffleNetV2 to improve the ShuffleNet unit. Figure 4.4 compares the units in both versions. The feature channels are split into two branches at the beginning of each ShuffleNet unit. This produces two groups for the convolutions, so the 1x1 convolutions in the unit are no longer group convolutions. At the end of the unit, the branches are concatenated together instead of an add operation. The resulting concatenated feature map is shuffled with the channel shuffle operation.

(33)

5 NEURAL NETWORK BASED OBJECT DETECTION

5.1 Methods

Object detection introduces new problems for computer vision technologies. Instead of classifying images to a single class, object detectors are required to localize multiple objects within the image. Figure 5.1 presents a comparison between classification and object detection. The first image contains the output of a classifier that can distinguish whether a photo is of a person or a dog. The second image contains the output of an object detector that classifies objects that it finds to be either a person or a dog.

Object detection is both a regression and classification task. Regression is used to localize the object. Localizing the object means predicting the pixel location and size of the object. For example, an object can be represented as its width, height, centre x and y coordinates. A bounding box is used for visualizations to be drawn around the object based on the output size and location from the object detector. Additionally, the object is classified, and the final output of the detector contains the location and class probabilities of the object.

In an object detection setting, the prediction is correct if its intersection over union, or IoU, is over the required threshold [6]. The IoU means the overlap between the predicted object and ground truth object. It can be calculated with:

IoU= area of overlap

area of union . (5.1)

Figure 5.1. Comparison between classifying images and detecting objects in images.

(34)

= Intersection

=

+ Union

IoU = IntersectionUnion Figure 5.2. Calculating the IoU of two bounding boxes.

Calculations of the IoU are visualized in Figure 5.2. One of the squares is the ground truth object in the image and the other is the prediction of the object detector. A common value for an IoU threshold is 0.5, meaning that the prediction is correct if it has more area on top of the ground truth than out of it. In the example figure that threshold would not yet be satisfied, and it would be a false detection. The average precision of an object detector can be calculated with Equation 2.6 like with other machine learning methods.

The objects in the image need to be detected despite their different scales, aspect ratios and locations. For example, people could be upright or laying down in images. They could also be close or far away from the camera. Objects could be detected simply by classifying small parts of the image like a sliding window across the whole image.

However, this would require huge amounts of computation, when different aspect ratios and scales should be considered. There are multiple approaches that tackle this problem.

The Histogram of oriented gradients (HOG) object detector is a common approach that uses normalized local histograms of image gradient orientations [20]. The gradient orientations can characterize edges and other intensity changes in local areas in images.

The histograms of gradient orientations are combined in small areas to HOG descriptor blocks. The descriptors are tiled as overlapping detection windows on the image and the window information is finally given to a linear support vector machine classifier to classify the regions.

Another approach, used for face detection, is the Viola and Jones algorithm [21]. It aims to be an efficient detector for finding faces in different poses. The approach first transforms the image into a representation that allow more efficient computation of its features. It also produces a large amount of these features, which are given to a set of weak classification functions, which are used to find initial locations of faces in window areas.

The large number of weak classifiers are trained together to form a strong classifier by focusing on the classifiers that are the most accurate for classifying the features. The windows with possible faces are given to a another set of classifiers in a cascade structure. The areas are put through the increasingly complex neural network classifiers in the

(35)

Architecture Presented features

Two-stage

Faster-RCNN Translation invariant region proposals

R-FCN Voting of regions of interest from position- sensitive score maps

One-stage

SSD Pre-defined anchor boxes for quick detections YOLO Efficient backbone network and detections in

pre-defined grids

CornerNet Detections with corner heatmaps straight from the convolutional layers

Table 5.1. Main features presented in the researched object detector architectures.

cascade. The simpler classifiers in the cascade are used to efficiently discard non-faces while the more accurate detections pass through the complex classifiers.

Recently, deep neural networks have been increasingly popular in object detection tasks due to their good performance and the availability of large training datasets. They are good at generalizing scenes because of the complexity of the neural network models.

The deep neural network object detectors are split into two-stage detectors and the more recent and efficient one-stage detectors.

Both two-stage detectors and one-stage detectors use an image classification convolutional neural network as its so-called backbone feature extractor network. The image classifiers are taught to extract the features in the image, and an object detector is set on top of it to find the objects from the features.

Five of the most relevant object detector architectures are presented. One-stage detectors have a larger focus in this work because of their speed compared to two-stage detectors. Table 5.1 presents the main features that are developed in the architectures of the researched object detectors. The functionality and more specific architecture of each of the methods are presented in the following subsections.

5.2 Two-stage object detection

In deep learning, two-stage object detectors split the detection of objects to two stages [22]. In the first stage, a region proposal network (RPN) proposes regions where possible objects could be found. The region proposal network is usually a common image classification network with some of its last layers removed. The proposed regions are given to a separate bounding box regression network. This network crops the proposals and predicts their classes. Commonly used two-stage detectors are Faster-RCNN and R-FCN.

(36)

Convolutional layers

Input image Feature maps

Region proposals

RoI pooling

Classiﬁcations

Figure 5.3. The Faster-RCNN architecture. Adapted from [23].

Faster-RCNN

The Faster-RCNN (Region-based convolutional neural network) object detector constructs the RPN by adding two additional layers on top of the feature extractor [23]. One layer en- codes the convolutional map into feature vectors, and the other calculates object scores and regressed bounds for region proposals of different scales and aspect ratios.

The Faster-RCNN architecture can be seen in Figure 5.3. Faster-RCNN utilizes shared weights between its convolutional layers to increase its speed. To generate the actual region proposals, a small Fast-RCNN [24] detection network slides over the last shared convolutional layer. The small network is fully connected to the input convolutional feature map. Each of these sliding windows is mapped to a lower-dimensional vector that is fed to the box-regression and classification layers that are fully connected.

At each location of the sliding window,kregion proposals are predicted, which represent k anchors. Each anchor has a specific scale and aspect ratio, meaning that different sized and shaped objects can be given to the classification layer. For a feature map of sizeW ×H, the RPN predictsW ×H×kproposals. The region proposal is translation invariant, which gives it a good accuracy.

The Faster-RCNN network is trained in four steps. First, the RPN is trained with a model that has been pre-trained on a classification dataset. This is fine-tuned for the region proposal task. The separate Fast R-CNN network detection network is trained with the proposals given by the first network. Next, the detector network is used to initialize the RPN training so, that the shared convolutional layers are fixed and only the unique layers of the RPN are fine-tuned. Finally, the unique layers of Fast R-CNN are trained separately.

All this is done in one training pipeline.

Efficient Deep Learning for Person Detection