AI-Based Object Recognition on RGBD Camera Images

(1)

Joni Tepsa

AI-BASED OBJECT RECOGNITION ON RGBD CAMERA IMAGES

Faculty of Engineering and Natural Sciences

Bachelor’s thesis

April 2020

(2)

ABSTRACT

Joni Tepsa: AI-Based Object Recognition on RGBD Camera Images Bachelor’s Thesis

Tampere University

Bachelor of Science (Technology), Degree Programme in Engineering Sciences, Automation Engineering

April 2020

In this thesis, it was researched how RGBD camera images could be implemented in object recognition algorithms and how they affect the performance of the algorithm. In the beginning, the history of artificial intelligence, and what kind of object recognition algorithms there already exists, are presented. Later on, the RGBD image structure is presented and it is analyzed how they are captured with a ZED camera.

In the practical part, the YOLOv3 algorithm was implemented, trained with RGB-D Object Dataset, and then evaluated. With the YOLOv3 algorithm, good learning results were achieved.

During the evaluation, it occurred that the implemented network was overfitting, and potential reasons for that happening was analyzed and potential solutions discussed. One of the research questions was how well the YOLOv3 compares to earlier research. For this question, there was no clear answer since the used dataset was not split the same way and, thus, direct comparison to earlier research was not possible.

In the end, several suggestions about how the implementation could be improved are represented, and also a couple of future research topics are discussed.

Keywords: Machine learning, object recognition, YOLOv3, RGBD camera images, ZED camera, depth camera, epipolar geometry, convolutional neural networks, deep learning

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

PREFACE

Firstly, I would like to say this has been the biggest project I have done in the engineering world to this date, and I have learned so much from this. This became a longer project than I thought, and the last bits were written during my exchange studies in Austria.

I would like to thank Aref for giving such an interesting topic to write about, and I would like to thank Eelis for guiding and advising during this thesis. I appreciate that I got answers to my emails very quickly and sometimes during the same day.

Lastly, I want to thank all my friends and Hanna for being supportive through this long process and giving me motivation for finishing this thesis.

I look forward to taking all the lessons I learned during this thesis to my master studies, and later into writing my Master’s thesis.

Graz, 23.4.2020

(4)

LIST OF FIGURES

Typical object recognition system adapted from (Dickinson 1999). ... 6

Decision tree whether to go outside play (Oritnk 2012). ... 8

Pixels in an RGB image are formed from the corresponding pixels of the three primary color component image planes. (Raj 2018). ... 10

Example of a depth image (Stereolabs 2020e). ... 11

Visual representation of RGBD tensor. ... 12

ZED camera (Stereolabs 2020g)... 13

Epipolar geometry between two cameras and a scene adapted from (Szeliski 2011). ... 16

Depth calculation from two camera image planes adapted from (Krutikova et al. 2017). ... 20

Visual presentation of a sigmoid neuron. ... 21

Step and sigmoid functions. ... 22

A connection from the local receptive field of size 5𝑥5 to the next layer’s neuron in the convolution layer (Nielsen 2015). ... 23

RGB camera image and corresponding depth image from RGB-D Object Dataset adapted from (Lai et al. 2011). ... 30

A mask image from the RGB-D Object Dataset (Lai et al. 2011). ... 31

The generated ground truth box over the mask image on the left and the RGB image on the right. ... 32

Structure of YOLOv3 head. ... 33

Calculating IOU (Rosbrock 2016). ... 34

Validation data from the network on different training data sizes. ... 35

(6)

LIST OF SYMBOLS AND ABBREVIATIONS

2D 2-dimensional

3D 3-dimensional

AI Artificial intelligence

CNN Convolutional neural network

DPM Deformable parts model

fps frames per second

GAP Global average pooling

GPU Graphics processing unit IOU Intersection over union

ISP Image signal processor

KSVM Gaussian kernel support vector machine LinSVM Linear support vector machine

NMS Non-maximum suppression

PIL Python imaging library

RF Random forest

SAD Sum of absolute differences

SDK Software development kit

SVD Singular value decomposition

YOLO You only look once

𝑏 bias

𝑏_ℎ height of the predicted bounding box in coordinates of the image 𝑏_𝑤 width of the predicted bounding box in coordinates of the image 𝑏_𝑥 𝑥-coordinate of the predicted bounding box in coordinates of the

image

𝑏_𝑦 𝑦-coordinate of the predicted bounding box in coordinates of the image

𝐵 number of bounding boxes predicted in YOLOv3

𝑐 first camera origin

𝑐₀ left camera center in epipolar geometry 𝑐₁ right camera center in epipolar geometry 𝑐_𝑗 general presentation of any camera center

𝑐_𝑥 distance between predicting cell’s left side and the top-left corner of the image

𝑐_𝑦 distance between predicting cell’s top border and the top-left corner of the image

𝐶 camera calibration matrix

𝐶_𝑗 general presentation of any camera calibration matrix

𝑑 disparity

𝑑₀ disparity value of observed point represented in the left camera image plane

𝑑₁ disparity value of observed point represented in the right camera image plane

𝑑𝑏𝑎𝑠𝑒 distance between cameras in a stereo rig

𝑑_{𝑝𝑙𝑎𝑛𝑒} distance between camera center and a corresponding image plane in a stereo rig

𝐷 depth value of a pixel in a depth map

𝐸 essential matrix

𝐸_𝑆𝐴𝐷(𝑗) result of the sum of absolute differences in point 𝑗

(7)

𝑓_𝑥 the focal length of the camera sensor in 𝑥-dimension 𝑓_𝑦 the focal length of the camera sensor in 𝑦-dimension 𝐻_𝑛 the neuron of the next hidden layer

𝐼 input data array

𝐼𝑚 identity matrix

𝐼₀ template image point at discrete pixel location

𝐼₁ corresponding image point to 𝐼₀ in another image plane

𝐼𝑂𝑈_{𝑝𝑟𝑒𝑑}^{𝑡𝑟𝑢𝑡ℎ} intersection over a union of truth label and predicted bounding box

𝐾 kernel array

𝑀 height of an image in pixels

𝑁 width of an image in pixels

𝑜_𝑐𝑥 center of optical 𝑥-axis 𝑜_𝑐𝑦 center of optical 𝑦-axis

𝑝 observed point on the scene

𝑝₀ observed point represented in the left camera image plane 𝑝₁ observed point represented in the right camera image plane 𝑝𝑐 3-dimensional camera-centered point

𝑝_ℎ height of the anchor

𝑝_𝑤 width of the anchor

𝑅 rotation matrix

𝑅⃗ rotation

𝑅₀ first camera canonical orientation 𝑅90° rotation matrix with rotation value 90°

𝑠 corrects skew between optical axes

𝑆 matrix containing orthogonal basis vectors and translation vector 𝑡̂

𝑡̂ translation vector

𝑡_ℎ height of the predicted bounding box 𝑡𝑤 width of the predicted bounding box 𝑡_𝑥 translation vector’s 𝑥-component

𝑡_{𝑥−𝑜𝑓𝑓} offset between predicting cell’s top-left corner and center of the predicted bounding box in 𝑥-dimension

𝑡_𝑦 translation vector’s 𝑦-component

𝑡_{𝑦−𝑜𝑓𝑓} offset between predicting cell’s top-left corner and center of the predicted bounding box in 𝑦-dimension

𝑡_𝑧 translation vector’s 𝑧-component

𝑡 translation

𝑢 displacement between two corresponding image pixel’s value 𝑢₀ the first column of matrix U

𝑢₁ the second column of matrix U

𝑈 left-singular vectors of the original matrix before singular value decomposition

𝑣₀ the first column of matrix 𝑉 𝑣₁ the second column of matrix 𝑉 𝑣₂ the third column of matrix 𝑉

𝑉 right-singular vectors of the original matrix before singular value decomposition

𝑤_𝑗 general presentation of any weight value in artificial neuron

𝑥_𝑗 general presentation of any pixel measurement in the camera image plane

𝑥̂₀ ray direction vector from the left camera center to the observed point 𝑥̂₁ ray direction vector from the right camera center to the observed

point

(8)

𝑥̂_𝑗 general presentation of any ray direction vector from camera center to observed point

𝑦 output of an artificial neuron

𝑦_{𝑠𝑖𝑔𝑚𝑜𝑖𝑑} the output of a sigmoid neuron model 𝑦_{𝑠𝑡𝑒𝑝} the output of a perceptron neuron model

𝑧 output function

𝑍 singular values of cross product operator with a translation vector 𝑡̂

𝛼_𝑖 general presentation of any input into a neural network layer

𝜎() sigmoid function

Σ singular values of the original matrix before singular value decomposition

[ 𝑗]_× general presentation of matrix form of the cross product operator with the vector inside

Pr () general presentation of the probability Pr (𝐶𝑙𝑎𝑠𝑠_𝑖|𝑂𝑏𝑗𝑒𝑐𝑡) a set of conditional class probabilities

(9)

1. INTRODUCTION

Artificial intelligence (AI) is generally thought to be of the main elements in the industry revolution 4.0. AI is a growing trend in the world right now. In fact, according to Dean (2019), around 100 papers related to machine learning per day were published at the end of 2018.

Implementing object recognition on computers has been studied for a long time, and now it starts to have many applications in our daily lives. For example, smartphones can be unlocked with facial recognition, and parking halls can automatically read license plates when cars enter. Object recognition makes it possible to grow the level of automation in factories and on robots. Thus, making it one of the critical elements in the world of automation.

The first version of You Only Look Once (YOLO) deep neural network was released in the year 2016 (Redmon et al. 2016). The YOLO was right away massive success in the world of object recognition, and Redmon continued the research on the YOLO algorithm, and in the year 2018 was published the latest version of YOLO called YOLOv3 (Redmon

& Farhadi 2018). This new deep neural network was heavier than its predecessor but more capable in object recognition and achieving better results.

In the year 2010, Kinect was released as an accessory to Xbox 360. Kinect is an RGB camera that also has an infrared emitter, thus making it possible to capture RGBD images. (Cruz et al. 2012) Even though the Kinect is not designed to be used on research purposes soon after the release, the OpenKinect was made publicly available. The OpenKinect project was started by Hector Martin, who hacked the Kinect and made it possible to access all it is data on the computer (Cruz et al. 2012). After a couple of years, Microsoft made an official software development kit (SDK) for the Kinect (Cruz et al. 2012). Kinect is thought to be the first cheap and widely available RGBD camera that revolutionized RGBD research.

In this thesis, the focus will be on researching how well the YOLOv3 neural network works with RGBD camera images and how well it manages when compared to earlier research with RGBD camera images. The research will also answer how using RGBD camera images differs just from using RGB images.

(10)

First, there is a brief introduction to the history of AI and how AI can be defined. After the discussion about general AI, the focus moves to object recognition, which is an essential part of this thesis. The third chapter’s first sub-chapter focuses on how RGBD images are formed and what their applications are. The second sub-chapter will go through how ZED camera works, which is one of the publicly available RGBD cameras. In chapter four, all needed theory behind forming a depth image and deep neural networks are discussed. Chapter five is the implementation and results of the research done in this thesis. The last chapter has only conclusions on the research.

(11)

2. OVERVIEW OF AI-BASED OBJECT RECOGNITION

The history of AI goes back to the early 1940s when Warren McCulloch and Walter Pitts showed that logical connectives could be represented by networks of artificial neurons.

In their work, the neurons are defined either “on” or “off”, where neuron would be turned

“on” when it receives enough stimulation from its neighboring neurons. Their work is acknowledged as the first work in the field of AI. (Russell et al. 2016)

Nowadays, AI has grown so large of a science field that it has an immense number of applications. These applications vary from quite simple looking chatbots to more complex systems like self-driving cars. Since AI is designed to be used at any intellectual task generally made by humans and in any field of science, it can be said that AI is a universal field in science (Russell et al. 2016).

The term AI is extensive, and it causes misunderstandings. Many times, AI is considered to be like the human mind. According to Loukides and Lorica can be said that this definition is somewhat hard since we do not understand enough about the human mind (Loukides & Lorica 2018). Cronin, on the other hand, defines that AI can be considered a wild variety of things from being a reasoner and problem-solver to androids and everything in between (Cronin 2014). Cronin’s explanation is a good but better description, what AI is, was made by Achin at AI conference in Japan where his definition was: “AI is computer systems able to perform tasks that ordinarily require human intelligence.” (Achin 2017)

In this thesis, the focus will be on machine learning that is a sub-section of AI. Machine learning is a field of AI, which aims to make the computer learn by itself without being directly programmed (Samuel 1959). The general structure of a machine learning algorithm is that the first computer learns to do something, for example, object recognition, by going through a set of examples, called training dataset. The training dataset consist of trainable data, for example, images and truth labels. Truth labels define all the possible outputs of the algorithm, for example, a category like a car. A computer learns from the training dataset, which is the similar features of examples with the specific truth label, and so creating a mathematical model of what data features result in each truth label. When the computer has learned to perform the task well enough, it will be fed with data that is unknown to it, and the result will be truth label for that previously unknown data. (Louridas & Ebert 2016)

(12)

A considerable amount of object recognition systems is powered by machine learning.

In this thesis, the implementation part of object recognition is done with machine learning.

Next, two sections 2.1 and 2.2 will concentrate in more detail to object recognition and algorithms used in object recognition.

2.1 Object recognition

The first tests and research done in computer vision were in the early 1960s. This research focused on pattern recognition in an office environment. (Roberts 1960; Tippett et al. 1965, according to Andreopoulos & Tsotsos 2013) As can be seen, AI-based object recognition is not a new thing and has interested researchers since the invention of computers. Object recognition is one of the sub-areas in computer vision, and in the context of this thesis, the focus will be on that area.

The object recognition problem can be divided into four smaller levels. These can be described more conveniently, and this way, understanding algorithms used in object recognition is divided into more logical steps. These four steps are:

1. Detection, whether an object is present in the image

2. Localization, detection plus the accurate location of the object 3. Recognition, localization plus accurate description of an object

4. Understanding, recognition, plus follow-up actions. (Tsotsos et al. 2005)

Detection can be done for the whole image or just part of it at a time. There can be algorithms that look for specific color values or shapes, or even both at the same time.

Detection part answers to question whether there is a desirable object in this picture.

Usually, the answer is probability telling how likely an object is in the image. With some threshold value, the answer can be turned to a true or false question.

Localization is the part where the algorithm decides wherein the picture desirable object is. In this part object’s exact location is determined. The solution can be described in coordinates, pixels, or even with words like “behind” or “front”.

Recognition is an essential part of object recognition because it tells what the looked object is in the process. This part is also the most complex one, and here comes much difference in how algorithms work. Recognition is comparing data of the object image to previously known data and concluding what the object’s class is. Data comparison can be made in a lot of different ways, such as comparing probabilities, pixels, or even features like variances to determinate object’s class.

(13)

Understanding is the last part of the object recognition problem, and it depends heavily on the recognition part. In this part, the objective is to determine what algorithm does with the knowledge of the object’s class. For instance, a robot’s object recognition algorithm recognizes a block on the table. Then the algorithm decides what to do with this object, and it could be picked up or just be ignored. Usually, this is the most prominent part of object recognition.

In object recognition, the data acquisition process can be divided into passive and active approaches. When there is no way of controlling data acquisition, the system is referred to as a passive approach. The active approach means that the object recognition system has an intelligent way of improving data acquisition by adjusting sensors to get better performance, for example, moving sensor closer to the object. In mobile robot applications, an active approach increases the autonomy of the robot and gives the robot better images, for example, to use in object recognition. For those reasons, an active approach is, in many cases, a desirable feature in mobile robot applications. (Chen et al.

2011) There are also significant improvements in vision systems where mobility and power efficiency becomes a problem when using an active versus passive approach. If an active approach controls only one parameter, for instance, zoom, it can be classified as a limited active approach. Active approaches that control more than one parameter can be described as fully active approaches. Since passive approaches are more straightforward to implement, they are a lot more popular than active approaches.

(Andreopoulos & Tsotsos 2013)

2.2 Algorithms of object recognition

As computational capabilities and power have increased over the years, and distributed systems have become more capable. It has resulted in learning algorithms becoming much more important in object recognition problem (Andreopoulos & Tsotsos 2013).

One of the most significant improvements in learning algorithms has been the emergence of convolutional neural networks (CNNs), which will be discussed in more detail in section 4.2.

All Object recognition systems typically contain the same process steps. One way to present the typical object recognition system is by dividing it into the following components: feature extraction, feature grouping, hypothesizing objects, and verifying objects (Dickinson 1999). An example of an object recognition system is shown in the following Figure 1.

(14)

Typical object recognition system adapted from (Dickinson 1999).

Firstly, there is feature extraction where features, for example, mean values, are extracted. Features are selected based on application and what recognizable objects are. Secondly, there are group features where extracted features are grouped into functional collections. These collections are called indexing primitives. Hypothesize objects compares indexing primitives to prior data and returns candidate objects that might be correctly classified. The last part is verifying objects, where each candidate object is assigned a comparable value, for example, probability, which tells how close it is to looked object. Then from all scores, the optimal is chosen based on some attribute, and thus the object is recognized. (Dickinson 1999)

When comparing Figure 1 to the list of object recognition’s subproblems in earlier chapter 2.1, many similarities can be seen. Object recognition sub-problem detection happens in between extracting features and grouping features. The localization sub-problem happens at the same time. Since candidate objects have a score or probability, they must also have an accurate description, meaning that recognition happens in hypothesizing objects partly in Figure 1. In understanding, part objects are recognized, and some follow-up actions might be defined, henceforth understanding happens after verifying objects in Figure 1.

Object recognition algorithms can be divided into real-time and less than real-time algorithms. The interest to real-time algorithm comes from the added possibilities in applications grows hugely compared to less than real-time. All algorithms that run at least 30 frames per second (fps) one frame is one image, are identified as real-time algorithms (Sadeghi & Forsyth 2014). The following algorithms that are looked in more detail are not real-time algorithms.

There is a vast number of different datasets available for training and evaluating object recognition systems. However, there is a minimal number of datasets that contain both color and depth images. One of these datasets is RGB-D Object Dataset, which is publicly available (Lai et al. 2011). The implementation phase of this thesis will be using the previously mentioned dataset. It was chosen because it has been widely used for research purposes. Thus, earlier research can be used as a point of reference for the implementation. RGB-D Object Dataset will be discussed in chapter 5 in more detail.

(15)

The first algorithm that will be looked at, which was used with the RGB-D Object Dataset is Linear Support Vector Machine (LinSVM) (Lai et al. 2011). The main idea of the LinSVM algorithm is to classify data using high dimensional feature space 𝑍. Firstly, the algorithm transforms every data point to a feature vector. Then these vectors are mapped to high dimensional feature space 𝑍, where a linear decision surface will be constructed.

Before the linear decision surface is constructed for every class, there is constructed a support vector. Support vectors define what data points belong to which class. Then linear decision surface is built, so that distance between support vectors is maximized.

(Cortes & Vapnik 1995)

The second algorithm that will be looked at, which was used with the RGB-D Object Dataset, is a Gaussian kernel support vector machine (KSVM) (Lai et al. 2011). Gaussian kernel support vector machine uses the same principle to do object recognition that LinSVM does, but with the ability to do it to non-linear data. KSVM uses kernel trick to transform example data from lower dimension space to higher dimensional space or even infinite-dimensional space. Kernel trick means using one of the kernel functions to transform the dimension of the data to a higher one. KSVM uses the Gaussian kernel, also known as Radial Basis Function. Applying the kernel trick makes it possible to fit linear decision surface to data and, thus, possible to use support vector machine in decision making. The kernel trick dimensionality of the space does not affect the design of a linear classifier. (Theodoridis & Koutroumbas 2008) Using KSVM versus LinSVM is slower in training and evaluation since there are more calculation operations due to the kernel calculations (Chang et al. 2010).

Random forest (RF) is a machine learning algorithm that uses decision trees. A decision tree can be thought to be a function, which input value is a vector containing values, and the output value is a single value (Russell et al. 2016). In Figure 2 is shown an example of a simple decision tree. The decision tree reaches its output by doing comparison tests to each value of the input vector. Tests are visualized in Figure 2 with white rectangles with rounded corners, and they are done to the value of the input vector, called internal nodes (Russell et al. 2016). From each internal node leaves branches, which show all the possible values of the test. For example, the test Windy has two potential outputs true and false. The decision tree’s output is called a leaf node, and these are visualized in Figure 2 as grey squares containing the output value of the tree. (Russell et al. 2016) The random forest contains a selected number of sub-trees that are built using a randomized subsample amount of the available input data. For example, input data

(16)

contains 100 points, and the sub-trees are built with 10 points, that are randomly selected for each tree independently from the input data.

Decision tree whether to go outside play (Oritnk 2012).

There are plenty of different algorithms deciding how to choose the number of sub- samples used. In the evaluation phase, the test data is fed to all decision trees, and they each independently vote what the output class of the test data is. Then the algorithm chooses the class that has the most votes. (Breiman 2001) Using random forests in RGB-D Object Dataset, Lai et al. (2011) got similar results compared to LinSVM and KSVM algorithms.

The last algorithm that is looked in more detail in this chapter is the convolutional k- means descriptor. The convolutional k-means descriptor is an unsupervised learning algorithm being very scalable. Since it uses all available data, and it uses the data in a unified way (Blum et al. 2012). Unsupervised learning means that the algorithm does not have pre-defined classes where it categorizes input, and instead, it creates the classes itself (Liu 2011). In the training phase, the convolutional k-means descriptor gathers a set of training patches around the interest points by extracting the surrounding area of the point. These interest points are chosen on different strategies, but in this chapter, they are not discussed. After extracting, some pre-processing is done to the patches.

From these processed patches, the k-means algorithm learns centroids for each class and adds centroids to a feature response directory. In the evaluation phase, the algorithm runs an interest point detection algorithm on the input image. Again, the algorithm extracts the surrounding area and calculates feature response from it. Then, for each feature response, the distance to all centroids in the feature response directory is

(17)

calculated. All the distances that are closer than the average distance are kept and summed together, forming a feature histogram. The feature histograms are then fed to LinSVM, which outputs the class of the input image. (Blum et al. 2012)

On the RGB-D Object Dataset Blum et al. (2012) got an excellent result that beat all the previous results. Blum et al. (2012) only performed RGBD variation of the experiments on the RGB-D Object Dataset. Thus, there is no data on how well convolutional k-means descriptor performed only using RGB pictures on the dataset. However, in another study, Blum et al. (2012) found out that switching from RGB images to RGBD images, the accuracy increased from 82,9 % to 89,5 %.

In the following Table 1 is collected the performances of all the algorithms looked in detail in this chapter.

Performance of different algorithms on RGB-D Object Dataset.

Algorithm RGB (%) RGBD (%)

LinSVM (Lai et al. 2011) 74,3 ± 3,3 81,9 ± 2,8

KSVM (Lai et al. 2011) 74,5 ± 3,1 83,8 ± 3,5

RF (Lai et al. 2011) 74,7 ± 3,6 79,6 ± 4,0

Convolutional k-means descriptor (Blum et al. 2012) - 86,4 ± 2,3

(18)

3. IMAGE ACQUISITION

In this thesis, object recognition was done for RGBD images that are captured with an RGBD camera. These images might increase object recognition percentage compared to standard RGB images. On the other hand, adding the 4^th element to every pixel is going to increase the number of parameters calculated by the algorithm and probably will make computation slower compared to RGB images. In this section, the RGBD image is introduced as well as a depth camera in detail.

3.1 RGBD image

RGBD image can be thought to be a version of an RGB image containing more information. RGBD image consists of an RGB image and depth image, which has geometrical information.

The RGB image is a data format used widely in computer science that displays images in the RGB color model. It consists of three components: red, green, and blue. These components are called primary colors, and mixing them in different proportions produce all possible colors that can be seen. (Butterfield et al. 2016) The image represented in digital format consists of 𝑀 ⋅ 𝑁 pixels, where 𝑀 is the height of the image in pixels and 𝑁 is the width of the image in pixels. Pixels contain three values that each represent one primary color plane of the image. These values are between agreed range; for example, in MATLAB pixels values are between 0 and 255. Combining these three primary color planes will result in a seeable RGB image. In the following, Figure 3 is a visual representation of the structure of an RGB image.

Pixels in an RGB image are formed from the corresponding pixels of the three primary color component image planes. (Raj 2018).

(19)

The depth image is a data format that represents a depth map. A depth map is an image that contains depth value for every possible pixel in the picture. Values of a depth image can be represented in various ways; for example, ZED camera outputs depth data in meters directly (Stereolabs 2020d). Usually, it is not possible to show depth maps on the computer screen when the data units of the image are meters. Therefore, it is best to transform depth data from meters to a data type that used programming language can read. For example, in Python programming language, there is a library called Python Imaging Library (PIL). When using PIL, a programmer can display images if the data is either integer values between 0 and 255 or floating values between 0 and 1 (Clark &

Contributors Handbook 2020). Depth images are like RGB images, easy for humans to visualize. The following Figure 4 is an example of a depth map image.

Example of a depth image (Stereolabs 2020e).

When combining RGB and the corresponding depth image result will be RGBD image.

RGBD image is rather hard to visualize for humans since the depth plane just messes up the color planes, and also, humans can sense some depth just from RGB image. In tensor format, the depth image is added to the RGB image as the 4th dimension for every pixel in the image. In following Figure 5 is visualized how an RGBD image tensor looks.

RGBD images open new potential applications that are hard or even possible with just standard RGB images, for example, pose recognition and object segmentation (Cruz et al. 2012). RGBD images contain more data compared to RGB images, therefore making it possible to do more in detail object recognition.

(20)

Visual representation of RGBD tensor.

Some RGBD camera systems use infrared light for depth measurement. Infrared light is not suitable for outside usages, and thus limiting potential applications on some cameras.

(Cruz et al. 2012) Infrared light also has another problem that comes with some materials that do not reflect infrared material well or not at all (Cruz et al. 2012). Adding a depth camera with an RGB camera also adds one layer where noise can occur. Therefore, a noise filtering method must be applied to get rid of the noise. Using RGBD images also adds more computing needed to recognition systems.

There has been done much research on RGBD images, and it also has a lot of practical applications. One of the applications that benefit a lot from depth data is constructing a virtual skeleton of human standing in front of a sensor. With this skeleton, it is possible to make gesture-controlled applications. (Cruz et al. 2012) The second application that is widely used with RGBD images is a 3-dimensional (3D) reconstruction of the scene.

Traditional ways use a dedicated scanner to construct 3D images. These scanners are usually more expensive and bigger compared to RGBD cameras, thus making RGBD cameras excellent for mobile applications. The 3D images are, for example, applied to augmented reality applications or robot controlling based on the scene it sees. (Cruz et al. 2012)

(21)

3.2 ZED camera

In this chapter, the ZED camera will be looked into in more detail. ZED camera is one of the cameras used to take RGBD pictures and videos, developed and sold by Stereolabs.

ZED camera is very flexible and offers a lot more than just RGBD image capturing. It also has motion, position, and environmental sensors (Stereolabs 2020c), which makes it possible to have all these sensor readings behind one application programming interface offered by Stereolabs (Stereolabs 2020b). ZED cameras come from the factory already calibrated (Stereolabs 2020a), which makes setting them up easy and fast. ZED SDK has a built-in feature to use multiple ZED cameras same time on the same computer (Stereolabs 2020h). In the following Figure 6 is a picture of the ZED camera.

ZED camera (Stereolabs 2020g).

ZED camera has two identical camera sensors that are separated by 12 cm (Stereolabs 2020d). These camera sensors both can record up to 4 million pixels at 2208x1242 resolution (Stereolabs 2020b). The ZED camera has an image signal processor (ISP), which synchronizes the left and right video frames to a side-by-side format. ISP is also equipped with the ability to do different kinds of image processing algorithms on the images. (Stereolabs 2020b)

ZED camera can measure depth up to 20 m and 100 fps. ZED camera forms the depth image from the left and right RGB images by comparing pixel values between these two images. (Stereolabs 2020f) The depth image is formed using epipolar geometry, which is discussed in more detail in chapter 4.1. Since the ZED camera uses epipolar geometry to form the depth image, it can be used both indoor and outdoor (Stereolabs 2020f).

Thus, giving an advantage compared to other RGBD cameras that use infrared light to measure depth, which cannot operate on outdoor.

ZED camera has very many different video modes alternating with different image resolutions and fps. The lowest resolution is 1344x376, which offers up to 100 fps, and the highest resolution is 4416x1242, which only has 15 fps. (Stereolabs 2020b) ZED

(22)

SDK makes it possible to record RGBD videos in various video packaging standards like H.264, H.265 or lossless compression (Stereolabs 2020h). It is possible to use ZED cameras on multiple different operating systems, including Windows 10, ubuntu, et cetera. The ZED SDK is supported in many modern programming languages, for example, C++ and Python. (Stereolabs 2020f)

(23)

4. THEORY AND METHODS

In this section, the needed theory and understanding to implement the YOLOv3 object recognition system are looked into in more detail. The first chapter focuses on epipolar geometry, which is widely used in computer vision. Although epipolar geometry is not needed on the YOLOv3 algorithm, the ZED camera uses it to form the depth picture.

After the first chapter, the focus will be moved to a more relevant theory regarding YOLO.

4.1 Epipolar geometry

Epipolar geometry is used as projection geometry, which allows representing every point from one view to another view. Making it possible to do depth calculations from standard RGB images. Epipolar geometry is mainly used in stereo vision rigs and also with active infrared camera applications, for example, with Intel D435 camera (Intel 2020). It has a lot of applications in the field of computer vision and mobile robots.

Epipolar geometry can be thought to be universal since it does not depend on the scene structure. Thus, it means looked object does not matter. Instead, epipolar geometry depends on the internal parameters of the camera and relative pose. (Hartley &

Zisserman 2004) Generally, cameras used in stereo image systems are carefully chosen for the system, and they are calibrated, which means that all the internal properties of a camera are known. Therefore, this chapter only discusses calibrated methods of epipolar geometry, even though there also exist uncalibrated methods.

In order to be able to map 3D camera-centered points 𝑝_𝑐 to 2-dimensional (2D) pixel coordinates, a camera calibration matrix 𝐶 is needed. There exist two kinds of parameters for cameras, internal and external. From these two, only internal parameters of a camera define the camera calibration matrix, and the external parameters need to be calculated from the stereo image. One of the ways to represent the camera calibration matrix 𝐶 is

𝐶 = [

𝑓_𝑥 𝑠 𝑜_𝑐𝑥 0 𝑓_𝑦 𝑜_𝑐𝑦

0 0 1

], (1)

where 𝑓_𝑥 and 𝑓_𝑦 are focal lengths for the sensor in 𝑥 and 𝑦 dimension, 𝑠 is a parameter that corrects any skews between sensors, and lastly 𝑜_𝑐𝑥 and 𝑜_𝑐𝑦 are the centers of the optical axes in pixel coordinates. (Szeliski 2011)

(24)

In the following Figure 7 is visually represented the relation between two camera image planes capturing the same object on the scene. The main component making it possible to represent points from one image plane to another is the epipolar plane. An epipolar plane is a plane between two camera centers 𝑐₀ and 𝑐₁, that are lying coplanar to each other, the observed point on the 3D scene 𝑝, and baseline connecting camera centers to each other. Since all these points lie on the epipolar plane, it is possible to connect the points between image planes via the observed point and rays projecting from image planes to observed point 𝑝. Epipolar line is a line where the epipolar plane and a camera image plane intersect. For every epipolar line in one camera image plane corresponding, one can be found at another. Thus, when searching for corresponding points between the camera image planes, the search can be just limited to corresponding epipolar lines.

(Hartley & Zisserman 2004)

Epipolar geometry between two cameras and a scene adapted from (Szeliski 2011).

The most fundamental property in epipolar geometry is that the relative position between two cameras can be encoded by a rotation 𝑅⃗ and translation 𝑡 . When starting to estimate the relative position between two cameras the first camera can be set at the origin 𝑐 = 0 and at a canonical orientation 𝑅₀= 𝐼_𝑚 , where 𝐼_𝑚 is an identity matrix size of 3 by 3. This representation has the benefit of not losing generality. (Szeliski 2011)

The ray direction vector from any camera center 𝑐_𝑗 to observed point 𝑝 on the scene can be defined in a general form as

𝑥̂𝑗 = 𝐶_𝑗⁻¹⋅ 𝑥𝑗, (2)

where 𝐶_𝑗 is corresponding camera’s calibration matrix and 𝑥_𝑗 there is pixel measurement in the corresponding camera image plane (Szeliski 2011).

(𝑅⃗ , 𝑡 )

(25)

From here on, the left camera will be marked with subscript 0 and right camera marked with subscript 1. The observed point in the scene 𝑝 is defined in the left image by the following equation

𝑝₀ = 𝑑₀⋅ 𝑥̂₀, (3)

where 𝑑₀ is disparity value, and 𝑥̂₀ there is a ray direction vector. (Szeliski 2011)

Now to transfer point 𝑝 from the left image to the second image, the right image, can be mapped by the following transformation

𝑑₁⋅ 𝑥̂₁= 𝑝₁= 𝑅 ⋅ 𝑝₀+ 𝑡̂, (4) where 𝑑₁ is disparity value of 𝑝 in the second image, 𝑥̂₁ is the second ray direction vector, 𝑝₁ is observed point’s location on the second image, 𝑅 is the rotation matrix, and 𝑡̂ is the translation vector. (Szeliski 2011)

Now it is possible to substitute 𝑝0 in the equation (4) by definition in the equation (3) and as a result, comes the following equation:

𝑑₁⋅ 𝑥̂₁= 𝑅 ⋅ (𝑑₀⋅ 𝑥̂₀) + 𝑡̂. (5) Next, the vectors are rotated 90° by taking the cross product of both sides with a translation vector 𝑡̂, which results in

𝑑₁⋅ [𝑡̂]_×⋅ 𝑥̂₁= 𝑑₀⋅ [𝑡̂]_×⋅ 𝑅 ⋅ 𝑥̂₀, (6) where [𝑡̂]_× is the matrix form of the cross product operator with the vector 𝑡̂ = [𝑡_𝑥, 𝑡_𝑦, 𝑡_𝑧] that is defined in matrix form as

[𝑡̂]_×= [

0 −𝑡_𝑧 𝑡_𝑦 𝑡_𝑧 0 −𝑡_𝑥

−𝑡_𝑦 𝑡_𝑥 0

], (7)

where 𝑡_𝑥 is the 𝑥-component of the translation vector 𝑡̂, 𝑡_𝑦 is the 𝑦-component of the translation vector 𝑡̂, and 𝑡_𝑧 is the 𝑧-component of the translation vector 𝑡̂. (Szeliski 2011) Skew-symmetric matrix has a property; when it is multiplied by the same vector on both sides, it results in zero. In matrix form, the other vector must be transposed. (Szeliski 2011) To use this property the equation (6) is multiplied by 𝑥̂₁^𝑇 from the left side, which results to the following equation

𝑥̂₁^𝑇⋅ 𝑑₁⋅ [𝑡̂]_×⋅ 𝑥̂₁= 𝑥̂₁^𝑇⋅ 𝑑₀⋅ [𝑡̂]_×⋅ 𝑅 ⋅ 𝑥̂₀. (8) Because 𝑑₁ and 𝑑₀ are scalars they can be swapped with 𝑥̂₁^𝑇 on both sides

𝑑₁⋅ 𝑥̂₁^𝑇⋅ [𝑡̂]_×⋅ 𝑥̂₁= 𝑑₀⋅ 𝑥̂₁^𝑇⋅ [𝑡̂]_×⋅ 𝑅 ⋅ 𝑥̂₀, (9)

(26)

now the left side has skew-symmetric matrix multiplied by the same vector on both sides, thus making left side equal to zero

0 = 𝑑₀⋅ 𝑥̂₁^𝑇⋅ [𝑡̂]_×⋅ 𝑅 ⋅ 𝑥̂₀. (10) By applying zero-product property to equation (10) and defining essential matrix 𝐸 as

𝐸 = [𝑡̂]_×⋅ 𝑅, (11)

the equation (10) results in an equation which is called epipolar constraint (Szeliski 2011)

𝑥̂₁^𝑇⋅ 𝐸 ⋅ 𝑥̂₀= 0. (12)

In order to be able to map images with triangulation from one image to another, epipolar constraint (12) must always be true. (Hartley & Zisserman 2004) Thus, making it one of the essential definitions in epipolar geometry.

In order to calculate disparity map first the translation vector 𝑡̂ and the rotation matrix 𝑅 must be estimated. Translation vector 𝑡̂ can be estimated after taking singular value decomposition (SVD) from essential matrix 𝐸

𝐸 = [𝑡̂]_×⋅ 𝑅 = 𝑈 ⋅ Σ ⋅ 𝑉^𝑇

= [𝑢₀ 𝑢₁ 𝑡̂] [

1 0 0 0 1 0 0 0 0

] [𝑣₀^𝑇 𝑣₁^𝑇 𝑣₂^𝑇], ⁽¹³⁾

where 𝑈 is a matrix containing left-singular vectors of 𝐸, Σ is singular values of 𝐸, 𝑉 is a matrix containing right-singular vectors of 𝐸, 𝑢₀ and 𝑢₁ are column vectors representing sub-components of 𝑈 matrix, and 𝑣₀, 𝑣₁, and 𝑣₂ are column vectors representing sub- components of the matrix 𝑉. (Trefethen & Bau 1997; Szeliski 2011)

Equation (13) shows the result of taking SVD of essential matrix 𝐸, when there are no noisy measurements. Thus, it can be seen from the Σ matrix’s last singular value equals to zero, and in noisy cases, it is a non-negative real number. (Szeliski 2011)

After estimating translation vector 𝑡̂, it comes possible to calculate the rotation matrix 𝑅.

Cross product vector operator [𝑡̂]_× can be defined as

[𝑡̂]_×= 𝑆 ⋅ 𝑍 ⋅ 𝑅_90°⋅ 𝑆^𝑇, (14) where 𝑆 is a matrix containing orthogonal basis vectors and 𝑡̂, 𝑍 is a matrix containing singular values of [𝑡̂]_×, and 𝑅_90° is a rotation matrix with rotation value 90°. (Szeliski 2011) Combining equations (13) and (14) results in the following equation

𝑆 ⋅ 𝑍 ⋅ 𝑅_90°⋅ 𝑆^𝑇 = 𝑈 ⋅ Σ ⋅ 𝑉^𝑇. (15)

(27)

In case when there are no noise measurements, which is now the estimated case, there are following similarities 𝑆 = 𝑈 and Σ = 𝑍, which means from equation (15) can be reduced components 𝑆, 𝑈, Σ, and 𝑍 (Szeliski 2011), which results to

𝑅_90°⋅ 𝑈^𝑇⋅ 𝑅 = 𝑉^𝑇, (16)

and solved with respect to 𝑅

𝑅 = 𝑈 ⋅ 𝑅_90°^𝑇 ⋅ 𝑉^𝑇. (17)

There are four possible rotation matrices 𝑅 that can be generated with the following equation

𝑅 = ±𝑈 ⋅ 𝑅_90°^𝑇 ⋅ 𝑉^𝑇. (18)

All these four rotations need to be calculated since there is no other information available to 𝐸 and 𝑡̂ other than their corresponding sign. Also, with matrices 𝑈 and 𝑉 it is possible to take their additive inverse and still get a valid result from SVD. After calculating all four possible rotation matrices 𝑅, only those two who satisfy the equation |𝑅| = 1 will be kept.

From the remaining pair of potential rotation matrices 𝑅, both are paired with both possible signs of the translation direction ±𝑡̂, equaling four different combinations. The correct rotation matrix 𝑅 is the one that has the most significant number of points seen in front of both cameras. (Szeliski 2011)

After calculating the correct rotation matrix 𝑅 it can be applied to one of the cameras making both camera image planes coplanar. Although now the camera image planes are coplanar, they are still not row-aligned. (Szeliski 2011)

Before calculating all the corresponding points from aligned camera image planes, there are generally some pre-processing done to each image plane, but these are not discussed in this section (Krutikova et al. 2017).

In order to be able to calculate a disparity map from the camera image planes, all points from one of the image planes must be found in the other image plane, thus meaning forming all corresponding points between the image planes. There are a variety of different algorithms on how these corresponding points can be searched, and, in this section, only one of them will be discussed, which is the sum of absolute differences (SAD). (Szeliski 2011)

In equation form, SAD is defined as

𝐸_𝑆𝐴𝐷(𝑢) = ∑|𝐼₁(𝑥_𝑗+ 𝑢) − 𝐼₀(𝑥_𝑗)| = ∑|𝑒_𝑗|

𝑖 𝑖

, ₍₁₉₎

(28)

where 𝐼₀(𝑥_𝑗) is template image point at discrete pixel location, 𝐼₁(𝑥_𝑗) is the same image point at the other image, and 𝑢 is the displacement between these image points. One way of finding all the corresponding points with SAD is to use a sliding window approach with it. (Szeliski 2011)

After finding all the corresponding points disparity map can be formed from the displacements 𝑢 and mapping them all to correct pixels. In the disparity map, these values are called disparity 𝑑. (Szeliski 2011)

Generally, some image post-processing is made to disparity map before doing depth maps to eliminate faulty values, but these techniques are not discussed in this section.

Depth calculation from two camera image planes adapted from (Krutikova et al. 2017).

In Figure 8 is visually represented how the depth values can be calculated from two camera image planes when the disparity map has been calculated. For every pixel, the corresponding depth value from the camera on the scene can be calculated with the following equation

𝐷 =𝑑_{𝑝𝑙𝑎𝑛𝑒}⋅ 𝑑_{𝑏𝑎𝑠𝑒}

𝑑 , ⁽²⁰⁾

where 𝐷 is the depth value of a pixel, 𝑑_{𝑝𝑙𝑎𝑛𝑒} is the distance from camera centers to the image plane, and 𝑑_{𝑏𝑎𝑠𝑒} is the distance between cameras. (Krutikova et al. 2017)

4.2 Convolutional neural network

A convolutional neural network is a variation from the neural network made for handling inputs that are known to have a grid-like topology. For example, CNNs are useful for processing images and time-series data. A standard feedforward neural network is called CNN when at least one of its layers is a convolutional layer. (Goodfellow et al. 2016) The fundamental building component of neural network is a neural network layer, of whom the convolutional layer is variation. Going further, the fundamental building component

𝑥_𝑗 𝑥_𝑗− 𝑑

𝑑_{𝑏𝑎𝑠𝑒} 𝑑_{𝑝𝑙𝑎𝑛𝑒} 𝐷

(29)

of neural network layers are artificial neurons. The First artificial neuron model is called perceptron, and the model was developed by Rosenblatt (1962). Nowadays, the perceptron model is not standard any more, and instead, the most common model for an artificial neuron is the sigmoid neuron (Nielsen 2015). In this chapter theory for the CNNs will be introduced with the bottom-up approach starting from defining the general artificial neuron model.

The sigmoid neuron is a modified version of the perceptron neuron. In the next Figure 9 is a visual representation of sigmoid neuron, and the equations relating to the neuron.

Visual presentation of a sigmoid neuron.

The basic principle is the same in both neurons (Nielsen 2015), and thus, the general principle of perceptron neuron can be understood from the visual representation of sigmoid neuron.

The neuron takes several inputs 𝛼₁, 𝛼₂, 𝛼₃,⋅⋅⋅ 𝛼_𝑖 and outputs only one value 𝑦. Every input has a corresponding weight 𝑤₁, 𝑤₂, 𝑤₃,⋅⋅⋅ 𝑤_𝑖 assigned to it. These weights represent how much each input affects the output of the neuron. Each neuron also has its overall bias 𝑏 that is value weighting the overall direction what the neuron outputs. (Nielsen 2015) For example, low bias 𝑏 value gives more effect on weights and high bias 𝑏 gives less effect on weights. The neuron calculates the output function 𝑧 from the following equation

𝑧 = ∑ 𝑤_𝑖⋅ 𝛼_𝑖− 𝑏

𝑖

. ₍₂₁₎

Neuron’s final output is calculated with step function with output function 𝑧 as the input.

(Nielsen 2015)

𝛼

₁

𝛼

₂

𝛼

₃

𝛼

_𝑖

+𝑏

𝑧 = ∑ 𝛼_𝑖𝑥_𝑖− 𝑏

𝑖

𝑦

𝑦 = 1

1 + 𝑒^−𝑧

(30)

Even though the perceptron and sigmoid neuron models have both the same working principle, the sigmoid neuron was developed in order to rectify some problems that the perceptron model has. The main difference between these two models is that in the perceptron model, every variable is binary, meaning that the value can have only values zero or one. In the sigmoid model, every variable can have a float value between zero and one. In the perceptron model, when using binary values, there is not much space to adjust the output of the neuron. For example, changing one weight value from zero to one might make the neuron output zero instead of one in all situations afterward. The neuron models also differ in what function they use to calculate output 𝑦. The perceptron model uses step function as the final function, and sigmoid neuron uses the sigmoid function. (Nielsen 2015) In following Figure 10 is represented the shapes of both output functions. From now on, all references to neurons refer to the sigmoid model.

Step and sigmoid functions.

The step function is defined in equation form as

𝑦_{𝑠𝑡𝑒𝑝}(𝑧) = {0 𝑖𝑓 𝑧 ≤ 0

1 𝑖𝑓 𝑧 > 0, (22)

where 𝑧 is the output function defined in equation (21), and 𝑦_{𝑠𝑡𝑒𝑝} is the output of the perceptron neuron (Nielsen 2015).

The sigmoid function is defined in equation form as 𝑦_{𝑠𝑖𝑔𝑚𝑜𝑖𝑑}(𝑧) = 1

1 + 𝑒^−𝑧, (23)

where 𝑧 is the output function defined in equation (21), and 𝑦_{𝑠𝑖𝑔𝑚𝑜𝑖𝑑} is the output of the sigmoid neuron (Nielsen 2015).

(31)

The neural network layer is a series of artificial neurons that are working parallel to each other, each calculating their output from the input that the layer is fed. The input can be either from a previous neural network layer or from the input layer. The input layer is the first layer of the neural network that introduces the data to the network. One way to visualize the layer is to think that all neurons belonging to that layer are in a single row, then the next layer has its row and so on. Each neuron feeds the output to every neuron in the next layer, except in the output layer. Layers, where neuron feeds the output to every neuron in the next layer, are called fully-connected layers (Nielsen 2015). The output layer is the last layer of a neural network, of whom the output of the whole neural network can be read. Every layer between input and output layers are called hidden layers. They are called hidden layers since their corresponding output is not visible (Goodfellow et al. 2016). Neural networks that have several hidden layers are called deep neural networks, and learning done on these networks is called deep learning.

Convolution layer has many differences between the standard neural network layer. The main one being that the convolution layer uses mathematical operation convolution instead of matrix multiplication when calculating the output of the layer (Goodfellow et al.

2016). Because convolution layers are used to process data, which format is grid-like topology, it is easier to visualize the neurons in a matrix shape instead of row shape.

Convolution layers are not fully-connected layers. Instead, only part of the input called the local receptive field is connected to each hidden neuron. The hidden neuron refers to the next layer’s neuron on CNN. (Nielsen 2015) In the next Figure 11 is visually represented this connection.

A connection from the local receptive field of size 5𝑥5 to the next layer’s neuron in the convolution layer (Nielsen 2015).

Using the sliding window approach convolution layer connects all input neurons to the next layer’s neurons. Generally, the sliding window is moved data point at a time to one

(32)

direction until it arrives to edge of the input, but sometimes it is wanted to move several data points in once. How many data points are moved is called stride length. (Nielsen 2015) The convolutional layer uses weight and bias sharing. Thus meaning that every hidden neuron’s value is calculated with the same weights and bias from its corresponding local receptive field. The array where weights are stored is called a kernel.

Weight and bias sharing make it possible to convolutional layer detect the same feature on all neurons but just in different parts of the image. (Goodfellow et al. 2016)

From the convolution layer to the next layer’s hidden neuron convolution can be written as

𝐻_𝑛(𝑖, 𝑗) = 𝑦_{𝑠𝑖𝑔𝑚𝑜𝑖𝑑}(𝑏 + ∑ ∑ 𝐼(𝑖 + 𝑚, 𝑗 + 𝑛) ⋅ 𝐾(𝑚, 𝑛)

𝑛 𝑚

), (24)

where 𝐻_𝑛(𝑖, 𝑗) is next layer’s hidden neuron in neuron array at position expressed with coordinates 𝑖 and 𝑗, 𝐾(𝑚, 𝑛) is weight value in kernels array at position expressed with coordinates 𝑚 and 𝑛, and 𝐼(𝑖 + 𝑚, 𝑗 + 𝑛) is input’s data point in data array at position (𝑖 + 𝑚, 𝑗 + 𝑛) (Nielsen 2015).

When using convolution layers with multi-dimensional data, like RGB images, the layer has so many kernels as the input data has dimensions (Goodfellow et al. 2016). For example, an RGB image has three dimensions; therefore, the corresponding convolution layer has three kernels, one for each dimension. Thus, making it possible for the convolution layer to learn features that occur only in a particular dimension.

Commonly pooling layers are used together with convolution layers. The pooling layer comes just after a convolution layer, and it is used to simplify the output of the convolution layer (Nielsen 2015). The pooling layer uses a sliding window approach taking small parts of the input at a time and doing summarization operation, then outputting one value to the next layer. These layers are mainly used to enhance detected features.

(Goodfellow et al. 2016) There exist many different variations of pooling layers, for example, max pooling and L2 pooling. However, in this chapter, those are not discussed.

One great advantage of using a pooling layer is that it makes it possible to use varying input data. In these cases, the pooling layer just adapts its summarization area to match the wanted output count. (Goodfellow et al. 2016)

(33)

4.3 YOLOv3

In this chapter theory relating to the YOLOv3 deep learning algorithm is discussed in more detail. YOLOv3 is a deep neural network made for object recognition, which predicts the location and the class of the object. YOLOv3 is based on earlier versions of YOLO9000, also called YOLOv2 and the original YOLO (Redmon et al. 2016; Redmon

& Farhadi 2017), and therefore the basic principle of the network has stayed the same in all versions. YOLO algorithm has always been competing with state-of-the-art methods at object recognition and thus, proving its capability at object recognition. Beginning with the first version of YOLO, Redmon et al. (2016) have focused on making the YOLO as fast as possible at object recognition resulting that all the versions are capable of running real-time.

YOLOv3 is a deep neural network that uses only convolution layers and residual connections for feature extraction. This feature extraction part is called darknet53 backbone or body. (Redmon & Farhadi 2018) In Table 2 is shown the structure of the YOLOv3 network.

In the current version, Redmon & Farhadi (2018) use a combination of three layers to make predictions from the extracted features: global average pooling (GAP), fully- connected layer, and a softmax layer. GAP is a layer that calculates the average value of all elements in all feature maps. GAP is used to reduce the size of the following fully- connected layers and thus, reducing the total number of parameters in the network.

Another benefit of using GAP to shrink the number of parameters is that at the same time, the risk of overfitting, machine learning algorithm learning details too closely, and not being able to generalize (Seel 2012), decreases. (Hsiao et al. 2019) Softmax layers are used in multi-class networks to transform outputs from earlier layers to probabilities, values between zero and one, resulting in a readable output of the network (Yuan 2016).

The softmax layer is usually the last layer of the network, but it can also be used as a hidden layer. YOLOv3’s prediction part, containing these three layers just described, is called darknet53 head.

(34)

YOLOv3 layer structure adapted from (Redmon & Farhadi 2018).

Type Filters Size Output

Darknet53 body

Convolutional Convolutional

32 64

3 x 3 3 x 3 / 2

256 x 256 128 x 128 1x Convolutional

Convolutional Residual

32 64

1 x 1 3 x 3

128 x 128 Convolutional 128 3 x 3 / 2 64 x 64 2x

Convolutional Convolutional Residual

64 128

1 x 1 3 x3

128 256

1 x 1 3 x 3

256 512

1 x 1 3 x 3

32 x 32 Convolutional 1024 3 x 3 / 2 8 x 8

Darknet53 head

4x

512 1024

1 x 1 3 x 3

8 x 8 Avgpool

Connected Softmax

Global 1000

YOLOv3 network can be divided into three main parts. First, the network splits the input image to 𝑆𝑥𝑆 grid cell. The latest version uses three different values for 𝑆 dividing the input image to 13 𝑥 13, 26 𝑥 26, and 52 𝑥 52 grid cells. (Redmon & Farhadi 2018) The second part is a bounding box prediction. A bounding box is a rectangular border around the looked object, indicating the location and size of the object. In all different scales, every grid cell predicts 𝐵 number of bounding boxes using dimension clusters as anchor boxes. These dimension clusters are calculated for each dataset independently from the training dataset using k-means clustering. (Redmon & Farhadi 2017) In the paper where Redmon & Farhadi (2018) published the YOLOv3, they used nine dimension clusters on three different scales on the COCO dataset (Tsung-Yi et al. 2015).

Every bounding box is assigned with predicting cell from the grid cell based on which cell the center of the predicted bounding box belongs. For each predicted bounding box, the network defines a confidence score, describing how well the predicted bounding box fits

AI-Based Object Recognition on RGBD Camera Images

Joni Tepsa