Computer Vision Methods for Parking Spot Extraction from LiDAR Top-down Images

(1)

COMPUTER VISION METHODS FOR PARKING SPOT EXTRACTION FROM LIDAR TOP-DOWN IMAGES

Master’s of Science Thesis Faculty of Information Technology and Communication Sciences Examiners: prof. Esa Rahtu prof. Joni Kämäräinen January 2022

(2)

ABSTRACT

Tiitus Hiltunen: Computer Vision Methods for Parking Spot Extraction from LiDAR Top-down Im- ages

Master’s of Science Thesis Tampere University

Master’s Programme in Information Technology January 2022

The goal of parking spot extraction is to find out the corner coordinates of a parking spot in pixel coordinates. If the image is georeferenced, where each pixel coordinate has a corresponding longitudinal and latitudinal coordinate, this information can be used to find out the coordinates of the parking spot in geographical coordinates. This opens up door for multiple applications such as speeding up the development of annotating parking garages. This annotation is important for example for autonomous vehicles to navigate inside the parking garage. Typically these georeferenced images are projected from a top-down view in relation to the parking spot.

Albeit parking spots being quite a simple feature in terms of their visual structure, parking spot extraction from top-down images is a complex and multifaceted problem. This is especially the case when using deep learning approaches and there is a lack of quantity and diversity of data samples, which makes it difficult to use advanced deep learning methods such as instance segmentation.

If the parking spot extractor model is not based on deep learning, some assumptions must be made about the visual nature of parking spots. Parking spots are typically rectangular, but parallelogram-shaped parking spots are not atypical either. In addition, parking spots can be in any rotation in the top-down image or inhabited by a car. They can also have visual cues symbolizing, for example, a parking spot reserved for families or people with physical disabilities.

The top-down images utilized in this work were based on point clouds, which were gathered using scanners inside parking garages. These point clouds and the top-down images generated from them produce their own set of problems for feature extraction, mainly the noisiness due to negligent scanning. Therefore the model has to be robust to be able to detect partially lost parking spots and parking spots that have only some parts of the lines visible in the top-down image.

The proposed method utilized histogram of oriented gradients features along with a logistic regression classifier for parking spot proposals, and a convolutional neural network called TilhiNet for proposal verification. The experiments show that the histogram of oriented gradient features work well and robustly with a small amount of sample data in the case of parking spot extraction from point cloud-based top-down images.

Keywords: parking spot extraction, histogram of oriented gradients, object detection, LiDAR, convolutional neural networks

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Tiitus Hiltunen: Parkkipaikkojen havainnointi LiDAR-yläilmakuvista konenäkömenetelmillä Diplomityö

Tampereen yliopisto Tietotekniikan DI-ohjelma Tammikuu 2022

Parkkipaikkojen havainnoinnin tavoitteena on havaita parkkiruutujen kulmissa olevat koordi- naatit yläilmastakuvasta pikselikoordinaatistossa. Jos kyseinen kuva on georeferoitu, eli jokaisella kuvan pikselillä on vastaava pituus- ja leveyssuunnan koordinaatti, voidaan parkkiruudun koor- dinaatit muuttaa kuvan koordinaatistosta maantieteellisiin koordinaatteihin. Tämä avaa mahdolli- suuksia useille sovellutuksille, kuten esimerkiksi parkkihallien parkkiruutujen annotoinnin nopeut- tamiselle. Tämä annotointi on tärkeää esimerkiksi autonomisten ajoneuvojen navigoinnissa parkkihallin sisällä.

Parkkiruudut ovat visuaaliselta rakenteeltaan hyvin yksinkertaisia, mutta parkkiruutujen havainnointi yläilmakuvista on melko monimutkainen ongelma. Tämä on totta varsinkin silloin, kun käytössä ei ole suuria määriä harjoitusaineistoa, jonka kanssa voisi hyödyntää kehittyneempiä syväoppimisen menetelmiä, kuten semanttista segmentointia.

Jos käytetty menetelmä ei hyödynnä syväoppimista, on tehtävä joitakin oletuksia parkkiruudun visuaalisesta luonteesta yläilmakuvassa. Parkkiruudut ovat tyypillisesti suorakulmioita, mutta suunnikkaan muotoiset parkkiruudut eivät myöskään ole harvinaisia. Lisäksi parkkiruudut voivat olla kuvassa missä tahansa rotaatiokulmassa ja niiden sisällä voi olla jokin ajoneuvo tai jokin sym- boli, joka rajaa kyseisen parkkiruudun käyttäjäkunnan esimerkiksi perheille tai fyysisesti vammau- tuneille.

Työssä käytetyt yläilmakuvat olivat pisteparvien avulla luotuja, jotka luotiin pisteparvien ke- räämiseen tehdyillä skannereilla parkkihallin sisällä. Tällaiset skannauksella luodut pisteparvet tuovat omat ongelmansa havainnointiin, kuten huolimattomasta skannauksesta johtuvan kohinan.

Tämän vuoksi hyödynnetyn menetelmän on oltava tarpeeksi vakaa vain osittain näkyvissä oleville tai heikosti erotettaville parkkiruuduille.

Tämä menetelmä hyödyntää orientoitujen gradienttien histogrammin piirteitä logistiseen regres- sioon perustuvan luokittelijan avulla parkkiruutujen ehdottamiseen, ja itse rakennettua konvolutii- vista neuroverkkoa nimeltä TilhiNet parkkiruutuehdostusten varmistamiseen. Kokeilut osoittavat, että orientoitujen gradienttien histogrammin piirteet toimivat hyvin ja vakaasti pienellä määrällä dataa parkkiruutujen havainnointiin pisteparviin perustuvista yläilmakuvista.

Avainsanat: parkkiruutujen havainnointi, orientoitujen gradienttien histogrammmi, hahmontunnis- tus, LiDAR, konvolutiiviset neuroverkot

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

After working on this thesis for the past six months, I can confidently admit that computer vision is a topic that never becomes dull or monotonous. The applications seem limitless and, due to that, so does my own curiosity.

I want to thank the Indoor Map Automation Team at HERE Technologies for support and stimulating discussions concerning the topic of this thesis. I want to also thank my su- pervisor, professor Esa Rahtu for giving valuable help, especially in terms of the writing process.

In Tampere, 19th January 2022 Tiitus Hiltunen

(5)

LIST OF SYMBOLS AND ABBREVIATIONS

x_s Horizontal coordinate of the sliding window position y_s Vertical coordinate of the sliding window position CNN Convolutional Neural Network

DNN Deep Neural Network

HOG Histogram of Oriented Gradients IoU Intersection over Union

LiDAR Light Detection And Ranging LR Logistic Regression

mIoU mean Intersection over Union NMS Non-Max Suppression

NN Neural Network

SGD Stochastic Gradient Descent

(8)

1. INTRODUCTION

Parking spot extraction from images is a problem that has many applications. One of these applications is the generation of geographic maps, including the longitudinal and latitudinal coordinates of the parking spots. The information about the geographical coordinates of the parking spots is useful, for example, for autonomous vehicles and drivers in non-autonomous cars to find vacant parking spots.

Parking garages are one the most important buildings for indoor maps in navigation and map creation since they are used by almost every driver. They are also quite simple and consistent in terms of their visual structure. This simplicity implies that with a certain set of assumptions, extraction of features such as parking spots can be done with either more traditional computer vision methods using strong assumptions or with massive amounts of sample data with deep learning -based approaches and less assumptions. An example of a geographic indoor map featuring a parking garage in in Figure 1.1.

One of the ways of creating geographic maps of parking garages is through top-down view images. Creating a parking garage indoor map containing labeled parking spots and other important features based on a top-down image can take hundreds of hours, where a computer vision model could succeed with the same or better precision in a matter of hours or even just minutes. If successful, outdoor maps could be extended to indoor spaces very rapidly, which would not only allow for autonomous driving and parking, but also help a typical driver to navigate inside the parking garage. One of the most time-consuming sub-tasks in parking garage map creation is the manual annotation and labeling of the parking spots.

A parking spot is a physical area inside which a vehicle can be parked. This area is usually marked on the floor of a parking garage, parking lot, or next to a driving road.

Parking spaces consist of multiple parking spots. A typical parking spot made for sedans is marked with two white parallel lines that are approximately twice as long as the shortest possible distance between the lines.

The goal of parking spot extraction is to extract parking spots from an image in such a way that allows someone to put it to its corresponding place in an image based on its pixel coordinates. Usually, these are corner coordinates of the parking spots which there are usually four, but there could be more if the parking spot is next to a pillar or something

(9)

Figure 1.1. An example geographic parking garage map used in car navigators. Image from HERE Technologies website (Krome 2018)

Figure 1.2. Obstructed parking spot

Figure 1.3. Labeled obstructed parking spot

else that could obstruct it from having the typical rectangular shape. An example of a case like this is demonstrated in figures 1.2 and 1.3 below.

Georeferencing of an image means that each pixel in the image has a longitudinal and latitudinal coordinate corresponding to an exact geographic location on earth. In the case of a top-down image, the extracted parking spots in pixel coordinates can be used to find out their geographical coordinates with a proper georeferencing of the image. In applications where this might be useful, the demand for great accuracy of the detected corner coordinates is much higher.

There are plethora of ways to measure the performance of parking spot extraction. These metrics utilize the ground truth parking spot coordinates and the predicted parking spot coordinates and calculate a value that describes their similarity. Some metrics include Intersection over Union (IoU), which compares the areas of the ground truth and the pre- diction, and center point distance, which compares the center points. Utilizing these two metrics together gives a good understanding of the success of the parking spot extraction.

Intuitively, parking spot extraction from top-down images seems an easy task for a modern computer vision algorithm to solve: find out the sidelines of a parking spot and return the

(10)

end coordinates of those lines as the coordinates of the parking spot. As mentioned earlier, parking garage features are often fairly consistent visually, and parking spots are no different in this case. Most parking spots are rectangular, have at least two white lines that are usually twice as long as the distance between them. In addition, parking spots are usually grouped in rows, which when used as a presumption could reduce the search space exponentially for certain computer vision algorithms.

Data used for indoor feature extraction can come in a form of light detection and ranging (LiDAR) data (US Department of Commerce and Administration 2012). This data is gathered with scanners that are attached to a car, which then drives inside the indoor space.

LiDAR data is point cloud data, where each voxel in the cloud has a certain x,y and z value based on its position within the point-cloud. Some scanners also take panorama images, which allow to map each point-cloud voxel with an RGB value. This type of data makes it possible to generate a view from a top-down perspective for indoor maps, where satellite images are not available for use. The top-down image can be used to extract meaningful features from the point cloud. In the case of a parking garage, these features could be, for example, parking spots, turning arrows, or other markings on the floor.

However, there are many problems with parking spot extraction, especially from top-down images that have been artificially created, for instance, from the aforementioned LiDAR data. point-cloud based top-down images are still far from high-resolution satellite images in terms of clarity, resulting in problems such as irregular noise and blind spots due to the lack of scanning coverage. Given the randomness of the noise and the blind spots, implementations have to make certain assumptions about the visual nature of the extracted feature. Finding out which presumptions generalize well and are robust to any noise is also a challenging task.

Many research questions can be asked about parking spot extraction at this point. What computer vision -based extraction methods work in the case of LiDAR point cloud data?

What are the solid assumptions about the parking spots that result in the most robustness in the face of noise and incomplete images? The data available for use in the experiments of the thesis come from three different parking garages with roughly 1500 parking spots in total. From the perspective of feature extraction, this is rather small amount and not very diverse set of training data, given that the parking spots inside a single garage look roughly the same. What sort of method provides accurate results from the perspective of the absolute accuracy of detections, when there is a lack of quantity and diversity in training data?

This thesis focuses on parking spot extraction from indoor parking garage top-down Li- DAR images, in a case where there isn’t much data both in quantity and diversity to be utilized, for instance, for semantic segmentation model training. Chapter 2 formulates the problem and discusses prior related work regarding the task. Chapter 3 goes in-depth

(11)

regarding feature extraction approach considerations and what approaches the nature of the LiDAR data deems almost instantly ineffective. In Chapter 4, a method for parking spot extraction is being proposed and discussed, Chapter 5 looks at experiments and results of the proposed method, and Chapter 6 concludes the findings and insights of the thesis.

(12)

2. BACKGROUND

Parking spot extraction connects multiple branches of computer vision and machine learning together. This chapter reviews the the key concepts regarding computer vision and deep learning, and attempts to emphasize concepts relevant for the problem of parking spot extraction.

2.1 Object Detection

The goal of object detection is to detect and classify objects in an image. Typically this means drawing a bounding box tightly around the detected object and then giving it a predicted class label. In general, object detection answers both what is in the image, and where is it located. There are many approaches to this problem, but this subchapter will only go over the related concepts that are relevant to this thesis.

2.1.1 Sliding Window Method

One of the more conventional ways of localizing objects in an image is to use a sliding window method. It works by sliding a window of a fixed size through the image and then, for each window, using a classifier to detect possible objects. The bounding box of the detected object would be, in this case, the position of the current instance of the sliding window. The idea behind the sliding window method is visualized in Figure 2.1 and in pseudo-code in 1.

The method is somewhat slow, especially when the image size is great in comparison to the step size of the sliding window, which defines how many pixels the window shifts each step. With a smaller step size, more precise detections are yielded, but the run time is also slower. On the other hand, with a higher step size, the algorithm will run much faster, but the detections will be less accurate.

(13)

Figure 2.1. The idea behind the sliding window method. The window of fixed size is shifted vertically based on the specified step size. The information in the current window can be used for classification for instance. When the window reaches the right edge of the image, the window starts from the left again but a step size lower.

Algorithm 1

Algorithm for sliding window classifier and multiple classes

Require: image, window_width, window_height, step_size, classifier x←0

y ←0

predictions ← {}

whiley+window_height < image_heightdo whilex+window_width < image_widthdo

patch←image[x:x+window_width, y:y+window_height]

pred_class←classif ier(patch) ifpred_class̸= 0then

coordinates←(x, y, x+window_width, y+window_height) predictions.insert((pred_class, coordinates))

end if

x←x+step_size end while

x←0

y←y+step_size end while

One limitation of the method is that the shape of the bounding box is limited to the shape of the sliding window, or at least to the predefined set of shapes of the sliding window.

This property requires some of the objects to be horizontal or vertical and not rotated too much, or the classifier will miss the object. Even though the classifier might detect the desired object, it might be that the bounding box is either not containing the whole object or contains the object but also excessive amounts of other parts of the image. If one can’t make any assumptions about the size of the object in the image, more advanced techniques should be used, such as Gaussian pyramids for detecting objects at multiple

(14)

scales (Adelson et al. 1984).

2.1.2 Non-maximum Suppression

No matter what approach of object detection is used, the proposed bounding boxes for objects can overlap a lot. This overlapping is not a problem if the classes of the overlapping boxes are different since there could be two classes in an image that are by nature together, for example, a cat laying on a carpet. But if the bounding box classes overlap significantly and the predicted classes are the same, one conclusion drawn from that is that the detected object is the same. Now how to merge these bounding boxes or select the best bounding box out of the overlapping bounding boxes is the problem that non-maximum suppression (NMS) is trying to solve. (Hosang et al. 2017)

The original NMS algorithm solves the problem of overlapping bounding boxes in a pretty straightforward manner. First one has to define a threshold value t ∈ {0,1} ∩R^{. Now,} when looking at the bounding boxes, if the overlapping area is higher than t, then one chooses the bounding box with a higher confidence score over the one with the lower confidence score. There are different variants of this method, such as looking at the bounding box that intersects the most in comparison to all the other overlapping boxes.

No matter what approach is used, non-maximum suppression is an essential part of any object detection algorithm in order to get rid of overlapping bounding boxes. (Hosang et al. 2017)

2.2 Histogram of Oriented Gradients

Object detectors have come a long way from the beginning phases of object detection.

One of the very first object detectors that were considered viable utilized histogram of oriented gradients feature descriptors (Dalal and Triggs 2005). It has been applied, for the most part, for human detection, but it has been used successfully for other tasks as well (Bougharriou et al. 2017; Ramirez Cerna 2013).

The idea behind the histogram of oriented gradients is to apply the derivative mask into the grayscale image window of interest. The gradients are calculated vertically, horizontally, or in both directions, which is mathematically

Ix(j, k) =I(j, k+ 1)−I(j, k−1), (2.1)

I_y(j, k) =I(j−1, k)−I(j + 1, k). (2.2)

(15)

After that the gradients are moved to polar coordinate system

r =

√︂

I_x²+I_y², (2.3)

θ= 180

π (tan⁻¹₂ (I_x, I_y) mod π) (2.4) where tan⁻¹₂ is the four-quadrant inverse tangent. The reason for modulo π is to limit the angle between 0 and 180 degrees so that the later constructed histogram would not separate gradients opposite to each other. (Tomasi 2015)

After that, the gradient window of interest will be divided into non-overlapping cells of size C ×C pixels. Inside each cell, a histogram of gradient orientations will be calculated.

There are B orientation bins, which determine different angles that the gradient can belong to. Each gradient will belong to two adjacent bins to make the calculation more robust to small changes in the orientation of the image. Each bin will have a number between 0 andB−1, and the width of the bin will bew= ¹⁸⁰_B . Now binihas center atc_i =w(i+¹₂) and the two adjacent votes of the pixel will be

vj =rc_j+1−θ

w to bin numberj =

⌊︃θ w − 1

2

⌋︃

modB (2.5)

and

v_j+1 =rθ−c_j

w to bin numberj+ 1 mod B (2.6)

which results in a vector with B non-negative entries. The cells are then made into2×2 overlapping blocks of size 2C×2C pixels with a stride length of C. With this for each cell, the calculated histograms are concatenated to be a single block featureb which is then normalized by

b← b

√︁||b||²+ϵ ^(2.7) whereϵis used to avoid division by zero. (Tomasi 2015)

Now these block features from all the cells are concatenated to produce a feature vector hwhich is also normalized by

h← h

√︁||h||²+ϵ, (2.8) followed by

h_n ←max(h_n, τ), (2.9)

(16)

Figure 2.2. Example of HOG features, original image from Unsplash website (Unsplash 2021). Any major change in contrast will be detected with the HOG features. This makes it a great feature for object detection, since anything that does not blend into the background will be stored to the HOG feature vector.

followed by

h← h

√︁||h||²+ϵ ^(2.10) where in equation 2.9 τ is a predetermined threshold value to avoid extreme gradients from flattening other values in the normalization process. Given all this normalization, the HOG features are not dependent on image contrast in any way. The feature vectorhcan be used as an input feature to a machine learning classifier, for example, Support Vector Machine or Logistic Regressor. (Tomasi 2015)

An example visualization of HOG Gradients with cell sizeC = 8and orientationsB = 9 is in Figure 2.2 above. Depending on the situation, one can alter the size of the cells and blocks and the number of orientation bins to get more detail-oriented but volatile or more approximate but robust HOG features.

2.3 Logistic Regression

One of the simplest classifiers to utilize following a set of features in binary classification is logistic regression (LR). Now, given the sigmoid function

σ(x) = 1

1 +e^−x ^(2.11)

and a feature vectorϕthe basic idea behind it is to calculate the posterior probability of a classC₁ so that

P(C₁|ϕ) = σ(w^Tϕ). (2.12)

(17)

Now the other class probability can be calculated with

P(C₂|ϕ) = 1−P(C₁|ϕ). (2.13) One can note that the weight vector w^T has to be of the same dimension as the input feature vector ϕ. This weight vector, which will be updated based on the ground truth valuesyand predicted valuest. This update is typically done with the gradient of cross- entropy loss which in this case is calculated as

∇E(w) =

n

∑︂

i=0

(yn−tn)ϕn. (2.14)

Sigmoid gives out values between 0 and 1, which, in this case, corresponds to the confidence of the input belonging to class 1. In the case of binary classification applications, the probability of 0.5 is considered the watershed between the classes. (Bishop 2006)

2.4 Deep Neural Networks

In the 2000s, neural networks (NN) have been one of the hot topics in machine learning research. As the name might suggest, neural networks attempt to mimic the structure of the human brain.

A neural network is constructed of nodes that make up layers of the network. Each node has a weight w and a bias term b.The first layer is called an input layer, which takes in an input vector x. The layers in the middle are called hidden layers, and the last layer is called an output layer. NNs that include any hidden layers are called multi-layer perceptrons (MLP). Now if there are more than two hidden layers, the neural network is called a deep neural network (DNN). (Nielsen 2018, p. 3-4)

Now for each layer, there is forward operation F(·) which defines the outputy^′ for that specific layer as follows:

y^′ =F(·) =Wx+b=

n

∑︂

i=0

w_ix_i+b_i. (2.15)

Now for this output, there is an activation function, which normalizes the output in some manner. Typically used activation functions are sigmoid (as described in Equation 2.11) and rectified linear unit (ReLU). Activation functions are central in NN training since they non-linearize the output of the layer and therefore make the network able to learn a more complex set of features.

Any trainable deep neural network has to have something called the cost function. This

(18)

cost function, also known as loss function or objective function, defines the metric, which can then dictate how to update the weights of the DNN. Now when one has input x, desired outputtand outputzpredicted by the DNN and the loss function is, for example, commonly used mean square error, then

L= 1 n

n

∑︂

i

(t_i−z_i)² (2.16)

becomes zero as the predicted outputtapproaches the desired outputz. To get there, the weights of the DNN have to be updated. Now, this is done using backpropagation, where the calculated loss is communicated back in the network, and based on the selected optimizer, the weights of each node will be updated. (Nielsen 2018, p. 49-57)

2.5 Convolutional Neural Networks

Plain deep neural networks can sometimes be too simplistic to get accurate results for some tasks. In addition, increasing the layers of the DNN might boost the model’s performance, but it also makes the model too slow for any practical purposes. To solve both of these problems, one can turn to convolutional neural networks (CNN).

What makes CNNs different from plain DNNs is that CNNs utilize local receptive fields.

This means that the network doesn’t connect each layer’s node to the next layer’s node but only nodes close to each other. This procedure decreases the number of calculations that are needed. The output of the CNN layer is going to be a set of feature maps. The idea behind the generation of those feature maps is in Figure 2.3 below. (Nielsen 2018, p. 170-171)

The second thing that convolutional layers utilize is shared weights. No matter where the localized region is currently scanning the matrix the weights are the same. One can think of this localized region as a filter that is run through the matrix. This further decreases the amount of memory and calculations required for CNN. In a specific feature map, the shared weights work the function of trying to extract the same feature but from different parts of the matrix. In order to acquire multiple features one has to use several feature maps.

2.5.1 Batch Normalization

The problem that many CNNs have is that the values generated by the non-linear activation functions get so high that learning rates and initial values for the network have to be selected very carefully. Batch normalization gets rid of that by learning the mean and variance of each mini-batch fed into the network and normalizing it. This will effectively

(19)

Figure 2.3. On the left an input to a convolutional layer, on the right a feature map generated from the input without padding using kernel size (3, 3) and stride length 1

speed up training and make it more robust to different learning rates and initialization values. (Ioffe and Szegedy 2015)

Batch normalization layers are typically used right after convolutional layers as most problems happen when the activation function of the convolutional layer outputs very high values. Therefore later layers can take in normalized inputs and perform their operations in a more consistent manner.

2.5.2 Dropout

As the number of parameters in a CNN increase, the probability of overfitting the model increases as well. Overfitting means, essentially, remembering the training data and not finding the patterns that describe the data. In a case like this, the model is unable to generalize to novel data outside the training data. One way of trying to rectify this problem is to use dropout layers (Srivastava et al. 2014).

How dropout layers work is simple: after each training epoch, a certain percent of the layer’s parameters are frozen, meaning they can’t be updated anymore. For example, a dropout layer with a value of 0.5 would freeze half of all the remaining parameters of the affected layer after each training epoch. There will always be some trainable parameters but the model is less prone to overfitting since some of the parameters can’t be overly optimized for the training data.

One usually sets multiple dropout layers to different parts of the CNN to customize the freezing process. Typically dropout layers are placed near the end of a CNN since those parts commonly play a more considerable role in the overfitting process.

(20)

Figure 2.4.Example of average pooling with pool size (2, 2), stride 1 and without padding

2.5.3 Pooling

Quite often the features that the CNN learns could easily be compressed to a form that doesn’t require as many nodes. In fact, in many cases doing so will only make the model better at finding what is relevant in each feature map. This sort of compression in the case of CNNs is called pooling. There are many different ways of pooling. In max pooling, one describes a certain area of the feature map with the maximum value of that area. Similarly, in average pooling, the area is described by calculating the average value of that area.

Typically this area is (2, 2)-(8, 8) pixels, but higher values could be tried as well. (Nielsen 2018, p. 174)

Visualization of average pooling for a 5 x 5 matrix is in Figure 2.4. Pooling layers are typically set after convolutional layers since they are great at summarizing the information in feature maps of the convolutional layer.

2.6 Light Detection and Ranging

Light Detection and Ranging (LiDAR) is one of the ways of extracting all sorts of features from the real world. LiDAR stands for Light Detection and Ranging, which basically means using laser sensors to detect the location of a certain voxel point in 3D space. Combined with taken panorama images, each point in the point cloud can have a RGB value based on the RGB value of the point in the panorama image.

Now in order to form a cohesive 3D point cloud, one needs a thorough scanning to grasp all the possible details in an indoor space. LiDAR sensors are typically used by attaching them to a car or a pushcart and then moving with the scanners around the space of interest. This results in vast amounts of point cloud data which can then be utilized in feature extraction. Sample visualization of a LiDAR point cloud is in Figure 2.5 below.

(21)

Figure 2.5. Example patch of a LiDAR point cloud. The uneven density of the point cloud is due to the biased scanning route: the middle of the parking garage driveway is quite dense, since the scanner was driven on it. The points on the ceiling are further away from the scanner and therefore sparser. The blind spots of the LiDAR scanner can be seen behind the vehicle, where there are seemingly no points at all.

One of the downsides of LiDAR scanning is its susceptibility to blank spots due to insuf- ficiently thorough scanning. Any error in the scanning process might result in completely blank spots in the point cloud which is very difficult to rectify after the scan has already been completed. If the mistakes made during scanning were severe enough, a cohesive point cloud formation using something called point cloud registration might not even be possible (Myronenko and Song 2009). Therefore it is essential to perform the scans thoroughly and diligently.

In the case of parking spot extraction, using 3D point cloud data is in many ways excessive and unnecessary since parking spots don’t have a 3D shape. A better way of utilizing the point cloud, in this case, is to project it as an image from a top-down view so that parking spots are going to appear in the image as 2D polygons, typically in the shape of a rectangle. Now, this can more easily be used for parking spot extraction. An example of a top-down projected point cloud image is in Figure 2.6.

LiDAR point clouds are great for creating top-down images in indoor spaces. The benefit of using a point cloud over projected panorama images is that panorama images can warp easily due to insufficient or slightly inaccurate information about the camera’s parameters or mild errors in the projection matrix calculations. Point cloud data -based top-down images will always be projected from the top-down view given that there were no errors in the 3D coordinate assignment for voxels during scanning. Since all the pixels have a 3D coordinate, point cloud data can be used to generate an image from any angle without being concerned about warping the image.

(22)

Figure 2.6. Sample patch of a LiDAR-based top-down image. The image includes parking spots of multiple kind: gray ones which are typical parking spots, blue ones that are inva parking spots and purple ones that are meant for women. There are two dark blocks, which are entrances for elevator and stairs. Black spots next to some parking spots are cross-sections of pillars of the parking garage. This top-down image can be geo-referenced and used for extraction of different features, such as the ones already mentioned, in earth’s coordinates. Noise as a result of uneven scanning can be seen around the middle line of each parking section.

Point-cloud type of data has also been used directly with different deep learning networks (Li et al. 2018; Qi, Su et al. 2016; Qi, L. Yi et al. 2017; Y. Zhou and Tuzel 2017). Many of these methods utilize the three-dimensional shape of an object to classify it. However, there is no 3D shape in parking spots, and examining them in 3D would add unwanted complexity to the learning process. Hence it is better to simplify the learning to a 2D case for parking spot extraction.

2.7 Related Research Problems

When inspecting prior work done regarding parking spot extraction, most of the published work is not aiming to extract parking spot coordinates per se, but just looking at parking spot occupancy (Amato, Carrara, Falchi, Gennaro, Meghini et al. 2016; Amato, Carrara, Falchi, Gennaro and Vairo 2016; Fabian 2013). This parking spot occupancy research aims at detecting a parking spot and looking at whether there is a car on it or not. This information can be used to keep count of vacant parking spots in any given parking lot, which can help drivers find a spot to park on with less effort. All of this is done in real- time using video or in any time frame with satellite images, for instance. The set of computer vision models used in parking spot occupancy detection is diverse, ranging from

(23)

more traditional approaches (Hsu and Chen 2019; T ˘atulea et al. 2019) to deep learning models (Acharya et al. 2018; Amato, Carrara, Falchi, Gennaro and Vairo 2016). The more conventional approaches are often making lots of assumptions about the structure of a parking spot. The reason for this is that those approaches are seldom learning- based where successful assumptions would be found more or less automatically. More flexible and especially deep learning methods don’t require many assumptions at all, but the amount of data for quality results can be very high (Shahinfar et al. 2020).

Another related research area in terms of parking features is parking space extraction.

This research aims to find parking spaces from satellite images, for instance. There is a dataset called APKLOT, that includes hundreds of images of parking spaces in outdoor parking areas (Hurst-Tarrab et al. 2020). Parking space extraction is a bit different task from parking spot extraction, since there are many different sizes of parking spaces, but parking spots typically have roughly the same size. However, extracting parking spaces first would at least reduce the size of the search space.

(24)

3. CONSIDERED METHODS

This chapter goes over considered approaches that did not make it to the proposed method. It also looks at scenarios where these approaches might work, if at all.

3.1 Instance Segmentation

Semantic segmentation has been used for quite some time now to segment objects out of images. The difference between semantic segmentation and object detection is that instead of generating a bounding box for the detected object, in semantic segmentation, each pixel will be classified (Guo et al. 2020). That is why semantic segmentation is sometimes called pixel-wise classification.

The limitation of semantic segmentation is that all the objects belonging to the same class will be labeled as one object. So in a case where there are, for example, two cats next to each other in an image, the semantic segmentation algorithm will not only consider them being of the same class but being completely the same object as well. The case where these instances of classes are classified segmented separately, which means the combination of both object detection and semantic segmentation, is called instance segmentation (J. Yi et al. 2019).

In the case of parking spot extraction, instance segmentation seems like a self-explanatory approach to go for. It does all that is required for the parking spot extraction task: by in- putting a top-down image, all the parking spots will be classified as separate instances pixel-wise, which can then be used to generate coordinates for the parking spots.

The weakness of this method in parking spot extraction, however, is the extensive need for diversity in training data. Parking spots inside one garage might be quite easily segmented by an instance segmentation model, but when giving the model more novel data from a parking garage with differently styled and shaped parking spots, the model can’t seem to find a way to segment the area properly. This is mainly the problem of diversity in the data and would be resolved in the future when more scanned parking garages can be utilized in training.

These types of problems could be fixed with more training data, and the approach seems to work, for example, in the case of Nvidia’s parking spot extraction (Nvidia 2020). At

(25)

the time of writing this thesis, there wasn’t enough available data to utilize this sort of approach for LiDAR top-down images. Typically instance segmentation models use dozens if not hundreds of thousands of labeled instances (Cordts et al. 2016; Waqas Zamir et al.

2019; Xia et al. 2018). The most common benchmark dataset used in instance segmentation, the COCO dataset, includes 328 000 images with over 2.5 million labeled instances spanning 80 different classes (Lin et al. 2015). The number of labeled instances that could be gathered from the available training data would be more than a thousand, but certainly less than ten thousand. In addition, using parking spots that are only from two to three different parking garages makes the dataset somewhat unbalanced.

3.2 Edge Detection and Hough Transform

Traditional approaches were also considered for parking spot extraction. Intuitively, utiliza- tion of edge detection with Hough transform line detection (Duda and Hart 1972) seemed like a decent approach. Parking spot lines will be detected using an edge detection algorithm, such as Canny (Canny 1986), and then by using line detection with Hough transform (Duda and Hart 1972), those edges are made into a list. This list will then be post-processed so that the parallel lines’ end coordinates will be the corner coordinates of the parking spots.

Now the problem of this approach is in the disadvantages of LiDAR data: many parking spots do not have adequate lines visible. One could attempt to rectify this by preprocessing the top-down image using, for example, morphological closing, dilation, or something else similar, albeit that might blur the image too much for the edge detection to work well. In addition, no matter what thresholds are selected for the Canny edge detector, there will be more or less other lines included as well that have nothing to do with the parking spots themselves. These lines could be, for instance, walls and some other lines on the parking garage floor. Furthermore, the parking lines that could be detected are rarely entirely detected by the edge detector due to insufficient choices of parameter values. These problems are visualized in Figure 3.1 below.

(26)

Figure 3.1. Lines detected with Canny edge detector and Hough lines in a morpholog- ically closed top-down image. The closing operation reduced the noise of the top-down image drastically. However, it was unable to fill in the blanks left by the noise removal, and hence, the detected edges of most of the parking spots are not fully complete.

Another thing that has to be considered with this approach, is that the Canny edge detector is not a learning based method. This fact makes it incredibly difficult if parking lines are in an environment with different contrast or in peculiar lighting conditions, and the old threshold values don’t work as well anymore.

This approach could work, although that would require clean data without any noise and sudden contrast changes typically present in LiDAR top-down images. If Canny cannot detect the edges properly, one has to stitch small detected edges together with Hough transform detected lines, and that can easily generate lines to undesired locations.

Some deep learning based edge detection approaches were also tried but with uncon- vincing results. The problem lies in the fact that the algorithm will still detect tons of edges that are not part of parking spots. In addition, some blurry or noisy parking lines were not detected, resulting in same problems as was in the case of the conventional edge detector.

In conclusion, edge detection with Hough transform line detection as an approach for parking spot extraction is not flexible and robust enough. The problem of occlusion and

(27)

noise in LiDAR top-down images almost necessitates using an approach that can learn from data. One can try different parameter values for the Canny edge detector and the Hough transform line detector, but no matter what values are tested, the best one can hope for, is to get decent results with that one specific image. If the same parameter values are applied to another top-down image, especially if it is from a different parking garage, chances are that a great part of the detected lines are undesirable, or most parking lines are not detected at all.

3.3 Parking Space Proposals

One way of reducing the search space for any parking spot approach is to find parking space proposals. In essence, this would mean detecting the parking spaces that include multiple parking spots out of the top-down image and using those for parking spot extraction.

APKLOT dataset that was already mentioned, in section 1, includes satellite images from outdoor parking lots (Hurst-Tarrab et al. 2020). Now by training a YOLOv5 object detector (Jocher et al. 2021) with APKLOT, one would assume that parking spaces are now easily detectable from LiDAR top-down images as well.

One inference attempt with APKLOT pretrained YOLOv5 is in Figure 3.2. As can be seen from the figure, some parking spaces were not detected at all and some detections had only some of the parking spots inside the bounding box. This phenomenon is quite detrimental since one undetected parking space means losing many potential parking spot detections.

Furthermore, if one looks at the confidence of the detection next to the detected parking space in Figure 3.2 one can see that the detections are not that confident. In essence, this means that the object detector didn’t generalize very well from APKLOT to indoor LiDAR top-down images. The problem lies also in the fact that indoor parking sections are a bit different from outdoor parking spaces: outdoor spaces have usually rectangular shapes, but indoor spaces might have walls and other obstructions between parking spots that result in shapes with more than four corners. This might end up confusing the model even further.

It also seems that the model isn’t quite sure where the boundaries of a parking space lie.

For example, in Figure 3.2 the most top-left detection only includes two parking spots, when it should contain the whole row. This effect is most likely due to parking spots being so simple in their visual structure that it is hard for the model to find a good reference point from the garage floor regarding the endpoint of a specific parking space.

To use parking space proposals effectively, one should be able to detect all parking spaces in some manner. The complete set of parking spots in the top-down image is

(28)

Figure 3.2. Parking space proposals detected with YOLOv5. For the most part, the confidences of the detections are poor, and many of the parking sections were either left unproposed or not fully detected.

not extracted if one of the parking spaces is undetected. There is no point in using this approach without retraining with indoor parking garage parking spaces if the results are even remotely similar to the one visualized in Figure 3.2.

3.4 Manual Annotation

The last considered method was the one that the whole problem was trying to get rid of, manual annotation. If annotating the parking spots by hand works better in all the aspects than the methods, there is no point complicating things with machine learning- based approaches.

However, the most significant caveat with manual annotation is the amount of time it takes. Some reports from the company’s labeling team indicated that the labeling of parking spots using Adobe Illustrator (Adobe Inc. 2019) could easily take dozens of hours for a single parking garage. All of this results in a lot of billable working hours.

(29)

There is also something called efficient annotation. The annotation process uses machine learning approaches to detect similar objects to the already labeled ones. These semi- automated annotation approaches can quickly improve the annotation time by ten-fold.

However, humans still need to perform the annotation, albeit more easily. (Liao et al.

2021)

All of the above becomes a significant problem when the scale of the parking garage map product increases. Suppose multiple parking garages were received to manual annotation each day. In this case, there would be only one solution with manual annotation to keep up with the upcoming data: to hire more employers to do manual labeling. This solution is not wanted and will temporarily fix the problem since the scaling could happen exponentially and for a long time. With computer vision methods, one can continually expand the cloud- based extraction pipeline to use more nodes to scale the system if necessary. Expanding the cloud system is much more cost-efficient than hiring new employers and will likely become a problem much later if the exponential scaling occurs. There is also a possibility of extracting parking spots from the parking garage levels parallel in the cloud. This way, a single parking garage takes only the amount of time to annotate as its largest level in terms of area. Hence, with proper implementations, machine learning-based annotation time is not necessarily dependent on the number of floors in the parking garage.

(30)

4. PROPOSED METHOD

This section introduces the proposed method for parking spot extraction. The main assumptions that this method makes are

• In LiDAR point cloud -based top-down images generated with a fixed voxel size, parking spots are approximately of size170×80pixels

• Parking spots are of rectangular shape

• Parking spots are separated from each other with a parking line of white or some other bright color

The first assumption is quite robust. Since one can decide the accuracy of the top-down slice of the LiDAR point cloud, the projected top-down image will always have the same resolution. As long as the real-world parking spots are roughly the same size, the assumption about these dimension sizes in image coordinates doesn’t limit the method excessively.

One could criticize any of these assumptions, but there has to be a decent set of assumptions when there is a lack of training data. Otherwise, more dynamic and abstract assumptions would be inferred from the diversity and depth of training data alone. These assumptions were found to be very effective, albeit the second point about the rectangular shape of the parking spot deemed specific top-down parking garage images instantly useless since they included primarily parking spots that were, for example, of parallelogram shape. To advance the model to generalize to those kinds of cases is one for the future.

4.1 Overall Architecture

Given the mentioned assumptions about the parking spots, the proposed method was im- plemented and grounded on those assumptions. The overall architecture of the proposed method is below in Figure 4.1.

(31)

Figure 4.1. The overall architecture of the proposed method: preprocessing to prepare the top-down for detection, proposed parking spots using sliding window method with HOG features and logistic regression classification, verifying the proposals using convolutional networks, and cleaning up the detections in the post-processing phase.

As can be seen from Figure 4.1 above, there are in total four stages in the overall architecture. More detailed descriptions of each pipeline stage are discussed and analyzed in the upcoming sections 4.2-4.5. Pseudocode for the complete proposed method is in Appendix A.

4.2 Preprocessing

Before the top-down image is fed into later stages, it has to be preprocessed. One of the problems that many top-down LiDAR parking garage images have is that they have horizontal parking spots at an angle, say between -2 and 2 degrees. But for the post- processing stage to work, all the horizontally rotated parking spots should be as close to 0 degrees in their rotation as possible. In essence, this means that the image has to rotate slightly.

The slight rotation implementation was done by using Hough transform detected lines to find straight lines from the image. These lines had to be at least 250 pixels long, and they should be continuous. Subsequently, the angles of the lines are calculated and put into a list. Following this, the outliers are removed from the list, and the median angle is taken as the degree offset of the image. Finally, the image is rotated to the other direction based on the found median angle, resulting in the deskewed top-down image.

Now there is also a need for specific processing of the top-down image in terms of the

(32)

noise. HOG-based proposals covered in section 4.3 do not need any denoising operations, albeit the CNN used after that utilized morphological dilation operation to strengthen the parking lines of the top-down image. In essence, this means that two copies of the deskewed original image were generated, one for the HOG detector and one for the CNN classifier.

4.3 Parking Spot Proposals

The first actual detection stage of the proposed method is mainly to propose a set of possible parking spots. The sliding window scans the top-down image with fixed window size, and a pre-trained classifier will give a set of parking spot proposals. The step size had to be low enough in that somewhat confident parking spot proposals could be found.

Then when the confidence exceeded a certain threshold, the nearby area of the detection was scanned more precisely for improved detection. This algorithm is explained in more detail in sub-chapter 4.3.2.

To deal with vertical and other rotated parking spots using a sliding window method with a fixed window size, one has to propose a set of angles to search. This is done by using edge detector on the top-down image and lines detected from the edges by Hough transform. After that a selected threshold determines which the most common angles from the set of lines. Those angles would then be searched by the sliding window by rotating the image appropriately, detecting the parking spots in that angle, and then converting the detected parking spot coordinates to the original top-down image coordinates.

4.3.1 HOG Features

As was previously discussed, parking spots have a somewhat simple visual structure, and therefore, accurate detection is quite challenging amidst all noise and occlusion.

However, specific manipulations can be done to the input data so that the detector can more firmly detect the parking lines, for example. One of these manipulations is to find out the histogram of oriented gradients features of a given input, a method that was explained in great detail in sub-chapter 2.2. Visualization of the histogram of oriented gradients in the case of a parking spot is in Figure 4.2 below.

As can be seen from Subfigure 4.2b, parking spot lines are pronounced with HOG features. This effect is what one wants since LiDAR top-down images might be too blurry or have a high contrast variance for a model to utilize without any prior feature extraction. If the sliding window size is selected appropriately, the extracted HOG features will be fairly consistent between parking spots.

(33)

(a) (b)

Figure 4.2. Visualization of HOG features for parking spot extraction. The contrast changes around the parking spot lines are stored into the feature vector.

4.3.2 Classifier

Now just by having dozens or some hundreds of samples of training data, one can train a classifier that takes in a feature vector of HOG features and learns the visual structure of a parking spot in a robust manner. One could use any classifier for the HOG features, for example, a support vector machine (SVM) (Cortes and Vapnik 1995) or random forest (Ho 1995). This method uses a logistic regression classifier, which is explained in more detail in sub-chapter 2.3.

The selection of the classifier was mainly done experimentally. Since logistic regression is relatively simplistic in comparison to many other classifiers, it can produce good results with a limited amount of training data. In addition, given that this stage was to propose the parking spots, logistic regression gives out a relatively uniform set of probability values for the predictions, though, for example, the support vector machine avoids the middle ground between 0 and 1 in the case of binary classification. Therefore logistic regression makes it possible to select a threshold probability which will then give even less confident proposals out, which might as well be parking spots, while support vector machines would classify occluded proposals way under that threshold due to the extended margin in the vector space between the two classes.

(34)

Figure 4.3. Visualization of the method used for detection accuracy improvement. The original detection is a reference point around which a better parking spot candidate is searched.

The sliding window used a certain step size, but to get the parking spot’s sidelines detected as accurately as possible, the step size for vertical direction was half of the original step size. If the sliding window and HOG-based logistic regressor found a relatively confident proposal for a parking spot (say with confidence over 0.5), there was a search close to that detection area for even more confident detection. If more confident detection was found, one would use that as the detected parking spot for that sliding window instance. Now if the position of the sliding window is(x_s, y_s), the search area was from ((x_s−stepsize/2),(y_s −stepsize/4)) to((x_s+stepsize/2),(y_s+stepsize/4). This search process is demonstrated in Figure 4.3 above.

With the help of this improvement method, it was possible to use a slightly bigger step size to make the inference process faster. The search was more precise in the vertical direction due to the shape of the sliding window and the parking spot that was to be searched. Modest errors in vertical direction could easily make the other parking line disappear under the radar of the current sliding window.

4.4 Verification of Proposals with Convolutional Neural Networks

The earlier stage gave a set of parking spot proposals, but there was a need to have a model to accurately verify or discard those proposals. It would also be ideal to have a model that looks at the data from a slightly different perspective than the HOG-based image classifier.

CNNs can often yield fantastic results in binary image classification especially if the image size is kept relatively small. Then the number of parameters in the CNN is kept low which makes the model less prone to overfitting and also faster to converge.

(35)

The proposed CNN, TilhiNet, includes four convolutional blocks with batch normalization, ReLU activation, and max pooling, after which there is a dense layer with eight nodes and another dense layer with two nodes and softmax activation for classification. The more detailed structure of TilhiNet is in Table B.1. The model has in total 34 064 parameters, of which 33 872 are trainable parameters.

To create a set of images for the CNN to work with, one takes the top-down image used in the proposal phase and crops the proposals out of that top-down image. The crops are slightly greater than the proposed bounding box since it is easier for the CNN to classify if it can see more of the context of the proposal. Subsequently, these proposals are fed into TilhiNet, which will output binary classification of the proposal as parking spot/not parking spots. Then one preserves all the positively labeled proposals and discards all the negatively labeled proposals.

For the sake of simplifying the model, TilhiNet was made to take in grayscale images to remove possible tone and color variance between parking garages. This grayscaling also takes out the possible color-coding confusion, where a novel parking garage might have different meanings for differently colored parking spots.

For this verification process to be advantageous, there is a high demand for the CNN model to have very high accuracy. If the model is not performing well enough, there are going to be too many true positive proposals lost and too many false positives preserved, which undermines the great work that the HOG detector has done.

4.5 Post-Processing

After the parking spot proposal verification, there needed to be a certified way to fix the possible errors that previous stages could have made. This stage was applied to each of the angles found by the preprocessing phase.

Given that the sliding window uses a specific step size, there is always a possibility that the classifier used under the sliding window will classify the parking spot one or couple of steps too early or too late. This problem was rectified by creating a list of parking spots, identified by their top-left (x, y)-coordinate. Two additional lists were made, one that created a histogram of different x-values, and the other did the same for y-values.

For parking spot rows in the aligned parking garage, there would be a high value in the x-value histogram. One would go through the detected parking spots, get their top-left coordinate, and if it was for some reason shifted one step size from the major row, as is pointed by the histogram, it would be shifted to that row. This was done for parking spots that were one to four step sizes away from the major row. The same procedure would be done for the y-coordinates as well to align the columns.

(36)

(a)The window given to HOG classifier (b)Coordinates of the parking spot

Figure 4.4.Comparison between a HOG sliding window instance and desired crop based on the extracted coordinates. There is no HOG feature activation on the top and bottom edges of the right image. For the classifier to have enough information, those edges are required in the original sliding window phase but cropped out later for cleaning purposes.

The second thing that was done in the post-processing phase was the rescaling of the detected parking spots. For the HOG, it is better to get full coverage of the parking lines of a parking spot as is given in Figure 4.4. But for the detections to not overlap, one has to rescale these detections down a bit. The aim was to rescale the detections so that the parking lines would be in the middle of the top-down image parking line.

The third thing that was done in the end was connecting parking spot coordinates that were close to each other. If for two different parking spots the corner coordinates are closer than threshold c, then both of the corners are shifted to the middle between the two corners. In this way there would be no gaps between parking spot detections. This procedure might make the parking spots non-rectangular unintentionally, therefore the parking spots were forced to a rectangular shape in the end.

This post-processing pipeline assumes that parking spots are arranged as rows or columns, and the HOG-classifier did not make too many mistakes in the detection process. If the classifier made tons of mistakes, the shift would be made from the correct row to the wrong one, which is not desired. The reason to shift parking spots only if they were one to four steps away was that if they were any further, it would be more productive to blame more the HOG classifier for not being of very high quality than blaming it all on post- processing. There is an increased probability of shifting something undesired if the width of the scan area for the shift is somewhat broad. This step can fix some minor errors but will not cure all the problems that the HOG classifier might induce.

(37)

5. EXPERIMENTS

This chapter includes the experiments and results of this thesis. Experiments include information about data gathering and training of both the HOG-based detector and TilhiNet verification network. For results, there are five top-down images from three different parking garages that were tested with the proposed method.

5.1 Data Gathering

For training, there were top-down images from two different parking garages, Munich and Tampere. One of the parking garages located in Oslo included mostly parallelogram- shaped parking spots, which meant that most of the parking garage could not be utilized.

The other two had a lot of rectangular parking spots. The top-down images were generated using external point cloud processing software unknown to the author.

The first fifty samples for HOG were cropped manually out of the training images. The crops were put into two folders: positive and negative, indicating whether or not the image included a parking spot completely or not. Some indications of what kind of images were in both folders are in Figure 5.1 below. The top row includes negative and the bottom row positive samples.

After these samples were in their corresponding folders, all the samples were resized to size(180,90)so that a fixed size HOG feature vector could be generated from them. The reasoning behind this size is that most of the manual crops were very close to this value, and as was mentioned already in Chapter 4, one of the assumptions behind the method was that parking spot in LiDAR top-down image is roughly170×80pixels. This window size made sure that the whole parking spot would be inside the crop.

To extend the dataset further without excessive manual work, each time after HOG detector testing, proposed parking spot samples were cropped out of the top-down image and saved. The only manual work that had to be done after was to move each image to its corresponding folder, either positive or negative. Using this method, one could find weaknesses of the HOG detector and try to minimize them by retraining again. This is sometimes called hard negative mining.

(38)

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 5.1. Example samples used for HOG-detector training. The upper row includes negative samples and the bottom row positive samples. The parking spot had to be fully in the crop to classify as positive.

TilhiNet data was slightly different, although the data gathering was done similarly to HOG proposal data gathering, by using hard negative mining. TilhiNet was trained with slightly bigger crops than the HOG detector in terms of the context. This kind of cropping was mainly to make the CNN look at the proposal slightly differently and with greater context than the HOG detector. The difference between typical HOG training sample (a)) and TilhiNet sample (b)) is visualized in Figure 5.2 below.

In the end, the HOG data included 230 positive and 403 negative samples, which yielded 633 samples for training. The performance of the model seemed to cap quite early with the existing data and therefore it seemed useless to add more training data for the HOG, since all the training data was from two different parking garages. Seeing similar type of parking spot five hundred as oppose to hundred times does not really improve performance, and this was something that was clearly observed with the HOG detector training.

For TilhiNet, there were 1712 training samples, of which 875 were positive and 837 negative. This difference of balance between the datasets was done intentionally and with the goals of both stages in mind. HOG-based classifier needs more negative examples to know what not to propose to the next stage, and TilhiNet needs a lot of positive examples to know the whole variety of different parking spots and not lose any true positives. In addition, given this sort of balance, the HOG-based classifier can discard most of the false positives. Given this, the majority of the proposals are true positive. In this case, TilhiNet has to be especially good at detecting true positives to make the greatest improvement in recall accuracy.

There was also an additional testing dataset made for TilhiNet performance measure- ments. It included 424 samples from the Munich garage, 329 of which were positive and 95 negative samples. This was done to test the model with samples outside the training data.

Computer Vision Methods for Parking Spot Extraction from LiDAR Top-down Images