Evaluation of Simultaneous Localization and Mapping (SLAM) Algorithms

(1)

EVALUATION OF SIMULTANEOUS LOCALIZATION AND MAPPING (SLAM) ALGORITHMS

Master of Science Thesis Faculty of Engineering and Natural Sciences Examiners: Associate Professor Reza Ghabcheloo Jukka Yrjänäinen December 2021

(2)

ABSTRACT

Chinna Bharath Garigipati: Evaluation of Simultaneous Localization and Mapping (SLAM) Algo- rithms

Master of Science Thesis Tampere University

Degree Programme in Automation Engineering, MSc. (Tech) December 2021

Simultaneous Localization and Mapping (SLAM) is a core component for the successful implementation of most autonomous robotics systems such as unmanned ground and areal vehicles and industrial mobile robots. It is crucial to have a robust localization and mapping system if the robot’s navigation depends on it while navigating through an unknown environment. However, the performance of a SLAM system varies based on the suitability of the algorithms to factors like environmental conditions, types of available sensors availability, dynamic motion, on-board computational capability of the robot, and more.

This thesis goes through the typical structure and components of a SLAM system and the factors affecting the performance. Various popular state-of-the-art LIDAR and visual SLAM algorithms and odometry algorithms that do not include loop closure are studied. The algorithm performance is tested using the datasets experiments conducted during the thesis. The custom datasets are collected using a multi-sensor setup include LIDARs, Stereo cameras, GNSS, and IMU sensors. Various experiments were conducted to simulate a few real-world conditions that a typical SLAM system could face, like rough terrains, different sensor elevations, varying robot velocities, and different mounting positions. Finally, the SLAM results on the custom data and the computational efficiency of the selected algorithms are analyzed using absolute position error and by recording the computational resource usage during runtime.

The purpose of this thesis is to provide an insight into the building blocks of the current state- of-the-art visual and LIDAR SLAM systems. Understand how external factors affect the system’s performance and how they compare with each other, and create a multi-sensor dataset to simultaneously test visual and LIDAR SLAM systems and that can be utilized for later research.

Keywords: Simultaneous Localization and Mapping, SLAM, Localization, Odometry, visual SLAM, LIDAR SLAM

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

PREFACE

I would like to thank my supervisors, Assoc. Prof. Reza Ghabcheloo and Mr. Jukka Yrjänäinen for their guidance while writing my thesis and providing the opportunity to work on this interesting topic while working as a research assistant in their group. I am also thankful for Prof. Reza Ghabcheloo’s mentorship throughout my Master’s degree.

Furthermore, I would like to thank Nikolay Serbenyuk for his help in building the sensor setup and outdoor data collection with the GIM machine, and Tuomas Välimäki for his help with the Optitrack system.

Finally, I would like to thank my parents for encouraging my choices and for their constant support throughout my studies.

Tampere, 7th December 2021

Chinna Bharath Garigipati

(4)

LIST OF SYMBOLS AND ABBREVIATIONS

APE Absolute Pose Error

BRIEF Binary Robust Independent Elementary Features DSO Direct Sparse Odometry

FAST Features from Accelerated Segment Test GPS Global Positioning System

GTSAM Georgia Tech Smoothing and Mapping ICP Iterative Closest Point

IMU Inertial Measurement Unit

iSAM Incremental Smoothing and Mapping

KLT Kanade–Lucas–Tomasi

LEGO LOAM Lightweight and Ground-Optimized Lidar Odometry and Mapping LIDAR Light Detection and Ranging, a method for determining ranges LOAM LIDAR Odometry and Mapping

LSD Large-Scale Direct Monocular SLAM MAP Maximum-a-Posteriori

MIR Mobile Industrial Robot

NDT Normal Distributions Transform ORB Oriented FAST and Rotated BRIEF RADAR Radio Detection and Ranging RANSAC Random sample consensus

RGB Red-Green-Blue color model for images

RGB-D Red-Green-Blue Depth model for depth images RMSE Root-Mean-Square Error

ROS Robot Operation System RPE Relative Pose Error

SIFT Scale-invariant Feature Transform SLAM Simultaneous Localization and Mapping

(7)

SONAR Sound Navigation and Ranging

STD Standard Deviation

SURF Speeded-Up Robust Features SVO Semi-direct Visual-Inertial Odometry

TAU Tampere University

TUNI Tampere Universities VIO Visual Inertial Odometry

(8)

1. INTRODUCTION

Localization and mapping are two of the essential components of autonomous robots.

They help the robot understand its relative position in the environment with respect to its surroundings and store the features detected on a map. Navigation modules can use the pose and map output to perform path planning, navigation tasks, obstacle avoidance, and more. Nonlinear state estimation methods such as extended Kalman filter (EKF) and unscented Kalman filter (UKF) using wheel odometry, inertial measurements (IMU), and global positioning system (GPS) are commonly used to estimate the pose of the robot [1].

But through developments in simultaneous localization and mapping (SLAM) systems, there has been a promise of increased robustness in localization in previously unexplored environments in both indoors and outdoors case, all while generating a map of the newly visited environment utilizing sensors such as cameras and LIDARs. This technology has a key role in applications such as unmanned aerial vehicles (UAVs), unmanned ground vehicles (UGVs), autonomous cars, service robots, in virtual and augmented reality to track user’s pose, and more. SLAM is a still-evolving technology, in this thesis, we intro- duce and test a few popular SLAM methods and compare how they stand against each other.

Many of the SLAM algorithms have similar individual building blocks; this thesis goes through the structure and components of different visual and LIDAR-based SLAM and odometry systems. We present the similarities in how the individual pieces are utilized and emphasize what design choices made a particular SLAM system different from the rest; this is important to understand in order to pick a SLAM system for your specific application. During data collection experimentation, we test multiple challenging scenarios to compare how the selected algorithms perform. Not all SLAM algorithms utilize or fit the same sensor models or types for their estimation. But, popular public benchmark datasets are either meant for testing visual inertial SLAM only [2] or don’t provide varia- tions of the same type of sensor [3]; it is only fair to compare the algorithms on the data collected simultaneously on necessary sensors during experimentation. Hence, for a fair comparison of the performance, we used a multisensor setup including two LIDARs, two stereo cameras, a global positioning system (GPS), and nine-axis inertial measurement unit (IMU) for data collection. The collected data is used to evaluate the 6 degrees of freedom (DoF) localization performances through absolute pose error (APE) and relative

(9)

pose error (RPE) with respect to the reference ground truth trajectories, we also analyze how the design choices made in these SLAM systems affect the accuracy of the trajectory generated by them.

The initial studies in this thesis aim to provide an insight into the workings of various modern visual and LIDAR SLAM systems and how they compare with each other. And through the conducted data collection experiments, the ideal use cases for the selected SLAM systems based on factors such as sensor mounting positions, environmental changes, and available computation resources are also found, this can serve as a simple guide for the selection of a SLAM system for deploying in any specific application which matches the tested conditions.

The structure of this thesis is as follows. Section 2: presents the theoretical background on visual and LIDAR SLAM methods, it first describes the types of sensor data and their differences. Then it describes the general layout of current state-of-the-art SLAM systems and a few essential components of it. Section 3 explains the methodology used for selecting the SLAM algorithms for testing and then briefly explains how they work; it also goes through the data collection, ground truth reference systems and evaluation metrics used in this thesis. Section 4 describes the design of data collection experiments and their execution and the performance of algorithms on those datasets. Finally, section 5 presents the concluding remarks of this thesis.

(10)

2. INTRODUCTION TO VISUAL AND LIDAR SLAM

Modern SLAM systems follow a loose general structure that can be commonly seen; there are also a some well-established common concepts widely utilized in various stages of the SLAM pipeline. Understanding this general framework and the key concepts helps us know the novelties that each of the SLAM systems provides, which in turn helps us analyze the performance differences and suitability of the SLAM algorithms to a given application environment. The following chapter briefly describes the general structure and a few standard building block components of modern visual and LIDAR SLAM systems.

2.1 Typical Structure and Components of a SLAM System

Although various SLAM algorithms use different sensors for the state estimation, they usually follow a similar algorithm structure consisting of a front-end and back-end as described in [4]. The front-end of a SLAM system varies based on the sensor. Its job is to process the raw sensor data and convert it into an interpretable representation, which can be used for a maximum-a-posteriori estimate of the robot state in the back-end; this process is called data association. To achieve this, the front-end of a typical feature- based visual SLAM algorithm consists of a feature detector; it helps extract features from the input image and represent them in the form of descriptors. The same features are tracked over consecutive images, the transformation can then be estimated using the tracked features between features in successive images using image pose-estimation techniques [5].

The front-end of a LIDAR-based SLAM needs to first fully form the point cloud by accumulating the incoming point cloud, and it also needs to compensate for the skewing of the point cloud caused to the robot’s motion. After the accumulation of Point clouds, they typically perform registration or scan matching, which is the process of finding the transformation between two successive Point clouds in time. The most common method to perform this step is the Iterative closest point (ICP) and Normal distributive transform (NDT). Compared to odometry, full SLAM methods also perform loop closure detection in the front-end, based on the estimated poses and map information generated in the back-end.

The back-end of the slam utilizes the data generated from the observations in the data as-

(11)

Figure 2.1.Components of SLAM system [4]

sociation step in the front-end to perform the maximum-a-posteriori estimate of the robot’s slam states, which usually consists of a set of discrete poses along the path taken by the robot and the Landmark locations. Factor graphs have been the most popular back-end approach in the current state-of-the-art SLAM systems [4] for solving the maximum-a- posteriori problem due to their intuitive representation and capability to support multi- modality in a sparse structure. On the other hand, visual inertial bundle adjustment has been the most common method of back-end optimization in visual SLAM systems; bundle adjustment can also be posed as a factor graph problem. In the following sections, we will discuss a few of the essential components of modern visual and LIDAR-based SLAM systems.

2.2 Types of Sensor Data

The types of sensors used in the front-end impacts the performance and cost of the simultaneous localization and mapping (SLAM) system due to many factors; hence it is essential to understand the characteristics of various sensors when selecting a sensor configuration for the required application. The current state-of-the-art SLAM research is mainly centered around image based [6][7][8], and point cloud based [9][10] SLAM.

Other sensor modalities such as RGB-D ( Red-Green-Blue depth model for images) [11], radio detection and ranging (RADAR) [12], and sound navigation and ranging (SONAR) [13] are also utilized for SLAM applications. In addition, localization estimates can be improved by complementary sensor information from Global Positioning System (GPS), Inertial Measurement Unit (IMU), and the wheel encoders [14]. We will discuss image, point cloud, and complementary sensors data in the following sections as the scope of this thesis is limited to them.

2.2.1 Image Data

Visual, visual-inertial localization and mapping are some of the most common methods currently in use, primarily due to the relatively inexpensive sensors compared to the oth- ers. The Monocular and Stereo camera configurations are usually used; monocular sen-

(12)

sors have the advantage of compact size; however, their lack of scale perception makes them less desirable for SLAM application, whereas a stereo camera pair can estimate the depth using the disparity between the two cameras [15].

Furthermore, sensors with a global shutter are preferable to avoid rolling shutter distortion that affects the accuracy of data association resulting in an inaccurate final estimate. We can easily use image data for other applications such as object recognition, semantic segmentation. The drawbacks are that environmental factors like changes in illumination and weather conditions impact the quality of the information in the data, affecting the SLAM system’s performance.

Feature-based SLAM algorithms like ORB SLAM [6] are the most popular type for image data; these methods usually depend on feature detector-based visual odometry. The landmarks/features in successive images are tracked using a feature detector to calculate the initial pose estimate. On the other hand, direct SLAM methods such as direct sparse odometry (DSO) [16], large-scale direct monocular SLAM (LSD) [17] can calculate the initial transformation estimates between the images using direct image alignment of the keyframes. Finally, to refine the estimate further, optimization methods such as bundle adjustment or pose graph optimization are used.

2.2.2 Point Cloud Data

Point clouds are a three-dimensional representation of the environment generated by sensors such as LIDARs, 3D RADARs, and RGBD time of flight cameras. Point clouds consist of voxels which are volumetric equivalents of pixels; each voxel consists of their intensity value and X, Y, Z coordinates; the additional third dimension of the point clouds makes the SLAM problem simpler than 2D images. Furthermore, active sensors like LIDARS and time of flight sensors generate point clouds, making them less dependent on the environmental conditions for consistency. Although point cloud data is accurate, the drawbacks are that the traditional mechanical rotating LIDARs have poor resolution along the axis of rotation due to the limited number of scanlines [18]; they also have a relatively low frame rate and can be expensive. Nevertheless, they have become less expensive in recent years due to the advancement in the field. And due to their current demand, manufacturers are competing to provide the most affordable version of LIDARs.

In a LIDAR-based SLAM, localization using point clouds requires finding the transformation between successive point clouds. Iterative closest point (ICP) and normal distribution transform (NDT) [19] are widely used for this process, and it is often referred to as point cloud scan matching; we will discuss their detailed working in the next chapter. Point-to- point and point-to-plane are the most popular variants of ICP; LOAM [9] algorithm uses these methods for scan matching. It is the basis for many best-performing algorithms on the Kitti public dataset. Current state-of-the-art LIDAR SLAM systems use variants of ICP

(13)

scan matching for odometry and posegraphs for optimization, because of their sparse nature and accuracy.

2.2.3 Complementary Sensors Data

Complementary sensor data such as IMU, GPS, and wheel odometry help the SLAM system increase accuracy. In loosely coupled systems, their data can be used along with the primary sensor data during the optimization stages to estimate robot pose. For example, in LIO-SAM [10], LIDAR-based SLAM utilizes IMU data to remove the distortion caused due to the robot’s motion during a LIDAR sweep and also integrates GPS and IMU based pose estimate. Due to its higher data rate, IMU data can help capture the information between keyframes, whereas GPS can help prevent the drift in localization.

Tightly coupled SLAM systems widely use the IMU preintegration method [20] to estimate the pose transform between two keyframe time steps and point cloud correction. Back- end optimization methods such as factor graph optimization allow for easy integration of multiple sensor data to calculate the final pose estimate.

2.3 Visual SLAM Front-End

It is a challenging task to represent the visual SLAM states consisting of series of robot poses and landmark locations as a direct function of the raw pixel values of the incoming images. Visual SLAM front-end performs intermediate operations that translate these pixel values into the SLAM state; which can later be optimized in the back-end. In a feature-based SLAM, the front-end generally performs feature extraction and obtains the reference key points from the images; for a SLAM system, the feature extractor should be robust and fast for accurate real-time performance. We will discuss the ORB feature extractor, which satisfies these conditions [21] in the following section.

Next, it needs to track the correspondences of the existing key points in the new image frames in the step called data association; the key points which can be tracked over multiple frames are labeled as landmarks. Descriptor matching along with the aid of optical flow trackers such as the KLT tracker are preferred due to their performance efficiency;

using these point correspondences, the trackers can also determine how image frames have moved by estimating a wrap function. In the following sections, we take a closer look at the mentioned detector and tracker.

2.3.1 ORB Feature Detector

The front-end of a visual SLAM system extracts and tracks visual landmarks or features over a series of images to estimate the robot’s pose. The feature detector must be robust to varying orientation, scale, and lighting to successfully track similar features over a

(14)

series of images. So robust feature detectors are an essential requirement for a visual slam front-end. There are well-known existing keypoint detectors and descriptors such as scale-invariant feature transform (SIFT), speeded up robust features (SURF), or harris detectors known for their robust qualities; but the detector’s speed also plays a vital role in the overall SLAM system performance in real-time applications.

Figure 2.2. Benchmark detector speed comparison [21]

This is why the oriented FAST and rotated BRIEF (ORB) detector[22] is widely preferred;

this detector and descriptor offer speed along with scale and orientation invariance. Ac- cording to the OpenCV detector survey conducted in [21], the ORB detector is more than twice as fast as the other established detectors as shown in figure (2.2).

ORB detector builds on top of the features from accelerated segment test (FAST) feature detector [23] which was published in 2006. Firstly, a feature or a key point in an image is a point with a noticeable change in the intensity values in one (edge) or both dimensions (corner). FAST detects these points by comparing the intensity of target pointpwith a set of pixels forming a ring around it, as shown in the figure (2.3). It considers the pixelpas a key point if>= 12continuous pixels in that ring have their intensity levels greater than intensityp+thresholdor lesser thanp−threshold.

The Accelerated segment test used in the FAST detector to reject outliers makes it very efficient; it performs this test by considering only the intensities of a set of intermittent points in the ring. For example, pixels1,5,9,13in the figure (2.3), if less than 3 of the selected 4 pixels do not satisfy the condition, pointpis rejected as an outlier. ORB detector performs FAST extraction at various image scale levels; it also adds orientation invariance to FAST as it does not have it on its own. ORB adds an orientation component to FAST by a method called orientation by intensity centroid. This is based on the assumption that the intensity center is offset from the pixel center of the patch. Therefore, the slope of the

(15)

Figure 2.3.FAST detector segment test [23]

line joining the two centers should result in the orientation of the patch.

m_pq =∑︂

x,y

x^py^qI(x, y) (2.1)

θ =atan2(m₀₁, m₁₀) (2.2)

It is performed by using patch momentsm_pyshown in equation (2.1). Where the function I gives the intensity at pixel coordinatesx, y. And the indicesp, q of momentmtake the value of1or0indicating the consideration of coordinate value ofx, y. The orientation of the patch is then given by θ in (2.2), ORB also adds learning-based rotation invariance to the binary robust independent elementary features (BRIEF) descriptor which works by performing binary tests between pixels on blurred patches, as described in [22].

2.3.2 SuperPoint Detector and Descriptor

Recent developments in feature detectors are centered around deep learning-based methods [24] to improve speed and robustness. Although they are not directly used in the algorithms tested in this thesis, it is essential to understand the future trend. One such promising detector is SuperPoint [25]; Unlike traditional hand-crafted detectors such as ORB, the SuperPoint is a learning-based self-supervised detector and descriptor. Con- volutional neural networks are well studied for conventional applications such as human or object detection, but such applications are developed relying on the hand-annotated ground truth labels necessary during the training process. It is difficult to achieve similar results with supervised learning in image feature points as it is semantically harder for humans to define and label a key point.

SuperPoint detector introduces a self-supervised method to address this problem; in this method, the detector network is initially trained on a very large synthetically generated

(16)

Figure 2.4.SuperPoint detector Training stages

dataset consisting of simple shapes with known ground truth. The network trained on this synthetic data is referred to as MagicPoint, which performs well on synthetic data but not so well on real-world data compared to traditional detectors. To make MagicPoint more reliable, synthetic images are introduced with multi-scale and multi-warping distortion during the Homographic adaption step, and MagicPoint is retrained, resulting in a more robust and reliable SuperPoint detector capable of performing in real time applications.

2.3.3 KLT Optical Flow Tracker

During the data association step of visual SLAM, we need to find the point correspondences of the set of landmark features in the incoming new images. Methods like feature descriptor matching or optical flow trackers can find these point correspondences, where the latter performs it more efficiently. Published in 1991, Kanade–Lucas–Tomasi (KLT) tracker [26] is a commonly used optical flow-based tracker used in visual SLAM algorithms.

KLT tracker utilizes the key points detected by the chosen detector; it computes the motion vector of a keypoint by connecting its displacement between successive frames. In order to not lose track of all the key points, the tracker resets the actively tracked key points after a predefined fixed number of frames.

SSD error of image wrapping with incremental∆p: E =∑︂

x

[I(W(x;p+ ∆p))−T(x)]² (2.3)

∆p=H⁻¹∑︂

x

[︃

∇I∂W

∂p ]︃T

[T(x)−I(W(x;p))] (2.4)

Using these known data associations between frames, the pose transformation of the camera can then be estimated using the wrapping functionW(x;p)where the pixel coordinates of the image being wrapped are represented byx, andprepresents the parameters of wrapping function which needs to be estimated. First, equation (2.3) defines the

(17)

Algorithm 1KLT tracker algorithm Input: New images

Output: Wrap functionW variables

1: Tracked landmarks = 0

2: forNew imagedo

3: Detect keypoints

4: Match landmarks from previous frames

5: if(Tracked landmarks < Threshold)then

6: Initialize new landmarks from detected keypoints

7: Initialize motion vectors for new landmarks

8: end if

9: Compute motion vector for landmarks using new pixel locations

10: ifE > error thresholdthen

11: calculate∆p(2.4)

12: update SSDE (2.3)

13: end if

14: end for

sum of squared differences error function between the wrapped image and the previous template with prior pand incremental change ∆p, after linearizing and rearranging pis given by the equation (2.4), whereH is the 2nd order partial derivative (Hessian matrix) of the wrapping prior wrapping function. ∆p is then added to p, and these steps are repeated recursively until convergence. After the parameterspof the wrapping function are estimated, the key points along with wrapping estimate can be utilized to initiate the back-end optimization.

2.4 LIDAR SLAM Front-end

Similar to Visual SLAM’s front-end, the LIDAR-based-SLAM front-end also needs to perform feature extraction and data association for the incoming sensor data before the optimization stage in the back-end. In the case of traditional rotating LIDAR’s the front needs to wait for the scan to be complete and accumulate the incoming points into a point cloud.

This stage also deals with removing distortion caused by the robot motion during the sweep, in a step called point cloud de-skew, where every new point is projected back to the start time of the scan either using previous motion estimate or by using the information from an IMU.

After accumulating point cloud, in most LIDAR SLAM algorithms, feature extraction and scan registration described in the following sections.

(18)

2.4.1 Point Cloud Feature Extraction

In most LIDAR SLAM applications, edge and planar points are extracted in the feature extraction by using the method described in LIDAR odometry and mapping in real-time (LOAM) [9], in this method, smoothness value c is extracted for every new point using equation (2.5). It is the normalized distance between the target point and a set of S neighboring points on either side of the target point. If the average distance between the point and its neighbours is high, then the point is considered to be an edge point, and if the average distance between the point and its neighbor is low, the points are lying on the same plane. Edge and planar thresholds for the smoothness valuecare set using which definite edge and planar points are classified.

c= 1

|S|.||X_(k,i)^L |||| ∑︂

j∈S,j̸=i

(X_(k,i)^L −X_(k,j)^L || (2.5)

Figure 2.5. Edge(yellow) and planar (red) features [9]

After classifying edge and planar points, a LIDAR SLAM front-end should estimate pose transform between two successive point clouds in the point cloud registration step. The most commonly used point cloud registration methods are iterative Closest Point (ICP), and its variants, such as point-to-line and point-to-plane ICP, Normal Distribution Trans- form (NDT), is also commonly used for denser point clouds due to its computational efficiency. In the following sections, we will briefly discuss the working of ICP and NDT registration methods.

(19)

2.4.2 Iterative Closest Point (ICP)

Iterative Closest Point (ICP) is a commonly used point cloud registration method in LIDAR- based SLAM systems. To estimate the pose transform rotationR and translationT between two point clouds, finding the point correspondences between two point clouds is necessary. Typically it is computationally inefficient to perform this operation; ICP disre- gards this step and assumes that the closes neighbor pof a point x in the target point cloud as its corresponding point. The error term in equation (2.6) is the squared distance after applying the transformationsR andT on the target points; non-linear least-square optimization methods such as Levenberg–Marquardt algorithm [27] are utilized to mini- mize the error. After finding and applying the transformation, the point correspondences are reinstated, and all the above steps are repeated iteratively until convergence.

E(R, t) = 1 N_p

Np

∑︂

i=1

||x_i−Rp_i−t||² (2.6)

This basic version of ICP is called point-to-point ICP; in some cases, point-to-point ICP could lead to inaccurate results due to poor point correspondences [28]. Point-to-line and point-to-plane versions of ICP try to solve this issue by redesigning the error term such that it represents the distance error between a point from the source point cloud and the closest line or a plane in the target point cloud.

d_ϵ = |(X˜^L

(k+1,i)−X¯^L

(k,j))∗(X˜^L

(k+1,i)−X¯^L

(k,l))|

|X¯^L

(k,j)−X¯^L

(k,l)| ^(2.7)

Equation (2.7) from LOAM[9] represents the point-to-line distance of point ierror which is used for edge feature points. WhereX˜^L

k+1,irepresents 3D coordinate of pointiin the LIDAR framek+ 1with respect to the LIDAR frameL. The pointjis the closest point of iin the previous framek, and(j, l)form the closest edge line to pointi, all 3 pointsi, j, l are required to be edge points.

d_ϵ = |(X˜^L

(k+1,i)−X¯^L

(k,j))∗(X˜^L

(k+1,i)−X¯^L

(k,l))∗(X˜^L

(k+1,i)−X¯^L

(k,m))|

|(X¯^L

(k,j)−X¯^L

(k,l))∗(X¯^L

(k,j)−X¯^L

(k,m))| ^(2.8)

In equation (2.8), similar to point-to-plane, the point-to-plane error is the distance between the point i of the new frame to the plane represented by the three points j, l, m from previous frames; this method is applicable for the planar points of the point cloud. During every iteration of ICP, the newly estimated transformation (R, t) is applied to pointi to find the newX˜^L

(k+1,i)for the next iteration.

(20)

Figure 2.6.Point-to-Point vs Point-to-Plane ICP error [28]

2.4.3 Normal Distributions Transform (NDT)

The main drawback of the ICP registration algorithm is that it depends on the point correspondences of individual shapes rather than the local surface shape, and searching for the nearest neighbor for individual points is inefficient. ICP also requires outlier rejection methods such as distance-based weighting to deal with poor correspondences but not always improve performance [19]. The normal distributions transform (NDT) registration [19] tries to address these issues by formulating a different approach for scan matching.

In this method, instead of using the point locations, it uses normal distributions; point clouds are divided into a set of equally sized cubic cells called voxel, which contains at least a fixed minimum number of points. The mean q and covariance C represent how the points spread out inside these cells given by the equations (2.9, 2.10). Using the probability density function given byp(x)in equation (2.11), we can find the probability of finding a point in input locationxinside the cell.

q= 1 n

n

∑︂

k=1

x_k, (2.9)

C = 1 n−1

n

∑︂

k=1

(x_k−q)(x_k−q)^T, (2.10)

p(x) = 1 cexp

(︃

−(x−q)^TC⁻¹(x−q) 2

)︃

(2.11)

Using this representation point cloud, NDT method can then use numerical optimization methods to solve the registration problem, the optimization parameters are rotation and

(21)

translation (R, t) represented as quaternion p = [tx, ty, tz, rx, ry, rz, w] in the transformation function T(p, x) where s = sin(w), c = cos(w), e = 1 −cos(w). To find the registration, the transformation functionT(p, x)needs to optimizeted such that when the resulting transformation is applied, the probability density function (2.12) is maximized.

T(p, x) =

⎡

⎢

⎣

er²_x+c er_xr_y−sr_z er_xr_z+sr_y er_xr_y +sr_z er²_y+c er_yr_z−sr_x er_xr_z−sr_y er_yr_z+sr_x er_z²+c

⎤

⎥

⎦ x+

⎡

⎢

⎣ t_x t_y t_z

⎤

⎥

⎦

(2.12)

2.5 Back-end Optimization

After the abstraction of sensor data into an usable format, such as feature points and initial pose estimates in the front-end, this new data is utilized in the back-end for refining the final estimate. Classical probabilistic approaches to perform this step include Kalman filters, Extended-Kalman filters, and Particle Filters. The current standard approaches [4]

formulate the problem as a Maximum-a-posteriori problem in the form of bundle adjustment for Visual SLAM and more general factor graph optimization which are applicable to all modalities of SLAM because of its simple representation of the problem and its gener- ality to take multiple constraints. In the following two sections, we look into the workings of bundle adjustment and factor graph-based back-end methods.

2.5.1 Bundle Adjustment

In Visual SLAM, Bundle adjustment [29] is used to estimate the pose transform between image frames and 3D coordinates of image features by minimizing the re-projection error of tracked feature points. Bundle adjustment is generally used for multi-view 3D scene reconstruction applications; in some cases, it also needs to estimate the internal camera parameters, which are known prior in the case of SLAM application.

When a set of known features from the previous image frame gets projected into the next frame by utilizing the initial guess of the transformation between two frames, the difference between the projected features and the observed features in the new image frame gives us the reprojection error. Hence reprojection error E is a function of the estimate of pose transform and the 3D feature point locations as described in equation (2.13) form [29]. The function Q(a_j, b_i) is the projection function of point i in image j where a_j contains the known internal camera parameters and the external parameters, which is the initial estimate of the pose transform that needs to be optimized, b_i contains the 3D point location estimate of point i. The function d() measures the euclidean distance between the estimated projection and actual feature point location x_ij of feature pointi in image framej. Similar to feature pointiin framej, the error term sums the error ofn

(22)

Figure 2.7.illustration of n = 7 points projecting on m = 3 images [29]

features tracked amongmimage frame as illustrated in figure (2.7), where the estimate from the previous frame is used in the projection functionQof the next frame.

E =

n

∑︂

i=1 m

∑︂

j=1

d(Q(a_j, b_i), x_ij)² (2.13)

During a cycle of bundle adjustment, it is computationally expensive to consider all of the trajectory frames for optimization. Instead, in most SLAM algorithms, a sliding window- based approach called fixed lag smoothing [7] [30] [31] is utilized, where only fixed recent keyframes are used to determine the error term, enabling real time operation. To further improve the estimate’s accuracy, current Visual-inertial SLAM systems include IMU preintegration step introduced in [32] to capture and utilize the motion information between keyframes. The resulting IMU residuals along with feature reporjection errors are together minimized in Visual-inertial Bundle adjustment to get the final estimate.

2.5.2 Factor Graphs

Factor graphs are a graph-based sparse representations of the Maximum-a-posteriori problem where the unknown parameters are represented as nodes in a graph, and the edge factors represent the sensor measurements connecting the nodes. In SLAM estimation problem at any point in time, the current estimate most likely depends on the subset of measurements local to that point in time; for example, finding transformation between two keyframes depends on the shared key points between those frames.

Factor graphs provide an intuitive way to formulate this local dependency as edge factors between the variables to be estimated in a sparse graph-based structure. They can be

(23)

Figure 2.8.Factor graph representaion of SLAM [33]

used to represent a variety of robotics problems, including navigation, optimal control, Bundle adjustment, and other modes of SLAM. Their sparsity enables easy modeling of complex systems with multiple factors. SLAM using factor graphs is also refered to as iSAM [33], which stands for incremental smoothing and mapping as the non-linear optimization is applied incrementally to the graph structure to update the nodes.

X^∗ =argmax(p(X|Z)) =argmax(p(Z|X)p(X)) (2.14)

Maximum-a-posteriori estimation of the SLAM state [4] X^∗ is performed by finding the state with maximum the probabilityX, given the measurementZgiven by equation (2.14), which by Bayes theorem can be written as p(Z|X)p(X) which is the probability of observed measurementZ given the stateXand the know prior probability ofX. When the priorp(x)is unknown, it reduces to a maximum likelihood estimation.

X^∗ =argmax(p(x)

m

∏︂

k=1

p(Z_k|X_k)) (2.15)

As the measurements are independent of each other (2.14) can be written as (2.15), also indicating that the measurement Z_k at time stepk only depends on the local subset of SLAM stateXk.

p(Z_k|X_k)∝exp (︃

−1

2||h_k(X_k)−z_k||²_Ω

k

)︃

(2.16)

X^∗ =argmin(−log(p(x)

m

∏︂

k=1

p(Z_k|X_k))) =argmin

m

∑︂

k=0

||h_k(X_k)−z_k||²_Ω

k (2.17)

(24)

Expanding (2.15) it can be written as a probability density function given by (2.16), where the function h_k is the known observation model and Ω_k is the information matrix, maxi- mizing (2.16) gives the same result as minimizing negative log given by (2.17), this is now in the form of a least squared problem and can be solved using non-linear least square solvers to get the maximum-a-posteriori estimate of the states.

(25)

3. METHODOLOGY AND EXPERIMENTAL SETUP

This chapter initially goes through the methodology behind the selection of visual and LIDAR-based SLAM algorithms chosen for testing in this thesis. We then have an overview description of the chosen algorithms to understand their key differences and see how the components described in chapter 2 are utilized within them. Section 3.3 presents the detailed account for the sensor suite used during the experimentation and the necessary calibration processes conducted for the selected sensors. Finally, the error metric methods used in this thesis, the absolute pose error (APE) and relative pose error (RPE) are described.

3.1 Selection of SLAM Algorithms

Figure 3.1.Selected Visual and LIDAR SLAM algorithms

Eight SLAM algorithms have been chosen for comparison in this thesis; figure (3.1) shows their categorization. It consists of four visual and four LIDAR-based algorithms. Both contain two odometry algorithms that do not include loop closure and two full SLAM algorithms, including loop closer. In visual algorithms, SVO SLAM has a semi-direct front-end that should help react to fast changes as it skips feature extraction step, whereas the rest contain a feature-based front-end. ORB SLAM3 is a feature-based full SLAM; BASALT VIO is bundle adjustment-based visual-inertial odometry with non-linear factor recovery, and finally, Kimera VIO is factor graph-based visual-inertial odometry.

In LIDAR-based algorithms, the full SLAM methods include LIO SAM and HDL graph SLAM, where LIO SAM implements IMU preintegration between keyframes and a GTSAM

(26)

library-based factor graph back-end, and HDL graph SLAM is a g2o library-based graph SLAM. LeGO LOAM and LOAM are the two odometry algorithms where LeGO LOAM is optimized for speed and designed for ground vehicles, and LOAM is a point-to-line and point-to-plane ICP based odometry algorithm.

3.2 Overview of the Chosen SLAM Algorithms 3.2.1 SVO2

Semi-direct Visual Odometry (SVO) [34], Unlike traditional feature-based visual algorithms, has a front-end similar to direct visual odometry methods, where it works on direct pixel intensities with a sparse model-based image alignment instead of performing feature extraction and matching. This makes it capable of performing at higher frame rates, which is preferable for fast-moving applications such as drones. It uses the local intensity gradient’s direction to estimate the motion, making it more robust in scenes with poor texture.

In the motion estimation thread of SVO (figure: 3.2), it first performs sparse model-based image alignment by minimizing the gradient photometric error. Unlike classic direct methods, SVO performs bundle adjustment during the pose and structure refinement step, which is initialized by the initial estimate and reprojection from features resulting from the earlier direct image alignment step.

Figure 3.2.SVO pipeline [34]

(27)

SVO 2.0 adds multi-camera and IMU support to SVO, and it also adds global loop closure using Distributed Bag of Words (DBoW) library.

3.2.2 ORB SLAM 3

ORB SLAM3 [6] is feature-based visual-inertial odometry with loop closure, making it a full SLAM. ORB SLAM3 has three parallel threads running tracking, local mapping, and loop closure (3.3). The tracking thread consists of an ORB feature detector discussed earlier in section (2.3.1); the same set of features are also later used for place recognition during relocalization and loop closure detection. The tracking thread uses a constant velocity model to estimate the search locations for the previous frame’s features. Using the tracked features, it calculates the new frame pose using the perspective-n-point algorithm; if the features are lost, it initializes global relocalization and saves the current local map.

If more than 20 frames passed since the last keyframe and the current frame has a minimum number of points with less than 90% of reference tracked points, then a new keyframe is added to the bundle adjustment optimization. ORB SLAM3 uses the dis-

Figure 3.3. ORB SLAM3 architecture [6]

tributed bag of words (DBoW) based loop closure detection and performs the loop correction using graph optimization.

3.2.3 BASALT VIO

BASALT VIO [7] is a feature-based visual-inertial odometry algorithm with a posegraph based loop closure. Most Visual inertial algorithms pre-integrate IMU measurements between keyframes, the creators of BASALT argue that if the selected keyframes are far

(28)

apart, the pre-integrated IMU measurement does not retain much information due to ac- cumulated noise. To address this issue, BASALT first performs local VIO with bundle adjustment at high framerate in a smaller window, during which the keyframes are selected, and the non-linear factors represent the visual-inertial information between keyframes are extracted. The keyframes and non-linear factors are then optimized during global bundle adjustment.

The front-end uses the FAST feature detector [23] to extract feature points; these features are then tracked over consecutive frames using a sparse patch-based KLT optical flow tracker [26]. The local bundle adjustment step minimizes the reprojection error generated from the reprojection of tracked features and the pre-integrated IMU residuals. It then selects keyframes and extracts non-linear visual-inertial factors used in the global bundle adjustment to generate the final estimate.

Figure 3.4.BASALT VIO pipeline

Basalt also extracts ORB features from keyframes and factor graph structure for loop closure detection and correction, respectively,

3.2.4 Kimera VIO

Kimera Visual-inertial odometry is part of the Kimera C++ library [8], which performs semantic 3D mesh reconstruction of the environment using metric semantic SLAM with visual-inertial data. Other than Kimera-VIO, the library also includes Kimera-RPGO, Mesher, and Semantics modules.

Kimera VIO performs keyframe-based maximum-a-posteriori estimation using factor graphs, and it allows to specify the time horizon of factor graph to be optimized. The front-end begins with Shi-Tomasi corners detection; the detected features are then tracked through the new image frames using the KLT tracker. Next, five-point/three-point RANSAC is used to solve the relative pose between keyframes, and pre-integrated IMU measurements are calculated between keyframes.

The calculated initial estimate and the pre-integrated IMU measurements are used to initiate the pose graph in the back-end. Using the specified time frame length, the pose graph is then optimized using the GTSAM library [33].

(29)

Figure 3.5.Kimera-VIO module of Kimera library [8]

3.2.5 LOAM

Lidar Odometry and Mapping (LOAM) [9] is a point cloud-based odometry and mapping algorithm that uses point-to-plane and point-to-line scan matching for registration. Using the methods described in the section (2.4), LOAM starts with extracting edge and planar features from the point cloud based on the smoothness valuecof the points. To eliminate distortion due to motion, every new point is projected back to the start time of the point cloud using the constant velocity model. During the registration, point-to-line ICP is used on the classified edge points, and the point-to-plane ICP method is used on the planar points to find the pose transform between successive point clouds.

Figure 3.6.LOAM system components [9]

The sequence of pose estimates are put together during transform integration to get the trajectory taken by the robot, and point clouds are also stitched together at a lower rate with respect to the pose estimates to get the point cloud map of the environment. The version of LOAM used in this thesis is A-LOAM [35] from Hong Kong University of Science and Technology.

(30)

3.2.6 LeGO LOAM

Lightweight and Ground-Optimized Lidar Odometry and Mapping (LeGO LOAM) [36], similar to LOAM is a point cloud odometry and mapping algorithm, it is lightweight and can run on embedded systems due to its assumption of the always existing ground plane.

LeGO LOAM first extracts planar and edge features by calculating the smoothness value of the points. Then it performs a two-step registration process between successive point clouds, where it first performs point-to-plane ICP on the planar points from a predefined bottom few scan lines to calculate Z translation, roll, and pitch angles of the pose transform and uses edge features from the other part of the point cloud to calculate X, Y translations, and yaw angle using point-to-line ICP, this two-step process reduces the computation necessary.

Figure 3.7. leGO LOAM system components [36]

Similar to LOAM, the pose estimates are then in transform integration to get the trajectory estimate, and point clouds are stitched with reference to this estimate to get the point cloud map.

3.2.7 HDL graph SLAM

HDL graph SLAM [37] is a factor graph-based LIDAR SLAM package, by utilizing the graph structure it offers the user to add optional edge constraints such as IMU, GPS, floor plane detection, and loop closure.

Figure 3.8.Nodes of HDL graph SLAM

(31)

The incoming point cloud first passes through a prefiltering node where it is downsampled to increase the computational efficiency in later steps. Next, in the scan matching node, the initial transformation estimate between two point clouds is calculated by user-specified registration methods such as variants of NDT and ICP scan matching, this initial estimate is used to initialize the nodes of the factor graph.

Loop closures are detected with the user-specified scan matching method, once the loops are detected they are then added to the factor graph as an edge constraint. The modu- larity of HDL graph SLAM enables it to vary from a simple scan match odometry method to a full slam with multiple constraints.

3.2.8 LIO SAM

LIO SAM is a tightly coupled LIDAR inertial odometry and mapping framework [10], which uses factor graphs to formulate the maximum-a-posteriori problem enabling the use of additional sensors and global loop closure. It uses the IMU preintegration introduced in [32] between keyframes to provide improved accuracy. The robot states at each factor graph node of LIO SAM consist of[R^T, P^T, v^T, b^T]^T,which are rotation matrix, position vector, robot speed, and IMU bias, respectively.

Figure 3.9. LIO SAM factor graph implementation [10]

LIO SAM factor graph consists of four different types of edge factors as shown in figure (3.9), they are IMU preintegration factors scan match odometry factor, GPS factor, and loop closure factor where the GPS and loop closure are optional. Upon adding a new node, It performs incremental smoothing of the factor graph using iSAM2 from the GTSAM library. The LIDAR odometry factor is generated similarly to LOAM, where the edge and planar features are extracted using the smoothness value, and then ICP scan matching is used to find the pose transform. But unlike LOAM, LIO SAM does not use all the LIDAR frames for registration. Instead, it picks keyframes based on the preset minimum distance between two keyframes.

The absolute measurements from GPS help prevent drift in the estimate over a long period of operation. The loop closure implemented in LIO SAM is based on euclidean distance; if the distance between the latest pose is close enough to any of the previous

(32)

nodes, then the algorithm uses scan matching to find the transformation between the current node and the existing node, which is then added as a loop closing factor between those two nodes.

Algorithms Sensors Front-end features Pose estimation Back-end optimization Loop closure

SVO2 Visual-inertial Direct BA BA (GTSAM) Yes (DBoW)

ORB3 Visual-inertial ORB BA BA (g2o) Yes (DBoW)

Basalt VIO Visual-inertial FAST KLT + BA BA Yes (DBoW)

Kimera VIO Visual-inertial Shi-Tomasi RANSAC Posegraph (GTSAM) No LOAM LIDAR Edges & planar

points ICP - No

LEGO LOAM LIDAR Edges & planar

points ICP - No

HDL Graph LIDAR Voxels with PDF ICP/NDT Posegraph (g2o) ICP/NDT based LIO SAM LIDAR + IMU Edges & planar

points ICP Posegraph (GTSAM) ICP based

Table 3.1. SLAM algorithms component comparison

3.3 Data Collection Setup

This section describes the equipment and methods used during the data collection process of this thesis. The data is collected in ROS bag files using Robot Operating System (ROS) environment on a Jetson Xavier NX device with a six-core 64 bit CPU and 8 GB DDR4 memory. To avoid writing bottleneck the data is written to 1 terabyte m.2 nvme SSD, which supports a write speed of 4100MB/s. The data from each sensor is recorded in the form of ROS topics inside the bag file; the bag files can then be played back later to simulate the live data stream from the sensors while running the SLAM algorithms. The following sections will describe sensors used, the necessary calibration processes, and the reference trajectories collected.

3.3.1 Sensors Setup

The five sensors used for data collection consist of two LIDARS, two stereo cameras, and an IMU. The two LIDARS include Veldoyne VLP 16; which is a mechanical 360 degree spinning LIDAR with 16 scan lines spread over a vertical field of view of 30 degrees; it has a vertical resolution of 2 degrees and horizontal resolution of 0.1 to 0.4 degrees based on a rotating speed which varies from 5 to 20 Hz. The second LIDAR is robosense Bpearl; it is also a spinning LIDAR, but with 32 scan lines, it has a much wider vertical field of view of 90 degrees With a resolution of 2.81 degrees and with a horizontal resolution of 0.2 to 0.4 degrees Depending on the speed which varies from 10 to 20 Hz. Both LIDARs have a maximum range of 100 meters and an accuracy of ±3cm.

(33)

Figure 3.10. Sensor bar with 2 LIDARS 2 stereo cameras and an IMU

The stereo cameras include ZED, a rolling shutter RGB stereo camera, recording higher resolution 1280x720 images at 15 frames per sec; it has a stereo baseline of 120mm with a horizontal and vertical field of views of 90 degrees and 60 degrees. The other stereo camera is realsense T265 from intel; it is a monochrome fisheye stereo camera with a global shutter. It has an image resolution of 848 x 800 pixels recording at 30 frames per second, and it has a baseline of 64 millimeters and a field of view of 163 degrees;

it also has a built-in IMU with an accelerometer and gyroscope, which can be useful for visual-inertial SLAM.

In the testing stage, we used Velodyne VLP 16 with LOAM, LEGO LOAM, HDL graph SLAM, with LIO SAM we use Velodyne in combination with Xsens IMU. We used Re- alsense T265 with Basalt VIO, SVO2 and Kimera VIO. Kimera VIO did not support the ultra wide angle fish-eye model of T265 perfectly and failed in outdoor datasets. Finally, we used ZED camera with ORB SLAM3.

3.3.2 Sensors Calibration

To use the sensor for SLAM, certain parameters of the sensors need to be obtained through the calibration process. Such as intrinsic parameters of cameras which include focal length, principal point, and distortion model parameters, and their extrinsic parameters, which is the baseline transformation between left and right stereo camera pairs, in the case of IMU, gyroscope, and accelerometer noise parameters need to be obtained.

For LIDAR and IMU-based SLAM, the transformation between the two sensors needs to be estimated.

For the two stereo cameras ZED and T265, rosbag calibration datasets with a checkerboard pattern in the field of view (3.11) were recorded while performing motion in all six

(34)

Figure 3.11.T265 calibration dataset with checkerboard

degrees of freedom. For LIDAR IMU calibration, a dataset with the sensor bar from (3.10 ) moving in all six degrees of freedom was recorded. And finally, to estimate the IMU noise parameters, an hour-long dataset where the IMU is left unmoved is recorded.

Figure 3.12. Barrel and pincushion radial distortion [38]

Camera parameters of the ZED camera are estimated using the inbuilt camera calibration tool of ROS [39] which uses camera calibration from OpenCV, it estimates the distortion parameters [k1, k2, k3, p1, p2] where k1and k2are the radial distortion which account for the curving of the straight lines as we move away from the center of the image with x_distorted =x(1 +k₁r²+k₂r⁴+k₃r⁶), y_distorted =y(1 +k₁r²+k₂r⁴+k₃r⁶). p1andp2 define the tangential distortion which is caused due to the missaliglnment between lens and image plane withxdistorted = x+ [2p1xy+p2(r²+ 2x²)], ydistorted = y+ [p1(r²+ 2y²) + 2p₂xy]. And the extirsinc parameters include baseline transformation between left and right camera. To calibrate the Visual inertial sensor Realsense T265 we used the Kalibr toolbox [40], which estimates the camera parameters mentioned above and the transformation between the cameras and the IMU frame.

LIDAR IMU six DoF rigid transformation is calibrated using the target less calibration method from the LIDAR IMU calib toolbox; it calculates the external pose transformation and time offset between the two sensors, which are necessary for LIDAR IMU SLAM [41]

such as LIO SAM. Finally, IMU noise parameters are estimated using the IMU utils toolbox [42].

(35)

3.3.3 Ground Truth/ Reference Trajectory

We used two different sources of ground truth trajectories during the data collection experimentation for outdoor and indoor cases. Outdoor ground truth is generated by fusing onboard GNNS location and heading data with IMU and wheel odometry of GIM machine (4.1) through a Simulink EKF model. The GNSS module used is Trimble BD982 [43] with a dual antenna input that helps in calculating heading; the module has an accuracy of 0.008 m + 1 ppm Horizontal, 0.015 m + 1 ppm Vertical while running in RTK mode. The fused Simulink EKF output is streamed through a rosbridge and collected as an odometry message into the same bag file as the sensor data.

Indoor datasets consist of ground truth data generated from the OptiTrack motion capture system [44] installed at the Robolab of Tampere University. This high-precision motion tracking system tracks infrared reflectors attached to the sensor bar using multiple infrared cameras installed on the ceiling. In total, there are eight Optitrack flex three high-speed cameras installed at the Robolab tracking the markers at up to 100fps. Using multiple markers attached to the sensor bar body, we can define a virtual rigid body, and its six DoF motion can then be tracked. The tracked 6 DoF pose is streamed to ROS as an odometry message, where it is collected along with the sensor data.

3.3.4 Error Metrics

The standard methods to measure the accuracy of a trajectory when there is a reference trajectory are Absolute Pose Error (APE) and Relative Pose Error (RPE). While measuring absolute pose error, the starting points of the reference trajectory and subject trajectory need to be first aligned. The error is then the RMSE of the sum of the euclidean distances between the two trajectories at every other point in time. Equation (3.1) is the absolute pose error, where Q_i is the transformation estimate generated by the SLAM algorithm from time 0 to time i, and P_i is the ground truth reference transformation of the same time step. The RMSE of this error at every time step is then calculated with (3.2) [45].

APE measure the global consistency of the trajectory. The two trajectories are aligned using the Umeyama alignment algorithm [46], which gives the rigid body transformation between two trajectories.

E_i =Q⁻¹_i P_i (3.1)

Ei =Q⁻¹_i PiAP Ermse = (︄1

n

∑︂

i=1

||trans(Ei)||² )︄¹₂

(3.2)

The drawbacks of APE are that, if there is a bad outlier transformation at the beginning

(36)

of the estimate, it subsequently affects the APE of all the estimates in future time steps, because of this characteristic of APE [47] claim this metric to be sub-optimal. Perfect alignment of the two trajectories is also not always possible for APE to work effectively, such as in the case of straight-line motion at the beginning where only 2 dimensions can be aligned perfectly.

Relative Pose Error (RPE) however compares the pose transform between two consecutive time steps of the ground truth trajectory with the estimated trajectory; therefore, it is invariant to the quality of alignment or the outlier in the trajectory. In this thesis, we use this RPE when perfect alignment of the trajectories is not possible due to the structure of the trajectory and measure global consistency with APE when possible.

Equation 3.3 represents the RPE, where (Q⁻¹_i Q_i+∆) represents the pose tranform estimate between times i, i+ ∆ and (P_i⁻¹P_i+∆) represents the actual transformation between those points. Multiplying the forward transformation of ground truth with inverse transformation will result in the resultant errorF_i^∆.

F_i^∆ = (Q⁻¹_i Q_i+∆)⁻¹(P_i⁻¹P_i+∆) (3.3)

(37)

4. DATA COLLECTION EXPERIMENTS AND RESULTS

After studying the necessary core concepts and the workings of the chosen SLAM methods, we now move to the experimentation stage where we test a few challenging cases for the SLAM system in indoor and outdoor cases using the test setup described in chapter 3. In this chapter, we first go through the outdoor experiments conducted during this thesis using the GIM machine and compare the localization performance of the SLAM algorithms, then we look at the indoor experiments and their localization results performed using the MIR robot at Robolab, Tampere University. Finally, we compare the CPU and memory usage of the chosen algorithms in order to evaluate their efficiency.

4.1 Outdoor Tests

Figure 4.1. GIM machine with sensor bar mounted on top.

This section contains the outdoor dataset experiments conducted during this thesis. The tests were conducted at the mobile hydraulics lab of Tampere University. The sensor bar containing two LIDARs, two stereo cameras, and IMU as described in earlier section 3.3.1, was mounted on GIM machine shown in figure (4.1) which is a modified Avant 600 multipurpose loader with various sensors and controllers installed; the GNSS localization from GIM machine is also collected as a reference trajectory. The collected outdoor datasets include loop and terrain change, change in speeds, and lighting change. The

(38)

following sections will describe these datasets and the performance of the selected SLAM algorithms in more detail.

4.1.1 Loop with Terrain Change

This experiment contains a series of datasets where we repeat the same trajectory with the Avant loader in the shape of an eight loop around Mobile Hydraulics LAB as shown in the figure (4.2). Each of these datasets has the sensor bar mounted at different positions on the loader, including three elevation levels of the boom, top mount, back mount, and side mount.

Figure 4.2.Loop with terrain change; top view

The trajectory starts at the center of the loop and continues towards the shorter leg where the objects are closer, and the environment is more urban with buildings, roads, and trees. It then returns to the start and continues towards the longer leg, where the terrain changes from tarmac to gravel with a downward inclination and the objects are also far away compared to the shorter leg. This data set should allow us to test how the change in terrain affects the localization performance and benefit of having loop closure after pass- ing through a challenging environment; having the same dataset with different mounting positions should also help us pick the best mounting place for the sensors.

The table 4.1 contains the Root Mean Squared Error (RMSE) of absolute pose error (APE) of each of the algorithm’s trajectories with respect to the GNSS reference trajectory. The ICP based odometry LOAM has performed the best in the side mount position, where the lidar has the widest horizontal field-of-view of around 270 degrees. It can also be seen that the ground dependent LEGO LOAM’s performance degrades as the elevation of the sensor increases as we move down the table. The LIDAR SLAM that uses IMU has the best performance of 1.142 meters of RMSE, but its performance degrades when attached

Evaluation of Simultaneous Localization and Mapping (SLAM) Algorithms