• Ei tuloksia

2. Introduction to Visual and LIDAR SLAM

2.3 Visual SLAM Front-End

It is a challenging task to represent the visual SLAM states consisting of series of robot poses and landmark locations as a direct function of the raw pixel values of the incoming images. Visual SLAM front-end performs intermediate operations that translate these pixel values into the SLAM state; which can later be optimized in the back-end. In a feature-based SLAM, the front-end generally performs feature extraction and obtains the reference key points from the images; for a SLAM system, the feature extractor should be robust and fast for accurate real-time performance. We will discuss the ORB feature extractor, which satisfies these conditions [21] in the following section.

Next, it needs to track the correspondences of the existing key points in the new image frames in the step called data association; the key points which can be tracked over mul-tiple frames are labeled as landmarks. Descriptor matching along with the aid of optical flow trackers such as the KLT tracker are preferred due to their performance efficiency;

using these point correspondences, the trackers can also determine how image frames have moved by estimating a wrap function. In the following sections, we take a closer look at the mentioned detector and tracker.

2.3.1 ORB Feature Detector

The front-end of a visual SLAM system extracts and tracks visual landmarks or features over a series of images to estimate the robot’s pose. The feature detector must be robust to varying orientation, scale, and lighting to successfully track similar features over a

series of images. So robust feature detectors are an essential requirement for a visual slam front-end. There are well-known existing keypoint detectors and descriptors such as scale-invariant feature transform (SIFT), speeded up robust features (SURF), or harris detectors known for their robust qualities; but the detector’s speed also plays a vital role in the overall SLAM system performance in real-time applications.

Figure 2.2. Benchmark detector speed comparison [21]

This is why the oriented FAST and rotated BRIEF (ORB) detector[22] is widely preferred;

this detector and descriptor offer speed along with scale and orientation invariance. Ac-cording to the OpenCV detector survey conducted in [21], the ORB detector is more than twice as fast as the other established detectors as shown in figure (2.2).

ORB detector builds on top of the features from accelerated segment test (FAST) feature detector [23] which was published in 2006. Firstly, a feature or a key point in an image is a point with a noticeable change in the intensity values in one (edge) or both dimensions (corner). FAST detects these points by comparing the intensity of target pointpwith a set of pixels forming a ring around it, as shown in the figure (2.3). It considers the pixelpas a key point if>= 12continuous pixels in that ring have their intensity levels greater than intensityp+thresholdor lesser thanp−threshold.

The Accelerated segment test used in the FAST detector to reject outliers makes it very efficient; it performs this test by considering only the intensities of a set of intermittent points in the ring. For example, pixels1,5,9,13in the figure (2.3), if less than 3 of the se-lected 4 pixels do not satisfy the condition, pointpis rejected as an outlier. ORB detector performs FAST extraction at various image scale levels; it also adds orientation invariance to FAST as it does not have it on its own. ORB adds an orientation component to FAST by a method called orientation by intensity centroid. This is based on the assumption that the intensity center is offset from the pixel center of the patch. Therefore, the slope of the

Figure 2.3.FAST detector segment test [23]

line joining the two centers should result in the orientation of the patch.

mpq =∑︂

x,y

xpyqI(x, y) (2.1)

θ =atan2(m01, m10) (2.2)

It is performed by using patch momentsmpyshown in equation (2.1). Where the function I gives the intensity at pixel coordinatesx, y. And the indicesp, q of momentmtake the value of1or0indicating the consideration of coordinate value ofx, y. The orientation of the patch is then given by θ in (2.2), ORB also adds learning-based rotation invariance to the binary robust independent elementary features (BRIEF) descriptor which works by performing binary tests between pixels on blurred patches, as described in [22].

2.3.2 SuperPoint Detector and Descriptor

Recent developments in feature detectors are centered around deep learning-based meth-ods [24] to improve speed and robustness. Although they are not directly used in the algorithms tested in this thesis, it is essential to understand the future trend. One such promising detector is SuperPoint [25]; Unlike traditional hand-crafted detectors such as ORB, the SuperPoint is a learning-based self-supervised detector and descriptor. Con-volutional neural networks are well studied for conventional applications such as human or object detection, but such applications are developed relying on the hand-annotated ground truth labels necessary during the training process. It is difficult to achieve similar results with supervised learning in image feature points as it is semantically harder for humans to define and label a key point.

SuperPoint detector introduces a self-supervised method to address this problem; in this method, the detector network is initially trained on a very large synthetically generated

Figure 2.4.SuperPoint detector Training stages

dataset consisting of simple shapes with known ground truth. The network trained on this synthetic data is referred to as MagicPoint, which performs well on synthetic data but not so well on real-world data compared to traditional detectors. To make MagicPoint more reliable, synthetic images are introduced with multi-scale and multi-warping distortion dur-ing the Homographic adaption step, and MagicPoint is retrained, resultdur-ing in a more robust and reliable SuperPoint detector capable of performing in real time applications.

2.3.3 KLT Optical Flow Tracker

During the data association step of visual SLAM, we need to find the point correspon-dences of the set of landmark features in the incoming new images. Methods like feature descriptor matching or optical flow trackers can find these point correspondences, where the latter performs it more efficiently. Published in 1991, Kanade–Lucas–Tomasi (KLT) tracker [26] is a commonly used optical flow-based tracker used in visual SLAM algo-rithms.

KLT tracker utilizes the key points detected by the chosen detector; it computes the motion vector of a keypoint by connecting its displacement between successive frames. In order to not lose track of all the key points, the tracker resets the actively tracked key points after a predefined fixed number of frames.

SSD error of image wrapping with incremental∆p: E =∑︂

Using these known data associations between frames, the pose transformation of the camera can then be estimated using the wrapping functionW(x;p)where the pixel coor-dinates of the image being wrapped are represented byx, andprepresents the param-eters of wrapping function which needs to be estimated. First, equation (2.3) defines the

Algorithm 1KLT tracker algorithm

4: Match landmarks from previous frames

5: if(Tracked landmarks < Threshold)then

6: Initialize new landmarks from detected keypoints

7: Initialize motion vectors for new landmarks

8: end if

9: Compute motion vector for landmarks using new pixel locations

10: ifE > error thresholdthen

11: calculate∆p(2.4)

12: update SSDE (2.3)

13: end if

14: end for

sum of squared differences error function between the wrapped image and the previous template with prior pand incremental change ∆p, after linearizing and rearranging pis given by the equation (2.4), whereH is the 2nd order partial derivative (Hessian matrix) of the wrapping prior wrapping function. ∆p is then added to p, and these steps are repeated recursively until convergence. After the parameterspof the wrapping function are estimated, the key points along with wrapping estimate can be utilized to initiate the back-end optimization.