Experimental Evaluation of Depth Cameras for Pallet Detection and Pose Estimation

(1)

EXPERIMENTAL EVALUATION OF DEPTH CAMERAS FOR PALLET DETECTION AND POSE ESTIMATION

Master of Science thesis Faculty of Engineering and Natural Sciences Examiners: Prof. Reza Ghabcheloo MSc. Jukka Yrjänäinen February 2021

(2)

ABSTRACT

Arvi Syrjänen: Experimental evaluation of depth cameras for pallet detection and pose estimation Master of Science thesis

Tampere University

Degree Programme in Automation Engineering, MSc (Tech) February 2021

AGVs are widely used for automatic handling of goods in industrial environments, often using some kind of standardized pallets as a platform. Conventionally used AGVs require well-defined, structured and obstacle-free working environments. To work in a dynamic environment, where the exact places of the handled pallets are not necessarily known, the pallet handler needs to detect the target pallets from its environment and localise them accurately enough for manipulation. This can be done using model-based pattern recognition approaches or data-intensive deep learning approaches. Regardless of the chosen approach, the algorithm needs data to work with, which could be conventional image data or depth data. Depth data produced by laser scanners or depth cameras can be used for accurate pose estimation.

This thesis aims to evaluate depth cameras while proposing criteria that future depth cameras can be evaluated against. The evaluation should focus on the needs of a target application, a research project in the Laboratory of Innovative Hydraulics and Automation (IHA), Faculty of Engineering and Natural Sciences of Tampere University named Himmeli Robotics. It includes technical research, market research and product development for a system that aims to partially or fully automate mobile pallet handling tasks in indoor logistics.

During this thesis, a short literature review on pallet detection and pose estimation from depth data is conducted. Also, a more in-depth literature review about the theory, working principles and noise characteristics of different commonly used depth camera technologies is conducted. After the literature reviews, object detection and pose estimation focused evaluation criteria for depth cameras is suggested. A set of industry-standard depth cameras are evaluated against these criteria using both qualitative and quantitative testing methods either common in the literature or newly proposed. Noise patterns that occur especially during the qualitative tests are further investigated individually for each camera and possible solutions to prevent or reduce the effect of the noise are presented. Finally, the most suitable depth cameras for the target application and a focus for further development are suggested.

The following properties of the depth data were chosen as the evaluation criteria: absolute error and variance of the depth data, point cloud density, the tolerance for different materials, shapes and illumination conditions, and finally the edge fidelity of the recorded objects in the depth data. The evaluated cameras were Realsense D435i, Realsense D415, Realsense L515, Stereolabs ZED, Azure Kinect DK and IFM O3D303. From these, the IFM O3D303 was chosen as the primary option as the depth camera for the target application due to its robustness with different materials and illumination conditions and its fairly small absolute depth error and depth data variance. However, using the IFM O3D303 necessitates further testing and development of a post-processing filter, which functionality is described in the thesis. The Azure Kinect DK is a good second choice due to its small absolute depth error and variance of depth data, good frame rate and dense point clouds.

Keywords: depth camera, stereo depth, active stereo, LiDAR, time-of-flight, amplitude modulated continuous wave ToF, pulse based ToF, automatic pallet picking, Automated Guided Vehicle, object detection, pose detection

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Arvi Syrjänen: Experimental evaluation of depth cameras for pallet detection and pose estimation Diplomityö

Tampereen yliopisto

Automaatiotekniikan DI-ohjelma Helmikuu 2021

Vihivaunuja käytetään teollisuudessa kappaleiden automaattiseen käsittelyyn. Kappaleet ovat yleensä lastattuna standardikokoisille kuormalavoille. Nykyään yleisesti käytetyt vihivaunut tar- vitsevat tarkasti määritellyn ja esteettömän toimintaympäristön, jossa käytettyjen kuormalavojen sijainnit ovat ennalta määrättyjä. Toimiakseen dynaamisessa ympäristössä, jossa kuormalavojen sijainnit eivät välttämättä ole tarkalleen tiedossa tulee kappaleen käsittelijän pystyä tunnis- tamaan kappale ja sen tarkka asento ympäristöstään. Tätä varten on olemassa kuormalavojen standardoituja mittoja hyödyntäviä hahmontunnistus algoritmeja, sekä dataintensiiviseen syväop- pimiseen perustuvia koneoppimisalgoritmeja. Algoritmin luonteesta riippumatta se tarvitsee dataa, joka voi olla perinteistä kuvadataa tai syvyyskameran tuottamaa dataa. Syvyyskameran tuottamaa etäisyysdataa voidaan käyttää kappaleen tarkan asennon tunnistamiseen.

Tämän diplomityön tarkoitus on ehdottaa arvostelukriteeristö ja arvioida syvyyskameroita tä- män kriteeristön avulla. Arvostelun tulisi ottaa huomioon kohdesovelluksen tarpeet. Kohdesovel- lus on tutkimusprojekti Tampereen Yliopiston Tekniikan ja luonnontieteiden tiedekunnassa Inno- vatiivisen hydrauliikan ja automaation tutkimusryhmässä. Tutkimusprojekti nimeltään Himmeli Ro- botics sisältää teknistä tutkimusta, markkinatutkimusta ja tuotekehitystä mobiilirobotiikan kehittä- miseksi kuormalavojen automaattiseen käsittelyyn.

Tässä diplomityössä suoritetaan lyhyt kirjallisuuskatsaus syvyyskameran tuottaman datan perusteella tehtävään kuormalavan ja sen asennon tunnistukseen, sekä laajempi kirjallisuuskatsaus erilaisten syvyyskameroiden toimintaperiaatteisiin ja kullekkin syvyyskamerateknologialle ominaisten etäisyysdatassa esiintyvien häiriöiden syihin ja taustalla vaikuttavaan teoriaan.

Diplomityö esittelee myös kappaleen ja asennon tunnistuksen vaatimukset huomioon ottavan arvostelukriteeristön syvyyskameroille, sekä testaa ja vertailee kuutta ajankohtaista syvyyskame- raa keskenään tämän kriteeristön perusteella käyttäen kirjallisuudessa esitettyjä ja myöskin tässä työssä ehdotettuja uusia sekä kvalitatiivisia että kvantitatiivisia testimetodeja. Etenkin kvalitatiivi- sen testauksen yhteydessä ilmaantuvien häiriökuvioiden perusteella suoritetaan kameramallikoh- taista yksilöllistä jatkoselvitystä häiriökuvion syyn selvittämiseksi ja mahdollisen ratkaisun löytämi- seksi häiriökuvion estämiseksi tai sen vaikutuksen vähentämiseksi. Lopuksi diplomityö ehdottaa sovellukseen sopivimmat syvyyskamerat ja mahdolliset jatkokehityskohteet.

Arvostelukriteeristöksi ehdotetaan seuraavia ominaisuuksia. Syvyysdatan absoluuttinen virhe, syvyysdatan hajonta, pistepilven tiheys, erilaisten pintamateriaalien, muotojen ja valaistuksien vai- kutus syvyysdataan, sekä kuvattujen kappaleiden reunojen selkeys ja eheys syvyysdatassa. Ar- vostellut syvyyskamerat olivat Realsense D435i, Realsense D415, Realsense L515, Stereolabs ZED, Azure Kinect DK ja IFM O3D303. Näistä sovellukseen sopivaksi syvyyskameraksi valittiin ensisijaisesti IFM O3D303 sen pintamateriaali- ja valaistusrobustisuuden, sekä pienten absoluuttisen virheen ja datan varianssin vuoksi. IFM O3D303 käyttäminen tosin edellyttää jälkikäsittely suotimen kehittämistä, suotimen toiminnallisuus kuvataan diplomityössä. Toissijaiseksi syvyyska- meravaihtoehdoksi sovellukseen valikoitui Azure Kinect DK sen pienen absoluuttisen virheen ja datan varianssin, sekä sen kuvataajuuden ja pistepilven tiheyden ansiosta.

Avainsanat: Syvyyskamera, Stereokamera, kulkuaikatekniikka, etäisyysmittaus, automaattinen materiaalinhallinta, vihivaunu, hahmontunnistus, asennontunnistus

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

Haluaisin kiittää Reza Ghabcheloota ja Jukka Yrjänäistä tämän työn mahdollistamisesta, ohjaamisesta ja tarkastamisesta. Kiitos myös perheelleni kaikesta saamastani tuesta ja kannustuksesta opintojeni aikana. Kiitokset kaikille opiskelijakavereille sekä Tampereen Teekkarien Hiihto- ja Purjehdusseuralle mahtavista menneistä vuosista, sekä toivottavasti myöskin tulevista!

Tampere, 14th February 2021 Arvi Syrjänen

(5)

LIST OF SYMBOLS AND ABBREVIATIONS

AGV Automated Guided Vehicle

ASC Advanced Scientific Concepts, Inc CNN Convolutional Neural Network

DC Direct current

LiDAR Light Detection and Ranging, method for measuring distances utilizing time-of-flight of emitted light

SL Structured Light, method for estimating depth using triangulation and coded light

CW-ToF Continuous Wave modulated ToF measurements PB-ToF Pulse modulated ToF measurements

DK Developer Kit

DL Deep Learning

EPAL European Pallet Association e.V.

EUR-pallet The standard European pallet as specified by EPAL

FOV Field of view

GAPD Geiger-mode Avalanche Photo Diode

IFM A manufacturer of sensors and controls for industrial automation IHA Innovative Hydraulics and Automation

IMU Inertial Measurement Unit

IR Infrared

ISO International Organization for Standardization MATLAB^⃝^R A numerical computing environment

MPI Multipath Interference

NIR Near-Infrared

PDS Pallet Detection System from depth data developed by IFM

PMD Photonic Mixer Device/Detector, another name for a CW-ToF camera

PMDTechnologies A developer of semiconductor components for ToF imaging RANSAC Random SAmple Consensus, a method to fit a model to data MSAC M-estimator SAmple Consensus, a method to fit a model to data

(7)

RGB RGB refers to the red, green and blue channels of a color image

RMSE Root Mean Squared Error

ROI Region of Interest

SDK Software Development Kit

SNR Signal-to-Noise ratio

SORT Simple Online and Realtime Tracking SPAD Single Photon Avalanche Diode

ToF Time of Flight

USB Universal Serial Bus

VGA Video Graphics Array, refers to the resolution of 640 x 480

VSC Visual Servo Control

XGA Extended Graphics Array, refers to the resolution of 1024 x 768 in this thesis

(8)

1 INTRODUCTION

Automated Guided Vehicles (AGVs) like the Toyota AGV [1] or the Rocla AGV [2] have been widely used in automatic handling of material in industrial environments for over a decade [3]. The handled goods are usually loaded on standardized pallets like the EPAL- pallet [4]. The conventionally used AGVs are almost always under some form of central- ized control and require well-defined, structured and obstacle-free working environments [3, 5]. To relieve such constraints and make operating in unstructured and dynamic environments without well-defined pallet locations possible the mobile manipulator needs to sense the target pallets on its own for example by using visual data. The pallet manipulation can be controlled using the found pallet pose data as feedback, like in this series of papers by Aref. et al [6–9], where the automatic pallet picking problem was tackled using a visual servo controlled (VSC) articulated-frame-steering hydraulic mobile machine equipped with a forklift manipulator. The system successfully performs path planning and controls both the mobile platform at the macro level and the rotary-prismatic-rotary forklift manipulator in the close vicinity of the pallets using visual feedback, inertial-measurement unit (IMU) data and odometry readings. Fiducial markers are attached to the face of the used pallet and their poses are detected from a video feed.

This thesis is conducted as a part of Himmeli Robotics, a research project in the Labo- ratory of Innovative Hydraulics and Automation (IHA) at the Faculty of Engineering and Natural Sciences of Tampere University. Himmeli Robotics includes technical research, market research and product development for a system that aims to partially or fully automate mobile pallet handling tasks in indoor logistics. Problems to be solved for the system to work in a dynamic environment without well-defined pallet locations are detection, tracking and pose estimation of the pallets. Object detection provides an answer to the question if there are objects of interest in the scene and localises them with a bounding box and object tracking aims to keep track of these detections over time as the object or the camera moves. Pose estimation provides a more accurate pose of said objects with respect to the system and a full 6-DOF (degrees of freedom) pose estimator provides the translationT = [x, y, z]and rotationR = [r, p, y]of an object.

To relieve the system from the need of using special markers, like the ones used in [6–9], a way to detect a general pallet is needed. In addition to solving the object detection problem, the detected objects should be tracked to associate the detected objects with an identity and to reduce the effect of possible occlusions and sensor noise.

(9)

A potential approach for object detection is using conventional color or grayscale images with a model-based pattern recognition approach [10, 11], or a data-driven deep learning (DL) approach [12, 13]. A comparison of three Convolutional Neural Network (CNN) models for pallet detection was conducted in [13], including the Faster R-CNN[14], YOLOv4 [15] and SSD [16] models. In the paper a data set of pallets was gathered and annotated, the models were each trained with this data and then compared against each other in a pallet detection task. The evaluation involves the detection of the pallet front face and the pallet fork pockets separately.

The tracking problem can be solved usingtracking-by-detectionmethods, where the detection and tracking tasks are separated and the output of an independent object detector is combined with a tracker that associates the detections with targets [17]. In a paper by Mandal and Adu-Gyamfi [18], YOLOv4 and other object detectors were combined with the DeepSORT tracker [19] to track cars from traffic footage with good results and realtime performance. SORT [17] is a widely used tracking-by-detection method that uses the Kalman Filter for target motion prediction and the Hungarian algorithm for detection association. DeepSORT extends SORT with deep learning appearance descriptor for improved target association. In contrast to the traffic surveillance problem in [18], the one that moves in our application is most often the camera, not the objects in the scene. If odometry data from the mobile platform is available it could be fused with the detection data as was done in [6–9].

The pallets can be detected and localised also from depth data [5, 20–22]. Using conventional color or grayscale image data for pallet detection might be a reasonable approach especially from longer distances since depth cameras do not have very long accurate operating ranges as the results of this thesis will later prove. However, depth data could be used for more accurate pose estimation of the manipulated objects at closer distances [23]. This requires the system to be equipped with suitable depth-sensing hardware. The system should also be updatable when the associated depth camera technology matures.

Therefore,the research problemof this thesis is to evaluate depth cameras and find the conditions where the various depth camera technologies fail to produce the desired output while proposing criteria that focus on the needs of the target application and test methods that can be used to evaluate upcoming depth cameras against the proposed criteria.

The thesis is structured as follows. Chapter 2 presents approaches for pallet detection and pose estimation from depth data found in the literature and more importantly presents the theory and different working principles of depth cameras. Chapter 3 describes the evaluation criteria, the chosen test devices and the conducted practical experiments.

Analysis of the results is split into two chapters; in chapter 4 the results of qualitative tests are assessed and in chapter 5 quantitative test results are assessed. The used evaluation methods are commented and improvements are presented also in chapter 5.

Finally, chapter 6 summarizes the found results and concludes the whole thesis.

(10)

2 DEPTH CAMERAS AND PALLET DETECTION FROM DEPTH DATA

This chapter starts with an overview of approaches found in the literature for pallet detection and pose estimation from depth sensor produced data. Next, the commonly used depth camera technologies and the data types used to represent depth data are introduced, then the different depth camera working principles are shown and finally, the noise sources of depth images found in the literature are presented.

2.1 Pallet detection and pose estimation from data produced by depth sensors

Depth camera research and development has truly accelerated during the last decade. A large portion of the papers in the literature during the first half of the decade is linked to the tremendously popular Microsoft Kinect structured-light depth camera [24]. The original Kinect had a good performance for the time combined with reasonable pricing and wide- scale availability, and since then more accurate affordable depth cameras have been developed [24]. The advancements in depth camera technology have also contributed towards the development of machine learning from 3D data as realistic 3D data has become more widely available simultaneously with the development of deep learning methods with conventional 2D images [25].

In [20] a model-based approach for object detection and pose estimation of euro pallets was developed. In the paper, the standardized shape of the pallets [4] is used as a geometrical constraint on point clouds generated with a depth camera. It was successful in detecting euro pallets that lay on the ground level in real-time using depth data gathered with a Kinect v2 time-of-flight depth camera. The pallets were assumed to lay flat on the ground level, so the resulting pose estimation includes x and y coordinates and the yaw angle around the vertical axis. In their approach the front surface of the pallet was detected by first removing the ground plane from the point cloud to reduce compu- tations, then vertical surfaces were detected from the point cloud using region growing, and lastly, the three blocks of the pallet front face were detected as vertical surfaces with the distances between each other defined by the standardized pallet dimensions.

A similar approach is taken in [21], where first all planar segments are detected from the point cloud using a region growing algorithm. From the set of planar segments, all non-

(11)

vertical segments and segments that significantly differ in width from the used pallets are filtered out. Pallets are detected from the remaining segments with sliding window template matching, different types of pallets are detected by using multiple templates.

The pose of the pallet is determined by the middle point and the normal of the recognized patch.

Papers that address pallet detection and/or pose estimation from depth data using model- based methods can be found in the literature [20–22], but only a few papers focus on data-intensive learning approaches with depth data [5]. However, in [5] an architecture using DL methods for pallet detection, localisation and tracking with 2D laser rangefinder data was developed. The approach is based on detecting the pallets using a Faster R- CNN network coupled with a CNN-based classifier and localising and tracking the pallets with a Kalman Filter based module. The pallets are localised with bounding boxes from the 2D laser scans. The 2D laser rangefinder provides range data in polar coordinates around the sensor, which is converted into binary images of the surroundings. The data set containing range scans of 340 pallets used for training and evaluation is publicly available [26].

Development of new and the adaptation of existing deep learning methods for depth data is a very active field of research [27]. Considering the maturity of 2D CNN architectures and effectiveness with conventional RGB images, the most straight forward way to utilize depth data in machine learning approaches would be handling the depth data just as RGB-D images and stacking the depth values as a fourth channel onto an existing CNN architecture [27]. This allows the utilisation of way the maturity of 2D deep learning architectures and large RGB data sets.

Examples of deep learning architectures designed from the ground up specifically for 3D data are the PointNet++ [28] for object classification and segmentation from point clouds and the VoxelNet [29] that first discretizes point clouds to voxel grids for object detection.

Griffiths and Boehm [27] provide a good overview on deep learning methods for object classification, detection and segmentation with depth data and Sahin et. al. [30] provide an extensive and in-depth review on the current state-of-the-art deep learning methods for object pose estimation. The detection and pose estimation of pallets and other objects is a study on its own which will not be addressed in this thesis any further, instead rest of the thesis will focus on the topic of depth cameras.

(12)

2.2 Depth camera types and depth data

Depth camerascan be divided into four classes based on their working principle, which are passive stereo cameras, active stereo cameras, structured-light cameras and Time- of-Flight (ToF) cameras. The stereo cameras and the structured-light cameras estimate the depth of the scene by triangulating the depth of a point when the cameras system’s dimensions are known. ToF-cameras measure the time that a light signal takes to travel from the camera to the scene and back and using the speed of light the distance is then calculated. See figure 2.1 for an illustration of the different technologies.

Figure 2.1.Most common depth camera technologies [31]

3D data can be presented in many different ways such as depth images, voxel grids, multi-views, meshes and point clouds [25]. This thesis will consider mostly depth images and point clouds as they are the most common data formats outputted by depth cameras.

In depth images, each pixel contains a depth value that is usually the orthogonal distance from the camera focal plane to the scene, D_orth in figure 2.2. Assuming the pinhole camera model and knowing the camera’s focal length f and optical center location the

(13)

orthogonal distance Dorth can be calculated from the undistorted depth image’s radial distanceD_rad using equation 2.1, wherex_p andy_p are the pixel’s distance to the optical center.

D_orth=D_rad∗cos(arctan(

√︂

x²_p+y²_p

f )) (2.1)

Figure 2.2. Orthogonal distance D_orth and radial distance D_rad in the pinhole camera model [32].

A point cloud is a collection of any number of points, which are data structures that contain coordinate values for each dimension of the 3D euclidean space. Since the vertical anglesβ are equal in 2.2 the coordinate valuesX, Y, Z of a point corresponding to a pixel of the depth image be calculated using equation 2.2.

X =D_orth∗x_p/f Y =D_orth∗y_p/f Z =D_orth

(2.2)

One can also incorporate RGB data to the point cloud, which often is readily available if the used sensor is an RGB-D camera. The points become data structures with six values, one for each coordinate and one for each color channel. An accurate calibration between the depth frame and the color frame is required for the color channels to fall on the correct points when creating an RGB point cloud from the separate depth and RGB images. Examples of a depth image, point cloud and a color point cloud are presented in figure 2.3.

(14)

Figure 2.3. An example of a depth image (top), a point cloud (left) and a color point cloud (right) from the same scene. The darker pixels in the depth image indicate the scene is closer, black pixels indicate invalid depth values. In the point cloud on the left different RGB colors represent different depth values. They are drawn by the visualisation program and not a part of the point cloud itself. The color point cloud on the right is fused with the corresponding color image from the same camera.

2.3 Depth by triangulation

Passive Stereo depth cameratakes images from the same scene simultaneously with two or more separate image sensors, which are displaced horizontally from each other.

Problems of calibration, correspondence and triangulation need to be solved to resolve a depth value of a point in the scene from the images. In calibration, the relative positions and orientations of the imagers as well as individual internal dimensions; focal length, optical center and lens distortions, are found. From two images taken from the same scene at the same time the problem of correspondence is solved, where the pixels that represent the same point in the scene are found from both images. The relative difference of the corresponding pixels is called disparity. Finding corresponding pixels might be difficult in regions of homogeneous intensity and color. The full set of found disparities from an image pair is called a disparity map or a disparity image [33, 34].

Depth calculated by triangulation from co-planar stereo images is based on the constant ratio of the heights and bases of triangles O_lP Or andp_lP pr in figure 2.4. DepthZ can be calculated using eq. 2.3, wheredis the disparity,B is the baseline i.e. the horizontal

(15)

Figure 2.4. Triangulation of depthZfrom two coplanar pinhole-camera images. Disparity d=x_l−x_ris the difference of the same real-world pointP in the stereo image pair, pixels pl andprboth depict pointP.

distance between the imager optical centers, and f is the focal length of the imagers.

Disparity is the difference in image coordinates (pixels) between the corresponding points d=x_l−x_r. The longer the disparity is, the closer the feature is and vice versa.

Z

B = Z−f B−d Z =fB

d

(2.3)

In reality, stereo cameras are rarely coplanar and resemble more a situation depicted in figure 2.5. However, knowing the calibration between the cameras and taking advantage of the epipolar constraint, the planes can be rectified and then calculate the distance Z with eq. 2.3 as if their epipolar lines were parallel and on the same plane [34, 35].

(16)

Figure 2.5. Arbitrary geometrical configuration between the two images that resembles the real-world situation. x^′ is found from the epipolar linel^′ due to epipolar geometry. [36]

Correspondence matching methods try to find pixels that depict the same real-world point between images. It is a computationally intensive task processed either on-chip on the camera [37] or on a separate host computer [38]. There are multiple approaches to correspondence matching [34], but they can roughly be divided into two,Feature-based matching and Correlation-based matching [39]. Feature-based matching is based on finding clear geometrical elements from the scene, such as lines, corners and curves.

This is difficult for homogeneous scenes and results in a quite sparse depth map if the scene is not abundant with good features, but the advantage of the method is its robustness against intensity differences and being less computationally demanding [39]. An example ofCorrelation-based matching is taking a fixed-sized window around a pointP and finding a window with matching values in the other image surrounding pointP^′. This is computationally more demanding than the feature-based matching and sensitive to intensity variations [39]. Correlation-based matching requires a textured scene but usually results in denser disparity maps than feature-based matching [39].

The epipolar lines are used to reduce the search space for feature correspondences. Due to the epipolar geometry, a point in frame 1 must fall on the corresponding epipolar line of frame 2, which is the line where the epipolar plane cuts the frame 2, as can be seen from figure 2.5. This results in the search space to be reduced from a plane search to a line search, reducing search time and the possibility of false matches [34].

Structured lightandactive stereodepth imaging try to relieve the restriction of requiring geometrical features or a textured scene by projecting their own features on the scene with light in visible or infrared (IR) frequencies. They both estimate depth with the same principle of triangulation as conventional passive stereo imaging does.

No structured light (SL) cameras are included in the depth camera evaluation of this thesis, but it is still mentioned here due to its relevancy in the field. The original Kinect depth camera is based on the SL technology. SL technology replaces one of the stereo camera’s imagers with a projector that projects light (usually IR wavelengths) with an en-

(17)

coded pattern to the scene. The encoded pattern simplifies the correspondence problem as the pattern is known a priori [40]. The scene is imaged with the other camera and the distance to the scene is triangulated from the found correspondences [40]. Choosing the projected pattern and interpreting the depth from it is not trivial. The pattern(s) projected may use temporal coding, direct coding or spatial neighbourhood coding strategies, of which temporal coding uses different patterns in different points of time, direct coding uses color-coded patterns and spatial neighbourhood projects a unique pattern for each pixel, from which each pixel can be triangulated individually [40]. Spatial neighbourhood coding using IR light is the most common strategy used in commercial SL cameras cur- rently [39].

Active stereodepth imaging also uses a projector to add features to the scene. Instead of projecting a coded, known pattern to the scene like in structured light, an arbitrary, non-repeating pattern is projected just to add features for the stereo camera from which it creates the disparity map [41]. Active stereo cameras can create the disparity map using both natural features of the scene and the projected ones, which makes it quite robust against bright scenes and dark scenes as proven later in this thesis. Apart from the projected features active stereo cameras are not different from the passive ones.

2.4 Time-of-Flight depth sensing

ToF cameras and LiDARs operate in a similar way. The distance to the scene is cal- culated by utilizing the constant speed of light and the round-trip time of a near-infrared (NIR) or IR light signal that is emitted to the scene and detected after the signal has been reflected back by the scene. The signal’s round-trip time∆tis measured either directly (pulse-based ToF) or indirectly using the modulated wave’s phase-shift (Continuous Wave ToF). The distancedis then calculated using equation 2.4, wherecis the speed of light.

d= ∆tc

2 (2.4)

2.4.1 Pulse-based Time-of-Flight

There are different ways to implement aPulse-BasedToF camera (PB-ToF) [42], but they all emit very short pulses of light, capture the reflected light and measure the time difference directly. PB-ToF cameras usually use fast electronics like Single Photon Avalanche Diodes (SPADs) to capture light. A naive approach to a direct ToF measurement would be to start a very fast counter when emitting a light pulse and count how long it takes for the pulse to travel to the scene and back to the camera. Lets say an accuracy of

±1mmis wanted for a depth sensor. The depth sensor would have to be able to discrimi- nate light pulses with a 6.6ps resolution and a SPAD implemented with silicon technology can not achieve this kind of accuracy in room-temperature [43]. Due to this pulse-based measurements employ averaging and multi-measurement techniques.

(18)

In figure 2.6 the working principle of a PB-ToF camera is illustrated. A pulse is emitted at t = 0, and it returns from the scene after∆t. Upon return, the pulse is measured using two temporal windowsC₁ and C₂, which are separated by180 degin phase. When the shutter timesw1 and w2 are set equal as the emitted pulse widthwpulse the calculations are simplified and the possibility of unambiguous depths is eliminated [44]. The travel time of the light pulse can be calculated from the ratio of the measurements g₁ and g₂ resulting in the equation for distance dto take the form of equation 2.5, where c is the speed of light andw=w_pulse =w₁=w₂is the pulse width used. The in-depth formulation for the equation is provided in [44].

d= 1

2cw( g₂ g1+g2

) (2.5)

Figure 2.6. Pulsed light ToF cameras emit short pulses of light for timew_pulse and the reflection is sampled during two out-of-phase windows C1 and C2.

To compensate for ambient light that affectsg₁andg₂two measurements are made, once with the active illumination and once without. The values from measurements without illumination are subtracted from the illuminated ones to obtain depth values unaffected by ambient light. To further improve the signal-to-noise (SNR) ratio multiple pulse cycles can be averaged per measurement, e.g. the PB-ToF camera used by Sarbolandi et. al.

in their paper [44] uses 3000 pulses per one range measurement.

2.4.2 Amplitude Modulated Continuous Wave Time-of-Flight

The round-trip time can also be figured out using the phase shift of a modulation signal.

InAmplitude Modulated Continuous Wave ToF measurements (CW-ToF) the emitted

(19)

NIR light is amplitude modulated with a radio frequency signal. The returning signal is captured using a CMOS or a CCD/CMOS image sensor and the phase difference φ between the emitted and the returning signals can be estimated with cross-correlation [45–47]. Knowing the modulation frequencyfmod, the distancedcan then be calculated using equation 2.6, wherecis the speed of light.

d= c

4πf_modφ (2.6)

CW-ToF cameras, also known as Photonic Mixer Devices/Detectors (PMD), perform CW- ToF measurements in every pixel in parallel [47]. The returning signal is typically sampled in four measurements that are in a 90-degree phase difference from each other. Figure 2.7 illustrates the process.

Figure 2.7. Illustration of the sampling of an sinusoidal amplitude modulated CW-ToF signal with four measurementsm0−m3 [47].

The measurementsm0, m1, m2 andm3 are used to calculate the amplitude Aand offset B of the returning signal and the phase differenceφbetween the emitted and returning signals with equations 2.7, 2.8, 2.9 [47, 48]. The emitted (solid red curve) and the returning (dotted green curve) signals are illustrated in figure 2.8. The shift in offsetB consists mostly of background illumination and possibly the DC-component of the signal [45, 48].

More in-depth formulations for the CW-ToF equations with sinusoidal signals are provided in [45, 46], other signal shapes can be used as well [47].

(20)

Figure 2.8. Theφphase difference between the emitted signal (solid curve) and the received signal (dotted curve). The offsetBand the modulation amplitudeAof the received signal together make up the overall intensity of the signal. The difference in offsetB is caused mainly by background light. [48]

φ= arctan(m3−m1

m0−m2

) (2.7)

A=

√︁(m₃−m₁)²+ (m₀−m₂)²

2 (2.8)

B = m0+m1+m2+m3

4 (2.9)

The maximum range of a CW-ToF camera is limited by the frequency of the modulating signal f_mod. A measured phase difference φ can unambiguously be the result of measurements from distancesd, d+damb, d+ 2damb, ..., d+ndamb, wheredambis the ambiguity distance of a CW-ToF camera anddis any distance. The ambiguity distanced_amb is calculated with eq. 2.10, wherecis the speed of light.

d_amb = c 2fmod

(2.10) The maximum operating range can be increased either by using a smaller modulating frequency, this decreases the accuracy of the CW-Tof camera, or by using multiple different modulations frequencies [46]. When using multiple modulation frequencies the objects true location is where the different individual frequencies agree, as illustrated in figure 2.9. This new ambiguity frequency is called thebeat frequency.

(21)

Figure 2.9.The unambiguous measurement distance can be extended by using multiple modulation frequencies [39].

As it was with the PB-ToF case, taking more sequential samples for one measurement improves the measurement precision decreasing the variance of data. The period of time used to collect samples in the pixels for one depth frame is the Integration time of the ToF-camera [48, 49]. The user of a ToF-camera can usually choose the integration time, some ToF-cameras support automatically adjusting the integration time and even using multiple integration times [48, 50].

2.4.3 Types of Time-of-Flight sensors

Scanning LiDARs are based on the PB-ToF principle [46]. They contain a single or multiple, the rotating type typically has from 16 to 64 [51], individual laser emitters and ded- icated photodetectors which each respectively measure a single point in the scene. The emitter-detector pairs scan the scene using a mirror(s) that are rotated or actuated in some other way. On the contrary, CW-ToF cameras illuminate the whole scene with modulated light and the imaging pixel array measures the round-trip time for each pixel individually. The electronic requirements of pulsed light sensors are a bit more demanding as you need to capture individual photons with a very precise time-of-arrival resolution.

As mentioned, very fast electronics like Single Photon Avalanche Diodes (SPADs), also known as Geiger-mode Avalanche Photo Diodes (GAPDs), have to be used [43]. In comparison, the requirements for CW modulated sensors are relaxed and most consumer priced ToF depth cameras are CW-ToF cameras [44].

Another type of PB-ToF depth cameras exists, which are often called 3D Flash LiDARs.

3D Flash LiDARs flood-illuminate the whole scene at once, which is then measured by SPADs [46]. 3D Flash LiDARs were first developed and commercialized by a California based company Advanced Scientific Concepts, Inc. (ASC), which was purchased by Continental Automotive Group in 2016 [52]. A company named ASC3D picked up from where ASC left and continued LiDAR development and production in the space, military, manned airborne, and bathymetric markets, whereas Continental AG’s focus is in LiDARs

(22)

for the automotive industry [53]. As the technology has matured new companies have recently announced upcoming Flash LiDAR products [54, 55].

2.5 Noise in depth images

Noise in depth images manifests itself as invalid depth values, wrong depth values or temporal noise[56].

• Byinvalid depth values this thesis refers to a situation where the depth estimation has completely failed. This can be due to the target being too close, too far or due to material surface properties of the target. Often the pixels for which depth couldn’t be estimated are valued as zero in the depth image.

• Bywrong depth values this thesis refers to the camera succeeding in estimating a valid depth value for a point, but it doesn’t correspond to the real distance of the point in the scene.

• By temporal noisethis thesis refers to pixels in the depth image that change their value over time even though there is no change in the scene. The pixel value can flicker between a set of valid values or an invalid value and a valid value.

These error types can be a result of various reasons and the same source of noise can result in multiple types of errors. Sources of error are strongly related to the working principle of the depth camera and the surface properties of materials in the scene.

2.5.1 Effect of surface optical properties

Optical properties of the scene’s surfaces like reflectance, absorbance and refraction might have a strong influence on the resulting depth image depending on the depth camera’s working principle.

Reflectancemeasures the ratio of returning light flux from a surface to the total light flux reflected upon the surface [57]. A surface is often modelled either as a specular or a Lambertian(diffusive, matte) reflector [58, 59], although in reality surfaces are combina- tions of both models instead of strictly one or the other. When a light ray hits a perfect specular reflector it is reflected away from the surface as a single ray with the same angle as it entered with [58]. On the contrary, when a light ray hits a perfect Lambertian reflector it diffuses to all directions away from the surface with varying intensity according to Lambert’s cosine law in a way that the reflecting area has the same apparent brightness regardless of the viewing angle [59]. See the reflection patterns of both models in figure 2.10. As it will be shown in chapter 4, specular reflectors cause problems with both stereo cameras and ToF cameras.

(23)

Figure 2.10. Specular and Lambertian reflection patterns on a non-absorbing surface, the arrow length represents luminous intensity.

Absorbanceis the material’s property to transform the energy of electromagnetic radi- ation to other forms of energy (like thermal) instead of reflecting it back to the environment or transmitted through the material reducing the wave’s intensity in the process [60].

Surfaces with a high absorbance factor cause problems with ToF measurements as the intensity of the returning signal’s low.

Refractionaffects the depth images when there are see-through materials in the scene, e.g. only one of the stereo cameras imagers look through a glass window or the light pulse of a LiDAR travels through water. Wrong depth values are induced in stereo depth images if only one imager is covered by e.g. a glass window, as the scene is shifted unequally in the image pair due to refraction. When the emitted pulse of a ToF camera travels through any other medium than air the resulting depth measurement is most probably wrong as light travels with different speeds in different mediums.

2.5.2 Sources of error in stereo depth images

The depth resolution of a depth camera refers to the accuracy it can distinguish differences in the depth direction. The depth data produced by stereo cameras is not linearly quantized, because the depth value is calculated using pixel disparity and one pixel in disparity corresponds to a longer real distance the farther away the scene is. In theory, the depth resolution of stereo depth images decreases quadratically with the distance to the target [61] as illustrated in fig. 2.11.

(24)

Figure 2.11. Illustration of the decrease in depth resolution with increasing distance Z using a stereo camera with baselineb, focal lengthf, vertical difference between pixelsr (individual camera resolution), whereR is the minimum distinguishable depth difference [61].

For the same reason as the depth resolution decreases, thedisparity errorgrows quadratically with distance to the scene, as a matching error of one pixel when creating the disparity map corresponds to a larger depth error the farther away the scene is [39]. The disparity error follows eq. 2.11, which is the result of taking a derivative with respect to the disparity of eq. 2.3. Eq. 2.11 expresses the resulting difference in depthδZ with a deviation disparity of δd. Commonly used correspondence algorithms achieve disparity error of sub-pixel values [39].

δZ δd = Bf

d² δZ = Z²

Bfδd

(2.11)

As can be seen from eq. 2.11, a wider baselineB reduces the disparity error. This results in less overall deviation in the depth data and a longer maximum operating distance be- fore the depth information gets overwhelmed by noise [62]. On the other hand, widening the baseline (without changing the viewing direction of the cameras) moves the minimum end of the operating range also farther away since the individual imagers’ field-of-view (FOV) need to overlap on a point to estimate depth for the said point, as seen from figure 2.11. The individual camera resolution combined with FOV also has an effect on the disparity error, as higher resolution enables disparities of far away features to be distin- guished more precisely, although the higher resolution in the raw images also increases computation cost [62].

Since the commercial stereo cameras consist of two or more CMOS [37, 38] or CCD [63] digital image sensors, the stereo camera produced depth images are affected by thesame noise sources as conventional digital images, likephoton shot noise,dark

(25)

current shot noise,readout noise sources,fixed pattern noise sourcesandthermal noise [63, 64]. Since the basic assumption in stereo matching is that the corresponding pixels should be similar in the left and right images, noise in the individual images causes matching errors or inability to find matching points while creating the disparity map, which leads to wrong and invalid depth values in the depth image [63]. The magnitude of the wrong depth values is determined by equation 2.11.

Lack of illuminationcauses problems for passive stereo cameras. When there is no light in the scene no features can be extracted from the raw images and no valid depth can be estimated. The change is gradual as better illumination makes more subtle features visible. In addition to having any valid depth estimations, they are also more accurate with well-lit images [65]. When more light is available during exposure time the images have less noise and a better SNR because some of the image sensor noise sources, like the dark current noise and the readout noise sources, are proportionally more severe in low light situations as they are not dependent on the amount of light in the image [64]. The dark current noise refers to the electrons that are generated in the pixel due to thermal effects even without any incident photons. The readout noise sources are independent of the image and the exposure time and they are caused by the amplification process electronics during pixel readout. Examples of stereo depth camera performance in different illuminations are presented in sections 4.1, 4.2 and 5.3.

Shadows due to occlusionoccur when the line of sight of one of the cameras is blocked to a point in the scene, which results in the inability to estimate depth in that point [34], e.g. pointsPo in figure 2.12. Occluded areas appear as "shadows" of invalid depth pixels in depth images.

Figure 2.12. Examples of occlusion in stereo depth images. Depth can be estimated in pointsPv, but not in pointsPobecause they are occluded in the right camera. [34]

Reflections are view-point dependent and cause errors in depth estimation [66]. To a stereo, camera the view-point dependent difference between bright reflection highlights in the images looks indistinguishable from disparity of real features. The problems are most relevant with reflective materials as nearby light-sources cause reflection spots on the surface. Very reflective materials might even reflect the surrounding scene like a

(26)

mirror. Examples of errors due to reflection are present in chapters 4.1 and 4.2. Stereo cameras that see IR-light, like most active stereo cameras, are affected by reflections also in the IR region.

Repetitive patternsmay cause erroneous depth measurements as they might cause the stereo matching algorithm to find false-positive matches [33].

2.5.3 Sources of error in Time-of-Flight depth images

As photons are detected only by a certain probability, photon shot noise affects all measurements with light including ToF measurements [67, 68]. The number of detected photons by any observer follows a Poisson distribution, so the standard deviation of shot noise is the square root of the number of photons detected [45]. Therefore the variance of a depth pixel in a ToF camera is inversely related to the signal amplitude it captures [68]. Photon shot noise can not be reduced by signal processing methods and is the theoretical limitation of ToF measurement accuracy [45, 68].

The precision of CW-ToF depth measurements is linearly dependent on the precision of phase estimation as eq. 2.7 states. The phase calculation’s precision is dependent on the returning signal’s modulation amplitude due to the photon shot noise [45, 69, 70].

The modulation amplitude is inversely proportional to the squared distance as the emitted light attenuates according to the inverse square law [49]. The amplitude of the returning signal is also affected by the material properties and the orientation of the reflecting surface. Disregarding the effect of quantization, which takes place in the analog-to-digital conversion of the IR signal, the probability distribution of the phase estimation follows an offset normal distribution [69]. The distribution widens with lower amplitudes and converges to a uniform distribution when the amplitude of the signal approaches zero, as illustrated in figure 2.13. [39, 71] show experimentally that the variance of commonly used CW-ToF cameras’ depth measurements seem to increase approximately linearly when the distance grows.

The chosen integration timeof a CW-ToF camera and the number of pulses per measurement of a PB-Tof camera have an effect on the ToF-measurements’ accuracy [44, 47, 49, 67]. Increasing the integration time or the number of pulses generally improves the precision of the measurement. However, motion in either the camera or the scene cause motion blur to the depth image, which is more severe with a longer integration time and a larger amount of pulses included [44, 72]. Too long integration time when recording reflective objects or objects nearby result in over-saturation (overexposure) and leads to invalid pixels at least in CW-ToF cameras [47, 69].

(27)

Figure 2.13. The probability distribution of the phase estimation as a function of the modulation amplitude. The distribution follows an offset normal distribution that converges to a uniform distribution when it approaches zero. [69]

Systematic errorsoccur in the depth data when the models used by the ToF-camera do not represent reality completely accurately, like the modulation and correlation functions.

A systematic "wiggling" error is common in CW-ToF cameras, where the measured distance wiggles around the ground truth (see fig. 2.14) due to the generated modulated amplitude signal containing higher order harmonics that are not modelled [48, 67, 73].

The PB-ToF cameras also suffer from systematic errors due to inaccuracies in the generated signal vs. modeled ones. For example, the distance calculation presented in eq. 2.5 and fig. 2.6 is based on the assumption of the emitted signal being perfectly rectangular [44].

(28)

Figure 2.14. Systematic "wiggling" error common with CW-ToF cameras. Left: Modu- lation of the light source, intensity I plotted over time t for one oscillating period. Right:

Mean depth deviation of the simulated and measured distance from the real distance, plotted over the real distance. Images from [74].

Phase-wrapping occurs when the ambiguity distance of a CW-ToF camera calculated with eq. 2.10 is exceeded. Any parts of the scene farther away than d_amb look like they would be closer than they actually are, as the depth camera chooses the closest possible distance for the ambiguous signals. For example, a CW-ToF camera with a d_amb = 5m imaging a point 6 meters away results in that point being registered to a 1 meter distance in the depth image.

Multipath interference (MPI)happens when the emitted signal is reflected from multiple points in or out of the scene and contribute to the measurement of a single point [75].

A simplified case is illustrated in figure 2.15, where returning signal r1 is affected by scattered light from the emitted signalse₂ ande₃. In chapter 4 a noticeable case of MPI is encountered when imaging a concave corner object made out of reflective material.

Multiple compensation methods for MPI have been proposed [75–77], more recent ones are based on deep learning methods [78, 79].

(29)

Figure 2.15. Omitting background illumination, without MPI the returning ray r1 should only contain light originated from emitted signale₁. However, MPI affects the measurements such that scattered lights1ands2 from emitted signalse2ande3also contribute to the returning signalr1.

Flying pixels due todepth inhomogeneityrefers to a situation where a pixel in the depth image falls on a depth boundary of a foreground and a background in the scene [67] and the pixel ends up collecting reflected light from both areas. Lefloch et. al. [67] suggests it is the raw measurements of a single depth frame (m₀−m3in fig. 2.7) that fall on different depth levels. Larger pixel size increases the incidence of flying pixels. In the case of the raw samples of a single measurement cycle falling on different depth levels, the flying pixel depth value is not restricted between the distances to the fore- and background due to the non-linear relationship between the phase calculation and the raw samples (eq.

2.7)[67]. The problem of flying pixels at object edges is related to the MPI problem, as also here the light is collected from multiple paths in one sensor pixel.

Intensity-related error or amplitude-related error is a systematic error that occurs in some CW-ToF cameras. In signals that return from absorbing or deflecting surfaces, the (de)modulation amplitude of the signal can be low. In theory, lower returning signal amplitude should only increase the variance of data, not move the mean of the data, but for some reason a bias is systematically introduced to depth calculations from low- amplitude signals [44, 67, 70, 80–83]. Lindner and Fuchs [82, 83] both suggest in their dissertations that the error is caused by unmodeled nonlinear characteristics in the pixel electronics that are most prominent in the limits of the operating range. Lefloch et. al.

[67] suggest that the underlying reason might be light scattering error, which on its own is a known error source in ToF cameras [84]. Light scattering erroris caused by multiple

(30)

reflections between the camera lens and the image sensor and Lefloch et al. suggest that pixels with low-intensity signals are affected relatively more. The intensity-related error is not apparent in all CW-ToF cameras as can later be seen in section 4.3 of this thesis.

It is interesting to note that in most references about the intensity-related error, where a commercially available CW-ToF camera is used, the cameras have ToF-chips developed by PMD Technologies [67, 80, 81]. The intensity-related bias is analyzed in chapter 4.3 focusing on the IFM O3D303 camera, which also uses a ToF-chip by PMD Technologies [50].

Sensor interference occurs from multiple active sensors pointing at the same scene.

The ToF-cameras often use the same or similar wavelengths of light [50, 85, 86] and when pointed at the same scene the emitted light might interfere with each other. Other active sensors like the active stereo cameras might cause interference in ToF-cameras also.

Error caused by component temperature. The warm-up time affects the variance and the absolute depth error of ToF-camera measurements. As the sensor warms up, the variance and absolute depth error might increase or decrease, stable variance and absolute error values are reached when the electronics have warmed up after being powered for some time [44, 48, 67]. This should be taken into account when gathering experimental measurements.

Dusty environments. It was noticed during the thesis that some ToF-cameras’ depth measurements were quite sensitive to dust particles in the scene. Examples and more discussion in chapter 4.4.

Motion artifacts present at object edges due to sequential sampling when imaging a non-static scene. Severity depends on frequency and amount of samples per measurement [44, 67, 72]. Some post-processing solutions do exist as addressed by Gottfried et.

al. in paper [72]. Although this thesis focuses on analysing static scenes, this might be a relevant source of error as the target application is mobile.

Strong ambient light reduces the signal to noise ratio of the emitted signals and de- grades the depth images [49]. For example, bright sunlight or intense artificial light might create challenges for ToF measurements as can be seen during the sensor evaluation of this thesis.

(31)

3 EVALUATED DEPTH CAMERAS AND EVALUATION METHODOLOGY

The chosen evaluation methods aim to assess the used sensors and to find conditions where the tested devices might fail to produce desired results. The sensor evaluation is split into two parts, qualitative evaluation and quantitative evaluation. Every aspect of the depth cameras is not practical to measure quantitatively and the qualitative evaluation tries to find the weak spots of each technology more generally [87].

3.1 Evaluation criteria

This thesis suggests the following metrics as criteria for depth camera evaluation for the following reasons.

Absolute depth error, or trueness of the measurements. ISO standard on the accuracy of measurements, ISO 5725-1, describestrueness as following. "Trueness refers to the closeness of agreement between the arithmetic mean of a large number of test results and the true or accepted reference value." [88]. When considering the target application, pallet manipulation with a forklift, the detected data must be as close as possible to the ground-truth, otherwise it might be difficult to localize the pallet for manipulation. The tolerance for a forklift’s forks to correctly fit in the EUR-pallet’s hole pockets [4] is only some centimeters, of course varying some on the used fork size.

Precision, or the variance of the depth measurements. ISO 5725-1 describesprecision as following; "Precisionrefers to the closeness of agreement between test results" [88].

Of course, some object detection algorithms are more robust against noise in the data than others, but sensor noise is a considerable difficulty in object detection and pose estimation from depth data [89, 90].

Point cloud density. The pixels-per-degree resolution is defined by the depth image resolution divided by the field-of-view (FOV) angle of the camera. The FOV angle is constant, but the FOV corresponds to a larger area the farther away the scene is from the camera. Combining the pixels-per-degree resolution with the distance to the scene make up thepoint cloud density. The point cloud density of an imaged object directly affects the range an object can be detected from because an object detection algorithm is bound to have some lower limit for point cloud density where it fails in detecting the objects.

(32)

Tolerance for different materials, shapes and illumination conditions. As can be seen later in this thesis, different materials, shapes and illumination conditions can cause invalid or wrong depth data. As it is desirable to be able to work in a wide range of working environments a robust depth camera is preferable.

Edge fidelity, good edge fidelity means that objects are distinct from their background and that the object edges are clear and crisp [87, 91]. Unclear edges might cause diffi- culties for object detection when the fore- and background boundaries are not clear. For example, a pose estimation algorithm proposed in [92] uses a pre-processing method to remove the background from depth images. Fore- and background boundaries are identified by large depth gradients, which might fail if the boundaries between fore- and background are not clear like in the ZED camera depth images discussed in chapter 4.1.

3.2 Evaluated depth cameras

A set of depth cameras were chosen for evaluation. These cameras were chosen from off- the-shelf commercial products to represent the commonly used standard of the industry today. The details of the cameras and the used settings are collected in table 3.1.

Table 3.1. Table of the tested cameras with the used settings. Values are manufacturer provided.

Depth resolution (H x V, px)

Frame rate (fps)

FOV (H x V, deg)

Operating range

Other streams

Power Consum.

ZED 2208 x 1242 1920 x 1080

15

30 90^◦x 60^◦ 0,3m - 25m left RGB

right RGB 1,9W

D435i 1280 x 720 30 86^◦x 57^◦ 0,280m - 10m

left BW+IR right BW+IR RGB IMU

3,5W (D4 Vision Processor)

D415 1280 x 720 30 65^◦x 40^◦ 0,450m - 10m

left BW+IR right BW+IR RGB

3,5W (D4 Vision Processor) IFM

O3D303 352 x 265 7 60^◦x 45^◦ 0,3m - 8m, max. 30m

Modulation

Amplitude 10W Azure

Kinect WFOV

1024 x 1024 15 120^◦x 120^◦ 0,25m - 2,21m IR RGB IMU

5,9W

Azure Kinect NFOV

640 x 576 30 75^◦x 65^◦ 0,5m - 3,86m IR RGB IMU

5,9W

L515 1024 x 768 30 70^◦x 43^◦ 0,25m - 6,5m IR RGB IMU

3,3W

(33)

Stereolabs ZED camera[38] is a passive stereo camera. It can output very dense depth images with 2,2k resolution in 15 fps. Combining the 2,2k resolution with its FOV results in the best pixel per degree resolution of 24,5 x 20,7 (pixels/degree) from the evaluated cameras. This larger resolution was used in the qualitative tests, in the quantitative tests the resolution of 1920x1080 was used.

Intel Realsense D400 series [37] active stereo cameras D415 and D435i both output 1280 x 720 depth images at 30 fps (maximum of 90 fps with lower resolutions). In this thesis, the D435i is used in the qualitative tests and D415 in the quantitative tests. The left and right imagers of the D400 cameras provide monochrome images that record all light in the visible spectrum and IR light at least up to 1000nm wavelength [37]. The D400 cameras emit IR light to the scene to support stereo matching in low light situations.

The D435i has a larger FOV and a slightly narrower baseline. Also, it’s left and right monochrome + IR imagers have a resolution of 1280 x 800 on their own, while the D415 left and right imagers have a resolution of 1920 x 1080. The wider FOV, lower individual imager resolution and narrower baseline of D435i results in it having at least two times more depth noise than the D415 [93] due to the reasons discussed in chapter 2.5.2. This and the D435i qualitative test results lead to switching to the D415 for the quantitative tests. During the qualitative tests, it is noticed that the D435i has severe issues with difficult lighting conditions to which the D415 is a lot more robust against. The wider baseline of D415 means losing depth data in the near operating range due to the D415 imager FOVs not overlapping in the near distances, but the trade-off for less depth noise and better performance in the far end of the operating range is probably worth it considering our target application. Also, the narrower FOV of D415 means a larger pixels-per-degree resolution and a denser point cloud. However, it must be noted that the D435i imagers have global shutters, which work better when there is motion in the scene as they do not suffer from motion blur while the rolling shutter imagers of D415 do. This might turn out to be a problem at some point if the D415 camera is chosen as the camera for object detection algorithm development.

IFM 03D303[50] is a CW-ToF camera that uses a ToF-chip by PMD technologies. Com- bining the 352x265 resolution with the FOV the IFM has the lowest pixel-per-degree resolution of the tested depth cameras with 5,87x5,87 pixels/deg. The operating range is from 0,3m to 8m, after which the point cloud density becomes very low. Singular measurements are possible up to 30m. When analysing the IFM O3D303 it is important to note that, in addition to the euclidean point cloud data, it outputs radial distance images, where each pixel has the radial distanceD_rad as its value, see eq. 2.1. The IFM frame rate is dependent on which resolution, how many modulation frequencies, how many integration times (exposure time in IFM documentation) and which post-processing filters are used. The manufacturer claims a maximum of 25 fps [50], but with the used settings of 352x265 resolution, 2 modulation frequencies and 2 integration times, the actual frame rate was around 6-7 fps. The effect of multiple modulation frequencies and exposure times on IFM depth data will be discussed more in chapter 4.3. Aside from depth data the O3D303 outputs amplitude images, where each pixel denotes the modulation

(34)

amplitude of the returning signal, A in eq. 2.8 [94].

Azure Kinect Developer Kit(DK) [95] is also a CW ToF-camera. It is the successor to the popular Kinect and Kinect v2 depth cameras. It offers two different fields-of-view, a wide FOV (WFOV) and a narrow FOV (NFOV), and a great maximum resolution for a ToF camera with it’s 1024x1024 pixels, which is used in the qualitative tests with the WFOV.

However the maximum operating range is shorter with the WFOV, so the NFOV with a resolution of 640x576 is used in the quantitative tests. Both FOVs have the same pixels- per-degree resolutions of 8,53 x 8,53 px/deg, so the point cloud density is unaffected.

The manufacturer claims an operating range of 0,5 - 3,86m for the NFOV but during the tests, it becomes apparent that the maximum operating distance is far longer at least in the used conditions.

Intel Realsense L515 LiDAR [86] is a solid-state LiDAR camera and makes use of the PB-ToF principle. It produces the densest point clouds out of the ToF cameras with its pixels-per-degree resolution of 14,63 x 17,86 px/deg. It includes an IR laser emitter, a microscopical MEMS mirror which scans the scene in horizontal and vertical directions reflecting the emitted signal to the entire scene and a single avalanche photodiode which receives the returning light pulses [96]. The working principle is similar to conventional rotating LIDARs, but the resolution of 1024x768 is more suitable for object detection, as conventional rotating LiDARs which typically have from 16 to 64 vertically stacked IR emitters [51] and photodiode pairs and a mechanically rotating mirror around an axis to create a point cloud that’s horizontally complete360 degbut only a few bits high vertically.

3.3 Test methods

The rest of this chapter will introduce the experimental methodology, first for the qualitative tests and then for the quantitative tests. Table 3.2 provides a summary of the tests.

Experimental Evaluation of Depth Cameras for Pallet Detection and Pose Estimation