Pallet detection and pose estimation from data produced by depth sensors 3

Depth camera research and development has truly accelerated during the last decade. A large portion of the papers in the literature during the first half of the decade is linked to the tremendously popular Microsoft Kinect structured-light depth camera [24]. The original Kinect had a good performance for the time combined with reasonable pricing and wide-scale availability, and since then more accurate affordable depth cameras have been developed [24]. The advancements in depth camera technology have also contributed towards the development of machine learning from 3D data as realistic 3D data has become more widely available simultaneously with the development of deep learning methods with conventional 2D images [25].

In [20] a model-based approach for object detection and pose estimation of euro pallets was developed. In the paper, the standardized shape of the pallets [4] is used as a ge-ometrical constraint on point clouds generated with a depth camera. It was successful in detecting euro pallets that lay on the ground level in real-time using depth data gath-ered with a Kinect v2 time-of-flight depth camera. The pallets were assumed to lay flat on the ground level, so the resulting pose estimation includes x and y coordinates and the yaw angle around the vertical axis. In their approach the front surface of the pallet was detected by first removing the ground plane from the point cloud to reduce compu-tations, then vertical surfaces were detected from the point cloud using region growing, and lastly, the three blocks of the pallet front face were detected as vertical surfaces with the distances between each other defined by the standardized pallet dimensions.

A similar approach is taken in [21], where first all planar segments are detected from the point cloud using a region growing algorithm. From the set of planar segments, all

non-vertical segments and segments that significantly differ in width from the used pallets are filtered out. Pallets are detected from the remaining segments with sliding window template matching, different types of pallets are detected by using multiple templates.

The pose of the pallet is determined by the middle point and the normal of the recognized patch.

Papers that address pallet detection and/or pose estimation from depth data using model-based methods can be found in the literature [20–22], but only a few papers focus on data-intensive learning approaches with depth data [5]. However, in [5] an architecture using DL methods for pallet detection, localisation and tracking with 2D laser rangefinder data was developed. The approach is based on detecting the pallets using a Faster R-CNN network coupled with a R-CNN-based classifier and localising and tracking the pallets with a Kalman Filter based module. The pallets are localised with bounding boxes from the 2D laser scans. The 2D laser rangefinder provides range data in polar coordinates around the sensor, which is converted into binary images of the surroundings. The data set containing range scans of 340 pallets used for training and evaluation is publicly available [26].

Development of new and the adaptation of existing deep learning methods for depth data is a very active field of research [27]. Considering the maturity of 2D CNN architectures and effectiveness with conventional RGB images, the most straight forward way to utilize depth data in machine learning approaches would be handling the depth data just as RGB-D images and stacking the depth values as a fourth channel onto an existing CNN architecture [27]. This allows the utilisation of way the maturity of 2D deep learning architectures and large RGB data sets.

Examples of deep learning architectures designed from the ground up specifically for 3D data are the PointNet++ [28] for object classification and segmentation from point clouds and the VoxelNet [29] that first discretizes point clouds to voxel grids for object detection.

Griffiths and Boehm [27] provide a good overview on deep learning methods for object classification, detection and segmentation with depth data and Sahin et. al. [30] provide an extensive and in-depth review on the current state-of-the-art deep learning methods for object pose estimation. The detection and pose estimation of pallets and other objects is a study on its own which will not be addressed in this thesis any further, instead rest of the thesis will focus on the topic of depth cameras.

2.2 Depth camera types and depth data

Depth camerascan be divided into four classes based on their working principle, which are passive stereo cameras, active stereo cameras, structured-light cameras and Time-of-Flight (ToF) cameras. The stereo cameras and the structured-light cameras estimate the depth of the scene by triangulating the depth of a point when the cameras system’s dimensions are known. ToF-cameras measure the time that a light signal takes to travel from the camera to the scene and back and using the speed of light the distance is then calculated. See figure 2.1 for an illustration of the different technologies.

Figure 2.1.Most common depth camera technologies [31]

3D data can be presented in many different ways such as depth images, voxel grids, multi-views, meshes and point clouds [25]. This thesis will consider mostly depth images and point clouds as they are the most common data formats outputted by depth cameras.

In depth images, each pixel contains a depth value that is usually the orthogonal distance from the camera focal plane to the scene, D_orth in figure 2.2. Assuming the pinhole camera model and knowing the camera’s focal length f and optical center location the

orthogonal distance Dorth can be calculated from the undistorted depth image’s radial distanceD_rad using equation 2.1, wherex_p andy_p are the pixel’s distance to the optical center.

D_orth=D_rad∗cos(arctan(

√︂

x²_p+y²_p

f )) (2.1)

Figure 2.2. Orthogonal distance D_orth and radial distance D_rad in the pinhole camera model [32].

A point cloud is a collection of any number of points, which are data structures that con-tain coordinate values for each dimension of the 3D euclidean space. Since the vertical anglesβ are equal in 2.2 the coordinate valuesX, Y, Z of a point corresponding to a pixel of the depth image be calculated using equation 2.2.

X =D_orth∗x_p/f Y =D_orth∗y_p/f Z =D_orth

(2.2)

One can also incorporate RGB data to the point cloud, which often is readily available if the used sensor is an RGB-D camera. The points become data structures with six values, one for each coordinate and one for each color channel. An accurate calibration between the depth frame and the color frame is required for the color channels to fall on the correct points when creating an RGB point cloud from the separate depth and RGB images. Examples of a depth image, point cloud and a color point cloud are presented in figure 2.3.

Figure 2.3. An example of a depth image (top), a point cloud (left) and a color point cloud (right) from the same scene. The darker pixels in the depth image indicate the scene is closer, black pixels indicate invalid depth values. In the point cloud on the left different RGB colors represent different depth values. They are drawn by the visualisation program and not a part of the point cloud itself. The color point cloud on the right is fused with the corresponding color image from the same camera.

In document Experimental Evaluation of Depth Cameras for Pallet Detection and Pose Estimation (sivua 10-14)