• Ei tuloksia

Passive Stereo depth cameratakes images from the same scene simultaneously with two or more separate image sensors, which are displaced horizontally from each other.

Problems of calibration, correspondence and triangulation need to be solved to resolve a depth value of a point in the scene from the images. In calibration, the relative positions and orientations of the imagers as well as individual internal dimensions; focal length, optical center and lens distortions, are found. From two images taken from the same scene at the same time the problem of correspondence is solved, where the pixels that represent the same point in the scene are found from both images. The relative difference of the corresponding pixels is called disparity. Finding corresponding pixels might be difficult in regions of homogeneous intensity and color. The full set of found disparities from an image pair is called a disparity map or a disparity image [33, 34].

Depth calculated by triangulation from co-planar stereo images is based on the constant ratio of the heights and bases of triangles OlP Or andplP pr in figure 2.4. DepthZ can be calculated using eq. 2.3, wheredis the disparity,B is the baseline i.e. the horizontal

Figure 2.4. Triangulation of depthZfrom two coplanar pinhole-camera images. Disparity d=xl−xris the difference of the same real-world pointP in the stereo image pair, pixels pl andprboth depict pointP.

distance between the imager optical centers, and f is the focal length of the imagers.

Disparity is the difference in image coordinates (pixels) between the corresponding points d=xl−xr. The longer the disparity is, the closer the feature is and vice versa.

Z

B = Z−f B−d Z =fB

d

(2.3)

In reality, stereo cameras are rarely coplanar and resemble more a situation depicted in figure 2.5. However, knowing the calibration between the cameras and taking advantage of the epipolar constraint, the planes can be rectified and then calculate the distance Z with eq. 2.3 as if their epipolar lines were parallel and on the same plane [34, 35].

Figure 2.5. Arbitrary geometrical configuration between the two images that resembles the real-world situation. x is found from the epipolar linel due to epipolar geometry. [36]

Correspondence matching methods try to find pixels that depict the same real-world point between images. It is a computationally intensive task processed either on-chip on the camera [37] or on a separate host computer [38]. There are multiple approaches to correspondence matching [34], but they can roughly be divided into two,Feature-based matching and Correlation-based matching [39]. Feature-based matching is based on finding clear geometrical elements from the scene, such as lines, corners and curves.

This is difficult for homogeneous scenes and results in a quite sparse depth map if the scene is not abundant with good features, but the advantage of the method is its robust-ness against intensity differences and being less computationally demanding [39]. An example ofCorrelation-based matching is taking a fixed-sized window around a pointP and finding a window with matching values in the other image surrounding pointP. This is computationally more demanding than the feature-based matching and sensitive to in-tensity variations [39]. Correlation-based matching requires a textured scene but usually results in denser disparity maps than feature-based matching [39].

The epipolar lines are used to reduce the search space for feature correspondences. Due to the epipolar geometry, a point in frame 1 must fall on the corresponding epipolar line of frame 2, which is the line where the epipolar plane cuts the frame 2, as can be seen from figure 2.5. This results in the search space to be reduced from a plane search to a line search, reducing search time and the possibility of false matches [34].

Structured lightandactive stereodepth imaging try to relieve the restriction of requiring geometrical features or a textured scene by projecting their own features on the scene with light in visible or infrared (IR) frequencies. They both estimate depth with the same principle of triangulation as conventional passive stereo imaging does.

No structured light (SL) cameras are included in the depth camera evaluation of this thesis, but it is still mentioned here due to its relevancy in the field. The original Kinect depth camera is based on the SL technology. SL technology replaces one of the stereo camera’s imagers with a projector that projects light (usually IR wavelengths) with an

en-coded pattern to the scene. The enen-coded pattern simplifies the correspondence problem as the pattern is known a priori [40]. The scene is imaged with the other camera and the distance to the scene is triangulated from the found correspondences [40]. Choosing the projected pattern and interpreting the depth from it is not trivial. The pattern(s) projected may use temporal coding, direct coding or spatial neighbourhood coding strategies, of which temporal coding uses different patterns in different points of time, direct coding uses color-coded patterns and spatial neighbourhood projects a unique pattern for each pixel, from which each pixel can be triangulated individually [40]. Spatial neighbourhood coding using IR light is the most common strategy used in commercial SL cameras cur-rently [39].

Active stereodepth imaging also uses a projector to add features to the scene. Instead of projecting a coded, known pattern to the scene like in structured light, an arbitrary, non-repeating pattern is projected just to add features for the stereo camera from which it creates the disparity map [41]. Active stereo cameras can create the disparity map using both natural features of the scene and the projected ones, which makes it quite robust against bright scenes and dark scenes as proven later in this thesis. Apart from the projected features active stereo cameras are not different from the passive ones.