Sensors and Calibration - Learning Grasp Affordances from Vision

As input data for the classifier, the image from the stereo camera was used. More specifi-cally, OpenCV was used to create a depth map and point cloud from the calibrated stereo images, as well as a rectified intensity image. The classifier calculates its features from the rectified intensity image, the depth map and the point cloud. Furthermore, the camera and robot were calibrated every time the camera was moved, so that 3D coordinates could be translated from camera to the robot frame.

An example rectified image can be seen in Figure 5 (the image has been cropped to display better). The image shows a salt box and a deodorant stick on top of the grasping area.

Because the robot is table mounted and the camera is calibrated, some further limitations could be placed, such as the fact that the grasping surface height is always known, because the robot is bolted to it.

Figure 5. Rectified stereo image.

3.3.1 Data filtering

The source data used required some filtering in preparation for feature extraction. The depth map using stereo vision had significant holes in it, and for several of the features used, an accurate depth map was required.

Taking for example the scene from Figure 5, the corresponding depth map straight from the stereo algorithm is shown in Figure 6. The black areas correspond to holes in the depth map, and they are present around the objects outline, which is an important area for the algorithm. There are also a few bright areas, indicating there would be something closer to the camera, where there is nothing. This could create false grasps, so both categories of problems need to be addressed.

First a median filter was applied to fill in any tiny holes in the depth map. This also made the depth values more stable, as our grasping method does not really need very fine details from the depth perception, only the large outlines of objects. A median filter of 7x7 size was used to obtain very smooth depth values, while still preserving edges.

The effect of this can be seen in Figure 7. Most of the small discrepancies, especially in the grasping area, have disappeared. The table and objects are a smooth depth surface,

Figure 6.Depth map from stereo camera.

with some holes remaining.

To fix the blank areas after median filtering, the depth values were interpolated horizon-tally line-by-line, so that for each gap in the depth map, the bordering depth value that was further away was used for all the points in the gap. This leads to strong preservation of edges, which is important for the grasping algorithm, as edge points are used for contact points.

After interpolation, the depth map in the grasping area is mostly smooth, as seen in Fig-ure 8. There are some small parts of the depth map where the table surface was not detected correctly. While these errors are not desireable, the issues can be handled by the classifier, as the object edge is still preserved.

It was also tested whether linearly interpolating the depth values would work. This was abandoned as it masks edges between object and background, causing for example the discontinuity feature to not register any discontinuities.

Figure 7. Depth map with 7x7 median filtering.

3.3.2 Selection of Candidate Pairs

The candidates for classification were selected by performing edge detection on the pro-cessed depth image. Due to the median filtering used and the quality of the depth map, spurious edges were often detected on the table surface also. As can be seen in Figure 9, edge detection produced a lot of points also where no object edges exist.

The large amount of pairs generated by taking all combinations of all the edge points required some fast filtering to cut the dataset down to a manageable size.

In fact, Figure 9 shows 1727 points. The formula for unique pair combinations is

N_pairs = N_points∗(N_points−1)

2 (1)

whereNpoints is the number of points. With 1727 points,Npairs then gives the number of combinations, which in this case is a little under 1.5 million pair combinations.

Figure 8.Final interpolated depth map.

Figure 9. Image showing edge points detected.

The average feature extraction time per pair was roughly 60 milliseconds after optimiza-tions, which would make the total time over 24 hours. This is too much for one scene, especially considering more complex scenes may have closer to 10000 edge points, which would put the number of pairs near 50 million, which is already over a months time.

Due to those problems, it was decided to implement some initial filtering of the grasp pairs before using them for classification. The worst grasps were removed based on their 3D location if they were too close to the table surface or outside the effective grasping area.

Finally, if too many grasps remained a large random sample of the remaining grasp candi-date pairs was taken as the representative candicandi-dates for classification. This was found to not noticeably degrade the classifier performance if the random sample was large enough, that it was statistically likely to contain a good selection of useable grasps. A sample size of 10000 candidate pairs was used for classifying the test set. It was tested if using a sample size of 25000 candidates would make any difference, and there was no change in the classification accuracy with the larger sample size.

In document Learning Grasp Affordances from Vision (sivua 31-36)