State of the Art - Learning Grasp Affordances from Vision

Current state of the art methods use a depth map and combine salient intensity and depth features to define grasp locations. Additionally, features that can distinguish the stability of the grasp are added. Finally, machine learning methods are applied to pick the most effective grasp.

The method chosen as best for implementation is a method by Le [3] made on the Stanford STAIR2 robot. It expands upon previous work by Saxena at Stanford, where they trained a grasping system based on 2D edge features and depth features with simulated training material to perform grasping. Their system produces a single grasp location, and the path planner figures out how to perform the grasp [38] [7]. Later the system has been extended to more cluttered environments and included a machine learning system for judging grasp quality and stability [39].

Le’s method was chosen because it has significant improvements in accuracy over Sax-ena’s method. Le [3] reports mean success rates of 69% in tests of the method from Saxena [39], compared to 82% mean success rate for the method presented by Le.

Le [3] extends on Saxena’s work by introducing several new visual features. He also chooses as the classification target a set of contact points, one for each finger of the grip-per. This choice removes most of the complexity from the path planner. Having the contact points available to the learning algorithm allows the classifier to learn which pair of points is the best candidate, instead of just selecting a potentially good location in the image.

In addition, classification with SVM is improved by utilizing a ranking cost function.

Normal classification with SVM usually aims to connect the right class for each feature vector. Ranking classifiers instead place more value on the highest ranked results, such that for example only thek= 10top results are ordered correctly and the highest value is placed on getting the top result right. This is commonly used for web search engines and similar, because the user cares most about having the top results correct. This also applies to grasping, as only the highest ranked grasp is significant, as it is the one actually used to grasp the object.

The cost function used by Le [3] is NDCG (Non-Discounted Cumulative Gain). NCDG is a ranking cost function that prioritizes the ordering of the top k results, applying de-creasing importance to results further away from the top result [30].

Le’s grasping algorithm is shown in Algorithm 1 on a high level. The workflow is that of a basic machine vision system [40], where data is gathered first from sensors and processed into suitable features. The features form a feature vector, which is then run through the

trained classifier to generate the grasp [8]. Finally, the grasp is executed by the robot.

Algorithm 1:Chosen Method 1. Take a picture the grasping area with the camera

2. Compute point cloud and depth map from the depth triangulation sensor 3. Perform edge detection to find grasp candidate points

4. Compute all triples of candidate points

5. Perform feature extraction for the candidate triples. Extract angle, distance, dis-continuity and sphere feature.

6. Classify the feature vectors with trained ranking SVM 7. Select the top 1 ranked triple as the grasp affordance 8. Compute a trajectory to perform the grasp

9. Execute the grasp

The robot used for implementation in Le’s paper is the Stanford STAIR2. It has 7-DOF (Degrees of Freedom) Barrett robot arm and a three-fingered 4-DOF hand. For depth data, the robot uses an active triangulation sensor with a laser projector and a camera to obtain very dense depth maps. As the robot has a three-fingered hand, candidate triples are used to describe the grasps.

The features used include angle features, distance features, discontinuity features and sphere features. These are all explained in detail when describing the implementation done for this study in Section 3. Additionally, some additional features are used such as raw template and depth data, but these only affect the performance of the algorithm slightly [3].

3 IMPLEMENTATION

The implemented method is a variation of the method presented in [3]. Some changes had to be made due to different configuration of the robot, and use of stereo vision for depth perception. The features used are largely the same as in Le’s paper. Some changes were introduced to take advantage of specifics of our robot configuration, as this allows overcoming the problems caused by the additional limitations it places. This is explained in greater detail in Section 3.7.

The grasping algorithm is shown at a high level in Algorithm 2. It follows the same general machine vision system structure as Le’s method described in Algorithm 1.

Algorithm 2:Grasping Algorithm 1. Take a picture the grasping area with the stereo camera

2. Compute point cloud and depth map from the calibrated stereo pair 3. Filter depth map

4. Perform edge detection to find grasp candidate points

5. Compute all pairs of candidate points, then take a subset of those pairs as candi-dates for classification

6. Perform feature extraction for the candidate pairs. Extract angle, distance, dis-continuity and sphere feature, among a few others.

7. Classify the feature vectors with trained ranking SVM 8. Select the top 1 ranked pair as the grasp affordance

9. Compute a trajectory to approach the grasp from straight above 10. Execute the grasp

There are a few differences to Le’s original implementation. First off the sensor and robot hand are different. The stereo camera produces a less dense depth map that the active depth sensor used by Le. The robot hand is a two-fingered gripper, so instead of candidate triples, candidate pairs are used. Second, there is additional filtering performed to get the stereo camera depth data to a level useable for feature extraction. In addition to that, due to the less accurate depth data, the generation of candidate pairs from edge detection also requires filtering the number of candidate pairs. Path planning had to be done to grasp from above due to the different robot configuration.

The features are largely similar. The additional features mentioned in Le’s paper like raw depth data are not used, as they were not described in Le’s paper. In some cases a little

interpretation had to be done for the implementation of the features, such as the counting of number of discontinuities. For distance feature, distance between the points and robot base was added due to different robot configuration. New sphere feature was added for detect collision between the base the gripper with objects, to avoid overreaching. Also, height from the table surface was considered, as this is implicitly known in our robot setup.

The implementation has a total of 42 features. Table 1 shows the number of features split by type.

Table 1.Number of features by type Angle Features (depth) 12

Training material was created in an original way, as the chosen paper does not go into great detail about the training material used or how it was created.

This section explains the implementation in detail, first going through the hardware setup used, then explains the sensors used and the data filtering done before feature extraction.

The features are explained one by one. After this, the classifier and training methods are explained, as well as how the training set was labelled. Finally, some limitations of the system are addressed.

3.1 Robot Hardware and Configuration

The robot configuration differs slightly from the one used in the chosen paper. Their robot was mobile with the camera and arm mounted on the robot. The robot used in this work consists of a Melfa industrial robot arm mounted on a table and a camera on a stand at the side of the table. Objects to be grasped lie on the table. The camera was moved to slightly

different locations in the same general direction from the robot arm.

The robot is a table mounted MELFA RV-3SB robot arm with 6 DOF (Degrees of Free-dom). The grasping area is next to the robot on the table. A picture of the robot setup in our lab can be seen in Figure 1.

Figure 1. Picture of the robot setup.

The gripper is a Weiss Robotics WRT-102 consisting of Weiss tactile sensors attached to a Schunk PG-70 gripper. The tactile sensors were not used for the grasping in this work.

A closer view of the gripper is visible in Figure 2.

The used stereo camera is a Bumblebee 2 by Point Grey, firewire attached stereo camera capable of capturing two 648x488 stereo images. The camera and stand can be seen in Figure 3. The open-source library OpenCV was used for computation of depth map and point cloud from the stereo images. The original work used an active depth sensor to obtain a more dense depth map, but the method was adaptable to the less precise depth readings from the stereo camera.

Figure 2.PG-70 gripper.

Figure 3.Bumblebee stereo camera.

In document Learning Grasp Affordances from Vision (sivua 24-30)