Limitations - Learning Grasp Affordances from Vision

The test setup introduced several limitations. The difference in the robot configuration required some changes in the algorithm itself, due to the robot grasping from a different direction than where the camera is. This also meant the grasps had to be simplified to be solely top grasps from straight above.

The camera posed the biggest limitations. The depth map produced by a stereo camera is not as accurate as that produced by other types of depth sensors, and has problems detecting depth where there is no texture. This was solved by mandating some texture to cover the table (newspaper was used to provide the texture) so that the background has something for the stereo algorithm to see depth from.

None of the objects used for grasping were problematic to detect, however in practice glossy or glass objects might pose problems for the hardware. In light of this, using a more accurate depth sensor would definitely improve results.

4 EXPERIMENTS

The performance of the implementation was tested with a small experiment. The exper-iments were performed with the hardware setup explained in Section 3.1, with the robot mounted on a table and the stereo camera on a stand on the side of the table.

The limitations in Section 3.7 were takes into account. The grasping area of the table was covered with newspapers to provide texture for the stereo vision algorithm, and the robot would perform grasps from above. The grips were made so that the two-fingered gripper was positioned sideways from the point of view of the camera, with one finger on the left side of the object and one finger on the right side.

The experiments were performed using a labelled training set. The labelling was done by a combination of automatic classification of obviously bad grasps, followed by a human operator classifying each grasp into one of three categories: Bad, Mediocre or Perfect (see Table 2 for more details about the categories). Due to the large amount of grasps, only some examples could be tested with the robot, and a large part of the training material was classified with judgement from the human operator.

In second stage of the experiment, completely novel objects not in the training set were chosen. These objects were placed 10 times at random locations in the grasping area. The trained classifier was then run to choose the best grasp from each image. The success of the grasp was evaluated by the human operator with very strict criteria of accepting only sideways grasps that were on stable surfaces of the object. The grasp visualizer used in the training tool was utilized to help with the task.

Figure 15 shows the objects in the test set. All objects are novel to the robot, in the sense that they were not included in the training set. Some of the object in the set are more difficult than others, for example the tapered shampoo bottle can only be grasp from near the cork at the top.

Figure 15.Pictures of the test set objects.

Table 3 lists grasp results for the set of novel test objects. The objects are in the same order as Figure 15, if the figure is read from left to right, top to bottom.

Table 3.Classification results.

DATASET NAME SUCCESSFUL GRASPS

Box 70 %

Herbal Spice 64 %

Wasabi Box 30 %

Screwdriver 40 %

Stamper 50 %

Block of Wood 40 %

Saltshaker 60 %

Shampoo 60 %

Shampoo (tapered) 40 %

The success of grasps depends highly on the object. Some kinds of objects were very simple, such as a box. Figure 16 shows a successful grasp on the box object. Some more

difficult objects were also included, such as a tapered shampoo bottle where the only stable, thin enough area for grasping was around the cap of the bottle.

In the vast majority of failed cases, the grasp was still on the object, and failures were mostly due to misoriented grasp that could not reliably be judged as stable. For example, grasping contact points on the front and back face of the object, which is a hard problem for the path planner in this setup. Figure 17 shows one example of a grasp, where it is hard to tell if a grasp would be successful (even when examining the grasp in 3D point cloud view). Another failure case was corner grasps, which are not very stable.

Some of the objects seems quite similar, such as Box and Block of Wood, with quite vary-ing results. This is likely explained by the fact that most of the problematic object were a bit smaller, which means the low quality of the depth perception caused a relatively larger discrepancy in their outline. This explains the lower results for Wasabi Box, Screwdriver and Block of Wood, as they were among the smaller objects.

The remaining grasps were misclassifications, like Figure 18 where the grasp is on one edge of the object and has no way of working.

Figure 16.Example of a good grasp.

The majority of grasps which were not on the object were caused by the stereo camera capturing an area with huge difference in height on the table surface, causing the algorithm to think there was a better object available than the one being grasped. Such a case of bad depth map data can be seen in Figure 19, where the depth map has a part on the table surface that could be interpreted as a really salient object.

Figure 17.Example of a mediocre grasp.

Figure 18.Example of a bad grasp.

Figure 19.Example of a failure case.

5 DISCUSSION

The review of the field of robotic grasping was interesting and fruitful, and revealed many interesting areas for improvement. The chosen state of the art method by Le was adaptable to the used laboratory hardware, however it did not perform as well as expected due to several factors.

One of the problems encountered was the use of stereo vision for depth perception. The low resolution of the stereo camera also provided additional hurdles. Some improvements had to be made to successfully use the method with stereo depth perception at all.

Further difficulties were the result of the difference in robot configuration. Le used a autonomous mobile robot, with the gripper and camera attached to the same mesh. Our laboratory has a more industrial setting, with the robot arm attached to a grasping surface, and the stereo camera on the side. This caused problems because the camera does not see the graspable object from the direction where the gripper will approach from, which limits the available grasps to ones which are common between the camera and robot arm frames.

Finally, generation of the training set and handling of sheer number of pairs generated as grasp candidates proved difficult. Even when feature extraction was optimized, feature extraction for all the candidates in a single scene would have taken more time than was reasonable. This was likely caused also by the lack of accuracy in the stereo depth sensor.

The amount of grasp candidates was overcome during training by automatically excluding obviously bad grasps, and finally filtering the edge detection by hand to remove points that were not viable. For actual classification, no viable solution was found beyond selecting a sufficiently large random subset of the candidate pairs that at least one viable grasp was statistically almost certain to be present.

5.1 Future Work

Future work in learning grasp affordances should definitely try to improve on the methods of Le [3]. Much better results compared to our implementation can be obtained with a higher quality depth sensor and other improvements.

The training of the classifier can definitely be improved upon. Current revision of the

method creates a very high number of candidates, which requires using much time and effort for creating the training set. The sheer number of samples also forbids directly driving the robot to get much more robust training classification.

If ways could be found to reduce the number of candidates significantly, or to filter out more thoroughly the obviously bad grasps, that would enable much higher quality train-ing material to be created. One possible solution would be to simply handpick traintrain-ing examples, and ensure best possible quality by driving the robot for this subset. The train-ing material generation is definitely an interesttrain-ing direction to research, and one of the biggest improvements that could be made.

The learning could be improved by using a larger number of categories for the labeled training set. Three quality classes were used in our work. Using a larger number of quality categories the ranking SVM could learn an even better ranking.

Further improvements could also be made to handling of the type of robot configuration used, where the camera is beside the robot. In this study, the problem was solved by limiting grasps to only sideways top grasps, where the object is grasped from straight above [34]. Much better results could be obtained if the robot could also use front-back facing grasps. Indeed, many of the grasps considered failure cases in test classification were of the kind where the classifier chose an angled or front-back grasp instead of a horizontal one.

Additionally, work could be done to lift some of the limitations imposed in this paper.

The largest and most important limitations to remove would be the requirement of texture on the surface from which objects are grasped. The data filtering presented here improves the stereo depth map quite a bit, however it still requires that there are depth readings of the table between the objects. Better solutions would likely be found if the table surface was searched for by fitting a plane to it. Other methods of repairing the depth map could also be tried, like the Successive Over Relaxation (SOR) technique used in [25].

An interesting direction for thought would be to continue on the path of adding more of the grasping process into the classifier, so that machine learning can be performed on it.

Current method still includes some path planning to avoid collision. If the classifier could also acknowledge the path to grasp from, even better results could be obtained. This paper already tried this in a minimal fashion by using a sphere feature to block the approach direction of the robot tool. This could be taken further by including path option in the feature vector. This inclusion would increase the number of candidate combinations even more than the the method presented, so the problem of filtering candidates would need to

be solved first.

Finally, the most obvious direction is improving upon the feature set and classifier to obtain much higher success percentages, and to cover more corner cases where the clas-sification results underperform. To have a working robot utilizing this, the success rate needs to be near 100%.

There can be some leniency, in that the robot can try multiple times to get a successful grasp, either by manipulating the object without successfully grasping it or by locomoting to a different position where a better view or grasping angle is available. Still, in the end the robot needs to be able to pick up the desired object to complete its task.

One interesting method that was not considered in this work is segmentation. There has been recent work in grasping with depth segmentation [25] that is very likely to improve the grasp success rate over the method presented in this thesis.

In light of the results of this thesis, the depth segmentation looks like a very viable course to investigate. Segmentation could possibly solve the problem of finding a small set of good grasp candidates, which would make the method presented here perform signifi-cantly better.

6 CONCLUSION

This study has surveyed the field of robotic grasping and looked into state of the art meth-ods for learning grasp affordances. Le’s [3] method was chosen as the method most suit-able for the MELFA robot arm and stereo camera in use. The method was implemented with small changes, and the implementation explained in detail in this work. Finally, the implementation’s classification performance was evaluated.

The survey of the field of robotic grasping through grasp affordances found much inter-esting research on the topic. There have been many methods and directions tried over the years, with different research directions for the primary data used for features, whether 2D intensity images, 3D models or depth maps.

Furthermore, published methods differed on how machine learning methods were used, especially on the kind of training used to train the classifier. There were many lines of research available, from training by a human operating the robot, to imitating a human demonstrating the grasps with finger-placement sensors. Some papers used simulated training material, such as rendered 3D graphics, to simplify the generation of training materials. Finally, training materials labeled by humans were used, and proved to be the most reliable.

The method described in Le’s paper [3] was chosen for implementation, because it had achieved the highest success rates and seemed the most robust. It also showed promise as the most adaptable method, allowing use with our hardware with minor modifications and improvements. The method used several ingenuous ideas to achieve good results.

Multiple contact points were used to bring more of the problem into the classification stage for machine learning, instead of using a hardcoded path planner to perform part of the work.

Several clever features were presented by Le, such as angle features at the contact points, which provide information about the stability of the grasp. Discontinuities between the contact points were used to judge that a single object is grasped. For preventing the tool fingers from colliding with the objects or the grasping surface, point cloud sphere features were introduced, that detected possibility of collision near the contact points. All of these features provide the classifier with enough information to make a decision.

Additionally, the implemented paper suggested the use of an SVM classifier with the NDCG ranking cost function, so that the classifier optimizes for the top grasps, instead

of trying to optimize maximum accuracy over all the grasps. As the first result is always taken for the grasp to execute, this leads to better grasping results overall.

The author of this work adapted Le’s method to work with the lower resolution depth maps from our Bumblebee 2 stereo camera. This was done by posing some restictions on table surface texture, and then filtering the depth map with median filter to remove salt &

pepper noise. Finally the depth map surfaces were expanded by interpolation to cover the holes still present in the depth map data.

Additional changes were needed to suit the robot configuration where the camera is not attached to the robot, and instead is beside it. This required restricting the direction of grasps to be performed from straight above, with contact points on the sides of the object as viewed from the camera. In this study, a sphere feature was added for detecting objects that block the gripper base preventing the robot from reaching the given contact points.

Additionally, since the robot always knows about the location of the table when calibrated, the height from the table was used as a feature.

REFERENCES

[1] J. J. Gibson. The Theory of Affordances. Lawrence Erlbaum Associates, 1977.

[2] P. Järvinen. On Research Methods. Opinpajan Kirja, Tampere, 2004. ISBN 952-99233-1-7.

[3] Quoc V. Le, D. Kamm, A. F. Kara, and A. Y. Ng. Learning to grasp objects with multiple contact points. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2010, pages 5062–5069, 2010.

[4] M.R. Cutkosky and R.D. Howe. Dextrous robot hands. chapter Human grasp choice and robotic grasp analysis, pages 5–31. Springer-Verlag New York, Inc., New York, NY, USA, 1990.

[5] J. Coelho, J. Piater, and R. Grupen. Developing haptic and visual perceptual cate-gories for reaching and grasping with a humanoid robot. Robotics and Autonomous Systems, 37(2–3):195 – 218, 2001.

[6] A. Bicchi and V. Kumar. Robotic grasping and contact: a review. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2000, volume 1, pages 348 –353, 2000.

[7] A. Saxena, J. Driemeyer, and A. Y. Ng. Robotic grasping of novel objects using vision. The International Journal of Robotics Research, 27(2):157–173, 2008.

[8] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 4th revised edition, 2008. ISBN 978-1597492720.

[9] F. Reuleaux. The kinematics of machinery: Outlines of a theory of machines.

Macmillan, 1876. ISBN 978-0132611084.

[10] V.-D. Nguyen. Constructing stable grasps in 3d. In Proceedings of the IEEE In-ternational Conference on Robotics and Automation (ICRA), 1987, volume 4, pages 234 – 239, Mar 1987.

[11] B. Dizio˘glu and K. Lakshiminarayana. Mechanics of form closure.Acta Mechanica, 52:107–118, 1984. 10.1007/BF01175968.

[12] V.-D. Nguyen. Constructing force-closure grasps. The International Journal of Robotics Research, 7(3):3–16, 1988.

[13] B. Faverjon and J. Ponce. On computing two-finger force-closure grasps of curved 2d objects. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 1991, volume 1, pages 424–429, Apr 1991.

[14] M.R. Cutkosky. Mechanical properties for the grasp of a robotic hand. In CMU, Technical Report, volume 19, Sept 1984.

[15] H. Hanafusa and H. Asada. Stable prehension by a robot hand with elastic fingers.

InProceedings of the 7th International Symposium on Industrial Robots, 1977.

[16] B. Baker, S. Fortune, and E. Grosse. Stable prehension with a multi-fingered hand.

InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 1985, volume 2, pages 570 – 575, Mar 1985.

[17] Quoc V. Le, A. Saxena, and A.Y. Ng. Active perception: Interactive manipulation for improving object detection. Technical report, Stanford University, 2010.

[18] W.H. Warren. Perceiving affordances: Visual guidance of stair climbing. Journal of Experimental Psychology: Human Perception and Performance, 10(5):683–703, Oct 1984.

[19] C. de Granville. Learning Grasp Affordances. Master’s thesis, School of Computer Science, University of Oklahoma, 2008.

[20] A. Bendiksen and G. Hager. A vision-based grasping system for unfamiliar pla-nar objects. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 1994, pages 2844–2849 vol.4, May 1994.

[21] R. Piatt, R.A. Grupen, and A.H. Fagg. Learning grasp context distinctions that gen-eralize. InProceedings of the 6th IEEE-RAS International Conference on Humanoid Robots (ICHR), 2006, pages 504–511. IEEE, Dec 2006.

[22] E. Chinellato, R.B. Fisher, A. Morales, and A.P. del Pobil. Ranking planar grasp configurations for a three-finger hand. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2003, volume 1, pages 1133 – 1138, Sept 2003.

[23] D.L. Bowers and R. Lumia. Manipulation of unmodeled objects using intelligent grasping schemes. IEEE Transactions on Fuzzy Systems (TFUZZ), 2003, 11(3):320 – 330, June 2003.

[24] R.B. Rusu, A. Holzbach, R. Diankov, G. Bradski, and M. Beetz. Perception for mobile manipulation and grasping using active stereo. In Proceedings of the 9th IEEE/RAS International Conference on Humanoid Robots (ICHR), 2009, pages 632 –638, Dec 2009.

[25] D. Rao, Q.V. Le, T. Phoka, M. Quigley, A. Sudsang, and A.Y. Ng. Grasping novel objects with depth segmentation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2010, pages 2578 –2585, Oct 2010.

[26] K. Huebner, S. Ruthotto, and D. Kragic. Minimum volume bounding box decom-position for shape approximation in robot grasping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2008, pages 1628 –1633, May 2008.

[27] A. Morales. Learning to Predict Grasp Reliability with a Multifinger Robot Hand by using Visual Features. PhD thesis, Universitat Jaume I, 2004.

[28] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995. 10.1007/BF00994018.

[29] T. Joachims. Optimizing search engines using clickthrough data. InProceedings of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142. ACM, Jul 2002.

[30] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques.

In document Learning Grasp Affordances from Vision (sivua 44-58)