Learning Grasp Affordances from Vision

(1)

Degree Program in Information Technology

Master’s Thesis

Arne Pajunen

LEARNING GRASP AFFORDANCES FROM VISION

Examiners: Professor Heikki Kälviäinen D.Sc. (Tech.) Jarmo Ilonen Supervisor: Professor Ville Kyrki

(2)

Lappeenranta University of Technology Faculty of Technology Management

Degree Program in Information Technology Arne Pajunen

Learning Grasp Affordances from Vision

Master’s Thesis

2012

58 pages, 19 figures, 3 tables, 2 algorithms and 0 appendices.

Examiners: Professor Heikki Kälviäinen D.Sc. (Tech.) Jarmo Ilonen

Keywords: grasping, grasp affordance, robot, novel object, machine learning, machine vision, computer vision

Robots operating in complex dynamic working environment require the ability to manipulate and grasp objects. This thesis examines previous works and the state of the art in robotic grasping and learning grasp affordances. Modern methods are surveyed, and Le’s machine learning based classifier is implemented because it provides highest success rates out of reviewed methods and is adaptable to our specific robot hardware. The implemented method uses intensity and depth features to rank grasp candidates. The performance of the implementation is presented.

(3)

Lappeenrannan teknillinen yliopisto Teknistaloudellinen tiedekunta Tietotekniikan koulutusohjelma Arne Pajunen

Tartuntapisteiden oppiminen stereonäöstä

Diplomityö

2012

58 sivua, 19 kuvaa, 3 taulukkoa, 2 algoritmia ja 0 liitettä.

Tarkastajat: Professori Heikki Kälviäinen TkT Jarmo Ilonen

Hakusanat: tartunta, tartuntapiste, robotti, tuntematon esine, koneoppiminen, konenäkö Keywords: grasping, grasp affordance, robot, novel object, machine learning, machine vision, computer vision

Monimutkaisissa ja muuttuvissa ympäristöissä työskentelevät robotit tarvitsevat kykyä manipuloida ja tarttua esineisiin. Tämä työ tutkii robottitarttumisen ja robottitartuntapis- teiden koneoppimisen aiempaa tutkimusta ja nykytilaa. Nykyaikaiset menetelmät käydään läpi, ja Le:n koneoppimiseen pohjautuva luokitin toteutetaan, koska se tarjoaa parhaan onnistumisprosentin tutkituista menetelmistä ja on muokattavissa sopivaksi käytettävissä olevalle robotille. Toteutettu menetelmä käyttää intensititeettikuvaan ja syvyyskuvaan po- hjautuvia ominaisuuksi luokitellakseen potentiaaliset tartuntapisteet. Tämän toteutuksen tulokset esitellään.

(4)

I wish to thank professor Ville Kyrki for supervising the laboratory portion of this work.

Furthermore, thanks to professor Heikki Kälviäinen and Jarmo Ilonen for seeing the writing of this work to completion.

Finally, thank you to Jarmo Ilonen for assistance with operating the robot hardware and other laboratory equipment, and Pekka Paalanen for making the latex template.

Lappeenranta, December 17th, 2012

Arne Pajunen

(5)

ABBREVIATIONS AND SYMBOLS

BMRM Bundle Methods for Regularized Risk Minimazation CAD Computer Aided Design

DOF Degrees of Freedom DVD Digital Versatile Disc k-NN k-Nearest Neighbor

NDCG Normalized Discounted Cumulative Gain POMDP Partially Observable Markov Decision Process SLAM Simultaneous Localization and Mapping SOR Successive Over Relaxation

SVM Support Vector Machines

(8)

1 INTRODUCTION

This is a Master of Science thesis examining the state of the art in learning grasp affordances [1] using machine vision. Modern methods in the field of robotic grasping are reviewed in a literature survey. One suitable method using machine learning and stereo vision is chosen for evaluation with our specific robot configuration. This work is a con- structive research project using quantitative research methods [2].

Robotic grasping has been recently shown new interest [3]. Autonomous grasping of objects seen for the first time is a very desireable ability for general purpose robots, and would simplify the construction of industrial robots operating in constrained environments as well. This makes grasping a very important research direction in the field of robotics. With recent advances in navigation and mobility, object manipulation remains as an obstacle for building robots capable of performing tasks autonomously.

1.1 Background

Autonomous robots operating in complex, previously unfamiliar and dynamic environments require the ability to grasp objects to be able to perform tasks. Grasping is required e.g. for cleaning, for retrieving objects and for using tools.

One example of a practical task would be emptying a dishwasher. The dishes might be in any number of configurations, and possibly not seen before by the robot. The robot needs to be able to find a way to pick up the cutlery to move it to a storage cabinet. Construction tasks require manipulating components to correct location. The use of tools is required as well, and while there are methods to swap the tool on the robot arm, it would be easiest if the robot could use its gripper to grasp and use the same tools as humans do.

The objective is then set to grasping objects. However, in practice, one has to go a little deeper. Simply grasping objects the robot is preprogrammed with is relatively easy. For objects the robot is perceiving for the first time, it is better to research methods for finding graspable locations. This represention is called an grasp affordance.

Grasp affordances are points describing potential locations where the robot can attempt a grasp. This representation means the robot potentially does not need previous knowledge of the object. Creating grasp affordances from stereo vision with machine learning allows

(9)

the robot to grasp previously unseen objects for which a 3D model is not available.

1.2 Objectives and Restrictions

The research question of this thesis is to survey the field of robotic grasping and look into state of the art methods for learning grasp affordances. The method that is expected to provide the best results with our labs MELFA robot arm using a stereo camera for depth perception is selected. The selected method is then implemented and the implementation with any changes required is explained. Finally, the implementation grasping performance is evaluated.

The method chosen for implementation is a derivative with slight modifications of work by Quoc Le [3] at Stanford University on their STAIR robot platform. The changes and improvements required to make the chosen method work with the given equipment are presented. Finally, the performance of the implementation is evaluated using simulated grasps.

This paper focuses only on methods suitable for the specific hardware mentioned. A table- mounted robot with a stereo camera observing the grasping area places some restrictions on the method. A two fingered gripper is used. Grasping is planned and then executed using the robot hardware. No realtime methods are used, such as tactile feedback from the gripper.

Some restrictions are based on the implementation. The camera and robot must be calibrated so that the mapping of 3D coordinates from the stereo camera to robot coordinates are known. Because of this, the table surface location is also known. Furthermore, the stereo camera requires some texture from the objects and background so that depth information can be obtained. This is taken into account by only focusing on objects with sufficient texture, and ignoring extremely difficult object surface materials like glass or non-textured color surfaces.

1.3 Structure of the Thesis

This thesis is organized as follows: Section 2 takes a look at existing work with grasp affordances and robotic grasping, explaining the different approaches to learning to grasp with robots.

(10)

Section 3 goes into greater detail on the implementation details of the chosen state of the art method, and the changes made due to robot configuration. In Section 4 the experiments used to evaluate the implementation are described and the results are presented.

Near the end, Section 5 goes over the results and problems encountered, and offers possible directions for future work and ways to improve the performance of the chosen method, both in our lab configuration and in general. Finally, Section 6 summarizes the conclu- sions made in this study.

(11)

2 PREVIOUS WORK ON GRASP AFFORDANCES

Robots operating in previously unfamiliar and dynamic environments require the ability to grasp objects. Grasping is required e.g. for cleaning, for retrieving objects and for using tools. An example of a practical task by a fixed robot would be emptying a dishwasher, where objects might be placed in different positions every time. A mobile robot can find many scenarios requiring grasping. Retrieval of objects requires picking up the object after the hard part of finding it is done.

Availability of reliable grasping would also enable many applications such as using generic tools for a variety of tasks simply by picking them up. This would be great for a general assembly robot. Unfortunately this also places more requirements for the quality of the grip, as a screwdriver cannot be used if it is held by the screw end.

There are two primary approaches to grasping. The other is purely analytical, modeling the physical force exerted by the fingers to create a stable grasp. The analytical approach usually requires simplifications that make the method not work outside controlled laboratory conditions. Recently more human-like grasping approaches using grasp affordances and machine learning have been developed, which tackles some of the complexity.

2.1 Human Grasping

Humans have evolved various mechanisms for grasping objects. Use of tools is one of the defining characteristics of the human species. It is therefore a good idea to explore how humans perform grasping, and see if any of the smart mechanisms that have emerged through evolution can be copied.

Humans perform grasping by combining a variety of senses and motor abilities. Humans learn locations that look graspable based on both intensity and stereo depth perception.

Human hands can also perform different types of grasps such as power or pinch grasps.

Using tactile feedback humans can know the type of surface being grasped. Also, humans can evaluate the force required to prevent the objects slipping with the pressure sensitivity of the fingers.

Cutkosky [4] describes a taxonomy of different grasps used by humans for different pur- poses. In this taxonomy, grasps are subdivided into precision and power grasps, with

(12)

precision grasps being used for smaller objects and increased dexterity. Precision grasps use the tips of the fingers to hold objects, and are further divided into circular grasps where the fingertips are placed around a radial object, and prismatic grasps, where the opposed thumb and other fingers pinch the object in different ways. On the other end of the spectrum, power grasps are divided into prehensile and non-prehensile grasps. Non- prehensile grasp could be described as pushing or holding something with the entire hand.

Prehensile grasps are divided into circular grasps, where the object is held in the palm and the fingers wrap around it, and prismatic grasps which are similar to wrapping ones hand around a pole-like object. Cutkosky did this taxonomy by observing machinists working with metal parts and tools.

All of these abilities enable humans to grasp even objects such as raw eggs without break- ing the shell, which is a very difficult task for a robot. Finally, humans can build experience. Humans learn by trial and error, by testing grasps and abandoning non-functional ones until they arrive at a suitable grasp. Coelho et al [5] explores a learning system for a humanoid robot that is based on the way human infants learn. They develop a haptic grasp model that learns grasps for simple 2-dimensional shapes from local visual features.

Mechanically it is relatively simple to copy the senses that humans use, and to build grippers capable of similar grasps as humans. The hard part is copying the learning of grasps. One of the simplest ways to make use of human grasping is by making a robot learn by imitating a human teacher. This is incidentally also the way human children learn how to do many things at early stages of development.

2.2 Robotic Grasping

Robotic grasping is the act of the robot finding a suitable location for grasping an object, and then moving its gripper to that location and closing the gripper to hold the object.

This is done by taking data from the sensors to compute a trajectory to a location where the gripper can grasp the object to obtain a stable grasp.

If a 3-D model of the object is known to the robot, or can be constructed, it is possible to compute the grasp on the model, and then fit the model to the observed object. Using methods similar to Computer Aided Design (CAD), the stability of the grasp can be simulated using many different methods [6]. This is possible if the object or the general type of the object is known beforehand, or a sufficient accurate 3D model of it can be obtained from the sensors.

(13)

Having a existing model might be a reasonable assumption for an industrial assembly robot. However, for more general purpose robots it is necessary to be able to grasp unfamiliar objects. Grasping of novel objects is where the biggest difficulties lie, as the data from sensors is usually limited enough that a complete model of the object cannot be formed [7].

The problem of robotic grasping is also very suitable for machine learning, as it is very difficult to code hard rules to describe what a good grasp looks like. Therefore it is necessary to define what kind of sensors and other inputs should be used, what features to extract from those, and what kind of training data and learning method should be used [8].

However, even before that one needs to look at what grasps are in-depth and analyze what kind of result the classification algorithm should produce.

2.3 Types of grasps

Many different types of holds for constraining objects have been researched. The early work is based on mechanics and kinematics, later introducing force control and dynam- ics. [6]

A grasp is essentially the placement of finger contacts such that the object is fully constrained. The finger point contacts can be of several primitive kinds. All more complex types of finger contacts can be described based on these primitives. One of the early grasp types researched are form-closure and force-closure grasps.

The movement of an object can be partially or completely prevented by contact with other surfaces. If that contact is mainted by applying force via these surfaces, that is force- closure. If the contact can maintain itself without regard for additional force applied, then the grasp is form closure. The origin of these concepts date back to the father of kinematics Franz Reuleaux [9].

There are three basic types of finger contacts. Frictionless point contacts are the first of the three primitive contact types. They are such that the finger exerts a force normal to the surface through the contact point. The second kind, Hard-finger contact is a point contact with friction, and the force is exerted into a friction cone describing the wrench convex.

The last type, Soft-finger contact, is an area contact that also allows the finger to apply torque around the normal at the contact point. [10]

(14)

2.3.1 Form-closure Grasps

Form-closure grasp is achieved when the object is enclosed by the fingers, such that it can’t move. The mechanics of form-closure were first explored by Lakshiminarayana [11].

Furthermore, he shows that contacts should be applied along the normal forces and con- structs a model for total restraint based on a minimum number of contacts with form- closure.

2.3.2 Force-closure Grasps

Force-closure grasps are such that the object is completely held by the fingers contact points exerting force on the object. Nguyen [10] [12] presents a method for finding different types of finger placements on polygonal objects that form a force-closure grasp.

Nguyen further extended this to 3D grasps. He also proves that all 3D force-closure grasps can be made stable.

The work on force-closure grasps on polygonal objects is extended to curved objects by Ponce [13]. This work only considers two-fingered grasps.

2.3.3 Equilibrium Grasps

Cutkosky further evolves the framework of describing grasps in the context of robots.

Grasping an object in a stable fashion requires an equilibrium of forces from the fingers. This involves taking into account such things as stiffness of the object, friction, weight, fragility of the object and how well the object size fits the geometry of the gripper. Cutkosky creates a model for describing the kinematics of grips for general gripper configurations in the presence of variety of forces. [14]

2.3.4 Stable Grasps

Lakshiminarayana indicates the lower bound for number of fingers to get a form closure grasp is 4 fingers in 2D, and 7 fingers in 3D. There has been work to reduce it. Hanafusa and Asada [15] study stable grasping with a three-fingered hand with the fingers in even 120^◦spacings. Baker et al [16] shows this is not enough for stable grasping in the absence

(15)

of friction, and a further degree of freedom in moving the fingers is required. This allows placing the fingers such that the force is exerted along the normal at the contact point, which is required to obtain stable grasps.

2.4 Difficulties in Robotic Grasping

There are multiple levels of difficulty in creating a robust grasping algorithm for a robot.

The first problem is that robot hardware usually only enables limited forms of grasping.

Most common is a gripper capable of pinch grips. Multi-fingered robot hands are starting to become more common in research. Having several fingers enables more complex grasp types, but it also introduces complexity in mapping grasp affordances into suitable grasps for multi-fingered gripper.

Additional problems are posed by objects that are not directly graspable. An example could be a thin disc like object level on a table. Its not easily graspable from any point.

An obvious multi-stage solution is to move the object to the edge of the table and grasp it there where a graspable location is visible.

Another case is where the the object should be grasped in a way associated with the specific kind of object. This requires identifying the object, which can be difficult depending on the orientation of the object. One option here is also to use a multi-stage plan, where the robot first manipulates the object to identify it until a satisfactory recognition is ac- complished, and only then is the object grasped. [17]

These types of complex grasps requiring multi-stage planning are not considered in this work.

2.5 Affordances

Affordances are an old concept in social behavior sciences. The word was originally in- vented by Gibson [1] to signify utilitarian functions of objects as perceived by humans and other living organisms. For example, a rigid flat surface affords standing on it. An- other example could be that the empty air of a doorway affords passage through. For an example to align this idea with robotic grasping, the handle of a tool affords grasping and using the tool.

(16)

This idea was further developed by many social scientists. Warren’s stair climbing experiments found that the affordances are perceived in body-scale. His experiments proved that taller humans would judge a taller stair height to still afford climbing, compared to a shorter control group. [18]

Grasp affordances are following this concept to signify locations where the object affords grasping. In line with the original definition, affordances are utilitary locations on an object that afford use of the object. Most current research focuses on grasp affordances purely from the perspective that they are locations where the object can be grasped, i.e.

the utility is picking up the object.

After the problem of grasping objects has been solved, the more advanced goal is to find the affordances that allow the robot to achieve its objective. For example, for picking up a tool, it should be picked up from its handle so it can be used. One can grasp a hammer from its head, however to effectively use it as a tool, it should be grasped from the handle.

There has been some work focusing on this aspect of affordances. The vast majority of current research on grasping of novel objects is still aiming at perfecting the ability to grasp objects at all.

The idea of grasping objects from useful locations is embedded into the training data, teaching the robot to pick an object from the most convenient location, which often hap- pens to be the part designed for grasping i.e. the handle. One example of such work with grasp affordances is DeGranville [19]. He teaches the robot how to grasp objects by human demonstration.

2.6 Grasp Affordances

The most common approach is to try to find grasp affordances. In this study, grasp affordances are considered purely from the perspective of points that afford grasping and picking up an object from. Most common approach to finding grasp affordances is to find visually salient locations that are suitable for stable grasping. The robot then has to find these graspable locations on novel objects and choose one of them to pick up the object from, without needing any previous data on what kind of object it is.

Grasp affordances are points describing a potential location where the robot can attempt a grasp. Creating grasp affordances from stereo vision with machine learning allows the robot to grasp previously unseen objects for which a 3D model is not available. This is

(17)

also a configuration that is very close to how humans do this task, which is good direction to try since mimicing nature is often a good starting point for developing methods for machines to do the same thing.

Some success has been had grasping objects based on simply the 2D intensity image.

This method has its limitations and most modern methods use some method of generating a depth map for the object, whether by a dedicated depth sensor or stereo vision. Other not so successful routes include the creation of a 3D model and analyzing this model for stable grasps. Unfortunately current methods for creating the models are not accurate enough and lead to to more complexity for less results than simpler methods. Analyzing 3D models also has problems with novel objects.

2.7 Grasping with 2D intensity image

Grasping objects purely based on 2D intensity information is a very difficult task. Grasp- ing is a naturally 3-dimensional activity, so depth information is often very valuable. Still, some methods for pure 2D grasping have been developed, and often provide value as additional features in unison with depth-based features.

One method is to use edge detection to find the outline of the object, then grasp based on the outline [20]. If points on the edges are selected to form a good grip on the outline, there is a very good likelihood of obtaining a stable grasp on the object itself. Analytical work on grasps in 2D has found very good ways of modeling grasps on 2D polygonal objects, taking into account a mechanical model to produce stable grasps [12].

Another method of grasping is simplifying objects to primitive shapes. You can then fit a bounding shape, such as square or circle, on the outline of the object, and then grasp based on the fitted shape. Piatt [21] performs grasping by fitting an ellipsoid to the shape of the object, and assuming new objects with similar ellipsoids should be grasped similarly.

Usually 2D features involve finding some local visual features that describe good grasp locations. Chinellato et al [22] learn grasps, with special attention to the hand kinematics (in their case Barrett robot hand). They use features from the 2D shape of the object to learn stable force-closure grasps.

Grasping by shape can work well in controlled environments. Bowers and Lumia [23]

use a fuzzy logic expert system to learn grasps for shapes which can be perfectly detected

(18)

from the 2D image, as they use a dark table combined with bright object with no overlap allowed. Their fuzzy grasping system achieves 100% success rate in their experiments.

Unfortunately these method when used in isolation is very susceptible to background texture, and cannot distinguish between an object and a picture of an object. 2D grasping works well only for simple planar objects. This is why 2D grasping is mostly used in concert with other methods, as one of the features. Usually depth information is required to grasp objects in complex environments.

2.8 Grasping with 3D model

Use of a 3D model generated from some kind of depth sensor (e.g. stereo camera, laser depth sensor) provides interesting opportunities. If a 3D model of an object can be constructed, the search for the grasp can be done on the 3D model.

With an accurate 3D model, finding two parallel surfaces on opposite sides of the object that allow for stable grasping is possible. One can simulate the properties of grasps to find stable grasp solutions. [10]

Unfortunately most methods of building a 3D model of the objects are fairly heavyweight computationally, and often do not produce a very accurate 3D representation, especially if examined from only one angle. This is why pure 3D analysis provides suboptimal results for robotic grasping.

One way to improve the accuracy of 3D grasping is to examine the model from multiple angles [17]. This can be performed by either by interactive manipulation of the object, or by a mobile robot by moving around the object and reconstructing a better 3D model of the object, using techniques familiar from visual SLAM (Simultaneous Localization and Mapping).

Unfortunately moving around the object is not always possible. The object could lie against a wall for example. The best way of applying this method is if the robot can obtain an intermediate grasp on the objects to be able to examine it. Due to those problems, examining the object is best used as an addition to improve the 3D data with a combination method, if the initial grasp fails.

(19)

2.8.1 Depth Segmentation

One interesting method that was not considered in this work is segmentation. There has been recent work in grasping with depth segmentation. Rusu [24] performs grasping and manipulation based on segmenting stereo point clouds into planes.

Le [25] uses depth segmentation to find shapes for initial grasp candidates, which are then grasped using a simply strategy. Even this simple strategy achieves good results in their tests given a high quality segmentation.

In this study, depth segmentation was not considered because the stereo camera used is unlikely to produce a good quality segmentation.

2.9 Grasping by Shape

Grasping based on general graspable features, instead of object identification, produces better results. One method to do this is to envelop the object by primitive shapes in 2D, or bounding volumes in 3D, and compute a stable grasp based on the primitive shape.

Huebner [26] devised a method to perform grasping based on minimum bounding volumes, which are further decomposed to smaller bounding boxes to better approximate the shape of the object.

Huebner’s subdivided bounding volume approach allows grasping from different points on the object, and performing different grasps based on tasks. The largest bounding box could be grasped for a good grip to move the object, grasping the smallest bounding box allows accomplishing tasks such as showing the object to a camera, and finally grasping by the outermost bounding box allows handing the object over to another actor (human or robot). [26]

2.10 Grasping of Novel Objects

A common distinction is between grasping familiar and unfamiliar objects [7]. With familiar objects the robot can be assumed to have a 3-D model of the objects, which enables accurate analysis of stable grasps. The only issue is matching the objects seen by

(20)

the robot with the model.

For many tasks the grasping of novel objects never seen before is a requirement. For a general purpose robot there is simply no way to program in all the items it would have to grasps e.g. a model of all the cutlery in the world for a dishwasher emptying robot.

Grasping of novel objects is a more interesting problem, as its also applicable to grasping known object. The additional 3-D model data could simply be used to improve the accuracy of grasp generation.

The requirements of practical applications for robots leads to the current research in grasping of novel objects, where the object is assumed unknown, and the robot simply finds grasp affordances from which it can grasp the object.

2.11 Learning Grasps

Writing hardcoded rules for finding the grasp is inefficient. This is suitable for industrial robots with very fixed operating scenarios, but even there the problem has such a large amount of variables that applying machine learning approach is the only reasonable approach. Machine learning simply produces better results.

The problem of robotic grasping is very suitable for machine learning, as it is very hard to code hard rules what a good grasp looks like. This section describes use of SVM for learning grasp affordances.

k-Nearest Neighbor (k-NN) is possibly the simplest machine learning method available.

In its basic form, its a method where the k nearest training samples are used to classify the result [8]. k-NN voting has been used in robotic grasping for predicting the quality of grasps [27]. This method works for simple scenarios, however better machine learning methods should be applied for operation in complex cluttered environments.

2.11.1 Support Vector Machines

Support Vector Machines are a machine learning method originally devised for two class categorization problems. It tries to find the optimal hyperplane separating the two classes’

feature vectors, using a training set containing pre-labelled feature vectors. [28]

(21)

Support Vector Machines has been extended to handle multiple classes, and has become one of the de facto machine learning methods [8]. In this work a ranking SVM with the NDCG cost function is used for learning the best grasp based on the features used.

Ranking SVM is a method for using SVM to learn the ranking of the samples, instead of simply classifying them [29]. Its original use is for search engines, for learning the ranking of search results. It is also very suitable for our task of learning to find the best grasp location.

NDCG is a improved cost-function for ranking SVM. It emphasizes the ordering of the highest ranking results [30]. This means it produces better results if only the top results are important.

2.11.2 SVM for Learning Grasps

Recently it has been found that traditional search algorithms are insufficient to finding solutions to the grasping problem. Instead machine learning should be used. Machine learning is also very suitable to generalize the grasping to novel objects, instead of just pre-programmed ones.

Support Vector Machines was used by Pelossof [31] to classify grasps. They used the simulation framework GraspIt! [32] to generate a simulated training and test set using superquadrics. The gripper used was a three-fingered Barrett robot hand. They had the machine learning algorithm learn the connection between the grasp parameters, parameters use to generate the superquadric and the quality measure of the grasp. This allowed their method to find a good quality grasp for any superquadric shape.

SVM is very suitable for learning to identify grasps from novel objects. Later on in this study the implemented method is described, and in this method SVM is used to classify grasps.

2.12 Training methods for Machine Learning

Grasping is most suitable for supervised learning methods. The biggest question is how to produce the data and labelling of the training set. Usually an image of graspable objects has such a large number of grasping candidates that its not feasible to have the robot try

(22)

choices to produce training material.

2.12.1 Human imitation

One possible method of teaching is to have a human illustrate the grasps, which can give very good positive examples, and is very intuitive to perform for the teacher.

There are two main ways how teaching by example can be done. The simplest and most direct to implement is to manually operate the robot to the correct grasp, which is then saved [33]. This is also called human teleoperation.

Other alternative is to use some kind of sensor to record the grasp as made by a human hand, and map this to the robot hand. This has the benefit of being very easy to understand for the human trainer, and the robot can attempt to imitate the human grasps.

Both recording human grasps and robot teleoperation are examined as teaching methods by DeGranville et al. They examine using these methods for learning probability density functions for grasps. He also examines learning available grasp actions on objects, in other words: grasp affordances. [19] [34]

Naturally there are also problems with a robot imitating a human. The largest issue is the difference in grasp types, as humans can perform a wide variety of different grasps, whereas robot grasping usually focuses on one specific type of grasping e.g. force-closure.

One current research direction is to use force control to allow easily manipulating the robot into desired positions. One application of force control is using force and torque sensors in the robot arm to be able to directly manipulate the robot by e.g. pushing it [35].

This allows teaching the robot by the human operator applying a small suggestive force to the robot arm. This allows operating the robot into grasp positions to be performed faster, thus allowing faster teaching.

The biggest problem with methods teaching via mimicry is the low volume of examples produced, as each sample grasp only provides a single labeled training sample. Negative examples are also hard to produce. This is why alternative methods have been actively researched.

(23)

2.12.2 Reinforcement learning

Reinforcement learning is a general machine learning method where the learner performs actions, and tries to optimize a reward function associated to those actions [36]. It is an exploratory learning method where the intent is to try different approaches, and constantly integrate feedback from those trials to improve future attempts.

Reinforcement has been tried for having the robot learn itself by trying grasps. Usually some basis still needs to be provide by other methods, from which the robot can then improve by trying varying grasps. Grasping is such a complex task that simply random exploration will not produce results. For reinforcement learning, starting with "tabula rasa", a blank slate with nothing learned, is usually not efficient, as the likelihood of arriving at a single successful grasp with no initial training is unlikely. One of the other methods of grasping can be used to seed the learning, after which the reinforcement learning can improve the success rate of the result. As an example of reinforcement learning, Hsiao [37] has used Partially Observable Markov Decision Processes (POMDP) to significantly improve control over grasping.

2.12.3 Simulated training

Simulation of grasps and training data significantly speeds up the process of implementing and improving grasping. Such tools are especially important for a topic as complex as grasping novel objects.

GraspIt! is one such simulator framework that contains many tools for simulating grasps.

GraspIt! includes models of several different types of grippers, simulation of grasping with visualization of weak points, tools for offline analysis of the quality of grasps and several other features that help with the process of designing and implementing robotic grasping. It can also integrate with an actual robot and work as the control framework for grasping [32].

Simulation of the training material has been tried in several ways. Saxena uses raytraced sample objects with labelled regions to generate simulated training and test data sets [7].

This method has the benefit of producing a very comprehensive training set with relatively little effort. The raytraced images are also very high resolution and have high accuracy, allowing the learning algorithm to focus on the significant features. The simulated learning results still correlate well to realworld performance.

(24)

The big question is how well the simulated training maps to the real world task. Simulated training material is not perfectly comparable to objects in real world environments, and does not include the noise that real 2D images and reconstructed 3D models have. Its also not possible to physically try the grasps, so the labeling relies completely on intuition of the maker of the training material.

2.12.4 Human Labelling

A training set with human placed labels is the most simple, and also the most robust method, as it is hard to do better than the classification accuracy of a human. Unfortu- nately it also takes the most time, which is why there is a lot of research into avoiding it. The work can be alleviated a little by having good tools for visualizing the grasps candidates being labelled. Using actual pictures from the sensors, and labeling the grasp candidates produces the most accurate result, however it is also very time consuming and painstaking work, as the human operator must evaluate each grasp candidate, and decide whether it is suitable for grasping.

Using this method, it is possible to run the robot in unclear cases to verify the suitability of the grasp candidate for actual grasping. This requires that the scene can be restored to previous state afterwards, as otherwise more samples cannot be tried if the object moves.

This can be achieved by careful marking of object positioning.

2.13 State of the Art

Current state of the art methods use a depth map and combine salient intensity and depth features to define grasp locations. Additionally, features that can distinguish the stability of the grasp are added. Finally, machine learning methods are applied to pick the most effective grasp.

The method chosen as best for implementation is a method by Le [3] made on the Stanford STAIR2 robot. It expands upon previous work by Saxena at Stanford, where they trained a grasping system based on 2D edge features and depth features with simulated training material to perform grasping. Their system produces a single grasp location, and the path planner figures out how to perform the grasp [38] [7]. Later the system has been extended to more cluttered environments and included a machine learning system for judging grasp quality and stability [39].

(25)

Le’s method was chosen because it has significant improvements in accuracy over Sax- ena’s method. Le [3] reports mean success rates of 69% in tests of the method from Saxena [39], compared to 82% mean success rate for the method presented by Le.

Le [3] extends on Saxena’s work by introducing several new visual features. He also chooses as the classification target a set of contact points, one for each finger of the gripper. This choice removes most of the complexity from the path planner. Having the contact points available to the learning algorithm allows the classifier to learn which pair of points is the best candidate, instead of just selecting a potentially good location in the image.

In addition, classification with SVM is improved by utilizing a ranking cost function.

Normal classification with SVM usually aims to connect the right class for each feature vector. Ranking classifiers instead place more value on the highest ranked results, such that for example only thek= 10top results are ordered correctly and the highest value is placed on getting the top result right. This is commonly used for web search engines and similar, because the user cares most about having the top results correct. This also applies to grasping, as only the highest ranked grasp is significant, as it is the one actually used to grasp the object.

The cost function used by Le [3] is NDCG (Non-Discounted Cumulative Gain). NCDG is a ranking cost function that prioritizes the ordering of the top k results, applying decreasing importance to results further away from the top result [30].

Le’s grasping algorithm is shown in Algorithm 1 on a high level. The workflow is that of a basic machine vision system [40], where data is gathered first from sensors and processed into suitable features. The features form a feature vector, which is then run through the

(26)

trained classifier to generate the grasp [8]. Finally, the grasp is executed by the robot.

Algorithm 1:Chosen Method 1. Take a picture the grasping area with the camera

2. Compute point cloud and depth map from the depth triangulation sensor 3. Perform edge detection to find grasp candidate points

4. Compute all triples of candidate points

5. Perform feature extraction for the candidate triples. Extract angle, distance, discontinuity and sphere feature.

6. Classify the feature vectors with trained ranking SVM 7. Select the top 1 ranked triple as the grasp affordance 8. Compute a trajectory to perform the grasp

9. Execute the grasp

The robot used for implementation in Le’s paper is the Stanford STAIR2. It has 7-DOF (Degrees of Freedom) Barrett robot arm and a three-fingered 4-DOF hand. For depth data, the robot uses an active triangulation sensor with a laser projector and a camera to obtain very dense depth maps. As the robot has a three-fingered hand, candidate triples are used to describe the grasps.

The features used include angle features, distance features, discontinuity features and sphere features. These are all explained in detail when describing the implementation done for this study in Section 3. Additionally, some additional features are used such as raw template and depth data, but these only affect the performance of the algorithm slightly [3].

(27)

3 IMPLEMENTATION

The implemented method is a variation of the method presented in [3]. Some changes had to be made due to different configuration of the robot, and use of stereo vision for depth perception. The features used are largely the same as in Le’s paper. Some changes were introduced to take advantage of specifics of our robot configuration, as this allows overcoming the problems caused by the additional limitations it places. This is explained in greater detail in Section 3.7.

The grasping algorithm is shown at a high level in Algorithm 2. It follows the same general machine vision system structure as Le’s method described in Algorithm 1.

Algorithm 2:Grasping Algorithm 1. Take a picture the grasping area with the stereo camera

2. Compute point cloud and depth map from the calibrated stereo pair 3. Filter depth map

4. Perform edge detection to find grasp candidate points

5. Compute all pairs of candidate points, then take a subset of those pairs as candidates for classification

6. Perform feature extraction for the candidate pairs. Extract angle, distance, discontinuity and sphere feature, among a few others.

7. Classify the feature vectors with trained ranking SVM 8. Select the top 1 ranked pair as the grasp affordance

9. Compute a trajectory to approach the grasp from straight above 10. Execute the grasp

There are a few differences to Le’s original implementation. First off the sensor and robot hand are different. The stereo camera produces a less dense depth map that the active depth sensor used by Le. The robot hand is a two-fingered gripper, so instead of candidate triples, candidate pairs are used. Second, there is additional filtering performed to get the stereo camera depth data to a level useable for feature extraction. In addition to that, due to the less accurate depth data, the generation of candidate pairs from edge detection also requires filtering the number of candidate pairs. Path planning had to be done to grasp from above due to the different robot configuration.

The features are largely similar. The additional features mentioned in Le’s paper like raw depth data are not used, as they were not described in Le’s paper. In some cases a little

(28)

interpretation had to be done for the implementation of the features, such as the counting of number of discontinuities. For distance feature, distance between the points and robot base was added due to different robot configuration. New sphere feature was added for detect collision between the base the gripper with objects, to avoid overreaching. Also, height from the table surface was considered, as this is implicitly known in our robot setup.

The implementation has a total of 42 features. Table 1 shows the number of features split by type.

Table 1.Number of features by type Angle Features (depth) 12 Angle Features (intensity) 12 Discontinuity Features 6

Distance Features 7

Sphere Features 3

Height Features 2

Total Dimensions 42

Training material was created in an original way, as the chosen paper does not go into great detail about the training material used or how it was created.

This section explains the implementation in detail, first going through the hardware setup used, then explains the sensors used and the data filtering done before feature extraction.

The features are explained one by one. After this, the classifier and training methods are explained, as well as how the training set was labelled. Finally, some limitations of the system are addressed.

3.1 Robot Hardware and Configuration

The robot configuration differs slightly from the one used in the chosen paper. Their robot was mobile with the camera and arm mounted on the robot. The robot used in this work consists of a Melfa industrial robot arm mounted on a table and a camera on a stand at the side of the table. Objects to be grasped lie on the table. The camera was moved to slightly

(29)

different locations in the same general direction from the robot arm.

The robot is a table mounted MELFA RV-3SB robot arm with 6 DOF (Degrees of Free- dom). The grasping area is next to the robot on the table. A picture of the robot setup in our lab can be seen in Figure 1.

Figure 1. Picture of the robot setup.

The gripper is a Weiss Robotics WRT-102 consisting of Weiss tactile sensors attached to a Schunk PG-70 gripper. The tactile sensors were not used for the grasping in this work.

A closer view of the gripper is visible in Figure 2.

The used stereo camera is a Bumblebee 2 by Point Grey, firewire attached stereo camera capable of capturing two 648x488 stereo images. The camera and stand can be seen in Figure 3. The open-source library OpenCV was used for computation of depth map and point cloud from the stereo images. The original work used an active depth sensor to obtain a more dense depth map, but the method was adaptable to the less precise depth readings from the stereo camera.

(30)

Figure 2.PG-70 gripper.

Figure 3.Bumblebee stereo camera.

3.2 Grasping Area and Objects

For training material, a variety of graspable and non-graspable (too large) objects were used to provide good coverage in the training set. In Figure 4 a variety of objects can be seen. The DVD boxset is an example of an object too large to grasp, while the stamp box, salt box and Rubik’s cube are are all graspable.

(31)

Figure 4.Examples of objects.

3.3 Sensors and Calibration

As input data for the classifier, the image from the stereo camera was used. More specifi- cally, OpenCV was used to create a depth map and point cloud from the calibrated stereo images, as well as a rectified intensity image. The classifier calculates its features from the rectified intensity image, the depth map and the point cloud. Furthermore, the camera and robot were calibrated every time the camera was moved, so that 3D coordinates could be translated from camera to the robot frame.

An example rectified image can be seen in Figure 5 (the image has been cropped to display better). The image shows a salt box and a deodorant stick on top of the grasping area.

Because the robot is table mounted and the camera is calibrated, some further limitations could be placed, such as the fact that the grasping surface height is always known, because the robot is bolted to it.

(32)

Figure 5. Rectified stereo image.

3.3.1 Data filtering

The source data used required some filtering in preparation for feature extraction. The depth map using stereo vision had significant holes in it, and for several of the features used, an accurate depth map was required.

Taking for example the scene from Figure 5, the corresponding depth map straight from the stereo algorithm is shown in Figure 6. The black areas correspond to holes in the depth map, and they are present around the objects outline, which is an important area for the algorithm. There are also a few bright areas, indicating there would be something closer to the camera, where there is nothing. This could create false grasps, so both categories of problems need to be addressed.

First a median filter was applied to fill in any tiny holes in the depth map. This also made the depth values more stable, as our grasping method does not really need very fine details from the depth perception, only the large outlines of objects. A median filter of 7x7 size was used to obtain very smooth depth values, while still preserving edges.

The effect of this can be seen in Figure 7. Most of the small discrepancies, especially in the grasping area, have disappeared. The table and objects are a smooth depth surface,

(33)

Figure 6.Depth map from stereo camera.

with some holes remaining.

To fix the blank areas after median filtering, the depth values were interpolated horizon- tally line-by-line, so that for each gap in the depth map, the bordering depth value that was further away was used for all the points in the gap. This leads to strong preservation of edges, which is important for the grasping algorithm, as edge points are used for contact points.

After interpolation, the depth map in the grasping area is mostly smooth, as seen in Fig- ure 8. There are some small parts of the depth map where the table surface was not detected correctly. While these errors are not desireable, the issues can be handled by the classifier, as the object edge is still preserved.

It was also tested whether linearly interpolating the depth values would work. This was abandoned as it masks edges between object and background, causing for example the discontinuity feature to not register any discontinuities.

(34)

Figure 7. Depth map with 7x7 median filtering.

3.3.2 Selection of Candidate Pairs

The candidates for classification were selected by performing edge detection on the processed depth image. Due to the median filtering used and the quality of the depth map, spurious edges were often detected on the table surface also. As can be seen in Figure 9, edge detection produced a lot of points also where no object edges exist.

The large amount of pairs generated by taking all combinations of all the edge points required some fast filtering to cut the dataset down to a manageable size.

In fact, Figure 9 shows 1727 points. The formula for unique pair combinations is

N_pairs = N_points∗(N_points−1)

2 (1)

whereNpoints is the number of points. With 1727 points,Npairs then gives the number of combinations, which in this case is a little under 1.5 million pair combinations.

(35)

Figure 8.Final interpolated depth map.

Figure 9. Image showing edge points detected.

The average feature extraction time per pair was roughly 60 milliseconds after optimiza- tions, which would make the total time over 24 hours. This is too much for one scene, especially considering more complex scenes may have closer to 10000 edge points, which would put the number of pairs near 50 million, which is already over a months time.

(36)

Due to those problems, it was decided to implement some initial filtering of the grasp pairs before using them for classification. The worst grasps were removed based on their 3D location if they were too close to the table surface or outside the effective grasping area.

Finally, if too many grasps remained a large random sample of the remaining grasp candidate pairs was taken as the representative candidates for classification. This was found to not noticeably degrade the classifier performance if the random sample was large enough, that it was statistically likely to contain a good selection of useable grasps. A sample size of 10000 candidate pairs was used for classifying the test set. It was tested if using a sample size of 25000 candidates would make any difference, and there was no change in the classification accuracy with the larger sample size.

3.4 Features

There were several categories of features used to create the feature vector for the classifier.

These features could be categorized as angle features, discontinuity features, distance features and sphere features.

The angle features consider the stability of the grasp by calculating the angle of the edge at the contact point, and calculating the difference between the angle and the angle of the line connecting the contact points. Discontinuity features try to ensure that a single objects is grasped, from its closest to camera portion. Distance features ensure the distance between the contact points is with graspable limits, and allows preferring optimal grasp width.

Finally, sphere features calculate the number of point cloud points blocking the robot hand and gripper, to make sure there is nothing in the way that the robot hand could collide with. Sphere features also easily remove grasps that are detected too close to the table, as the table surface is in the way.

The following sections explain these features in detail.

3.4.1 Angle Features

The most important new feature of the method in [3] is the angle feature. This feature is calculated by taking a histogram over the gradients of a template patch (size 10x10 pixels) near the contact point, and selecting the two most significant angles present in

(37)

the gradient. The gradient field is calculated with multiple edge detection algorithms:

Sobel, Prewitt and Roberts. The feature is taken for both the intensity and depth images separately. The gradient is divided into 36 bins with 10^◦ width for the histogram.

These detected angles are then normalized by taking the difference between the angle and the line connecting the two contact points of the grasping pair. This provides a strong representation of the stability of the grasp. If the angles at both ends of the grasp pair are linear to the line connecting the contact points, this means the grasp points are on perpendicular surfaces, which inherently produces stable grasps.

Figure 10 shows the angle feature visualized. The blue square shows the 10x10 template at the end of the purple line connecting the contact points. The green lines are the most significant angles in the histogram of the gradient, and the yellow lines are the second most significant. As can be seen, the green lines are nearly parallel with the line connecting the contact points, detecting the fact that the contact point is on two sides parallel to each other and would be a good candidate for a stable grasp. The second most significant angle for both contact points detects that the grasp is near the top of the object, so this grasp is not quite perfect, as its more likely to slip. A better grasp would be lower down the object.

Figure 10.Visualization of the angle features.

As an example, in Figure 11 the gradient for the template of the right contact point from Figure 10 is shown. The gradient is calculated with the Roberts edge filter. The dominant

(38)

gradient angles are towards the right and tilted slightly upwards. The histogram is shown in Figure 12, clearly showing the two dominant peaks.

Figure 11.Gradient of the angle template.

Figure 12.Histogram of gradient angles in the template.

3.4.2 Line Discontinuity Feature

The line discontinuity feature distinguishes grasps that are not on the same object from ones that are. The intensity or depth values on the line between the contact points are taken, and some statistics on the values are calculated.

(39)

First off, standard deviation of the intensity and depth is taken. Second, the difference between the minimum and maximum values. Third, any discontinuities larger than a set threshold are counted. For strong grasps, the number of discontinuities should be zero, and the depth should deviate as little as possible. This ensures the grasp is on the same object, and is the leanest grasp available.

The discontinuity feature distinguishes grasps where the contact points are on two sepa- rate objects, or where there is an object in front of the object where the contact points are, and also grasps where the points are not on the object at all.

Imagine a grasp, where the two grasp points are on the table, off the left and right side of the object, so that the connecting line passes over the object. In Figure 13 one sees the depth plot for such a grasp. The object can be clearly seen as two significant discontinuities. A solid grasp for a simple object usually has no discontinuity to speak of. If there are multiple objects, all their edges would be visible.

Figure 13.Plot of depth values on line between grasp points.

One potential improvement would be to evaluate the amount of discontinuity as a linear value. The second possible improvement would be to distinguish between positive and negative discontinuities.

3.4.3 Distance Features

The distance features are calculated by taking the 2D, 3D euclidean distance between the points, meaning second norm, or simply the "straight line" distance between the points.

(40)

Additionally, both 2D and 3D manhattan distance between the contact points is also taken.

Manhattan distance is the distance of the points along the axes, also known as the taxicab distance.

Furthermore, the feature is normalized for euclidean distances so that zero is at the best grasp width of the robot. In case of the MELFA robot with the Weiss Robotics PG- 70 based gripper, the optimal grasp width is around 60mm, with viable grasps between 10mm and 72mm. The formula for the normalized distance is

d=kd_pair−d_optimalk (2)

where d is the distance feature, d_pair is the euclidean distance between the points and d_optimal = 60mmis the optimal gripper width. [3]

Figure 14 shows a contact point pair and how 2D euclidean distance and 2D manhattan distance are measured. 3D distances are not shown, however they are similar, simply extended to the added 3rd dimension.

Figure 14.Example of euclidean and manhattan distance for a point pair.

In addition to distance between contact points, the distance from the contact points to the robot base and the camera is calculated. This lets us differentiate grasps which are in the area where the robot can best grasp, as well as preferring those objects that are closer to the camera.

Le [3] discards the height from the table to make the method more universal, however in our configuration the table height is always known due to the required calibration between the table-mounted robot and stereo camera. For this reason the distance from the table

(41)

surface is used to distinguish grasps that are too close to the table.

3.4.4 Sphere Features

The sphere features are calculated by making a point cloud of the depth map and calculating the number of points within a sphere. The method used defines one sphere feature, taken just outside the contact points (moving along the 3-D line connecting the contact points). This filters out contact points blocked by something else, and also removes contact points that lie on the surface of the table, as there are points from the table surface within the sphere.

In addition, this work adds a third sphere feature to cover for the bottom of the gripper.

This is improves results because the camera is located to the side and the robot grasps from above the table. This could lead to grasps being detect so low on the object that in practice the robot could not grasp it. To keep things simple, this feature is calculated by taking the middle point of the contact points and moving upward on Z axis as far as the gripper depth.

With some experimentation, the sphere features location and size were chosen. Location was placed so that the sphere’s center is 20mm away from the contact point along the line connecting the contact point. The radius of the sphere was set at 15mm so that it does not contain points from the object surface at the contact point.

3.5 Classifier

Similar to the source method [3], for classification of the grasp contact point pairs Support Vector Machines are used to perform the learning. A ranking classifier is taught with a cost function called NDCG withk = 10, which focuses on getting the top 10 best grasps right, while not caring so much for the order of the remaining grasps.

The NDCG [30] cost function places decreasing value on the correct ranking of items the further away from the top result they are. This method is commonly used in information retrieval, for example search engines. It is also very logical in the context of robotic grasping, as the best grasp candidate is taken as the grasp to use, and it is not considered important whether the remaining grasps are in perfect order. The remaining grasps cannot be used, since the act of trying the grasp is likely to disturb the object. For this reason, if

(42)

the first attempt fails, the search for a grasp must begin from the beginning.

Training and classification was done with BMRM (Bundle Methods for Regularized Risk Minimazation) software package, which supports many different kind of machine learning methods, including SVM learning using ranking cost functions. [41]

The classifier is trained by feeding it a collection of different scenes, with carefully filtered set of grasp candidates that have been classified into three categories by a human operator.

The marked categories are shown in Table 2.

Table 2.Quality classes.

0 = Bad grasp, not viable at all

1 = Mediocre grasp, points on the object, but not in good locations

2 = Perfect grasp, contact points on the sides of the object in desired configuration

The categories were classified so that0 =Badwas used for any grasp which was unlikely to produce decent grasping results. These kinds of grasps were never successful in trial runs. 1 = M ediocrewas used for grasps that were on object, and look reasonable, but have potential problems. In trials runs these grasps had variable success. All grasps which could not be classified into the best class were placed here. The final class2 = P erf ect signified the best grasp one could hope for, with contact points in best position on the sides of the object and angle features so that the grasp was very close to perfectly level.

3.6 Training

The creation of the training set is one of the more difficult undertakings for the system. It can be hard even for a human operator to be able to tell whether a specific grasp will be successful, and due to pairs of contact points being used the amount of grasp pairs quickly reaches thousands.

Multiple solutions were devised to simplify the creation of the training set. Most important among these is use of the fact that many of the features themselves are human understandable and very robust. This allows to perform initial filtering of grasps that are

(43)

not possible grasps.

Another important tool was development of a training software in Matlab that allowed the trainer to easily go through grasps and classify them into one of the three categories.

Saving and restoring partway, as well as undoing misclicks all proved useful features to have.

Due to the large number of grasps it was not feasible to run the robot for each training sample, so the operator had to be able to rely on his experience to determine if grasps were viable based on previous trials.

To solve the problem of a large number of grasps, multiple solutions were devised. First, the set of pairs is filtered to restrict the number of samples. Second, a visualization tool was made for the grasp pairs, which allowed the trainer to easily determine if grasps were viable or not without running the robot.

3.6.1 Initial Filtering of Training Set

To filter the bad grasp pairs out and allow the human operator to focus on the ambiguous cases several of the features were used for filtering before applying machine learning.

The easiest feature to use is the grasp width. In the training samples there were no very lean grasp points, so contact pairs less than 15mm in width could be classified as failures.

Also, as the gripper is limited in maximum grasp width, contact pairs wider than 100mm were filtered out. This is a little larger than the actual gripper maximum width of 72mm, but due to possible inaccuracies in the depth map some safety margin was allowed so the human operator can teach the classifier to sort out the edge cases. Furthermore, contact points that were too close to the table at less than 15mm were considered bad.

Finally the sphere features were used to discard grasps where the gripper would collide with objects. A suitable threshold was found by trial and error by checking that no grasp points on the object were considered bad.

Using these features it was possible to narrow the thousands of contact point pairs in each scene down to a few hundred possible grasps. With around ten scenes this cut down the number of grasp pairs down to between 1000-2000. This amount is sufficiently low for a human trainer to go through, even if it still is high enough to be tedious. It was absolutely necessary to make an easy to use training tool to make sure the quality of the labelling

(44)

was good.

3.6.2 Training Tool

The training tool was developed as a training software in Matlab that allowed the human trainer to easily go through grasps and classify them into one of the three categories.

Saving and restoring the whole process partway, where the partial classification of a scene could be saved and labelling continued later, was very helpful. It was also found important to have a multilevel undo ability in case of misclicks.

Due to the large number of grasps remaining after filtering it was not feasible to operate the robot for each training sample, so the human operator had to be able to rely on his experience from a limited number of trial grasps to determine if a given grasps was viable.

For this purpose a visualization tool was made for the grasp pairs, which allowed the trainer to far more easily examine the grasps and determine if grasp candidates were viable or not. This visualizer would graphically show the angle features, and allow showing the grasp in a 3-D model view with the point cloud data, to allow three dimensional inspection of the grasp candidate.

Visual color indicators for potential problems like too wide grasp were also added. Ques- tionable values were shown in red, ok values in yellow and values in perfect range in green. This kind of indicator was implemented for all features which were human inter- pretable.

3.7 Limitations

The test setup introduced several limitations. The difference in the robot configuration required some changes in the algorithm itself, due to the robot grasping from a different direction than where the camera is. This also meant the grasps had to be simplified to be solely top grasps from straight above.

The camera posed the biggest limitations. The depth map produced by a stereo camera is not as accurate as that produced by other types of depth sensors, and has problems detecting depth where there is no texture. This was solved by mandating some texture to cover the table (newspaper was used to provide the texture) so that the background has something for the stereo algorithm to see depth from.

(45)

None of the objects used for grasping were problematic to detect, however in practice glossy or glass objects might pose problems for the hardware. In light of this, using a more accurate depth sensor would definitely improve results.

Learning Grasp Affordances from Vision

LEARNING GRASP AFFORDANCES FROM VISION

CONTENTS

ABBREVIATIONS AND SYMBOLS

1 INTRODUCTION

1.1 Background

1.2 Objectives and Restrictions

1.3 Structure of the Thesis

2 PREVIOUS WORK ON GRASP AFFORDANCES

2.1 Human Grasping

2.2 Robotic Grasping

2.3 Types of grasps

2.4 Difficulties in Robotic Grasping

2.5 Affordances

2.6 Grasp Affordances

2.7 Grasping with 2D intensity image

2.8 Grasping with 3D model

2.9 Grasping by Shape

2.10 Grasping of Novel Objects

2.11 Learning Grasps

2.12 Training methods for Machine Learning

2.13 State of the Art

3 IMPLEMENTATION

3.1 Robot Hardware and Configuration

3.2 Grasping Area and Objects

3.3 Sensors and Calibration

3.4 Features

3.5 Classifier

3.6 Training

3.7 Limitations