Computer simulation of an automated robot grasping with object pose estimation : Software frameworks that are suitable to build a reliable computer simulation

(1)

COMPUTER SIMULATION OF AN AUTOMATED ROBOT GRASPING WITH OBJECT POSE ESTIMATION

Software frameworks that are suitable to build a reliable computer simulation

Bachelor’s thesis Science and Engineering Tarkastajat: Associate Prof. Roel Pieters April 2021

(2)

i

ABSTRACT

Dmitrii Panasiuk: Computer simulation of an automated robot grasping with object pose estimation Bachelor’s thesis

Tampere University Science and Engineering April 2021

This Bachelor thesis paper presents the results of a research exploring which software frameworks are required in order to create a reliable computer simulation of automated robot grasping.

First, automated grasping was defined and broken into components. Then for each stage of the process a corresponding framework was studied. The research also covered how object pose estimation or grasp evaluation can be performed with a neural network, which possible solutions exist and how it have to be configured. The research was done by the means of setting up example simulations on the personal machine. The result of the work is a comprehensive analysis of all software frameworks that are required for creating a simulation of the automated grasping with case example simulations that are explained within the thesis paper.

Keywords: Automated grasping, Object pose estimation, ROS, Gazebo, MoveIt, PVN3D, Grasp- Net, Panda robot

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

LIST OF FIGURES

1.1 Stages of robot’s grasping in real life. . . 3

1.2 Frameworks responsible for simulating robots grasping on the computer. . 3

1.3 Computer frameworks for grasping simulation are united by ROS. . . 4

2.1 Gazebo representation of a Panda robot. . . 6

2.2 User interface of Rviz. . . 8

3.1 Schematic representation of GraspNet pipeline [12]. . . 11

3.2 Schematic representation of PVN3D pipeline [18]. . . 14

4.1 CrankSlider dataset as seen by Kinect camera. . . 17

4.2 Contents of the ROS package for Gazebo integration [32] . . . 21

4.3 Gazebo simulation running with MoveIt. . . 23

4.4 Rviz interface with a set goal state. . . 23

4.5 Gazebo simulation with the robot moved into a goal state. . . 24

4.6 Gazebo simulation with objects from the dataset and Kinect. . . 25

4.7 RGB image collected by Kinect in the Gazebo. . . 26

4.8 Output of the PVN3D model: estimated object poses. . . 26

4.9 ROS environment for image gathering and pose estimation. . . 27

(5)

1. INTRODUCTION

One of the main catalysts of the development of humanity is the automation of production processes. Thus, people made themselves able to step aside from the manual labor, invest their time into research and development and increase the productivity and quality of products [1]. At the same time, robots are used for eliminating people from unhealthy or dangerous tasks such as welding or work with chemical compounds [2]. Automation utilizes various types of robots from the most simple ones to complicated machines that have to perform millions of calculations and operations. Despite the use of the robot, at first it has to be designed and tested. However, as the complexity of robotic solutions increases, the cost of assembling a test machine becomes much higher. Hence, it has to be simulated and examined within computer environment to be considered feasible and passed to real life tests.

1.1 Research objectives

Automated robot grasping is the future of robotics as it is being increasingly shifting towards unsupervised operating. Automated grasping has a wide variety of uses on manufacturing sites, such as item sorting or assembling. Current thesis studies what robotics software frameworks are required to deploy a reliable and realistic automated grasping simulation. Simulations have to be highly realistic for the gap between test runs on the computer and real life performance to be minimal.

The task example considered in the research was modelling of the automated object grasping done by a robot (Panda robot was used as an example [3]), when the information about the target object is collected by a camera, then processed by a neural network in order to obtain possible grasping poses. After that robot hand would approach the target and try to pick it up.

Thus, the questions targeted by the research were as follows:

• What are the actions involved in the automated robot grasping and what do they require?

• Which software components are required to create a simulation of automated grasping on the computer? What are the options for substituting components if any?

(6)

2

• How simple or troublesome is it to set up a simulation on the blank machine?

In the next chapters first the overall process of grasping is broken into individual components, then suitable frameworks are considered and described together with the neural network used for object’s pose estimation. Eventually, test setups are presented with the discussion on the setup and performance.

1.2 Stages of grasping: computer implementations of real-life tasks

In order to simulate the process of robot grasping one needs to consider what the grasping actually is and what does it consist of.

1.2.1 Robot grasping stages

Figure 1.1 shows a schematic diagram with stages of automated unsupervised grasping with a robot in the real life:

1. At the beginning of the process, robot is in an arbitrary start state (can be a constant predefined home state, but does not have to). The robot is waiting for commands.

2. At first, the information about target object is observed and image data is collected for further computations.

3. Next, captured image is processed by a pre-trained model and the pose of the object is estimated.

4. Using the output of the model, a transformation is applied to the object’s pose to obtain possible grasps, from which the best one is chosen.

5. The path for the robot arm movement from the current state towards the grasping location is calculated and planned.

6. The robot arm moves towards the grasping state.

7. The robot tries to pick up an object. At first, closing of the gripper has to be successful, which is usually checked by the change of the distance between fingers of the gripper. Then, the robot attempts to lift the object without dropping.

8. In case of the success, the task is completed; in case of a failure, the robot retreats to the initial state and the process is repeated.

In the case if output of the neural network is already possible grasps and not object’s poses, steps 3 and 4 are combined and handled by the network. The diagram assumes that the input data and all calculations are represented in the simplest (for the task) form - 6 DoF pose. However, that is not always the case [4]. Other possible cases imply additional computations.

(7)

Figure 1.1.Stages of robot’s grasping in real life.

Figure 1.2. Frameworks responsible for simulating robots grasping on the computer.

1.2.2 Software frameworks

Each stage of the real life process has to be recreated in a computer simulation. However, there is no single framework that could perform all of the above mentioned tasks. Instead, separate frameworks have to be used to solve separate tasks. Also, one framework has to model the simulation itself and act as a representation of a physical robot in real life.

Figure 1.2 depicts how various frameworks share duties of the grasping simulation and which one is used for the exact stage. All of them were studied in the research and will be covered in the next chapters.

(8)

4

Figure 1.3. Computer frameworks for grasping simulation are united by ROS.

1.2.3 Frameworks communication

As it was mentioned above, computations at various stages of the simulation are performed by various frameworks. Thus, communication of those frameworks is required in order to share the information and pass outputs of processing as inputs of succes- sive steps. All the communication within the computer simulation is managed by a one more separate framework - ROS, which stands for Robot Operating System [5]. It is an open-source system that is used to solve a wide range of tasks, one of which is trans- ferring messages between separate software instances involved in operating the robot model. Thus, ROS unites frameworks performing detached computations and generally each process is happening within ROS environment. Hence, the diagram in the Figure 1.2 can be updated to obtain the one in the Figure 1.3.

(9)

2. ROBOTICS

2.1 Panda robot: robot model used in the research

The core of the robot grasping simulation is the chosen robot that will be picking up objects.

The robot model used in the study is a Franka Emika Panda robot [3]. The robot consists of 7 joints and hence it has 6 links connecting pairs of joints. Additionally, it has one hard fixed link connecting base joint to the ground. Together that provides Panda high flexibility and allows robot to approach points in space with a wide variety of poses. On the tip of the arm Panda robot has a hand with 2 fingers that acts as an end effector and performs the grasping. The model of the Panda robot launched in Gazebo (considered in Chapter 4.2) is presented in the Figure 2.1.

2.2 Robot Operating System

It is essential to begin describing used frameworks from the Robot Operating System or ROS. ROS is a basis of the whole simulation compilation and it provides communication to all processes. In the next chapters when other frameworks and their responsibilities will be described they will be referenced as a part of the ROS environment created for the project. Thus, basic ROS principles have to be defined first.

2.2.1 ROS terminology

In the following sub-chapter main ROS terms will be defined based on ROS documentation reference [6]. They will then be supported by examples from the project.

Package

Packages are main ROS blocks. They are containers used for storing all software related to a certain project that uses ROS. It contains ROS nodes, configuration files, libraries, data and outside software. The main use of the package is that all files related to one project could be stored in a single organized unit, which then can be easily transferred to another machine. Additionally, if the machine on which environment is being run has

(10)

6

Figure 2.1.Gazebo representation of a Panda robot.

another project, package sourcing helps to organise imports and dependencies so that versions’ compatibility is maintained.

Node

Nodes are building pieces of a package and ROS environment. Each node is a single computational unit, which performs respective actions. Nodes transfer data between each other by passing messages related to separate topics. For example, in the covered project, separate nodes are separate processes of software instances that were depicted on Figure 1.3.

Message

Messages are the mean of nodes’ communication. Messages are published by nodes into certain topics and received from the topic by nodes that are subscribed to it. Messages passing is anonymous, subscriber does not know who is the publisher and it cannot request any data through the topic. On the inside, messages consist of usual data types or

(11)

data structures, such as int, string, arrays, et cetera. They can also contain several data types, thus passing several parameters. Messages are defined by.msg files stored in the package. Examples of messages in the Panda simulation:

• GraspResult - notifies of the success of the grasp and consists of a bool variable depicting if the grasp was successful and a string variable describing an error if it has happened.

• FrankaState- describes current position and state of the robot, consists of 42 variables mainly of the type float.

Service

As it was mentioned earlier nodes cannot request information through topics, instead, it is done with services. Requesting node sends a message to the service, which then receives a reply message from the target node and transfers it to the requester. Service is defined with.srv files that contain two message types - request and reply. Example of a service used in the project:

• computeGrasps- request has a boolean type which ask for a computation and the reply is of the type PoseArray which comprises proposed grasp poses.

Launch files

ROS environment can be initialized by the means of launch files. The file contains defini- tions of all nodes, topics and services. It launches frameworks and links nodes to topics.

One of the launch files used in test runs was panda_simulation.launch. A typical node declaration from that file looks as follows:

<node name="spawn_urdf" pkg="gazebo_ros" type="spawn_model"

args="-param robot_description -urdf -model panda"/>

Nodespawn_urdf is used to add a Panda model into the Gazebo simulation. Declaration includes defining of the name, title of the ROS package from which the node type is referenced and arguments that are passed to the node creator.

For larger environments it is a common approach for the main launch file to call other launch files that are responsible for initialising sub-parts of the simulation. For example, panda_simulation.launch uses a separate call of grasp_trial_world.launch (described in the chapter 4.2) to launch Gazebo andmove_group.launchto start MoveIt interface.

(12)

8

Figure 2.2. User interface of Rviz.

2.2.2 ROS package used in the research

The main focus of the research was to study which software frameworks have to be utilized in order to recreate an action of automated grasping by Panda robot in a computer simulation. That was done through implementing a simulation on a blank Linux machine.

The core of the project was obtained by recreating the work done by Saad Ahmad in his Master thesis paper [4]. Ahmad’s ROS package with created environment was copied to the machine. One of the tasks was to configure it to work with the dataset of target objects created by Kulunu Samarawickrama in his Master thesis work [7].

2.3 Robot movement planning: MoveIt

After the image data is collected by Kinect camera in Gazebo simulation, it is passed through PVN3D network and poses of objects are estimated. Then grasps are generated from the set of predefined ones and the target grasp is chosen. The next step is to calculate how the robot can achieve the position of the grasp: MoveIt interface performs that task [8].

MoveIt is an open source framework for robot motion planning and simulation. It is built on top of ROS and utilizes ROS tools such as messaging and topics. It is initially installed as a ROS package and hence it cannot be operated without ROS. MoveIt is commonly used with a graphical interface - ROS Visualizer or Rviz [9]. An example of Rviz interface with Panda robot in it is shown in the Figure 2.2.

(13)

MoveIt is not used only for computer simulations of robot tasks. Instead, MoveIt is used widely to provide manipulator functionality for real life robot runs. For example, automated grasping in the real life would have the frameworks layout as on the figure 1.3 but instead of Gazebo a real robot model would have been used.

During the research MoveIt framework was not studied in details, but it was mostly used as a black box interface. Hence, the pipeline of motion planning is not covered thoroughly.

2.3.1 MoveIt in the research simulation

As it will be described in Chapter 4.3.1 in the grasping simulation MoveIt framework was represented by the node /move_group. The node was launched by the original ROS panda_simulation.launch file by calling another launch file move_group.launch. It was created by the developers of MoveIt interface in their original ROS package for dealing with Panda robot tasks -panda_moveit_config. Hence, the package was used as a ready made tool that does not require significant intervention.

The node/move_groupreceives required end state from the Python script that computes grasp poses. Then it runs the motion planning with MoveIt pipeline and sends execution message to/gazebonode.

(14)

10

3. ROBOT’S PERCEPTION: METHODS FOR DETERMINING GRASPS

The central part of the automated robot grasping is the object’s grasp pose generation.

The most common approach to the problem is to use neural networks. There are multiple models which are created for that directly: Grasp Pose Detection (GPD) [10], Dex-Net [11], GraspNet [12] and others [13, 14, 15]. Another approach is to predict objects’ poses and then robustly compute grasps based on that, examples would be PoseCNN [16], PVNet [17], PVN3D [18] and others [19, 20]. Both categories of models are targeting the same task - grasp proposal. Therefore, GraspNet and PVN3D were considered in the research and PVN3D was used in the final setup.

3.1 Grasp detection: GraspNet

Information about GraspNet model was retrieved from the original article written by Ar- salan Mousavian, Clemens Eppner and Dieter Fox [12].

3.1.1 Overview of the model

GraspNet works with point cloud images of single objects only and the segmentation of it from the background is not considered. The model can work with any unknown objects and does not require object class specific training. As an input GraspNet receives a depth image of an object. Partial point cloud is then extracted using plane fitting. Point cloud represents only part of an object as the input image is taken from one view and the full model cannot be retrieved. Next, the point cloud is passed through the pipeline shown in the Figure 3.1. Circles represent input and/or output data, rectangles represent processing stages and rhombus is a logical operation.

At the end, grasp pose is defined as a 6D object, that consists of translational and rotational components, both having 3 degrees of freedom.

(15)

Figure 3.1.Schematic representation of GraspNet pipeline [12].

3.1.2 Grasp Sampler

Grasp sampler is the first stage of the processing. It utilizes principles of a Variational Autoencoder [21].

Encoder and decoder utilize PointNet++ architecture [22] and points of the input image are represented with a 3D coordinate and a corresponding feature vector that depicts point’s location relative to other points. In the process of training each ground truth grasp g and the point cloud of the objectX are transformed by the encoder into a subspace of a latent spaceZ. After that the decoder reconstructs it back into a graspg˜. The training itself aims to minimize the error betweeng andg˜and moves the form of a latent spaceZ into a normal distributionN(0, I).

During operation the encoder part is omitted and the decoder samples values ofz from the spaceZ resulted from training and transformszandX into sampled grasps.

3.1.3 Grasp Evaluator

Grasp sampler stage is trained solely with positive examples. Due to that it is prone to produce a certain rate of false grasps. Hence, a grasp evaluation has to be performed.

As it can be seen in the figure 3.1, output produced by the sampler stage is passed to the evaluator. But before that the point cloud of the object is added to the point cloud of the grasp pose (that represents the location of the robot gripper). Thus, the input of the evaluator stage is the point cloud of the object with the gripper located on the place suggested by the grasp sampler. The usage of a combined point cloud instead of two separate point clouds enhances the relative information between an object and the grasp

(16)

12

pose.

Grasp Evaluator is also based on the PointNet++ architecture [22]. Exact descriptions of used parameters and layers can be found in the original paper [12].

3.1.4 Grasp Refinement

It was found out by creators of GraspNet that the large portion of grasps rejected by the evaluator stage are in fact significantly close in space to successful grasps that have not been found by the model. Hence, it was decided to add another network that could try to alter dismissed grasps by various small∆g so that they could become successful.

Grasp refinement stage is based on the fact that evaluator stage produces a functions(g) which represents a chance of grasp success based on the grasp pose. Hence,∆gcan be suggested as a derivative of s(g)- δs

δg. The derivative represents the direction of movement towards increasing the success probability. The parameters for the computation are setup for the total translation of the gripper’s geometrical center to be no more than 1 cm.

Refined grasps are then passed back to Grasp Evaluator. Output of it can again be passed through refinement stage. The number of Evaluator - Refinement iterations is set up by the user and is defined before the start of the model operation.

3.1.5 Output

After all iterations of the grasp refinements and evaluations are completed successful grasps are given as an output of the computation.

3.2 Object pose estimation: PVN3D

PVN3D is a neural network for object pose estimation. In the research it was used as the main one. The choice was not based on the accuracy or efficiency, PVN3D was used because of the availability of the integration into ROS and Gazebo. The information about PVN3D model was retrieved from the original article written by Yisheng He, Wei Sun, Haibun Huang, Jianran Liu, Haoqiang Fan and Jian Sun [18].

3.2.1 Comparison with GraspNet

It is not reasonable to compare performances of PVN3D and GraspNet as those models have different end goals. The output of GraspNet contains grasps that are predicted to be successful. Whereas PVN3D estimates the pose of a previously known object and the the ouput can be used to transform a set of predefined grasps according to the object

(17)

position in the camera frame. Hence, there are two major differences:

• GraspNet can work with any unknown object, whereas PVN3D has to be trained with the set of exact objects that will be then used in the grasping.

• GraspNet gives ready grasps as an output, whereas PVN3D requires additional script that will receive estimated object pose, use the set of predefined grasps for the object and compute how the grasp has to be translated and rotated to fit the current object position.

Suitability of two models can be compared. Clearly, PVN3D has a disadvantage as it cannot be used with random objects. However, when considering it in terms production automation (which is one of the most common robots’ usages), the disadvantage vanishes as on the manufacturing sites objects are usually well known and certain robots are used with certain objects.

Another point that may be considered as a benefit of GraspNet is that it produces the end result - grasp poses. However, in fact it can be a better approach to produce object poses as it is unambiguous and the probability of a mistake is significantly lower than with suggesting a set of grasps. After the pose is assessed, grasps are chosen from a ground truth possibilities and thus cannot be false, unless the pose was estimate incorrectly.

Last point is that PVN3D is capable of working with several objects in the frame, seman- tically segment them and compute the pose of the each one. However, as the grasps are then taken from the predefined ones, occlusions of object’s surrounding is not considered and it can affect grasping possibility .

All in all, both models have their relative advantages and suitability has to be considered for exact applications.

3.2.2 Overview of the model

PVN3D requires RGBD image as an input, which is a regular RGB image but with a fourth component representing depth - distance from the camera to a point on the image. The outline of the network’s goal is to determine known objects on the image, separate them from each other and find suitable transformations for objects’ models in order for models to correspond to the objects in the image. The overview of the pipeline is presented in the Figure 3.2. Circles represent input and/or output data, rectangles represent processing stages.

3.2.3 Feature extraction

Input image is broken into RGB and depth components. RGB image is processed by PSPNet [23] with an ImageNet [24] and pretrained ResNet34 [25]. This block enhances

(18)

14

Figure 3.2.Schematic representation of PVN3D pipeline [18].

the appearance information that will be then used for object segmentation and keypoints detection. Depth component is passed to PointNet++ [22] that extracts geometry features.

Two outputs are then fused by Dense Fusion [26].

3.2.4 Object segmentation and keypoint detection

After initial processing extracted point-wise features are passed to MultiLayer Perceptrons (MLP) that process them simultaneously.

3D keypoint detector is responsible for suggestions of possible model object keypoints correspondences to known target objects keypoints. The block tries to find point specific target keypoints transformation so that it fits visible points. Thus, it considers feature vectors individually for separate points and looks for point-wise correspondences between input and target model objects and predicts required Euclidean transformations to align corresponding points.

Semantic segmentation module is performing point-wise prediction of visible objects correspondences to known target objects. For each point of the image the module predicts a class to which it could belong. Usage of the module can seem misleading as a 3D keypoint detector is also processing points in a class-wise manner and additionally predicts points’ correspondences. However, the idea behind that block was for two MLP to enhance each others performances. Enhancement happens as the segmentation module extracts both global and local features for objects segmentation and helps to obtain object size information.

Object center voting module predicts objects’ centers. It’s structure is similar to the first MLP as the center of the object can be considered as the keypoint. Again, Euclidean translations are being predicted.

(19)

Outputs of last two modules are used in the clustering algorithm [27] for distinguishing separate instances of objects of the same class. Hough voting is performed with keypoints suggested by 3D keypoint detector to choose final keypoints that are used for comparison with targets.

3.2.5 Pose Estimation

Input to the last stage are sets of keypoints in the input image in the camera frame of coordinates with object classes labeled. Least-squares fitting is performed with respect to the target objects’ keypoints in the image object frame of coordinate.

3.2.6 Output

Output of the model are sets of class labeled rotational and translational matrices that describe poses of objects found in the input image.

3.3 Kinect camera

In order to compute possible grasps, target object poses were estimated by PVN3D algorithm [18]. As an input it requires images with a depth component, that describes the distance between the camera and the point on the image. One of the most utilized ways to obtain images with depth is to use Kinect camera [28]. Second version of Kinect camera was used in the simulation. Alongside a normal camera that creates an RGB image, Kinect v2 uses a Time-of-Flight (ToF) camera that captures the time difference between emitting an infrared light and receiving it back after being reflected by the object. In this way ToF camera computes the distance between the Kinect and a particular spot of the viewing scene and calculates the depth component [29]. The result of Kinect capturing are two images - one is a usual RGB image and the other one is depth image. On the Kinect device usual camera and depth camera are separated. Obviously, they have a difference in their locations relative to the scene. It is compensated by applying a transformation to the components at the preprocessing stage.

In the simulation Kinect camera is attached to the end effector of Panda robot, so that the direction of Panda arm pointing determines the field of view of the camera. Kinect camera can be seen attached to the robot in the Figure 2.1.

(20)

16

4. SIMULATIONS

4.1 Test objects

Parts of packages created by Master students at Tampere University were used in the research. The majority of models were predefined as they were taken from previous researches.

The research was done in the scope of the ongoing large research in the Robolab at Tampere University, which covers Panda robot automated grasping of diesel engine parts.

Hence, in the current paper objects used for grasping were also diesel engine parts as the neural network model was pre-trained for it. Considered objects were, for example, piston head, connecting rod, bearings. The dataset was created by a Master’s student at the Robolab [7]. Object models as seen by Kinect camera are shown in the Figure 4.1.

4.2 Gazebo

Gazebo is an open source robotics physics simulation software that is used for simulating robot’s dynamic actions with real world conditions taken into account: gravity, friction, inertia, torque, etc. Gazebo provides graphical user interface that gives visual feedback about robot actions (but it can be used without GUI as well). Additionally, Data collection can be performed through Gazebo, software supports adding of various sensors and cameras. Thus, while other frameworks considered in the research, perform actions "un- der the hood", Gazebo is the representation of the actual process of robot grasping with objects’ image retrieval, grasping pose approaching and pick up.

Gazebo simulation is defined by the world description, which includes robots, sensors, objects, etc. In this chapter main components of Gazebo simulation will be first described and then linked to the simulation in the research.

4.2.1 Gazebo usage in the research simulation

In the simulation of automated grasping Gazebo is used for several tasks:

• Representation of the objects in the world with Panda robot, simulation of real life

(21)

Figure 4.1.CrankSlider dataset as seen by Kinect camera.

conditions.

• Providing view of the robot performing the task.

• Image data collection with Kinect camera.

• Feedback on the grasp attempt.

4.2.2 World

World file is used during the launch of a Gazebo simulation as it contains the description of the simulation, which covers used models (robot, objects), placement of those models, physics settings and simulation parameters.

Typical way to create a world file is by creating an empty world in Gazebo, modifying it with GUI and then saving the configuration. World file is saved in an XML format and thus afterwards can be edited with a text editor.

During the research, a simulation described in the chapter 4.3.1 was set up. Main parts of gazebo world file grasp_trial_world.world that was used for launching Gazebo interface

(22)

18

will be described in the current chapter.

4.2.3 Object models

Models that are used in the simulation can be robots and static or dynamic objects. The syntax for declaring a model in the world file is usually the same:

<model:ModelName>

<id>modelId</id>

</model:ModelName>

x, y, zdetermine the model’s location with respect to the world center andr, p, y describe roll pitch and yaw respectively which are generally rotations against model axes.

Models can also be added through Gazebo GUI or through creating ROS nodespawn_model in the simulation ROS launch file. Latter method was used in the example simulation to spawn Panda robot:

<node name="spawn_urdf" pkg="gazebo_ros" type="spawn_model"

args="-param robot_description -urdf -model panda"/>

In order to be able to export the model into the world it has to be added into the Gazebo folder with two files: .config file containing the name of the model, current version and other meta information and.sdf file which is a special file format that was originally created to define robot specifications for Gazebo simulations [30].

In the simulation used in the study, test objects were added with the GUI. When the objects are exported with a.world file static objects have to be marked with a parameter in the definition:

When instead of exporting predefined the object is created in the world file from scratch, a variety of parameters have to be defined. One of the examples can be a Box that was used as a table for test objects and can be seen in the Figure 4.8. The code that was used to create it in the world file can be see in the Appendix A.

At first, the name of the model is given: /textutunit_box. Then the position has to be defined throughx, y, z andr, p, y. Model has to possess inertia and it is specified through inertia matrix. Collision parameters define a geometry of an object that is used for contact checking, which is crucial in the grasping simulation. The shape is defined as a box (which is a standard model in Gazebo) with given sizes. Friction of a model could be defined by passing values for Coulomb friction coefficient into <mu> </mu> but table box does not

(23)

have to be picked up and hence friction is not needed. Visual model for the object is the same as a collision geometry. Finally, the material is specified asGazebo/Grey which is a plain color with no texture.

Geometry parameters of an object that has to picked up are specified with significantly greater details as various surface parameters can affect the results and act as an error source. In order to eliminate or at the least minimize the difference between real life and computer simulation great attention has to be paid to the object collision model. Here is an example of collision parameters for one of the test objects fromgrasp_trial:

<laser_retro>0</laser_retro>

<max_contacts>10</max_contacts>

<mesh>

<uri>model://cran_field_round_peg/cran_feld_peg1.stl</uri>

</mesh>

</geometry>

<ode>

</ode>

<ode/>

</torsional>

</friction>

<ode>

<min_depth>0.0001</min_depth>

<max_vel>0</max_vel>

</ode>

</contact>

</surface>

</collision>

(24)

20

Friction coefficients are defined based on the material of the test object. Additionally, multiple variables are defined for the contact of the object surface with some other surface:

kp represents stiffness coefficient, kd is damping coefficient, min_depth is a minimal required depth of an interference for the contact to be registered, otherwise no force is applied to the object model. Finallymax_vel is the correction velocity of a contacts [31].

Properly defining collision parameters is a challenging task and it plays a vital role in the simulation of a robot grasping.

4.2.4 World parameters

Gazebo world can be defined with a great precision, among various details that can be specified some valuable examples are:

• Physics. It is possible to set up the direction of gravity with respect to the world frame of coordinate. It is also possible to set up magnetic field’s directions and at- mosphere type to reach highly realistic simulation. Physics of thegrasp_trial_world:

<param:Global>

<magnetic_field>6e-06 2.3e-05 -4.2e-05</magnetic_field>

</param:Global>

• Light.The source of light can be defined, whether it is a natural sun, lamp or some- thing else. The direction of light is defined in the world frame of coordinates and various parameters are chosen, such as diffusion, attenuation or shadow casting:

<cast_shadows>1</cast_shadows>

</attenuation>

</light>

(25)

Figure 4.2.Contents of the ROS package for Gazebo integration [32]

4.2.5 Integration of Gazebo into ROS

Gazebo simulation of the Panda robot and test objects had to be integrated into ROS environment in order to communicate with other frameworks. For that ROS has a package gazebo_ros_pkgs that covers all interfaces needed for a comprehensive integration of stand-alone Gazebo simulation. Figure 4.2 depicts components of the package [32].

Controlling of robot movements is performed with ROS package ros_control [33]. The package provides interfaces to collect states of robot’s joints, pass it to the path planner and return outputs to joint controller. There are five types of controllers defined in ROS:

• effort_controllers- target torque/force is passed to the joints.

• joint_state_controller - parameters of the joint state are passed.

• position_controller - one or multiple joint positions are controlled.

• velocity_controller - same as position_controller but the velocity is controlled.

• joint_trajectory_controller - provides extra functionality to every other controller.

First, second and last types were used in the setup.

(26)

22

4.2.6 Gazebo alternative: Webots

There are multiple robot physics simulators alternative to Gazebo. One of the main ones is Webots [34]. It can be utilized to fully replace Gazebo in the setup in the Figure 1.3 and perform same tasks. Just as Gazebo, Webots has a simple to understandable GUI and a ROS integration package. With the reference to a recently published comprehensive comparison of physics robotics simulators it can be said that considering functionalities Gazebo and Webots almost have no differences [35].

Webots was not used in the research, but it was covered in the initial considerations and the only found difference is that Gazebo seems to be used more frequently and hence it is easier to find supportive material on-line.

4.3 Final simulations

The goal of the research was to study which software frameworks are suitable for creating a reliable computer simulation of automated grasping and the most trustworthy method to investigate the topic was to create the simulation alongside covering frameworks. How- ever, achieving a complete operable simulation of the process depicted in the Figure 1.3 did not fit into the scope of the Bachelor thesis and requires significantly more time in- vestments. Nevertheless, two simulations were covered that resemble separate actions from the pipeline of an automated grasping. First one is a Gazebo simulation with test objects added and Panda robot. Panda can be operated through passing poses to MoveIt interface, calculating the path and executing it. Another simulation is a Gazebo world with CrankSlider dataset and the Kinect camera. Simulation uses a Python script (Kulunu reference) that makes camera fly around the dataset capturing images, then pass it to PVN3D script and receive estimated object poses with marked keypoints.

4.3.1 Gazebo simulation with MoveIt control

Gazebo world used in the setup was described in the chapter 4.2. The idea of the simulation was to integrate Panda robot into Gazebo, build a ROS communication between Gazebo and MoveIt simulations and then control Panda robot in Gazebo with predefined poses.

Figure 4.3 shows simultaneously running Gazebo simulation and Rviz interface with Panda robot in the home state. Test objects can be seen on the box in front of the robot. In the Figure 4.4 a predefined goal state of the Panda robot is passed to MoveIt and can be seen in the Rviz interface. Orange texture represents the goal state. After the goal state is provided MoveIt calculates an approaching path, executes it and send commands to Gazebo simulation by the means of ROS messaging described in the chapter??. Panda

(27)

Figure 4.3.Gazebo simulation running with MoveIt.

Figure 4.4.Rviz interface with a set goal state.

robot that had moved into a goal state can be seen on the figure 4.5, test object are now in the proximity of the gripper.

The graph of the ROS environment during setup run is shown in the Appendix B. Even though the grasping and image data collection is not considered, the graph is significant, main part related to controlling of the robot state is at the bottom: /gazebonode represents Gazebo simulation, /move_group is a MoveIt interface that computes how Panda gripper can move from position A to position B. Those nodes communicate between each other using multiple topics:

• /gazebo publishes current state of joints into topic /joint_states, which is then received by/robot_state_publisher and passed to/move_group.

(28)

24

Figure 4.5. Gazebo simulation with the robot moved into a goal state.

• topics /panda_arm_controller and /panda_hand_controller are controllers of the typejoint_state_controller that are used for passing commands to execute the path computed by MoveIt. Node/move_grouppublishes commands into topics inside of the controllers, while/gazebois subscribed to it and listens to those messages.

• node/gazebopublishes feedback on commands execution through the same topics and the feedback is then received by/move_group.

4.3.2 Gazebo simulation with Kinect and object pose estimation

The goal of the simulation was to collect data with the Kinect camera inside of the Gazebo simulation and to test PVN3D model with the Kinect data by running the model and receiving object poses. The simulation was using a Python script for camera movement and PVN3D calculation [7].

Gazebo interface with the simulation can be seen in the Figure 4.6. CrankSlider dataset is located in the middle of the coordinate frame, while the Kinect camera is rotating above and collects images. Example of a collected image can be seen on the figure 4.7. Ex- ample demonstrates the RGB component only, the depth component separately is hardly discernible and is not be informative.

The result of PVN3D processing of the same time stamp as in the Figure 4.7 can be seen in the Figure 4.8. Objects are segmented, object poses are estimated and designated by white borders. Key points are marked as red dots. It can be seen in the Figure 4.8 that PVN3D succeeded to predict object poses even though some keypoints are placed faulty. The performance was generally the same for all Kinect locations in the described

(29)

Figure 4.6.Gazebo simulation with objects from the dataset and Kinect.

simulation.

The graph of the ROS environment of the simulation can be seen in the Figure 4.9. As before, node /gazebo represents Gazebo simulation and node n _pvn3d _LM _pred is for Python computation of PVN3D script. Topic /kinect1 is used for publishing pictures gathered by Kinect camera in Gazebo simulation. Subtopics/depthand/color are for the depth and color components respectively. An example of the picture collected by the latter one can be seen in the Figure 4.7. PVN3D node is subscribed to the/kinect1topic. Thus, it extracts Kinect images from there and after processing publishes results into /pvn3d _label _img topic. An example of a published image can be seen in the Figure 4.8.

4.4 Discussion

All simulations require Linux operating environment. The reason for that is the availability of software. Out of all mentioned frameworks only Webots have a Windows alternative, besides that, everything else is compatible with Linux only.

At the beginning of the research. Ubuntu 20.04 was chosen as the Linux environment and that resulted in the significant number of issues related to software compatibility. The version of Ubuntu defines which versions of robotics software can be used. Hence, for

(30)

26

Figure 4.7.RGB image collected by Kinect in the Gazebo.

Figure 4.8.Output of the PVN3D model: estimated object poses.

Ubuntu 20.04 versions of ROS, Gazebo and MoveIt also had to be new. Thus, the source of the majority of issues was found to be the relative novelty of the environment version that produces the lack of tests on how new frameworks versions behave with each other.

That in turn, led to the presence of various errors and bugs. Eventually, it was decided to downgrade to Ubuntu 16.04 LTS and begin setting up frameworks from scratch again.

Finally, for each involved software older versions had to be used as that provided the most stable integration with respect to other each other.

It is worth mentioning, that even with software versions chosen properly it still requires a lot of effort to set up the environment. Reasons include additional packages that are needed for the proper operation and typical Linux issues such as files sourcing and refer- encing.

(31)

Figure 4.9.ROS environment for image gathering and pose estimation.

The resulting environment consisted of following software versions:

• Ubuntu 16.04 LTS

• ROS kinetic

• Gazebo 7.16.1-1 xenial

• MoveIt 0.9.18-1xenial

(32)

28

5. CONCLUSIONS

The goal of this thesis was to investigate which software frameworks are needed for creating a dependable simulation for automated Panda robot grasping. During the research grasping pipeline was described, main robotics frameworks were studied and as a result used to set up simulations of main processes of automated grasping.

A robotic automated grasp simulation pipeline was set up, including main components:

• Simulation of a robot manipulator with RGB-D camera

• Objects’ pose estimation in the simulation environment

• Robot control with motion planning

The work concludes that considered frameworks: Gazebo, Moveit, Python compiler and ROS are suitable and sufficient for creating a reliable simulation of automated robot grasping as it was shown by simulations. A user has to be highly cautious and attentive to software versions when setting up the environment.

Additionally, PVN3D network does perform well in Gazebo environment with CrankSlider dataset.

(33)

REFERENCES

[1] Groover, M. P.Automation. Oct. 22, 2020.URL:https://www.britannica.com/

technology/automation(visited on 04/18/2021).

[2] Pham, Q. C., Madhavan, R., Righetti, L., Smart, W. and Chatila, R. The Impact of Robotics and Automation on Working Conditions and Employment [Ethical, Le- gal, and Societal Issues]. IEEE Robotics Automation Magazine 25 (June 2018), pp. 126–128.DOI:10.1109/MRA.2018.2822058.

[3] Franka Emika Panda robot datasheet. Apr. 2020.URL:https://s3-eu-central- 1.amazonaws.com/franka-de-uploads/uploads/Datasheet-EN.pdf(visited on 04/15/2021).

[4] Ahmad, S. Robotic assembly, using RGBD-based object pose estimation grasp detection. (Sept. 2020).

[5] Shalom, L. Introduction to Robot Operation System (ROS). Dec. 28, 2020. URL: https://dzone.com/articles/ros- robotic- operation- systems(visited on 04/15/2021).

[6] ROS concepts.URL:http://wiki.ros.org/ROS/Concepts(visited on 03/24/2021).

[7] Samarawickrama, K. RGB-D based Deep Learning methods for Robotic Perception and Grasping.Not yet published Master’s thesis().

[8] Coleman, D., Sucan, I. A., Chitta, S. and Correll, N. Reducing the Barrier to Entry of Complex Robotic Software: a MoveIt! Case Study.Journal of Software Engineering for Robotics (May 2014). DOI:10.6092/JOSER_2014_05_01_p3. URL:https:

//aisberg.unibg.it//handle/10446/87657.

[9] Summary of Rviz package. May 16, 2018. URL: http : / / wiki . ros . org / rviz (visited on 04/18/2021).

[10] Pas, A. ten, Gualtieri, M., Saenko, K. and Jr., R. P. Grasp Pose Detection in Point Clouds.CoRR abs/1706.09911 (2017). arXiv:1706.09911.URL:http://arxiv.

org/abs/1706.09911.

[11] Mahler, J., Liang, J., Niyaz, S., Laskey, M., Doan, R., Liu, X., Ojea, J. A. and Gold- berg, K. Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics. CoRR abs/1703.09312 (2017). arXiv: 1703 . 09312.URL:http://arxiv.org/abs/1703.09312.

[12] Mousavian, A., Eppner, C. and Fox, D. 6-DOF GraspNet: Variational Grasp Gener- ation for Object Manipulation.CoRR abs/1905.10520 (2019). arXiv:1905.10520.

URL:http://arxiv.org/abs/1905.10520.

(34)

30

[13] Pinto, L. and Gupta, A. Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours. CoRR abs/1509.06825 (2015). arXiv: 1509 . 06825.

[14] Levine, S., Sampedro, P. P., Krizhevsky, A., Ibarz, J. and Quillen, D. Learning Hand- Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. (2017).URL:https://drive.google.com/open?id=0B0mFoBMu8f8%

20BaHYzOXZMdzVOalU.

[15] Redmon, J. and Angelova, A. Real-Time Grasp Detection Using Convolutional Neu- ral Networks. CoRR abs/1412.3128 (2014). arXiv: 1412 . 3128. URL: http : / / arxiv.org/abs/1412.3128.

[16] Xiang, Y., Schmidt, T., Narayanan, V. and Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes.CoRRabs/1711.00199 (2017). arXiv:1711.00199.URL:http://arxiv.org/abs/1711.00199.

[17] Peng, S., Liu, Y., Huang, Q., Bao, H. and Zhou, X. PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation.CoRRabs/1812.11788 (2018). arXiv:1812.11788.URL: http://arxiv.org/abs/1812.11788.

[18] He, Y., Sun, W., Huang, H., Liu, J., Fan, H. and Sun, J. PVN3D: A Deep Point-wise 3D Keypoints Voting Network for 6DoF Pose Estimation. CoRR abs/1911.04231 (2019). arXiv:1911.04231.URL:http://arxiv.org/abs/1911.04231.

[19] Pavlakos, G., Zhou, X., Chan, A., Derpanis, K. G. and Daniilidis, K. 6-DoF Object Pose from Semantic Keypoints.CoRRabs/1703.04670 (2017). arXiv:1703.04670.

[20] Park, K., Patten, T. and Vincze, M. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation. CoRR abs/1908.07433 (2019). arXiv: 1908 . 07433.URL:http://arxiv.org/abs/1908.07433.

[21] Kingma, D. P. and Welling, M.Auto-Encoding Variational Bayes. 2014. arXiv:1312.

6114 [stat.ML].

[22] Qi, C. R., Yi, L., Su, H. and Guibas, L. J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. CoRR abs/1706.02413 (2017). arXiv:

1706.02413.URL:http://arxiv.org/abs/1706.02413.

[23] Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J. Pyramid Scene Parsing Network.

CoRR abs/1612.01105 (2016). arXiv: 1612 . 01105. URL: http : / / arxiv . org / abs/1612.01105.

[24] Deng, J., Dong, W., Socher, R., Li, L., Kai Li and Li Fei-Fei. ImageNet: A large- scale hierarchical image database.2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.DOI:10.1109/CVPR.2009.5206848. [25] He, K., Zhang, X., Ren, S. and Sun, J. Deep Residual Learning for Image Recog-

nition.CoRR abs/1512.03385 (2015). arXiv: 1512.03385.URL:http://arxiv.

org/abs/1512.03385.

(35)

[26] Wang, C., Xu, D., Zhu, Y., Martın-Martın, R., Lu, C., Fei-Fei, L. and Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. CoRR abs/1901.04780 (2019). arXiv: 1901 . 04780. URL: http : / / arxiv . org / abs / 1901.04780.

[27] Comaniciu, D. and Meer, P. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24.5 (2002), pp. 603–619.DOI:10.1109/34.1000236.

[28] Cruz, L., Lucio, D. and Velho, L. Kinect and RGBD Images: Challenges and Appli- cations. Aug. 2012.DOI:10.1109/SIBGRAPI-T.2012.13.

[29] Wasenmüller, O. and Stricker, D. Comparison of Kinect V1 and V2 Depth Images in Terms of Accuracy and Precision. (Nov. 2016). DOI: 10 . 1007 / 978 - 3 - 319 - 54427-4_3.

[30] SDF format documentation. URL: http : / / sdformat . org / tutorials ? cat = specification&(visited on 04/17/2021).

[31] World File Syntax. Dec. 21, 2004. URL:http : / / playerstage . sourceforge . net / doc / Gazebo - manual - 0 . 5 - html / worldfile _ syntax . html (visited on 04/17/2021).

[32] Summary of gazebo_ros_pkgs. Feb. 13, 2021. URL: http : / / wiki . ros . org / gazebo_ros_pkgs(visited on 04/18/2021).

[33] Summary of ros_control package. Aug. 16, 2020. URL:http://wiki.ros.org/

ros_control(visited on 04/17/2021).

[34] Webots documentation. URL: https://cyberbotics.com/doc/guide/index (visited on 04/18/2021).

[35] Collins, J., Chand, S., Vanderkop, A. and Howard, D. A Review of Physics Sim- ulators for Robotic Applications. IEEE Access 9 (2021), pp. 51416–51431. DOI: 10.1109/ACCESS.2021.3068769.

(36)

32

APPENDIX A: EXAMPLE DEFINITION OF A BOX IN GRASP_TRIAL_WORLD

</inertia>

</inertial>

<box>

</box>

</geometry>

<max_contacts>10</max_contacts>

<ode/>

</contact>

<ode/>

</torsional>

(37)

<ode/>

</friction>

</surface>

</collision>

<box>

</box>

</geometry>

<name>Gazebo/Grey</name>

<uri>file://media/materials/scripts/gazebo.material</uri>

</script>

</material>

</visual>

<self_collide>0</self_collide>

</link>

</model>

(38)

34

APPENDIX B: ROS ENVIRONMENT FOR PANDA

CONTROL SIMULATION

(39)

Computer simulation of an automated robot grasping with object pose estimation : Software frameworks that are suitable to build a reliable computer simulation