(logo)
(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Open Source Books | Project Gutenberg | Biodiversity Heritage Library | Children's Library | Additional Collections

Search: Advanced Search

Anonymous User (login or join us)Upload
See other formats

Full text of "Proceedings of the Robotics: Science & Systems 2006 Workshop - Manipulation for Human Environments"

Learning to Grasp Novel Objects using Vision 

Ashutosh Saxena, Justin Driemeyer, Justin Kearns, Chioma Osondu, Andrew Y. Ng 

{ asaxenajdrierneyerjkearns, cosondu, ang } @ cs. Stanford, edu 

Computer Science Department 

Stanford University, Stanford, CA 94305 



Abstract — We consider the problem of grasping novel objects, 
specifically, ones that are being seen for the first time through 
vision. We present a learning algorithm which predicts, as a 
function of the images, the position at which to grasp the object. 
This is done without building or requiring a 3-d model of the 
object. Our algorithm is trained via supervised learning, using 
synthetic images for the training set. Using our robot arm, we 
successfully demonstrate this approach by grasping a variety of 
differently shaped objects, such as duct tape, markers, mugs, 
pens, wine glasses, knife-cutters, jugs, keys, toothbrushes, books, 
and others, including many object types not seen in the training 
set. 1 

I. Introduction 

If we are seeing a novel object for the first time through a 
vision system, how can we autonomously grasp the object? In 
this paper, we address the problem of grasping non-deformable 
objects, including ones not seen before, and that the robot is 
perceiving for the first time through a web-camera. 

Modern-day robots can be carefully hand-programmed or 
"scripted" to carry out amazing manipulation tasks, from 
using tools to assemble complex machinery, to balancing a 
spinning top on the edge of a sword [14]. However, fully 
autonomous grasping of a previously unknown object still 
remains a challenging problem. If the object was previously 
known, or if we are able to obtain a full 3-d model of it, 
then various approaches, for example ones based on friction 
cones [4], pre-stored primitives [6], or other algorithms can 
be applied. However, in practical scenarios it is generally very 
difficult to obtain an accurate 3-d reconstruction of an object 
that we are seeing for the first time through vision. 2 

In this paper, we show that even without building a 3-d 
model of the object to be grasped, it is possible to identify 
a good grasp using learning algorithms. Specifically, there 
are certain visual features that indicate good grasps, and that 
remain consistent across many different objects. For example: 
jugs, cups, and mugs have handles; objects such as screw- 
drivers, toothbrushes, etc. can be grasped along the midpoint 
of their length; and so on. Given only a quick glance at almost 
any rigid object, most primates can quickly choose a grasp to 
pick it up; our work represents a first step towards designing a 

! An extended version of this paper will appear in proceedings of 10th 
International Symposium on Experimental Robotics (ISER), 2006. [13] 

2 This is particularly true if we have only a single camera. But for objects 
without texture, even a stereo system would work poorly, and be able to 
reconstruct only the visible portions of the object. Finally, even if we try to 
"engineer" the problem away and use a laser (or active stereo) to estimate 
depths, we would still have only a 3-d reconstruction of the front face of the 
object. 



vision grasping algorithm which can do the same. We also take 
inspiration from Castiello [2], who showed that for commonly 
used objects, cognitive cues and prior knowledge are used in 
visually guided grasping by humans and monkeys. 

Learning algorithms have been applied to grasping problems 
before. For example, Jebara et al. [8] used a supervised 
learning algorithm to learn grasps, but only assuming a full 
3-d model of the object. Piater described an algorithm [9] to 
position single fingers given a top-down view of an object, but 
applied it only to to very simple objects (specifically, square, 
triangle and round "blocks"). Piatt et al. [10], [11] learned 
to sequence together manipulation gaits, but again assumed a 
specific, known, object. Wheeler et al. [15] used Q-learning 
for hand selection. 

To pick up an object, we need to identify the grasp — more 
formally, a position and configuration for the end-effector. This 
paper focuses on the task of grasp identification, and thus 
we will consider only objects that can be picked up without 
performing complex manipulation, 3 and that are commonly 
found in an office or household environment, e.g., toothbrush, 
pens, books, mugs, martini glass, jugs, keys, duct tape roll, 
markers. (Fig. 1) 

This paper will emphasize grasping previously unknown 
objects in uncluttered environments (for example, when the 
objects are placed against a uniform-colored background). 
However, Section IV will also present preliminary results 
on applying our approach to the cluttered background of a 
dishwasher rack. The remainder of this paper is structured as 
follows. Section II describes our machine learning approach 
for grasp identification. Trajectory planning (on our 5 dof arm) 
is then briefly discussed in Section III. Section IV presents our 
experimental results, and finally Section V concludes. 

II. Learning the Grasping Point 

There are certain visual features that indicate good grasps, 
and that remain consistent across many different objects. For 
example: jugs, cups, and mugs have handles; objects like 
screwdrivers, toothbrushes, etc. can be grasped in the center. 
We propose a learning algorithm that learns to use visual 
features to identify good grasping points across a large range 
of objects. 

More precisely, we will predict grasp as a function of the 
image. An image is a projection of the three-dimensional world 

3 For example, picking up a heavy book lying flat on table might require 
a sequence of complex manipulations, such as to first slide it to the edge of 
the table. 




Fig. 1. Some real objects on which the grasping algorithm was tested. 



onto an image plane, which does not have depth information. 
Therefore, we will predict the 2-d location of the grasp in the 
image, which corresponds to the projection of the 3-d grasp 
point into the image plane. We use supervised learning for 
this task, with synthetic images (generated using computer 
graphics) as our training data. We then use two (or more) 
images to triangulate and obtain the 3-d location of the grasp. 

A. Synthetic Data for Training 

Collecting real-world data is cumbersome and prone to 
labeling errors. Generating perfectly labeled synthetic data is 
significantly less time-consuming and easier, as compared to 
real images. 

Therefore, we generate synthetic images along with labels 
denoting the correct grasp (Fig. 2) using a computer graphics 
ray tracer. 4 The advantages of using synthetic images are 
multi-fold [5]. Once a synthetic model for the object has 
been created, a large number of training examples can be 
generated with random lighting conditions, camera position 
and orientation, etc. Additionally, to increase the diversity in 
our data, we randomized some properties of the object as well 
such as color, scale, and text (e.g. on the face of a book). The 
time-consuming part of synthetic data generation is the manual 
creation of the numerical models of the objects. However, 
there are many objects for which models are available on the 
internet, and that can be used with only minor modifications. 
Synthetic data also provides perfectly labeled data, i.e., the 
exact location of the grasp in 3-d coordinate space, which 
would be significantly more difficult to obtain if using training 
data from real images. 

4 Ray tracing [3] is a standard image rendering method in computer graphics. 
It handles many real- world phenomenon such as multiple specular reflections, 
texture mapping, soft shadows, smooth curves, and caustics. We used PovRay, 
an open source ray tracer. 



B. Grasping Point Classification 

Given the training set, our algorithm learns to identify 
grasping regions in the images. More precisely, given the 
training set, the learning algorithm predicts the 2-d position 
of the grasp projected into the image plane. The algorithm 
uses a set of features of the image, which include edges and 
texture information, applied at various scales [12]. Using these 
features, we apply logistic regression to decide whether each 
position in the 2-d image plane corresponds to a valid grasp 
point. 

In detail, the logistic regression algorithm models the prob- 
ability of a particular patch of the image being a valid grasping 
point as: 

p(y = l\x;w)= i + e _ wTx . (1) 

Here, w G M 459 are the parameters, which are learned by 
maximum likelihood. The features x G M 459 we use for 
the patch include edges and texture information [12], applied 
at three scales, and appended with the filter outputs for the 
surrounding patches. Fig. 3 shows some predicted valid grasps 
on real images. 

C. Approximate Triangulation 

Given two (or more) images of a new object from different 
camera positions and the predicted 2-d grasp positions in each 
image, we need to triangulate to obtain 3-d positions of the 
grasping points (Fig. 4). Note that we perform triangulation 
only to identify the 3-d position of the grasp, not for full 3-d 
reconstruction. Indeed, many of our test objects are textureless 
or reflective, and 3-d reconstruction using standard stereopsis 
would perform poorly on them. We use a triangulation algo- 
rithm that is more complex than one based on standard geo- 
metric calculations to handle the learning algorithm's output 
being slightly noisy/uncertain, and to handle the possibility 
of there being multiple valid grasp points on an object. In our 



Fig. 2. Synthetic images of the objects used for training. 




Fig. 3. Grasping point classification. The red points show the predicted grasping point. 



experiments, this significantly increases algorithmic robustness 
in the face of ambiguity in triangulation. 




£50 300 350 



400 45 ° 



Fig. 4. Statistical Triangulation of the grasping point. The dark blue lines, 
originating from the camera locations, show the camera direction. Light blue 
shows the Gaussian cones for the candidate grasp regions. The dark blue spot 
is the predicted grasping point. 

III. Control 

To grasp an object (using our 5-dof arm), we plan a 
trajectory to take the end-effector to an approach position, 5 
and then move the end-effector in a straight line towards the 
predicted grasp point. 

We use two classes of grasps: downward and outward, 
which arise because of the workspace of the 5-dof arm (Fig. 5). 

5 The approach position is defined to be a point a fixed distance away from 
the predicted grasp point. 



A "Downward" grasp is for objects that are close to the base of 
the arm, and which the arm can reach in a downward direction. 
An "Outward" grasp is for objects further away from the base, 
which the arm is unable to reach in a downward direction. To 
choose the class of the grasp, we scan the workspace of the 
arm and determine which region contains the object. 6 

IV. Experiments 

A. Hardware Setup 

We used the STAIR (STanford AI Robot, see Fig. 7) robot 
built at Stanford University. This platform is equipped with a 
robotic arm mounted on a mobile platform, together with other 
equipment such as cameras, microphones, etc. The long-term 
goal of the STAIR project is to create a robot that can navigate 
home and office environments, pick up and interact with 
objects and tools (including carrying out more complex tasks 
such as unloading a dishwasher), and intelligently converse 
with and help people in these environments, Clearly, the ability 
to grasp a novel object represents an interesting and necessary 
small step towards these goals. 

The robotic arm on STAIR is a light 4 kg, 5-dof arm 
(Katana [7]) equipped with a parallel plate gripper. It holds 
a pay load of 500g, and has a horizontal reach of 62cm, and a 
vertical reach of 79cm. The positioning accuracy of the arm 
is ±1 mm. It is a position controlled arm, i.e., it requires 
specification of joint locations instead of torques. Our vision 
system uses a low-quality webcam mounted near the end- 
effector. 

6 We determine the position of objects in the picture of the robot workspace 
by thresholding the saturation channel of HSV (Hue- Saturation- Value) of the 
image. We use this position to determine the region in which the object lies. 




Fig. 5. The robot arm picking up various objects: jug, box, screwdriver, duct-tape, wine glass, book, a chip-holder, powerhorn, and cellphone. 



B. Results and Discussion 

We first tested the algorithm for its predictive capability 
on synthetic images not in the training set. The average 
classification accuracy was 94.2%, although the accuracy in 
predicting a 3-d grasp point was higher than the classification 
accuracy may suggest because 3-d triangulation takes care of 
some errors in the classification step. 

Next, we tested the algorithm on the STAIR robot. The 
task was to use input from a web-camera, mounted on the 
robot, to pick up an object placed in front of the robot against 
a white background. The parameters of the vision algorithm 
were trained from synthetic images of a small set of objects, 
namely books, martini glasses, white-board erasers, coffee 
mugs, tea cups and pencils. We performed experiments on 
coffee mugs, wine glasses, pencils, books, and erasers — but 
all of different dimensions and appearance than the ones in the 
training set — as well as a large set of novel objects, such as 
duct tape rolls, markers, a translucent box, jugs, knife-cutters, 
a cellphone, pens, keys, screwdrivers, a stapler, toothbrushes, a 
thick coil of wire, a strangely shaped power horn, etc. (Fig. 1). 

The algorithm for predicting grasps in images appears to 
generalize very well. Despite being tested on images of real 
(rather than synthetic) objects, including many very different 



from ones in the training set, it was usually able to identify 
correct grasp points. We note that test error (in terms of 
average error in predicting a good grasp point) on the real 
images was only somewhat higher than the error on synthetic 
images, showing that the algorithm trained on synthetic images 
transfers well to real images. (Over all 5 object types used in 
the synthetic data, average absolute error was 0.8cm 7 in the 
synthetic images; and over all the 11 real test objects, average 
error was 1.8cm.) For comparison, neonate humans can grasp 
simple objects with an average accuracy of 1.5 cm. [1] 

Table I shows the errors in actual grasping points that 
we obtained on the real dataset. The table presents results 
separately for objects which are similar to those we trained 
on (e.g., coffee mugs) and those which were very dissimilar 
to the training objects (e.g., duct tape). In addition to reporting 
errors in grasp positions, we also report the grasp-rate, i.e., the 
fraction of times the robotic arm was able to physically pick 
up the object (out of 4 trials). On average, the robot succeded 
in picking up a novel object 87.5% of the time. Overall, the 
algorithm worked well when there was a clear best grasping 
region in the object (e.g. for pens, mugs, wine glass). 

7 Units based on typical size of real world objects represented by the 
synthetic images (e.g., a typical mug is 25 cm. high, etc.) 



TABLE I 

Average absolute error in locating the grasp point for 

different objects, as well as success rate in grasping objects 

using our robot arm. 

Objects similar to ones in the training set 



Tested on 


Mean Error 


GRASP- 




(CM) 


RATE 


Mugs 


2.8 


75% 


Pens 


0.9 


100% 


Wine Glass 


1.1 


100% 


Books 


2.9 


75% 


Eraser/Cellphone 


1.6 


100% 


Overall 


1.86 


90% 


Novel Objects 


Tested on 


Mean Error 


GRASP- 




(CM) 


RATE 


Keys/Markers 


1.2 


100% 


Toothbrush/Cutter/ 






Screwdriver 


1.1 


100% 


Jug 


1.7 


75% 


Powerhorn 


3.5 


50% 


Duct Tape 


1.8 


100% 


Coiled Wire 


1.4 


100% 


Overall 


1.77 


87.5% 



For simple objects like cellphones, wine glasses, keys, 
toothbrushes, etc., the algorithm performed perfectly (100% 
grasp-rate). However, objects such as mugs and jugs allow 
only a narrow trajectory of approach; as a result, a minor 
error in grasping point prediction can cause the arm to hit 
and move the object, resulting in failure to grasp, and thus a 
lower overall success rate. We believe that these problems can 
be solved with better control strategies using haptic feedback. 
Some of the failures can also be attributed to the fixed gripper 
width used across all objects; this can be solved by learning 
how much the gripper should open. Videos of the arm picking 
up various objects are available at 

http://ai.stanford.edu/^asaxena/learninggrasp/ 

In many instances, the algorithm was able to pick up com- 
pletely novel objects (strangely shaped power-horn, duct-tape, 
etc.; see Fig. 1) by identifying grasping points. Perceiving 
a transparent wine glass is a difficult problem for standard 
vision (e.g., stereopsis) algorithms because of reflections, etc. 
However, as shown in Table I, our algorithm successfully 
picked it up 100% of the time. The same rate of success holds 
even if the glass is 2/3 filled with water. 

Finally, we also tested the algorithm for predicting the 
grasping point in a cluttered environment, specifically in 
a dishwasher. To make the algorithm robust to dishwasher 
clutter, we included a number 8 of hand labeled, real images 
of objects in a dishwasher along with the synthetic images for 
training. This resulted in performance that appeared extremely 
robust to the background dishwasher clutter. A few examples 
of predicted grasp points are shown in Figure 6. In the case of 
multiple objects, our algorithm picks the grasping point which 

8 30 real images in addition to 2500+ synthetic images 




Fig. 7. The robot arm mounted on a mobile base. The 5-dof arm was used 
to pick up the objects. 



has the highest response. 

V. Conclusions 

We described a machine learning algorithm for identifying a 
grasping point on a previously unknown object that a robot is 
perceiving for the first time using vision. Our algorithm does 
not require (and nor does it build) a 3-d model of the object, 
and was applied to grasping a number of novel objects using 
our robotic arm. 

Acknowledgments 

We thank Morgan Quigley, Anya Petrovskaya and Jimmy 
Zhang for help with the robot arm control driver software. We 
also thank David Ho and Seokchang Ryu for help with running 
simulations. This work was supported by the DARPA transfer 
learning program under contract number FA8750-05-2-0249. 

References 

[1] T. G. R. Bower, J. M. Broughton, and M. K. Moore. Demonstration of 

intention in the reaching behaviour of neonate humans. Nature, 228:679- 

681, 1970. 
[2] U. Castiello. The neuroscience of grasping. Nature Reviews Neuro- 

science 6, pages 726-736, 2005. 
[3] A. S. Glassner. An Introduction to Ray Tracing. Morgan Kaufmann 

Publishers, Inc., San Francisco, 1989. 




I l\T 



Fig. 6. Grasping point classification (shown in red) for objects placed in dishwasher. In case of multiple objects, the algorithm picks one it thinks is the best. 



[4] 



[5] 

[6] 

[7] 
[8] 

[9] 

[10] 



M. T. Mason and J. K. Salisbury. Manipulator grasping and pushing 

operations. In Robot Hands and the Mechanics of Manipulation. The [11] 

MIT Press, Cambridge, MA, 1985. 

J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance 

using monocular vision and reinforcement learning. In ICML, 2005. [12] 

Miller and et. al. Automatic grasp planning using shape primitives. In 

ICRA, 2003. " " " [13] 

Neuronics. Katana user manual, http://www.neuronics.ch/, 2004. 

R. Pelossof and et. al. An svm learning approach to robotic grasping. [14] 

In ICRA, 2004. 

J. H. Piater. Learning visual features to predict hand orientations. In [15] 

ICML Workshop on Machine Learning of Spatial Knowledge, 2002. 

R. Piatt and et. al. Manipulation gaits: Sequences of grasp control tasks. 



In ICRA, 2004. 

R. Piatt, A. H. Fagg, and R. Grupen. Reusing schematic grasping 

policies. In IEEE-RAS International Conference on Humanoid Robots, 

Tsukuba, Japan, 2005. 

A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single 

monocular images. In NIPS 18, 2005. 

A. Saxena, J. Driemeyer, J. Kearns, C. Osondu, and A. Y. Ng. Learning 

to grasp novel objects using vision. In ISER, 2006. 

T. Shin-ichi and M. Satoshi. Living and working with robots. Nipponia, 

2000. 

D. Wheeler, A. H. Fagg, , and R. Grupen. Learning prospective pick 

and place behavior. In IEEE Conf on Development & Learning, 2002.