Learning to Grasp Novel Objects using Vision
Ashutosh Saxena, Justin Driemeyer, Justin Kearns, Chioma Osondu, Andrew Y. Ng
{ asaxenajdrierneyerjkearns, cosondu, ang } @ cs. Stanford, edu
Computer Science Department
Stanford University, Stanford, CA 94305
Abstract — We consider the problem of grasping novel objects,
specifically, ones that are being seen for the first time through
vision. We present a learning algorithm which predicts, as a
function of the images, the position at which to grasp the object.
This is done without building or requiring a 3-d model of the
object. Our algorithm is trained via supervised learning, using
synthetic images for the training set. Using our robot arm, we
successfully demonstrate this approach by grasping a variety of
differently shaped objects, such as duct tape, markers, mugs,
pens, wine glasses, knife-cutters, jugs, keys, toothbrushes, books,
and others, including many object types not seen in the training
set. 1
I. Introduction
If we are seeing a novel object for the first time through a
vision system, how can we autonomously grasp the object? In
this paper, we address the problem of grasping non-deformable
objects, including ones not seen before, and that the robot is
perceiving for the first time through a web-camera.
Modern-day robots can be carefully hand-programmed or
"scripted" to carry out amazing manipulation tasks, from
using tools to assemble complex machinery, to balancing a
spinning top on the edge of a sword [14]. However, fully
autonomous grasping of a previously unknown object still
remains a challenging problem. If the object was previously
known, or if we are able to obtain a full 3-d model of it,
then various approaches, for example ones based on friction
cones [4], pre-stored primitives [6], or other algorithms can
be applied. However, in practical scenarios it is generally very
difficult to obtain an accurate 3-d reconstruction of an object
that we are seeing for the first time through vision. 2
In this paper, we show that even without building a 3-d
model of the object to be grasped, it is possible to identify
a good grasp using learning algorithms. Specifically, there
are certain visual features that indicate good grasps, and that
remain consistent across many different objects. For example:
jugs, cups, and mugs have handles; objects such as screw-
drivers, toothbrushes, etc. can be grasped along the midpoint
of their length; and so on. Given only a quick glance at almost
any rigid object, most primates can quickly choose a grasp to
pick it up; our work represents a first step towards designing a
! An extended version of this paper will appear in proceedings of 10th
International Symposium on Experimental Robotics (ISER), 2006. [13]
2 This is particularly true if we have only a single camera. But for objects
without texture, even a stereo system would work poorly, and be able to
reconstruct only the visible portions of the object. Finally, even if we try to
"engineer" the problem away and use a laser (or active stereo) to estimate
depths, we would still have only a 3-d reconstruction of the front face of the
object.
vision grasping algorithm which can do the same. We also take
inspiration from Castiello [2], who showed that for commonly
used objects, cognitive cues and prior knowledge are used in
visually guided grasping by humans and monkeys.
Learning algorithms have been applied to grasping problems
before. For example, Jebara et al. [8] used a supervised
learning algorithm to learn grasps, but only assuming a full
3-d model of the object. Piater described an algorithm [9] to
position single fingers given a top-down view of an object, but
applied it only to to very simple objects (specifically, square,
triangle and round "blocks"). Piatt et al. [10], [11] learned
to sequence together manipulation gaits, but again assumed a
specific, known, object. Wheeler et al. [15] used Q-learning
for hand selection.
To pick up an object, we need to identify the grasp — more
formally, a position and configuration for the end-effector. This
paper focuses on the task of grasp identification, and thus
we will consider only objects that can be picked up without
performing complex manipulation, 3 and that are commonly
found in an office or household environment, e.g., toothbrush,
pens, books, mugs, martini glass, jugs, keys, duct tape roll,
markers. (Fig. 1)
This paper will emphasize grasping previously unknown
objects in uncluttered environments (for example, when the
objects are placed against a uniform-colored background).
However, Section IV will also present preliminary results
on applying our approach to the cluttered background of a
dishwasher rack. The remainder of this paper is structured as
follows. Section II describes our machine learning approach
for grasp identification. Trajectory planning (on our 5 dof arm)
is then briefly discussed in Section III. Section IV presents our
experimental results, and finally Section V concludes.
II. Learning the Grasping Point
There are certain visual features that indicate good grasps,
and that remain consistent across many different objects. For
example: jugs, cups, and mugs have handles; objects like
screwdrivers, toothbrushes, etc. can be grasped in the center.
We propose a learning algorithm that learns to use visual
features to identify good grasping points across a large range
of objects.
More precisely, we will predict grasp as a function of the
image. An image is a projection of the three-dimensional world
3 For example, picking up a heavy book lying flat on table might require
a sequence of complex manipulations, such as to first slide it to the edge of
the table.
Fig. 1. Some real objects on which the grasping algorithm was tested.
onto an image plane, which does not have depth information.
Therefore, we will predict the 2-d location of the grasp in the
image, which corresponds to the projection of the 3-d grasp
point into the image plane. We use supervised learning for
this task, with synthetic images (generated using computer
graphics) as our training data. We then use two (or more)
images to triangulate and obtain the 3-d location of the grasp.
A. Synthetic Data for Training
Collecting real-world data is cumbersome and prone to
labeling errors. Generating perfectly labeled synthetic data is
significantly less time-consuming and easier, as compared to
real images.
Therefore, we generate synthetic images along with labels
denoting the correct grasp (Fig. 2) using a computer graphics
ray tracer. 4 The advantages of using synthetic images are
multi-fold [5]. Once a synthetic model for the object has
been created, a large number of training examples can be
generated with random lighting conditions, camera position
and orientation, etc. Additionally, to increase the diversity in
our data, we randomized some properties of the object as well
such as color, scale, and text (e.g. on the face of a book). The
time-consuming part of synthetic data generation is the manual
creation of the numerical models of the objects. However,
there are many objects for which models are available on the
internet, and that can be used with only minor modifications.
Synthetic data also provides perfectly labeled data, i.e., the
exact location of the grasp in 3-d coordinate space, which
would be significantly more difficult to obtain if using training
data from real images.
4 Ray tracing [3] is a standard image rendering method in computer graphics.
It handles many real- world phenomenon such as multiple specular reflections,
texture mapping, soft shadows, smooth curves, and caustics. We used PovRay,
an open source ray tracer.
B. Grasping Point Classification
Given the training set, our algorithm learns to identify
grasping regions in the images. More precisely, given the
training set, the learning algorithm predicts the 2-d position
of the grasp projected into the image plane. The algorithm
uses a set of features of the image, which include edges and
texture information, applied at various scales [12]. Using these
features, we apply logistic regression to decide whether each
position in the 2-d image plane corresponds to a valid grasp
point.
In detail, the logistic regression algorithm models the prob-
ability of a particular patch of the image being a valid grasping
point as:
p(y = l\x;w)= i + e _ wTx . (1)
Here, w G M 459 are the parameters, which are learned by
maximum likelihood. The features x G M 459 we use for
the patch include edges and texture information [12], applied
at three scales, and appended with the filter outputs for the
surrounding patches. Fig. 3 shows some predicted valid grasps
on real images.
C. Approximate Triangulation
Given two (or more) images of a new object from different
camera positions and the predicted 2-d grasp positions in each
image, we need to triangulate to obtain 3-d positions of the
grasping points (Fig. 4). Note that we perform triangulation
only to identify the 3-d position of the grasp, not for full 3-d
reconstruction. Indeed, many of our test objects are textureless
or reflective, and 3-d reconstruction using standard stereopsis
would perform poorly on them. We use a triangulation algo-
rithm that is more complex than one based on standard geo-
metric calculations to handle the learning algorithm's output
being slightly noisy/uncertain, and to handle the possibility
of there being multiple valid grasp points on an object. In our
Fig. 2. Synthetic images of the objects used for training.
Fig. 3. Grasping point classification. The red points show the predicted grasping point.
experiments, this significantly increases algorithmic robustness
in the face of ambiguity in triangulation.
£50 300 350
400 45 °
Fig. 4. Statistical Triangulation of the grasping point. The dark blue lines,
originating from the camera locations, show the camera direction. Light blue
shows the Gaussian cones for the candidate grasp regions. The dark blue spot
is the predicted grasping point.
III. Control
To grasp an object (using our 5-dof arm), we plan a
trajectory to take the end-effector to an approach position, 5
and then move the end-effector in a straight line towards the
predicted grasp point.
We use two classes of grasps: downward and outward,
which arise because of the workspace of the 5-dof arm (Fig. 5).
5 The approach position is defined to be a point a fixed distance away from
the predicted grasp point.
A "Downward" grasp is for objects that are close to the base of
the arm, and which the arm can reach in a downward direction.
An "Outward" grasp is for objects further away from the base,
which the arm is unable to reach in a downward direction. To
choose the class of the grasp, we scan the workspace of the
arm and determine which region contains the object. 6
IV. Experiments
A. Hardware Setup
We used the STAIR (STanford AI Robot, see Fig. 7) robot
built at Stanford University. This platform is equipped with a
robotic arm mounted on a mobile platform, together with other
equipment such as cameras, microphones, etc. The long-term
goal of the STAIR project is to create a robot that can navigate
home and office environments, pick up and interact with
objects and tools (including carrying out more complex tasks
such as unloading a dishwasher), and intelligently converse
with and help people in these environments, Clearly, the ability
to grasp a novel object represents an interesting and necessary
small step towards these goals.
The robotic arm on STAIR is a light 4 kg, 5-dof arm
(Katana [7]) equipped with a parallel plate gripper. It holds
a pay load of 500g, and has a horizontal reach of 62cm, and a
vertical reach of 79cm. The positioning accuracy of the arm
is ±1 mm. It is a position controlled arm, i.e., it requires
specification of joint locations instead of torques. Our vision
system uses a low-quality webcam mounted near the end-
effector.
6 We determine the position of objects in the picture of the robot workspace
by thresholding the saturation channel of HSV (Hue- Saturation- Value) of the
image. We use this position to determine the region in which the object lies.
Fig. 5. The robot arm picking up various objects: jug, box, screwdriver, duct-tape, wine glass, book, a chip-holder, powerhorn, and cellphone.
B. Results and Discussion
We first tested the algorithm for its predictive capability
on synthetic images not in the training set. The average
classification accuracy was 94.2%, although the accuracy in
predicting a 3-d grasp point was higher than the classification
accuracy may suggest because 3-d triangulation takes care of
some errors in the classification step.
Next, we tested the algorithm on the STAIR robot. The
task was to use input from a web-camera, mounted on the
robot, to pick up an object placed in front of the robot against
a white background. The parameters of the vision algorithm
were trained from synthetic images of a small set of objects,
namely books, martini glasses, white-board erasers, coffee
mugs, tea cups and pencils. We performed experiments on
coffee mugs, wine glasses, pencils, books, and erasers — but
all of different dimensions and appearance than the ones in the
training set — as well as a large set of novel objects, such as
duct tape rolls, markers, a translucent box, jugs, knife-cutters,
a cellphone, pens, keys, screwdrivers, a stapler, toothbrushes, a
thick coil of wire, a strangely shaped power horn, etc. (Fig. 1).
The algorithm for predicting grasps in images appears to
generalize very well. Despite being tested on images of real
(rather than synthetic) objects, including many very different
from ones in the training set, it was usually able to identify
correct grasp points. We note that test error (in terms of
average error in predicting a good grasp point) on the real
images was only somewhat higher than the error on synthetic
images, showing that the algorithm trained on synthetic images
transfers well to real images. (Over all 5 object types used in
the synthetic data, average absolute error was 0.8cm 7 in the
synthetic images; and over all the 11 real test objects, average
error was 1.8cm.) For comparison, neonate humans can grasp
simple objects with an average accuracy of 1.5 cm. [1]
Table I shows the errors in actual grasping points that
we obtained on the real dataset. The table presents results
separately for objects which are similar to those we trained
on (e.g., coffee mugs) and those which were very dissimilar
to the training objects (e.g., duct tape). In addition to reporting
errors in grasp positions, we also report the grasp-rate, i.e., the
fraction of times the robotic arm was able to physically pick
up the object (out of 4 trials). On average, the robot succeded
in picking up a novel object 87.5% of the time. Overall, the
algorithm worked well when there was a clear best grasping
region in the object (e.g. for pens, mugs, wine glass).
7 Units based on typical size of real world objects represented by the
synthetic images (e.g., a typical mug is 25 cm. high, etc.)
TABLE I
Average absolute error in locating the grasp point for
different objects, as well as success rate in grasping objects
using our robot arm.
Objects similar to ones in the training set
Tested on
Mean Error
GRASP-
(CM)
RATE
Mugs
2.8
75%
Pens
0.9
100%
Wine Glass
1.1
100%
Books
2.9
75%
Eraser/Cellphone
1.6
100%
Overall
1.86
90%
Novel Objects
Tested on
Mean Error
GRASP-
(CM)
RATE
Keys/Markers
1.2
100%
Toothbrush/Cutter/
Screwdriver
1.1
100%
Jug
1.7
75%
Powerhorn
3.5
50%
Duct Tape
1.8
100%
Coiled Wire
1.4
100%
Overall
1.77
87.5%
For simple objects like cellphones, wine glasses, keys,
toothbrushes, etc., the algorithm performed perfectly (100%
grasp-rate). However, objects such as mugs and jugs allow
only a narrow trajectory of approach; as a result, a minor
error in grasping point prediction can cause the arm to hit
and move the object, resulting in failure to grasp, and thus a
lower overall success rate. We believe that these problems can
be solved with better control strategies using haptic feedback.
Some of the failures can also be attributed to the fixed gripper
width used across all objects; this can be solved by learning
how much the gripper should open. Videos of the arm picking
up various objects are available at
http://ai.stanford.edu/^asaxena/learninggrasp/
In many instances, the algorithm was able to pick up com-
pletely novel objects (strangely shaped power-horn, duct-tape,
etc.; see Fig. 1) by identifying grasping points. Perceiving
a transparent wine glass is a difficult problem for standard
vision (e.g., stereopsis) algorithms because of reflections, etc.
However, as shown in Table I, our algorithm successfully
picked it up 100% of the time. The same rate of success holds
even if the glass is 2/3 filled with water.
Finally, we also tested the algorithm for predicting the
grasping point in a cluttered environment, specifically in
a dishwasher. To make the algorithm robust to dishwasher
clutter, we included a number 8 of hand labeled, real images
of objects in a dishwasher along with the synthetic images for
training. This resulted in performance that appeared extremely
robust to the background dishwasher clutter. A few examples
of predicted grasp points are shown in Figure 6. In the case of
multiple objects, our algorithm picks the grasping point which
8 30 real images in addition to 2500+ synthetic images
Fig. 7. The robot arm mounted on a mobile base. The 5-dof arm was used
to pick up the objects.
has the highest response.
V. Conclusions
We described a machine learning algorithm for identifying a
grasping point on a previously unknown object that a robot is
perceiving for the first time using vision. Our algorithm does
not require (and nor does it build) a 3-d model of the object,
and was applied to grasping a number of novel objects using
our robotic arm.
Acknowledgments
We thank Morgan Quigley, Anya Petrovskaya and Jimmy
Zhang for help with the robot arm control driver software. We
also thank David Ho and Seokchang Ryu for help with running
simulations. This work was supported by the DARPA transfer
learning program under contract number FA8750-05-2-0249.
References
[1] T. G. R. Bower, J. M. Broughton, and M. K. Moore. Demonstration of
intention in the reaching behaviour of neonate humans. Nature, 228:679-
681, 1970.
[2] U. Castiello. The neuroscience of grasping. Nature Reviews Neuro-
science 6, pages 726-736, 2005.
[3] A. S. Glassner. An Introduction to Ray Tracing. Morgan Kaufmann
Publishers, Inc., San Francisco, 1989.
I l\T
Fig. 6. Grasping point classification (shown in red) for objects placed in dishwasher. In case of multiple objects, the algorithm picks one it thinks is the best.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
M. T. Mason and J. K. Salisbury. Manipulator grasping and pushing
operations. In Robot Hands and the Mechanics of Manipulation. The [11]
MIT Press, Cambridge, MA, 1985.
J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance
using monocular vision and reinforcement learning. In ICML, 2005. [12]
Miller and et. al. Automatic grasp planning using shape primitives. In
ICRA, 2003. " " " [13]
Neuronics. Katana user manual, http://www.neuronics.ch/, 2004.
R. Pelossof and et. al. An svm learning approach to robotic grasping. [14]
In ICRA, 2004.
J. H. Piater. Learning visual features to predict hand orientations. In [15]
ICML Workshop on Machine Learning of Spatial Knowledge, 2002.
R. Piatt and et. al. Manipulation gaits: Sequences of grasp control tasks.
In ICRA, 2004.
R. Piatt, A. H. Fagg, and R. Grupen. Reusing schematic grasping
policies. In IEEE-RAS International Conference on Humanoid Robots,
Tsukuba, Japan, 2005.
A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single
monocular images. In NIPS 18, 2005.
A. Saxena, J. Driemeyer, J. Kearns, C. Osondu, and A. Y. Ng. Learning
to grasp novel objects using vision. In ISER, 2006.
T. Shin-ichi and M. Satoshi. Living and working with robots. Nipponia,
2000.
D. Wheeler, A. H. Fagg, , and R. Grupen. Learning prospective pick
and place behavior. In IEEE Conf on Development & Learning, 2002.