We would like to have highly useful robots which can richly perceive their world, semantically distinguish its fine details, and physically interact with it sufficiently for useful robotic manipulation. This is hard to achieve with previous methods: prior work has not equipped robots with the scalable ability to understand the dense visual state of their varied environments. The limitations have both been in the state representations used, and how to acquire them without significant human labeling effort. In this thesis we present work that leverages self-supervision, particularly via a mix of geometrical computer vision, deep visual learning, and robotic systems, to scalably produce dense visual inferences of the world state. These methods either enable robots to teach themselves dense visual models without human supervision, or they act as a large multiplying factor on the value of information provided by humans. Specifically, we develop a pipeline for providing ground truth labels of visual data in cluttered and multi-object scenes, we introduce the novel application of dense visual object descriptors to robotic manipulation and provide a fully robot-supervised pipeline to acquire them, and we leverage this dense visual understanding to efficiently learn new manipulation skills through visuomotor imitation. With real robot hardware we demonstrate contact-rich tasks manipulating household objects, including generalizing across a class of objects, manipulating deformable objects, and manipulating a textureless symmetrical object, all with closed-loop, real-time vision-based manipulation policies.
Thesis Supervisor: Prof. Russ Tedrake