Doctoral Thesis: Building Blocks for Human-AI Alignment: Specify, Inspect, Model, and Revise
32-D463 (Star)
Serena Booth
Abstract: The learned behaviors of AI systems and robots should align with the intentions of their human designers. In service of this goal, people—especially experts—must be able to easily specify, inspect, model, and revise AI system and robot behaviors. In this thesis, I study each of these problems. First, I study how experts write reward function specifications for reinforcement learning (RL). I find that these specifications are written with respect to the RL algorithm, not independently, and I find that experts often write erroneous specifications that fail to encode their true intent, even in a trivial setting. Second, I study how to inspect the agent’s learned behaviors. To do so, I introduce two related methods to find environments which exhibit particular behaviors. These methods support humans in inspecting the behaviors an agent learns from a given specification. Third, I study cognitive science theories which govern how people build conceptual models to explain these observed examples of agent behaviors. While I find that some foundations of these interventions are employed in typical interventions to support humans in learning about agent behaviors, I also find there is significant room to build better curricula for interaction—for example, by showing counterexamples of alternative behaviors. I conclude by speculating about how these building blocks of human-AI interaction can be combined to enable people to revise their specifications, and, in doing so, create better aligned agents.
Thesis Supervisor: Prof. Julie Shah
Thesis Committee: Drs. Dylan Hadfield-Menell, Leslie Kaelbling, Elena Glassman, and Peter Stone
Details
- Date: Friday, October 6
- Time: 9:30 am - 11:00 am
- Category: Thesis Defense
- Location: 32-D463 (Star)
Additional Location Details:
To attend remotely, please contact the doctoral candidate at sbooth@mit.edu