Abstract: Intelligent agents require the ability to perceive their environments, understand their high-level semantics, and communicate with humans. While computer vision has recently made great strides on visual recognition tasks, the predominant paradigm is to predict one or more fixed visual categories for each image. I will describe a line of work that significantly expands the vocabulary of our computer vision systems and allows them to express visual concepts in natural language, such as “a picture of a girl playing with a stack of legos”, or “a couple holding hands and walking on a beach”. In particular, the final model can take an image and both detect and describe in natural language all of its salient regions. My modeling techniques draw on recent advances in Deep Learning that allow us to construct and train neural networks with hundreds of millions of neurons that take raw images and map them directly to natural language sentences. I will show that the model generates qualitatively compelling results and quantitative evaluation and control experiments demonstrate the strength of this approach with respect to simpler baselines and previous methods.