Cameras of good quality are now available on handheld and wearable mobile devices. The high resolution of these cameras coupled with pervasive wireless connectivity and advanced computer vision algorithms makes it feasible to develop new ways to interact with mobile video. Two important examples are interactive object recognition and search-by-content. Interactive recognition continuously locates objects in a video stream, recognizes them, and labels them with information associated with the objects in the user’s view. Example use cases include an augmented shopping application that recognizes products or brands to inform customers about the items they buy and a driver assistance application that recognizes vehicles and signs to improve driver safety. Interactive search-by-content allows users to discover videos using textual queries (e.g., “child dog play”). Instead of requiring broadcasters to manually annotate videos with meta-data tags, our search system uses vision algorithms to automatically produce textual tags.
These two services must be highly interactive because users expect timely feedback for their interactions and changes in content. However, achieving high interactivity without sacrificing accuracy or efficiency is challenging. The required computer vision algorithms use computationally intensive deep neural networks and must run at a frame rate of 30 frames per second. Recognizing an object scales with the size of the corpus of objects, and is infeasible on a mobile device. Off-loading recognition operations to servers introduces network and processing delay; when this delay is higher than a frame-time, it degrades recognition accuracy.
This dissertation presents two systems that study the trade-off between accuracy and efficiency for interactive recognition and search, and demonstrate how to achieve both goals. Glimpse enables interactive object recognition for camera-equipped mobile devices. Because the algorithms for object recognition entail significant computation, Glimpse runs them on servers across the network. To “hide” latency, Glimpse uses an active cache of video frames on the device and performs tracking on a subset of frames to correct the stale results obtained from the processing pipeline. Our results show that Glimpse achieves a precision of 90% for face recognition, which improves over a scheme performing server-side recognition without using an active cache by 2.8×. For fast moving objects such as road signs, Glimpse achieves precision up to 80%; without using the active cache, interactive recognition is non-functional (1.9% precision). Panorama enables search on live video streams. It introduces three new mechanisms: (1) an intelligent frame selector that reduces the number of frames on which expensive recognition must be run, (2) a distributed scheduler that uses feedback from the vision algorithms to dynamically determine the order in which streams must be processed, and (3) a search-ranking method that uses visual features to improve search relevance. Our experimental results show that incorporating visual features doubles search relevance from 45% to 90%. To achieve 90% search accuracy, with current pricing from Amazon Web Services, Panorama incurs 24× lower cost than a scheme that recognizes every frame.
Thesis Supervisor(s): Hari Balakrishnan, Dina Katabi, Victor Bahl