The aggregation and denoising of crowd-labeled data is a task that has gained increased significance with the advent of crowdsourcing platforms and requirements of massive labeled datasets. In this paper, we propose a permutation-based model for crowd-labeled data that is a significant generalization of the popular "Dawid-Skene” model. Working in a high-dimensional non-asymptotic framework, we derive optimal rates of convergence for the permutation-based model. We show that the permutation-based model offers significant robustness in estimation due to its richness, while surprisingly incurring only a small statistical penalty as compared to the Dawid-Skene model. Finally, we propose a polynomial-time computable algorithm, called OBI-WAN, for provably efficient estimation under these models.
Joint work with Sivaraman Balakrishnan (CMU) and Martin J. Wainwright (UC Berkeley)
Nihar B. Shah is a final year PhD candidate at UC Berkeley, working with Martin Wainwright and Kannan Ramchandran. His research
interests include statistics, machine learning and information theory, with applications to crowdsourcing. He is the recipient of the Microsoft Research PhD fellowship 2014-16, the Berkeley fellowship 2011-13, the IEEE Data Storage Best paper and Best student paper awards for the years 2011/2012, and the SVC Aiya Medal from the Indian Institute of Science.
Host: Prof. Guy Bresler