Datasets in the wild are rarely captured by the clean models on which learning theory is built, yet we run our algorithms on them all the same, with no guarantee that their performance will degrade gracefully as we stray from the models originally posited. To complicate matters further, modern ML tasks are typically high-dimensional, and data often arrives in a dynamic, non-stationary fashion. In these settings, designing learning algorithms that work under even the cleanest of models can be challenging, let alone ones that are resilient to the pathologies of real-world data.
The first part of this talk will center around developing provable algorithms for handling all of these issues simultaneously. I will present the first algorithms for online linear regression and contextual bandits to achieve optimal error guarantees even when a constant fraction of the data has been corrupted. These results provably cannot be obtained using off-the-shelf techniques like Huber regression or least trimmed squares. And importantly, unlike existing guarantees in algorithmic robust statistics, ours do not posit any model for the data-generating process.
In the second half of the talk, we consider the regime where a majority of the data has been corrupted. Here, it turns out that exploiting generative assumptions about the data is unavoidable, and even under “benign” distributional assumptions, it is a challenging open question to obtain efficient algorithms. I will present the first subexponential-time algorithm for mixed linear regression, a popular instantiation of this problem.
Thesis Supervisor: Prof. Ankur Moitra
To attend this defense, please contact the doctoral candidate at sitanc at mit dot edu