The success of deep learning has been primarily driven by empirical breakthroughs, but such empirical successes have left many theoretical questions that cannot be explained via existing theory. For example, despite the nonconvexity of the training objectives, deep neural networks can be reliably trained to fully memorize the training data, yet perform very well on unseen test data. Although progress has been made over the recent years, the gap between theory and practice still remains wide open. This thesis aims to take a step towards understanding the practice of deep learning from theoretical perspectives.
First, an investigation on the optimization landscape of neural networks is discussed. The existence of spurious local minima for general datasets and activation functions is established, which suggests that the convergence of optimization methods on neural networks cannot be explained solely via the training objectives. The second part of this thesis discusses the implicit bias in training neural networks, which is an area to understand generalization in deep learning via the optimization trajectory. Through a unified analysis using a tensor representation of neural networks, we show how different architectures in linear neural networks lead to different global minima in overparameterized setups. The last part of this thesis addresses a major theory-practice gap in stochastic finite-sum optimization: practical algorithms shuffle and iterate through component indices, while most theoretical analyses assume uniformly sampling the indices. To close this gap, tight convergence rates of shuffling-based SGD that are faster than uniform-sampling SGD are developed.
Thesis Supervisors: Prof. Suvrit Sra, Prof. Ali Jadbabaie
Thesis Reader: Prof. Asu Ozdaglar
To attend this defense, please contact the doctoral candidate at chulheey at mit dot edu