Doctoral Thesis: Making sense of training large AI models

Thursday, June 6
1:45 pm - 3:00 pm

E18-304 (IDSS) or https://mit.zoom.us/my/jadbabai

By: Kwangjun Ahn

Details

  • Date: Thursday, June 6
  • Time: 1:45 pm - 3:00 pm
  • Category:
  • Location: E18-304 (IDSS) or https://mit.zoom.us/my/jadbabai
Additional Location Details:

Abstract:
Today, one of the most impressive applications of optimization is the training of large AI models. But currently such models are trained with ad-hoc heuristics and a very large computational cost, mainly due to lack of understanding of their working mechanisms. In this thesis, we conduct a systematic study of large model optimization, crucially informed by its use in practice. The aim of this thesis is to extract insights, which in turn will help us to design better optimization algorithms in the future.
The first part of the thesis investigates two interesting phenomena regarding optimization of Transformer-based models, one of the most popular architectures for language modeling. We investigate how training Transformer-based models can lead to remarkable properties, such as in-context learning, and we also aim to understand the main challenges associated with Transformer training.
The second part of the thesis focuses on understanding practical optimization algorithms. Motivated by the success of Adam in practice, we offer a theoretical understanding of its effectiveness based on an online learning that underscores the importance of Adam’s algorithmic components. We then discuss the working mechanisms of flatness optimizers, such as sharpness-ware minimization (SAM), a practical optimizer known to improve model prediction performance. We formally define the notion of flat minima, and study how algorithms like SAM can efficiently find flat minima.

Host