Doctoral Thesis: Co-designing Efficient Systems and Algorithms for Sparse and Quantized Deep Learning Computing

Wednesday, November 20
1:30 pm - 3:00 pm

MIT 32-D463

By: Haotian Tang

Thesis Supervisor: Prof. Song Han

Thesis Committee: Prof. Song Han, Prof. Saman Amarasinghe, Prof. Phillip Isola

Details

  • Date: Wednesday, November 20
  • Time: 1:30 pm - 3:00 pm
  • Location: MIT 32-D463
Additional Location Details:

Abstract:
Deep learning models are growing rapidly in complexity and scale, encompassing a wide range of input types—such as 1D text, 2D images, and 3D point clouds. This expansion necessitates a focus on computational efficiency. This thesis systematically explores efficiency in two resource-intensive domains: autonomous driving and generative AI, leveraging foundational model compression techniques like sparsity and quantization, along with system and algorithm co-optimization.
In this presentation, I’ll start with our recent advancements in accelerating image generation models. We developed HART, a model that addresses the challenges of continuous image token modeling through hybrid tokenization: it represents the image’s overall structure with discrete, quantized tokens, while learning only continuous residual tokens to capture fine details. HART achieves 4.5-7.7x higher throughput compared to diffusion models.
Next, I’ll introduce our work on accelerating large language models using quantization, with two high-performance GPU systems: TinyChat and QServe, designed for edge and cloud deployment, respectively. TinyChat enhances edge LLM inference speeds by 3x with activation-aware weight quantization, while QServe further optimizes cloud-based inference by quantizing activations and KV caches, boosting NVIDIA TensorRT-LLM throughput by 1.2-2.4x on A100 GPUs.
In the final section, I’ll cover our systems and algorithms focused on sparsity. We developed TorchSparse, a high-performance GPU system for sparse convolution on 3D data that achieves 1.7-3.3x speedup over current state-of-the-art systems. Additionally, we present BEVFusion, an end-to-end multi-sensor 3D perception framework that achieves superior accuracy with 1.9x less computation compared to prior methods.

Bio: Haotian Tang is a final-year Ph.D. candidate in MIT’s EECS department, advised by Prof. Song Han, with research interests in systems and machine learning. He served as system co-lead on the AWQ project, which won the MLSys 2024 Best Paper Award. Haotian has also been recognized as an outstanding reviewer five times across ICLR, ICML, and NeurIPS. His research contributions have received over 3,400 citations on Google Scholar and 7,000+ stars on GitHub. He has interned as a research scientist at NVIDIA Research (2024), Waymo Research (2023), and OmniML Inc. (2022, now part of NVIDIA).

Host