Doctoral Thesis: Efficient Algorithms and Systems for Large Language Models

Monday, November 17
3:00 pm - 5:00 pm

MIT Room 35-225, https://zoom.guangxuanx.com

By: Guangxuan Xiao

Details

  • Date: Monday, November 17
  • Time: 3:00 pm - 5:00 pm
  • Category:
  • Location: MIT Room 35-225, https://zoom.guangxuanx.com
Additional Location Details:

Large Language Models (LLMs) are transforming how we interact with information and augment human capabilities across all domains. Yet their computational demands—from memory requirements to quadratic attention complexity—create critical barriers between breakthrough capabilities and real-world deployment. This dissertation presents a comprehensive framework for efficient LLMs, progressing from immediate optimizations to architectural innovations.

We begin by compressing models through quantization with SmoothQuant, a post-training method that addresses the core challenge of activation outliers. By developing a mathematically equivalent transformation that migrates quantization difficulty from activations to weights, SmoothQuant enables the first practical 8-bit weight and 8-bit activation (W8A8) quantization for billion-scale models, achieving memory reduction and speedup without accuracy loss.

For infinite-length sequence processing, we discover the “attention sink” phenomenon in StreamingLLM: initial tokens act as stabilizing anchors for attention regardless of semantic relevance. This insight enables constant-memory streaming for infinite text sequences, extending processing sequences from thousands to millions of tokens. We extend this principle to vision-language models (VLMs) with StreamingVLM, creating a unified framework for real-time video understanding that can process hours of content while maintaining temporal coherence.

For finite long-context processing, we develop complementary solutions targeting different bottlenecks. DuoAttention exploits our discovery of functional dichotomy in attention heads: only a fraction are critical “retrieval heads” requiring full attention, while most are “streaming heads” that focus on recent tokens. This enables dramatic memory reduction through hybrid KV caching. XAttention addresses the pre-filling bottleneck by using the antidiagonal scoring mechanism to identify and compute only essential attention blocks, achieving substantial acceleration.

Moving beyond existing models, we architect natively efficient models by analyzing Mixture of Block Attention (MoBA), a pre-training sparse attention mechanism. Through statistical analysis of signal-to-noise ratios, we derive that smaller blocks are theoretically optimal. However, small blocks suffer from poor hardware efficiency. We resolve this with FlashMoBA, a custom CUDA kernel that makes small-block architectures practical and achieves up to 9x speedup.

This dissertation charts a path from addressing immediate deployment barriers to reimagining LLM architectures from first principles, establishing a comprehensive framework for efficient AI that addresses today’s challenges while laying groundwork for the next generation of computationally efficient and universally accessible artificial intelligence.