Doctoral Thesis: Self-Orchestrating Language Models: Leveraging Semantic Dependence for Efficient Inference

Wednesday, March 4

12:30 pm - 2:00 pm

32-G575 zoom: https://mit.zoom.us/j/92132984483

Add to Calendar

By: Tian Jin

Details

Date: Wednesday, March 4
Time: 12:30 pm - 2:00 pm
Location: 32-G575 zoom: https://mit.zoom.us/j/92132984483

Additional Location Details:

Abstract:
Large language models (LLMs) demonstrate impressive capabilities, but their deployment presents significant efficiency challenges. Autoregressive decoding imposes substantial infer- ence latency and under-utilizes hardware accelerators in low batch size regimes. Discrete diffusion models can generate in parallel but struggle to match autoregressive quality without many diffusion denoising steps. Long-context reasoning creates memory bottlenecks that strain even state-of-the-art accelerators.

My thesis is that language models can direct their own inference execution strategy by annotating semantic dependence—which tokens require knowledge of which others—in their generation. I call such models self-orchestrating language models. A co-designed runtime interprets these annotations as optimization directives to parallelize generation and evict context, achieving Pareto-optimal quality-efficiency trade-offs.

I demonstrate this approach through three self-orchestrating systems. First, PASTA uses semantic dependence to parallelize autoregressive decoding, training the model to annotate which output chunks can generate independently. Second, Planned Diffusion uses semantic dependence to derive a denoising order for discrete diffusion, autoregressively generating a plan that specifies which chunks to denoise in parallel. Third, TIP uses semantic dependence to evict intermediate reasoning steps from the KV cache, reducing memory consumption while preserving accuracy.

Host

Tian Jin
Email: tianjin@mit.edu