Doctoral Thesis: Self-Orchestrating Language Models: Leveraging Semantic Dependence for Efficient Inference
32-G575 zoom: https://mit.zoom.us/j/92132984483
By: Tian Jin
Details
- Date: Wednesday, March 4
- Time: 12:30 pm - 2:00 pm
- Location: 32-G575 zoom: https://mit.zoom.us/j/92132984483
Additional Location Details:
Abstract:
Large language models (LLMs) demonstrate impressive capabilities, but their deployment presents significant efficiency challenges. Autoregressive decoding imposes substantial infer- ence latency and under-utilizes hardware accelerators in low batch size regimes. Discrete diffusion models can generate in parallel but struggle to match autoregressive quality without many diffusion denoising steps. Long-context reasoning creates memory bottlenecks that strain even state-of-the-art accelerators.
My thesis is that language models can direct their own inference execution strategy by annotating semantic dependence—which tokens require knowledge of which others—in their generation. I call such models self-orchestrating language models. A co-designed runtime interprets these annotations as optimization directives to parallelize generation and evict context, achieving Pareto-optimal quality-efficiency trade-offs.
I demonstrate this approach through three self-orchestrating systems. First, PASTA uses semantic dependence to parallelize autoregressive decoding, training the model to annotate which output chunks can generate independently. Second, Planned Diffusion uses semantic dependence to derive a denoising order for discrete diffusion, autoregressively generating a plan that specifies which chunks to denoise in parallel. Third, TIP uses semantic dependence to evict intermediate reasoning steps from the KV cache, reducing memory consumption while preserving accuracy.
Host
- Tian Jin
- Email: tianjin@mit.edu