Doctoral Thesis:Taming Data Movement Overheads in Latency-Critical Cloud Services
32-G882 Also available through Zoom
By: Nikita Lazarev
Thesis Supervisor(s): Christina Delimitrou (MIT) and Zhiru Zhang (Cornell University)
Details
- Date: Wednesday, January 22
- Time: 3:00 pm - 4:30 pm
- Category: Thesis Defense
- Location: 32-G882 Also available through Zoom
Additional Location Details:
Abstract:
Cloud providers are being urged to enhance the efficiency, performance, and reliability of datacenter infrastructures to support applications across many domains with diverse requirements for quality of service. Data movement is a significant source of overhead in today’s servers, and it is particularly critical for the recent emerging interactive and realtime cloud applications. In this thesis, I investigate and propose a set of novel approaches to mitigate the data movement overheads in general-purpose datacenters. This allows to establish a roadmap towards more efficient and reliable cloud services which are severely bottlenecked by data movement. In particular, I propose, implement, and evaluate three systems for the applications in (1) microservices, (2) serverelss, and (3) realtime cloud-native services on the example of virtualized radio access networks (vRAN), which are known to raise challenges to existing cloud infrastructures.
First, we discuss Dagger – a system for mitigating the overheads of remote procedure calls in interactive cloud microservices. Dagger introduces a novel yet practical solution enabling fast and low-latency communication between distributed fine-granular application components. We then present Sabre – a practical and efficient system for mitigating the challenging overhead of cold start in serverless. Sabre relies on emerging tightly-coupled accelerators for compression and allows to dramatically reduce the latency of page movement in serverless microVMs without compromising the CPU cost. Finally, we build Slingshot – the first to the best of our knowledge infrastructure that enables fault tolerance in realtime cloud-native services such as vRAN. With Slingshot, we make substantial progress towards deploying reliable distributed systems working in realtime in the general purpose cloud by addressing the key challenges of fast state migration, realtime fault detection, and low-latency disaggregation.
Host
- Nikita Lazarev
- Email: niknik@mit.edu