Doctoral Thesis: Compiler-Hardware Co-Design for Pervasive Parallelization

Wednesday, May 24
10:00 am - 11:30 am

32-144

Victor A. Ying

Abstract:
Parallelization is critical to fast computation, but it remains a painstaking and piecemeal practice. This dissertation shows how new compilers and hardware can make parallelization of complex programs simple and systematic. This work combines new hardware and compiler techniques to parallelize challenging applications so they can use hundreds of cores without restrictive assumptions on the structure of source programs.

I follow a novel approach to eliminate the burden of explicit manual parallelization by exploiting implicit fine-grained parallelism. To maximize implicit parallelism, compilers should thoughtfully use hardware support for speculation and serialize tasks only when necessary. Serialization can be reduced by dividing code into shorter tasks. To make spawning many short parallel tasks efficient, careful hardware-compiler co-design is needed to reduce per-task overheads. To make short tasks scale to large numbers of cores, parallel composition uncovers parallelism across whole programs, while distributed software and hardware queues provide high-throughput task scheduling. This distributed management of tiny tasks provides compilers with new opportunities and challenges. In particular, novel combinations of static and dynamic information can exploit data locality while maintaining load balance on large multicore systems.

I present three systems that embody this new approach. First, T4 automates the parallelization of sequential programs and scales hard-to-parallelize real-world programs to tens of cores, resulting in order-of-magnitude speedups. Second, S5 builds on T4 with abstract data types that remove dependences. Thus, S5 scales real-world applications to hundreds of cores, delivers additional order-of-magnitude speedups over T4, and outperforms manually parallelized code tuned by experts. Finally, ASH is an accelerator that demonstrates the same approach can be applied with simpler mechanisms tailored for digital circuit simulation. A small ASH implementation is 32× faster than a large multicore CPU running a state-of-the-art parallel simulator.

Details

  • Date: Wednesday, May 24
  • Time: 10:00 am - 11:30 am
  • Category:
  • Location: 32-144
Additional Location Details:

Thesis Committee: Professors Daniel Sanchez (supervisor), Joel Emer, and Saman Amarasinghe