As modern chip multi-processors (CMPs) scale to larger core counts, data movement is proving to be a major architectural challenge. Unless data is kept near the cores that use it, i.e. in nearby cache banks or memories, processors will be hopelessly throttled waiting for data to arrive. Indeed, without major architectural innovations, just the energy of moving data across the CMP will soon dominate the power budget.
This thesis presents novel architectural techniques to significantly reduce data movement. Since caches are notoriously unpredictable, many prior techniques employ empirical heuristics based on common-case application behaviors. Our approach differs by trying to rigorously reason about memory system performance from first principles. This lets us intelligently adapt the memory system, deciding for each application (i) what data to cache, (ii) how much data to cache, (iii) where to place threads and their data on the CMP, and (iv) how well the resulting memory system will behave.
Navigating these complex tradeoffs requires sophisticated models, so to produce a practical system we take a hybrid approach. Hardware is kept simple and cheap, but highly configurable; its job is to monitor applications and enforce policy. We then infrequently adapt the policy to suit running applications; these reconfigurations are generally done in software, but can be done in hardware if convenient. Our split design lets us produce a high-quality policy, with costs amortized over many memory references, and retains the performance benefits of a hardware implementation.
We illustrate this approach in three important areas. First, we use partitioning to guarantee convex cache performance. Convexity eliminates performance cliffs, improving performance and fairness, and also makes cache performance easier to reason about. Second, we use planning theory to design a practical cache replacement policy, and we present a general-purpose cache model that predicts such policies’ performance. These contributions improve cache performance generally. Third, to specifically address CMP scaling, we construct virtual cache hierarchies out of shared cache resources, including forthcoming technologies like 3D-stacked memories. Each virtual hierarchy is dynamically sized and placed near applications to minimize access latency.
These techniques affect the full system stack, from microarchitecture through algorithms. We show that grounding architecture in analytical models significantly improves performance at low cost and implementation complexity. Moreover, doing so yields qualitative benefits. For example, virtual cache performance scales independently of system size, and our analytical models allow us to reason about convergence, convexity, and optimality.
Thesis Supervisor: Prof. Daniel Sanchez