Next generation multicore processors and applications will operate on massive data with significant sharing. A major challenge in their implementation is the storage required for tracking the sharers of data. The bit overhead for such storage scales quadratically with the number of cores in conventional directory-based cache coherence protocols. This thesis proposes a broadcast-based limited directory protocol, ACKwise to track the sharers in a cost-effective manner.
Another major challenge is limited cache capacity and the data movement incurred by conventional cache hierarchy organizations when dealing with massive data scales. These two factors impact memory access latency and energy consumption adversely. This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality and controlling replication. First, a locality-aware replication scheme that better manages the private caches is proposed. This scheme enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. The approach relies on in-hardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality.
Second, a timestamp-based memory ordering validation scheme is proposed that enables the intelligent private cache replication scheme to be implementable in conventional processors with popular memory consistency models. Third, a locality-aware LLC replication scheme that better manages the last-level cache is proposed. This scheme balances data locality and off-chip miss rate for optimized execution. And finally, all the above schemes are combined to obtain a cache hierarchy replication scheme that provides optimal data locality and miss rates at all levels of the cache hierarchy. These techniques enable optimal use of the on-chip cache capacity, and provide low-latency, low-energy memory access, while retaining the convenience of shared memory. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% while incurring a storage overhead of 30.7 KB per core.
Thesis Supervisor: Srinivas Devadas