Re-Nuca: Boosting CMP performances through block replication

Foglia, Pierfrancesco; Giovanna, Monni; Prete, Cosimo Antonio; Marco, Solinas

doi:10.1109/DSD.2010.41

Abstract— Chip Multiprocessor (CMP) systems have become the reference architecture for designing micro-processors, thanks to the improvements in semiconductor nanotechnology that have continuously provided a crescent number of faster and smaller per-chip transistors. The interests for CMPs grew up since classical techniques for boosting performance, e.g. the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. CMP systems generally adopt a large last-level-cache (LLC) (typically, L2 or L3) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for tolerating wire-delay effects on the overall performance. In this paper, we introduce a novel NUCA architecture, called Re-NUCA, specifically suited for (but not limited to) CMPs in which cores are placed at different sides of the shared cache. The idea is to allow shared blocks to be replicated inside the shared cache, in order to avoid the limitations to performance improvements that arise in classical D-NUCA caches due to the conflict hit problem. Our results show that Re-NUCA outperforms D-NUCA of more then 5% on average, but for those applications that strongly suffer from the conflict hit problem we observe performance improvements up to 15%.