CINECA IRIS Institutional Research Information System

On the road to computer systems able to support the requirements of exascale applications, Chip Multi-Processors (CMPs) are equipped with an ever increasing number of cores interconnected through fast on-chip networks. To exploit such new architectures, the parallel software must be able to scale almost linearly with the number of cores available. To this end, the overhead introduced by the run-time system of parallel programming frameworks and by the architecture itself must be small enough in order to enable high scalability also for very fine-grained parallel programs. An approach to reduce this overhead is to use non-conventional architectural mechanisms revealing useful when certain concurrency patterns in the running application are statically or dynamically recognized. Following this idea, this paper proposes a run-time support able to reduce the effective latency of inter-thread cooperation primitives by lowering the contention on individual caches. To achieve this goal, the new home-forwarding hardware mechanism is proposed and used by our runtime in order to reduce the amount of cache-to-cache interactions generated by the cache coherence protocol. Our ideas have been emulated on the Tilera TILEPro64 CMP, showing a significant speedup improvement in some first benchmarks.

The home-forwarding mechanism to reduce the cache coherence overhead in next-generation CMPs

MENCAGLI, GABRIELE^Primo;VANNESCHI, MARCO^Secondo;LAMETTI, SILVIA^Ultimo

2018-01-01

Abstract

On the road to computer systems able to support the requirements of exascale applications, Chip Multi-Processors (CMPs) are equipped with an ever increasing number of cores interconnected through fast on-chip networks. To exploit such new architectures, the parallel software must be able to scale almost linearly with the number of cores available. To this end, the overhead introduced by the run-time system of parallel programming frameworks and by the architecture itself must be small enough in order to enable high scalability also for very fine-grained parallel programs. An approach to reduce this overhead is to use non-conventional architectural mechanisms revealing useful when certain concurrency patterns in the running application are statically or dynamically recognized. Following this idea, this paper proposes a run-time support able to reduce the effective latency of inter-thread cooperation primitives by lowering the contention on individual caches. To achieve this goal, the new home-forwarding hardware mechanism is proposed and used by our runtime in order to reduce the amount of cache-to-cache interactions generated by the cache coherence protocol. Our ideas have been emulated on the Tilera TILEPro64 CMP, showing a significant speedup improvement in some first benchmarks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2018
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.future.2017.01.009
			
	Tutti gli autori
	
						Mencagli, Gabriele; Vanneschi, Marco; Lametti, Silvia
					
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
preprint-fgcs-2018.pdf accesso aperto Tipologia: Documento in Pre-print Licenza: Creative commons Dimensione 1.02 MB Formato Adobe PDF Visualizza/Apri	1.02 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/841180

Citazioni

ND

4

2

social impact