The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High-Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade-off between failure-free and recovery performances. In this paper we describe our contribution in fault tolerance for high-level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure-free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties.

Fault tolerance for data parallel programs

BERTOLLI, CARLO;VANNESCHI, MARCO
2010-01-01

Abstract

The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High-Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade-off between failure-free and recovery performances. In this paper we describe our contribution in fault tolerance for high-level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure-free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties.
2010
Bertolli, Carlo; Vanneschi, Marco
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11568/138199
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 1
social impact