Login / Signup
FTXS@HPDC
2015
2017
2015
2017
Keyphrases
Publications
2017
Umar Kalim
,
Mark K. Gardner
,
Wu Feng
A Non-Invasive Approach for Realizing Resilience in MPI.
FTXS@HPDC
(2017)
Anne Benoit
,
Aurélien Cavelan
,
Valentin Le Fèvre
,
Yves Robert
Optimal Checkpointing Period with Replicated Execution on Heterogeneous Platforms.
FTXS@HPDC
(2017)
Saurabh Hukerikar
,
Rizwan A. Ashraf
,
Christian Engelmann
Towards New Metrics for High-Performance Computing Resilience.
FTXS@HPDC
(2017)
Ayush Patwari
,
Ignacio Laguna
,
Martin Schulz
,
Saurabh Bagchi
Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters.
FTXS@HPDC
(2017)
Anne Benoit
,
Aurélien Cavelan
,
Franck Cappello
,
Padma Raghavan
,
Yves Robert
,
Hongyang Sun
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale.
FTXS@HPDC
(2017)
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS@HPDC 2017, Washington, DC, USA, June, 2017
FTXS@HPDC
(2017)
2016
Maher Salloum
,
Jackson R. Mayo
,
Robert C. Armstrong
In-Situ Mitigation of Silent Data Corruption in PDE Solvers.
FTXS@HPDC
(2016)
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, FTXS@HPDC 2016, Kyoto, Japan, May 31, 2016
FTXS@HPDC
(2016)
Scott Levy
,
Kurt B. Ferreira
An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart.
FTXS@HPDC
(2016)
Allan S. Nielsen
,
Jan S. Hesthaven
Fault Tolerance in the Parareal Method.
FTXS@HPDC
(2016)
Francesco Rizzi
,
Karla Morris
,
Khachik Sargsyan
,
Paul Mycek
,
Cosmin Safta
,
Bert J. Debusschere
,
Olivier P. Le Maître
,
Omar M. Knio
ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner.
FTXS@HPDC
(2016)
Fumiyoshi Shoji
The K computer and its failures.
FTXS@HPDC
(2016)
Zachary W. Parchman
,
Geoffroy Vallée
,
Thomas J. Naughton
,
Christian Engelmann
,
David E. Bernholdt
,
Stephen L. Scott
Adding Fault Tolerance to NPB Benchmarks Using ULFM.
FTXS@HPDC
(2016)
Piyush Sao
,
Oded Green
,
Chirag Jain
,
Richard W. Vuduc
A Self-Correcting Connected Components Algorithm.
FTXS@HPDC
(2016)
2015
Catello Di Martino
,
Saurabh Jha
,
William Kramer
,
Zbigniew T. Kalbarczyk
,
Ravishankar K. Iyer
LogDiver: A Tool for Measuring Resilience of Extreme-Scale Systems and Applications.
FTXS@HPDC
(2015)
Sudhanva Gurumurthi
Failures in Large-Scale Systems: Insights from the Field.
FTXS@HPDC
(2015)
Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS 2015, Portland, Oregon, USA, June 15, 2015
FTXS@HPDC
(2015)
Aiman Fang
,
Andrew A. Chien
How Much SSD Is Useful for Resilience in Supercomputers.
FTXS@HPDC
(2015)
Alireza Goudarzi
,
Dorian C. Arnold
,
Darko Stefanovic
,
Kurt B. Ferreira
,
Guy Feldman
A Principled Approach to HPC Event Monitoring.
FTXS@HPDC
(2015)
Brian Austin
,
Eric Roman
,
Xiaoye Li
Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices.
FTXS@HPDC
(2015)
Felix Loh
,
Parameswaran Ramanathan
,
Kewal K. Saluja
Transient Fault Resilient QR Factorization on GPUs.
FTXS@HPDC
(2015)
Aurélien Cavelan
,
Yves Robert
,
Hongyang Sun
,
Frédéric Vivien
Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors.
FTXS@HPDC
(2015)
Daniel Alfonso Gonçalves de Oliveira
,
Laércio Lima Pilla
,
Caio B. Lunardi
,
Luigi Carro
,
Philippe O. A. Navaux
,
Paolo Rech
The Path to Exascale: Code Optimizations and Hardening Solutions Reliability.
FTXS@HPDC
(2015)
Jeremiah J. Wilke
,
Keita Teranishi
,
Janine C. Bennett
,
Hemanth Kolla
,
David S. Hollman
,
Nicole Slattengren
Evolving the Message Passing Programming Model via a Fault-Tolerant, Object-oriented Transport Layer.
FTXS@HPDC
(2015)
Qiang Guan
,
Nathan DeBardeleben
,
Sean Blanchard
,
Song Fu
Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection.
FTXS@HPDC
(2015)