Author of the publication

Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems.

, , , , and . DSN, page 37-44. IEEE Computer Society, (2015)

Please choose a person to relate this publication to

To differ between persons with the same name, the academic degree and the title of an important publication will be displayed. You can also use the button next to the name to display some publications already assigned to the person.

 

Other publications of authors with the same name

Are we witnessing the spectre of an HPC meltdown?, , , , , , , , and . Concurr. Comput. Pract. Exp., (2019)Scaling the Summit: Deploying the World's Fastest Supercomputer., , , , , , , , , and 7 other author(s). ISC Workshops, volume 11887 of Lecture Notes in Computer Science, page 330-351. Springer, (2019)Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers., , , and . Euro-Par, volume 12247 of Lecture Notes in Computer Science, page 37-51. Springer, (2020)Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems., , , , and . DSN, page 37-44. IEEE Computer Society, (2015)Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility., , , , and . SC, page 38:1-38:12. ACM, (2015)GPU age-aware scheduling to improve the reliability of leadership jobs on Titan., , , , and . SC, page 7:1-7:11. IEEE / ACM, (2018)GPU lifetimes on titan supercomputer: survival analysis and reliability., , , , , and . SC, page 41. IEEE/ACM, (2020)Understanding failures through the lifetime of a top-level supercomputer., , , and . J. Parallel Distributed Comput., (2021)Understanding GPU errors on large-scale HPC systems and the implications for system design and operation., , , , , , , , , and 2 other author(s). HPCA, page 331-342. IEEE Computer Society, (2015)Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer., , , and . SBAC-PAD, page 196-203. IEEE, (2019)