Article,

Lightweight checkpointing for concurrent ML

L. Ziarek, and S. Jagannathan.
20 (2): 137--173 (2010)
DOI: 10.1017/S0956796810000067

Abstract

Transient faults that arise in large-scale software systems can often be repaired by reexecuting the code in which they occur. Ascribing a meaningful semantics for safe reexecution in multithreaded code is not obvious, however. For a thread to reexecute correctly a region of code, it must ensure that all other threads that have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior might result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward because thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction, called stabilizers, that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). We present a formal characterization of its design, and provide a detailed description of its implementation within MLton, a whole-program optimizing compiler for Standard ML. Our experimental results on microbenchmarks as well as several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.

BibTeX key: ziarek2010lightweight
entry type: article
booktitle: Journal of Functional Programming
year: 2010
number: 2
pages: 137--173
publisher: Cambridge University Press
volume: 20
issn: 09567968
DOI: 10.1017/S0956796810000067
url: https://www.cambridge.org/core/article/lightweight-checkpointing-for-concurrent-ml/A8CBF8766727B44869F7C7C5D01B9EC5

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Journal Article %1 ziarek2010lightweight %A Ziarek, Lukasz %A Jagannathan, Suresh %B Journal of Functional Programming %D 2010 %I Cambridge University Press %K Checkpoints Concurrency Roleback %N 2 %P 137--173 %R 10.1017/S0956796810000067 %T Lightweight checkpointing for concurrent ML %U https://www.cambridge.org/core/article/lightweight-checkpointing-for-concurrent-ml/A8CBF8766727B44869F7C7C5D01B9EC5 %V 20 %X Transient faults that arise in large-scale software systems can often be repaired by reexecuting the code in which they occur. Ascribing a meaningful semantics for safe reexecution in multithreaded code is not obvious, however. For a thread to reexecute correctly a region of code, it must ensure that all other threads that have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior might result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward because thread interactions are a dynamic property of the program. In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction, called stabilizers, that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Safe global states are computed through lightweight monitoring of communication events among threads (e.g., message-passing operations or updates to shared variables). We present a formal characterization of its design, and provide a detailed description of its implementation within MLton, a whole-program optimizing compiler for Standard ML. Our experimental results on microbenchmarks as well as several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs.

BibSonomy

Lightweight checkpointing for concurrent ML

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on