Distributed Snapshots - Review

The paper Distributed Snapshots: Determining Global States of Distributed Systems (1985) by K. Mani Chandy and Leslie Lamport presents algorithms which allow processes in distributed systems to determine a global state of the system during a computation.  Many problems and tasks in distributed systems can be formulated as the problem of detecting global states, such as stable property detection and checkpointing.  The paper’s algorithms provide mechanisms for solving such problems and tasks.

A snapshot, in this setting, is carefully defined to conform to a concept which is both practical and useful.  Specifically, the authors enforce that each process in a distributed system can record its own state and the messages it sends and receives and nothing else.  That is, one process cannot record the state of another process directly.  Furthermore, the algorithms described by the authors maintain the important property that they do not interfere with the underlying computations of the system.  In other words, an effective snapshot algorithm can’t alter the computations of the processes.

The goal of these snapshots, at least described in this paper, is to be able to provide a mechanism for solving a class of problems related to the global state of a distributed system.  If y is a predicate function defined on the global states of a distributed system (such that y(S) is either true or false for any global state S), then the predicate y is a stable property of the system if y(S) implies y(S’) for all global states S’ of the system reachable from S.  This says that if y is a stable property and y is true at a point in a computation of the system, then y is true at all later points in that computation.  They give the examples of “computation has terminated,” “the system is deadlocked,” and “all tokens in a token ring have disappeared.”

The authors model a distributed system as a finite set of processes and a finite set of channels described as a directed graph where each vertex represents a process and each edge represents a channel.  Channels deliver messages between processes, and the state of a channel is the sequence of messages sent along the channel (excluding messages received along the channel).  In this way, the global state of a distributed system is a set of component process and channel states.

In order to capture the complete and consistent global state of the system, the authors describe the use of markers, which are sent and received along channels and mark logical points in time.  When a process records its state, it immediately sends a marker through each of its outbound channels.  The global state detection algorithm, then, is described by a marker sending rule and a marker receiving rule.

The marker sending rule states that each process sends one marker along each of its outbound channels after it records it state and before it sends further messages along the channel, and the marker receiving rule states that when a process receives a marker if it has not recorded its state then it records is state and the state of channel it received the marker from as empty otherwise it records the state of the channel as the sequence of messages received along the channel after the processes state was recorded and before it received the marker.

One interesting outcome of this algorithm is that the recorded global state isn’t necessarily identical to any of the actualized states of the system.  This seems problematic, but the authors explain that the utility of such a snapshot comes from the fact that the snapshot state is guaranteed to be reachable from the initial state and from which the final global state is reachable.  Depending on ones use, this can be sufficient (and certainly better than an inconsistent snapshot).

This paper, which is almost 35 years old, introduces concepts which are still being leveraged in distributed systems today.  One would imagine that the gap between the resulting snapshot produced by this algorithm and a snapshot which describes a state that is guaranteed to have been actualized by the system would have been filled by now, but the lack of such progress hints at a logical impossibility that may not ever be resolved.  Nonetheless, it poses and interesting problem that may be built upon by modern approaches.

One immediate consideration is to determine the likelihood of a snapshot state being actualized in a distributed system.  For example, if a simple system takes snapshots such that the resulting snapshot is one of N possible configurations, it might be more useful to know a distribution over those configurations (versus, presumably, an assumed uniform distribution over them).  In the case of debugging a system based on recorded snapshots, having a better understanding of the most likely state configuration can lead to more effective troubleshooting and debugging, and can possibly lead to more intelligent snapshotting where this distribution is considered.

A possible approach would be to use machine learning to watch attributes of the state leading up to a snapshot, and have it learn how the sequence of events can lead to particular states which can’t be directly observed by any specific process.  These models, then, can effectively take educated guesses as to how the checkpoint will unfold, and can possibly influence how a checkpoint is made (e.g., by utilizing physical clocks to add synchrony to a system to control how and when markers are sent or when snapshots are created) in order to “direct” a snapshot procedure towards states which are preferable for a given task (such as debugging).  Alternatively, machine learning models may be used outside the system to determine a potential “true” snapshot state given the realized snapshot state.  If a model is trained in a controlled setting to learn how snapshot states correspond to actualized states, these models may be used during debugging and troubleshooting to better understand how a system has behaved in the past.

Considering how complex global states can become, it makes sense to utilize generalizable models to learn how these systems are behaving.  Although the underlying snapshot algorithms may not change, the effectiveness of the snapshot themselves may become more valuable.

Show Comments