Paxos Made Live - Review

As our knowledge and understanding of complex theories matures and grows, it is important to keep in mind the practical implications of research.  Although producing logically sound and meaningful research is difficult in itself, implementations in the real world provides a new set of challenges that are often not considered in a research setting.  Paxos, a family of protocols for solving consensus in distributed systems, is a prime example of such research, and the paper Paxos made live - An engineering perspective by Tushar D. Chandra, Rovert Griesemer, and Joshua Redstone (2007) provides an excellent summary of how such challenges are dealt with in a practical setting.

The paper Paxos made live describes the experience of an engineering team at Google as they build a fault-tolerant database based on Paxos.  The authors make a point to emphasise how translating the pseudo-code from the pure research setting to production ready code proved unexpectedly difficult.  Their goal is to replace the database, which they refer to as “3DB”, that their fault-tolerant Chubby system is based on with a built-from-scratch system they’d build based on Paxos.  They describe how the 3DB system that they are currently using had a history of bugs related to replication, and even point out that it’s replication mechanism was not based on a proven algorithm so they couldn’t even be sure it was correct.  Given the importance of Chubby at Google, they had decided to utilize Paxos (a proven fault-tolerant consensus algorithm) to rebuild that portion of Chubby.

The authors spend a great deal of time discussing the challenges in implementing Paxos.  These challenges include handling disk corruption, dealing with possible stale data in the master, reliably detecting master turnover, handling the group membership problem, creating snapshots, and working with database transactions.  The authors note how these challenges are not usually brought up in literature on Paxos (or theoretical consensus algorithms in general), but are critical to the success of the actual implementation of the algorithm.  The authors describe each problem and how they solve them either by borrowing from previous work or creating custom solutions of their system.

Next, the authors describe the software engineering aspect of the implementation.  They need to build their system to be robust enough to run as a user-facing fault-tolerant system.  They note that bugs found in fault-tolerant systems are much more critical than bugs of similar gravity in other kinds of software given the nature of the system.  They describe how they code the core algorithm as two explicit state machines in order to express the algorithm effectively.  They are liberal with their use of runtime consistency checking in order to verify their code and test data structure for efficiency.  They describe the explicit testing they perform on the system, and how they utilize a safety mode and a liveness mode to facilitate this.  And, finally, they describe the challenges faced when dealing with concurrency and how their original intentions for testing concurrent fault-tolerant code could not be satisfied as the project grew.

The authors finish the paper by describing unexpected failures with their implementation, measurements they use to compare their new system to the previous system based on 3DB, and a summary of their work as well as describing some shortcomings and open problems.

As a research paper, I very much enjoyed this work.  The authors did not hide the issues they faced and spoke very frankly and transparently about the situations and how they managed to resolve them.  It’s easy for these papers to focus too narrowly on theory without delving into application, and idealizing a situation cannot lead to the criticism and insight needed to progress these fields.  The engineering perspective shown here is a welcomed angle to an otherwise theoretical and idealistic problem.

I found their approaches to handling the challenges of the implementation to be very grounded, giving a more intuitive sense of the limitations of Paxos as a framework.  For example, they discuss the necessity to properly handle disk corruption.

In Paxos, there is no contingency to handle situations which resemble disk corruption.  The authors mention this coming in one of two forms: either the file(s) change or the file(s) become inaccessible.  The authors addressed these issues by incorporating a checksum of the contents of the file to make sure data wasn’t incorrectly altered and by having new replicas leave a marker in Google Filesystem after start-up.

To me, this section highlights the strengths of this paper.  The author’s had to deal with real-world phenomenon that any distributed system needs to take into account.  They note a possible issue (disk corruption) and utilize tools at their disposal to properly deal with this.  This makes for great reference for those looking to implement similar systems, and their detailed response to these issues makes it very tangible.

However, this section also exposes a weakness of this paper.  In addressing this specific issue, I felt the authors failed to properly define the issue in the language of Paxos and, as such, leads to a solution that is hard to accept in general.  Although I’m sure they dealt with the situation more rigorously during the implementation, I felt they could have extended Paxos to deal with these kinds of situations in general.

They note at the end of that section that their procedures allow for the system to not have to flush writes to disk immediately (under certain circumstances), and that they had considered schemes to exploit this observation, but those schemes were not implemented.  I think if they would have better defined this situation in terms of the Paxos algorithm, they may have been able to extend (or allow others to extend) the algorithm in general to leverage this.

Most of the paper tends to do this (and, unfortunately, discussing every instance of this is outside the scope of this assignment).  In general, Paxos Made Live does an excellent job describing the real-world implementation of Paxos, but fails to provide the theoretical backing to describe their solutions.  Knowing that this wasn’t their intention when writing the paper makes these weaknesses understandable, but it does make one question if others have properly addressed these issues in a more rigorous manner and if those solutions are compatible with theirs.

Show Comments