The paper TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems (2016) by Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi is, as the title suggests, a taxonomy of non-deterministic concurrency bugs which is compiled from a study of 104 distributed concurrency (DC) bugs from four cloud-scale datacenter distributed systems. Software systems, especially in cloud settings, are becoming more complex and, as a result, are becoming more challenging to deal with. The authors in this paper set out to provide a comprehensive study of real-world distributed concurrency bugs. TaxDC provides data and information related to DC bugs across several axes of analysis such as the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies. The main contribution of the paper comes from TaxDB being the first large-scale DC-bug benchmark.
The authors, first, describe the methodology used during the study. The study examines bugs from the widely-deployed open-source datacenter distributed systems Hadoop MapReduce, which represents distributed computing frameworks, HBase and Cassandra, which represents distributed key-value stores, and ZooKeeper, which represents synchronization services. The authors started the study by examining an open source cloud bug study (CBS) database from which they filtered out local concurrency (LC) bugs and excluded DC bugs that do not contain clear descriptions and then picked 104 samples from the remaining detailed DC bugs. The study characterizes bugs along three key stages: triggering, errors and failures, and fixing.
The authors describe triggering processes from two perspectives: timing conditions and input preconditions. Concerning timing conditions, the authors identify the smallest set of concurrent events such that those events can guarantee the manifestation of a bug. Most DB bugs are triggered either by message timing bugs, which is caused by the untimely delivery of messages, or by fault timing bugs, which is caused by untimely faults or reboots. Input preconditions, such as faults, reboots, and multiple protocols, are what cause the triggering path for the timing conditions previously discussed.
Next, the authors describe the errors and failures found with these bugs. First errors are the point that bridges the triggering and error-propagation process, and these errors are categorized into local errors and global errors based on whether they can be observed from the triggering node alone. These errors will eventually lead to wide-range fatal failures, and the authors categorize these symptoms accordingly.
The authors analyze bug patches in order to understand developers’ fix strategies. The authors find, in general, that DC bugs can be fixed by either disabling the triggering timing or changing the system’s handling of that timing. The authors attempt to postulate some possible and common misbeliefs which may lead to DC bugs. Finally, the authors describe their findings so as to direct future research with regards to DC bugs.
This paper engages a more practical mindset than most research papers. The specificity of their study and how immediately applicable it is makes the paper stand out as something that may be of interest to many people working in this field and not just researchers. Specifically, the Lessons Learned section provides much material which bridges the gap between the findings in their study and actionable advice that can be implemented in real-world applications.
Despite the very clear methodology and seemingly meaningful results of this work, I have a hard time understanding how the results of this work may actually be utilized in the real-world. For instance, the authors discuss bug detection tools to combat these errors. Bug detection tools look for bugs that match specific patterns, and the study provides some guidance and patterns that can be exploited by DC bug detection, including discussions on generic detection frameworks, invariant-guided detection, misconception-guided bug detection, error-guided bug detection, and software testing. Although the discussions themselves are comprehensive and agreeable, I don’t see how these forms of bug detection couldn’t be implemented in large-scale datacenters already. The paper doesn’t necessarily state this to be the case, but I question how pertinent these observations are to this specific discussion, and if they are, the paper wasn’t able to communicate this to me.
Another section of the paper that left me puzzled was the discussion on distributed transactions. The authors pose the question of whether or not DC bugs can be eliminated through distributed transactions, and answer that the actual implementations of theoretically proven distributed transactions are not always correct and the distributed transactions are only a subset of a full complete system. The first point seems obvious, but I could not understand what the second point was not explored further. The authors provide an example of a bug involving ZooKeeper in HBase for coordinating and sharing states between HBase masters and region servers. Briefly, ZooKeeper is able to provide linearization of updates, but HBase must hand its concurrent operations to ZooKeeper, causing a race condition. The source of this issue makes sense, but I don’t see why these two systems can’t be built so as to avoid these issues in the first place? The authors state that the issue, simply speaking, is that there are many protocols that do not use distributed transactions. But, given that the authors previously stated that these bugs can theoretically be solved using transactions, I question why more effort hasn’t been put into making these systems compatible. If there is some other reason why this isn’t the case, the authors fail to mention it.
Overall, the paper offers an excellent taxonomy of DC bugs and provides interesting insight into the methodologies of the study as well as some seemingly meaningful conclusions about the current state of fault-tolerance in distributed systems. Although I criticize the lack of explanation for some of the discussions, I recognize that this work is very large and would take much effort to fully explore. I’m interested to see how works like this can be utilized in practice, and what other research may end up stemming from this.