Does The Cloud Need Stabilizing? (2018) by Murat Demirbas, Aleksey Charapko, and Ailidani Ailijiang offers a discussion on the factors which contribute most to the high-availability of cloud computing services. The authors introduce us to the concept of self-stabilization, a type of fault tolerance, and describe how it can be leveraged by such systems to help facilitate the required high-availability by dealing with faults in a principled, unified manner. They explain that cloud computing systems rely on infrastructure support to simplify these systems in order to reduce the need for sophisticated design of fault-tolerance mechanisms, but as the uses for these services grow more complex (such as the case with the composition of microservices), the need for more sophisticated recovery techniques, such as self-stabilization, is becoming more critical.
The authors start by presenting examples of global-scale high-availability systems and describes the architecture and design principles of these services and their interactions with each other. Internet scale applications utilize cluster management, data storage, data processing, and application layers for their software infrastructure. Each layer works as a distributed system which adopts a simple architecture pattern where a master node maintains the critical states and workers accepts tasks from the master and computer on data in a stateless manner. Large-scale applications typically rely on a Service Oriented Architecture (SOA) where functionality is divided into isolated services and each service is able to communicate with one another through a communication protocol over a network. This pattern provides scalability through functional decomposition, and recent technological movements towards stateless computing in the cloud indicates it’s success in achieving this.
In the next section, the authors review the literature on what types of faults occur in the cloud computing systems and what type of fault-tolerance and recovery mechanisms are employed to deal with these. The authors explain that the crash fault approach to fault tolerance, where a node failure which does not impact any global state of the system simply causes a restart or a new provisioned node, is crucial to small-scale failures. Large-scale failures are much more problematic, and require more intricate recovery models. The authors describe a number of recovery models that are utilized in these systems to handle such large-scale failures.
In the final section, the authors point to opportunities where self-stabilization techniques in new emerging topics in cloud computing areas. Transactions are often used to abstract the complexity of distributed systems away from developers, but are plagued with performance and compatibility issues. The authors give examples where self-stabilization can be used in-place of transactions, and describes some of the benefits of this approach.
Overall, the paper works to describe the current state of large-scale distributed systems and their approaches to fault-tolerance and recovery. The authors, being proponents of self-stabilization, aim to highlight how such strategies may be implemented in these systems, specifically in cases where multiple microservices work together to provide higher-level services.
In terms of the knowledge provided by this paper, I found the survey of current cloud computing design principles to be very enlightening. The faults found in cloud computing services are very comprehensible, and provide some implicit insight into the recovery strategies found in these systems. For example, 15% of failures are the result of bugs, 10% are the result of misconfigurations, and 16% come from bugs and misconfigurations introduced during system upgrades. These kinds of failures, to me, seem difficult to systematically catch and recover from, since they are introduced on a level outside the systems considerations. In other words, it would seem the developers of these systems have to assume that the system itself functions as expected, otherwise the number of fault-inducing circumstances they’d have to consider would be infeasible. Obviously nobody can expect every possible fault to be accounted for, but it’s not clear where this line (as fuzzy as it may be) is drawn.
The paper describes the recovery models used by such systems. The idea is that the possible faults that may arise should fall into one of these categories, and the prescribed recovery strategy should be able to handle it. This makes sense, but necessitates the ability to accurately categorize these issues which doesn’t seem trivial. Furthermore, some of these methods consist of the developers correcting the issue in an offline fashion, which isn’t automated and isn’t suitable for the large-scale systems discussed in this paper. Of course, in lieu of a better recovery option, the developers don’t have much of a choice but to “manually” fix these kinds of issues, but the paper doesn’t delve into the details of these kinds of faults and I feel this is where some focus should be placed.
As a developer, system upgrades are almost the only time I become anxious of faults because they naturally introduce possible unknown variables into a system’s functionality. The longer a system runs, the more likely we are to experience an unforeseeable fault, giving us the opportunity to analyze and fix the issue so that we can avoid (or properly recover from) the issue in the future. When an update is deployed, there is a possibility that we are unwittingly introducing more of these unforeseeable faults. This may be avoidable, but based on the statistics offered by the paper and the experiences found in practice, this also seems to be of a higher concern than faults caused more more predictable or out of our control faults.
I’d like to see more emphasis on how these faults can be handled and how self-stabilization may facilitate this. In this paper, there is an emphasis on self-stabilization recovery, but there is a lack of details in terms of how one might actually implement such methods for recovery. I think the idea is intriguing, since self-stabilization might be the best approach to dealing with developer-induced faults, and one might imagine a recovery model where code and configuration versioning may be incorporated directly so that the system has some “intelligence” with regard to how it might resolve faults through self-stabilization. But, as it stands, it would seem self-stabilization has some ways to go before it can feasibly be used in such a manner.