Understanding Distributed System – Resiliency
Introduction
Chapter 24 – Common Failure Causes
- Hardware Faults
- Incorrect Error Handling
- Configuration Changes
- Single Points of Failure
- Network Faults
- Resources Leaks
- Load Pressure
- Cascading Failures
Chapter 25 – Redundancy
Redundancy, the replication of functionality or state, is a critical defense against failures. When replicated over multiple nodes, functionality or state can be maintained even if a node fails. This redundancy not only enhances availability but also enables horizontal scaling, as discussed in Part III.
Redundancy is a key reason why distributed applications can achieve better availability than single-node applications. However, not all forms of redundancy improve availability. Marc Brooker outlines four prerequisites:
- The complexity added by redundancy mustn’t cost more availability than it adds.
- The system must reliably detect which redundant components are healthy and which are unhealthy.
- The system must be able to run in degraded mode.
- The system must be able to return to fully redundant mode.
Chapter 26 – Fault Isolation
While redundancy addresses infrastructure faults, some failures, due to their high correlation, cannot be tolerated with redundancy alone. One such example is Shuffle Sharding, a variation of partitioning that helps mitigate the impact of degraded partitions on stateless services.
Chapter 27 – Downstream Resiliency
This chapter explores tactical resiliency patterns that prevent faults from propagating from one component or service to another, thus reducing the impact of faults at the architectural level.
Chapter 28 – Upstream Resiliency
Contrasting with the previous chapter, this section discusses mechanisms that protect against upstream pressure, such as failures to reach an external dependency.
Summary
As the complexity of a system increases, so does the likelihood of failures. Engineers often prioritize minimizing and tolerating failures over scaling out systems, as it’s impossible to build an infallible system. The goal is to reduce the blast radius of failures and prevent cracks from propagating between components.