Concept: Fault vs. Failure
1. The Snapshot
A Fault is a local deviation from spec (one component breaks); a Failure is a global outage (the system stops serving users).
2. The Description
This distinction is crucial for building reliable systems. Since it is impossible to reduce the probability of a fault to zero, the engineering goal is to design systems that prevent faults from becoming failures.
3. Author Quotes
"Note that a fault is not the same as a failure. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user." (p. 7)
4. Defining Features
- Fault: A component-level error (e.g., HDD crash, network packet loss).
- Failure: A system-level outage (e.g., User sees "500 Internal Server Error").
- Goal: Build fault-tolerant systems that prevent faults from causing failures.
5. The Boundary
- Fault is NOT Failure: You can have many faults (e.g., redundant nodes failing) without a system failure, provided your architecture is resilient.
6. The Prototype
A car with a flat tire has a Fault. If you have a spare and fix it, you avoid Failure (not reaching your destination). If you have no spare, the fault becomes a failure.
7. Helpful Info
Netflix's "Chaos Monkey" is a famous application of this concept. It intentionally introduces faults (killing servers) to verify that the system does not experience a failure.
8. The Swap Test
"We use redundancy to ensure that a hardware Fault doesn't result in a total system Failure."
9. Source Reference
ddia/pages/page_029.txt
🧠 Pedagogical Tracking
| Milestone | Status | Date | Lesson Ref | Notes |
|---|---|---|---|---|
| Introduced in Lesson | ⚪ | |||
| Active Recall #1 | ⚪ | |||
| 1-Day Review | ⚪ | |||
| 1-Week Review | ⚪ | |||
| 1-Month Review | ⚪ |