Skip to main content

Concept: Fault vs. Failure

1. The Snapshot

A Fault is a local deviation from spec (one component breaks); a Failure is a global outage (the system stops serving users).

2. The Description

This distinction is crucial for building reliable systems. Since it is impossible to reduce the probability of a fault to zero, the engineering goal is to design systems that prevent faults from becoming failures.

3. Author Quotes

"Note that a fault is not the same as a failure. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user." (p. 7)

4. Defining Features

  • Fault: A component-level error (e.g., HDD crash, network packet loss).
  • Failure: A system-level outage (e.g., User sees "500 Internal Server Error").
  • Goal: Build fault-tolerant systems that prevent faults from causing failures.

5. The Boundary

  • Fault is NOT Failure: You can have many faults (e.g., redundant nodes failing) without a system failure, provided your architecture is resilient.

6. The Prototype

A car with a flat tire has a Fault. If you have a spare and fix it, you avoid Failure (not reaching your destination). If you have no spare, the fault becomes a failure.

7. Helpful Info

Netflix's "Chaos Monkey" is a famous application of this concept. It intentionally introduces faults (killing servers) to verify that the system does not experience a failure.

8. The Swap Test

"We use redundancy to ensure that a hardware Fault doesn't result in a total system Failure."

9. Source Reference

ddia/pages/page_029.txt


🧠 Pedagogical Tracking

MilestoneStatusDateLesson RefNotes
Introduced in Lesson
Active Recall #1
1-Day Review
1-Week Review
1-Month Review