Reliability & Availability

A system should be resilient (fault-tolerant) and performant under expected load


  • design for failure and trigger them deliberately e.g. kill processes without a warning
  • consider hardware faults such as blackouts, hard disk crashes, add redundancy as necessary
  • consider software faults such as
    • processes that slow down or that return corrupted responses
    • fault cascading where the a fault triggers faults in other components
  • measure/monitor the system to identify faults


A system should be able to handle load increases

  • Queries per second (QPS) to a web server
  • Ratio of read/writes in a DB
  • Cache hit/miss rate
  • Number of simultaneous users in a realtime system

Handling load

  • scaling up (vertical scaling), simple
  • scaling out (horizontal scaling), complex
  • manual scale, for predictable systems, simple
  • elastic scale, add resources as load increases, for unpredictable systems, complex


  • throughput: number of requests processed per second
  • latency: time to handle the request
  • response time: latency + network/queue delays

For the response time we use percentiles, given some metrics gathered for a set of requests in a period of time sort them from fastests to slowest, the common metrics are p50, p95, p99, p999 (used in SLAs)

When a requests involves parallel calls to multiple services, the response time is equal to the service which took the maximum time


Data should not be lost once sent to a system

Monitoring & metrics collection

Capture metrics about the data going in/out of the system