A system should be resilient (fault-tolerant) and performant under expected load


  • design for failure and trigger them deliberately e.g. kill processes without a warning
  • consider hardware faults such as blackouts, hard disk crashes, add redundancy as necessary
  • consider software faults such as
    • processes that slow down or that return corrupted responses
    • fault cascadig where the a fault triggers faults in other components
  • measure/monitor the system to identify faults



  • Queries per second (QPS) to a web server
  • Ratio of read/writes in a DB
  • Cache hit/miss rate
  • # of simultaneous users in a realtime system

Approximations for back of the envelope calculations

Operations in a day

Exact value:        24 h/day * 3600 s/h = 86,400 s
Approximate value:  ~80k s
Approximate (frac): ~80k s (not needed)

Example: 200 M users in a chat system sending 40 messages in average every day

\[ \frac{200M * 40}{80k \; s} = \frac{8 * 10^9}{8 * 10^4 \; s} = 10^5 QPS \]

Operations in a month

Exact value:        30 day/month 24 h/day * 3600 s/h = 2,592,000 s
Approximate value:  ~2.5M s
Approximate (frac): ~5/2M s

Example: 500 M requests per month

\[ \frac{500M}{5/2M \; s} = \frac{500 * 2}{5} \; QPS = 200 \; QPS \]

Handling load

  • scaling up (vertical scaling), simple
  • scaling out (horizontal scaling), complex
  • manual scale, for predictable systems, simple
  • elastic scale, add resources as load increases, for unpredictable systems, complex


  • throughput, # requests processed per second
  • latency (time to handle the request)
  • response time (latency + network/queue delays)

For the response time we use percentiles, given some metrics gathered for a set of requests in a period of time sort them from fastests to slowest, the common metrics are p50, p95, p99, p999 (used in SLAs)

When a requests involves parallel calls to multiple services, the response time is equal to the service which took the maximum time