Reliability & Availability

A system should be resilient (fault-tolerant) and performant under expected load.

Strategies:

  • Design for failure and trigger them deliberately, e.g., kill processes without a warning.
  • Consider hardware faults such as blackouts and hard disk crashes, and add redundancy as necessary.
  • Consider software faults such as:
    • Processes that slow down or that return corrupted responses.
    • Fault cascading, where a fault triggers faults in other components.
  • Measure/monitor the system to identify faults.

Scalability

A system should be able to handle load increases.

  • Queries per second (QPS) to a web server.
  • Ratio of reads/writes in a DB.
  • Cache hit/miss rate.
  • Number of simultaneous users in a real-time system.

Handling load:

  • Scaling up (vertical scaling): simple.
  • Scaling out (horizontal scaling): complex.
  • Manual scale: for predictable systems, simple.
  • Elastic scale: add resources as load increases; for unpredictable systems, complex.

Performance

  • Throughput: number of requests processed per second.
  • Latency: time to handle the request.
  • Response time: latency + network/queue delays.

For the response time, we use percentiles. Given some metrics gathered for a set of requests in a period of time, sort them from fastest to slowest. The common metrics are p50, p95, p99, p999 (used in SLAs).

When a request involves parallel calls to multiple services, the response time is equal to the service that took the maximum time.

Durability

Data should not be lost once sent to a system.

Monitoring & metrics collection

Capture metrics about the data going in and out of the system.