With replication

  • Copies of each partition are stored in multiple nodes

Partitioning strategies

  • Unfair partitioning may lead to hotspots (e.g., nodes with more data than others).
  • Assigning records randomly makes data unreadable.
  • By range (e.g., given a dictionary with sorted keys, Node 1 can have words from A to B, Node 2 from B to C, etc.).
    • Within each partition, keys are stored in order.
    • Range scans are easy.
    • This may lead to hotspots (e.g., if all keys belong to the range A to B, Node 1 will become the hotspot).
  • By hash key (e.g., take the hash key of the key and assign it to a range, using consistent hashing).
    • It distributes data evenly.
    • No range scans are possible.
    • Cassandra allows a multi-column primary key; the first part of the key is hashed to determine the partition, and the other columns are used as a concatenated index for SSTables.
  • A hybrid approach can be taken with skewed workloads (e.g., where all writes/reads are for the same key).
    • Append two digits to the key (e.g., key00, key01, …, key99). This introduces a tradeoff with read performance.

Rebalancing partitions

Things that change in a database over time:

  • More throughput requires more CPU, RAM, and disk, leading to vertical scaling.
  • A machine fails, and other machines need to take over its responsibilities.

Rebalancing requirements:

  • Load should be fairly shared.
  • The database should accept reads/writes while it’s being rebalanced.
  • No more data than necessary should be moved (to minimize I/O).

Strategies:

  • Fixed number of partitions (Riak):
    • When the number of partitions is greater than the number of nodes, multiple partitions are assigned to each node.
    • When a new node is added, it steals some partitions from every other node.
    • When a node is removed, it distributes its partitions to every other node.
    • The number of partitions is fixed when the database is set up and is not changed afterward.
    • Choosing the number of partitions is difficult if the size of the dataset varies.
  • Dynamic partitioning (MongoDB):
    • A partition is split once it reaches a limit or merged if it has very little data.
    • The number of partitions adapts to the dataset’s size.
    • An empty database starts with a single partition, and all writes are directed to the same node; consequently, the other nodes remain idle.
  • Partitioning proportionally to nodes (Cassandra):
    • There is a fixed number of partitions per node; partitions grow without affecting the nodes.
    • When a node is added, the partitions become smaller, and the data is redistributed.

Request routing

Problem: How does a client know which node to connect to?

  • Clients can connect to any node via a round-robin load balancer.
    • Cassandra and Riak use a gossip protocol to inform about changes in the cluster.
    • A request can be sent to any node, which forwards it to the appropriate node.
    • This puts more complexity on the database to avoid a dependency.
  • Requests are sent to a routing tier acting as a partition-aware load balancer.
    • ZooKeeper is a coordination service that keeps track of the cluster metadata, mapping partitions to nodes. Whenever a partition is created, updated, or removed, ZooKeeper notifies the routing tier.
  • The client is aware of the partition and doesn’t need an intermediary.