Consistent Hashing

Rahul Pradeep
6 min readJun 21, 2021

To understand the need for consistent hashing, we need to first understand the need for data partitioning. Dataset, which is small enough, could be stored in a single physical machine. Every read request will be served by this one machine. This single machine is the ‘master’.

For fault tolerance, let’s add a few slaves to this master. In this configuration, any requests which modifies the state, goes through the master. The slaves just keeps themselves in sync with the master so that they can serve read requests. This eliminates a single point of failure for all the read requests. Since there is a single master handling all writes, we still have a single point of failure for write requests. But this failure can be corrected easily by promoting one of the slaves to master, either manually or automatically.

As the dataset grows, this kind of a system fails to scale simply because we don’t have infinite disk capacity in a single machine. At some point, we need to look at partitioning the data and distributing these partitions across multiple physical machines. These is a commonly used pattern by many NoSQL database systems.

Partition Strategies

To understand various ways to partition data, let us have the following assumptions:

  • The dataset that we are dealing with are key:value pairs.
  • The database system has N physical machines.
  • There are four kinds of operations possible on this data set — create, read, update and delete.

Random Partitioning

Whenever a new key needs to be written to our database, we choose one machine at random and write the key:value in that machine. This the simplest form of partitioning. Create operations in such a partitioning scheme is highly performant. But all other operations will be costly. For a read operation for a key, we need to first find out the machine which holds this key. This would require us to send parallel read requests to all of the N machines. The same needs to be done for update and delete requests as well.

Hash Partitioning

In this scheme, for every key, we first compute the hash of the key. The out of hash(key) will be a number. To assign a machine to this key, we simply take hash(key)%N. This will give us a number between 0 and N-1. Each of these numbers indicate a unique machine in our database cluster. Any CRUD request on a key will now be only sent to a single machine.

Fault Tolerance

In a distributed database, we need to make sure we handle node/process failures gracefully. For the purpose of this example, let’s assume when a node fails, only the ability to serve CRUD requests are compromised on the node and the data itself isn’t lost. If the database is optimised for availability (AP system), then it needs to make sure node failures doesn’t impact any clients which performs CRUD operations on the data.

In case one of our N machines fail, we have two options

  • Bring back another machine in its place and copy all the data held by the failed machine to this new machine. Bringing a machine up could be expensive. This could lead to a situation where some of the machines hold more data than others. In the long run, this could bring those overloaded machines down.
  • Copy the data held by the failed machine to one of the N-1 machines that are already up. This activity is still expensive but lesser than the first approach. Final state is similar to the first approach and it will lead to a non-uniform cluster.
  • Re-distribute all data across these N-1 active machines. This activity is probably the most expensive because it will involve moving large amounts of data across machines. But the final state is desirable in the long run. Data will be evenly distributed across these N-1 machines and there won’t be any overloaded machine in the cluster.

If we assume that number of node failures are fairly lesser than the request rate, we can pick the third approach because the final state is the most desirable in the that approach.

In the above diagram, if node 3 goes down, we need to re-hash all keys using hash(key)%3 instead of hash(key)%4.

Since the cost of re-distributing all data is very expensive, we have to optimise it. The goal would be minimise movement of data over the network.

Consistent Hashing using Hash Rings

Instead of a line to represent output of the modulus operation, we can represent it on a circular ring.

Let val = hash(key)%4

If val lies in quadrant 0, then we walk clockwise to the nearest active machine in the ring, which is Machine 0. Hence, the key will lie in Machine 0.

In case one of the machine fails, only the keys which were previously in the failed machine is re-distributed to the nearest active machine, which in this case is Machine 1.

This significantly reduces our data movement over the network. If the number of machines in the cluster are less, then it can still result in non-uniform data size across the cluster. In the above example, Machine 1 will end up with 50% data while others only have 25% each. To minimise the impact, we can have a large cluster size. Having a large number of nodes in the cluster will make sure delta increase in one node due to the failure of a preceding node in the ring is very less.

Practically, having a large cluster size is unnecessary and can be counter productive. Instead of having a large number of physical machines, we can instead create a logical machines, virtual node. One physical machine will host one or more virtual nodes. Virtual nodes also allow us to have heterogenous machines in the cluster.

Let val = hash(key)%8

Find the val in the hash ring and then find the nearest virtual node. Once the virtual node holding the key is determined, we can then find the machine which hosts it.

In case one of the the machine fails, in this case, say Machine 0, we need to re-distribute keys held by VN0 and VN1 into other machines and we need to do it in a uniform manner so that there is no large skew in the load. After removing VN0 and VN1 from hash ring, all keys in VN0 will be now handled by VN3 and all keys in VN1 by VN6. So, all the keys handled by Machine 0 are now distributed among Machine 1 and Machine 3.

The above method ensures minimal movement of data over the network and also nearly uniform final state of the cluster.

References

--

--