Amazon Dynamo — a conceptual overview (Part 2)

4 min readJul 6, 2021

In Part 1 of this series, we saw the various techniques that Amazon Dynamo employs to achieve high availability for write requests. In this post, we will deep dive into the different failure handling techniques used by Dynamo.

Quorum Failure

In Part 1, we saw how we can configure the values of R and W to achieve different levels of consistency in the system.

To achieve strong consistency, R and W values are configured such that R+W>N, where N is the number of the replicas for a given partition. For example, let N = 5, R=3 and W=3. In this case, if there is a network partition causing 3 of the nodes to be unreachable, the system becomes unavailable for both reads and writes for the keys in that partition.

In an eventual consistency configuration, a reasonable configuration would be N=5, R=1 and W=3. In this mode, R+W≤N and hence there could be replicas with stale values. Even in this mode, writes can fail if a network partition causes 3 of the nodes to be unreachable.

Didn’t we emphatically described Amazon Dynamo as an always writeable store ?

It sure is ! If majority of the replicas are unreachable, the system seeks help from the other live nodes. It uses a technique called hinted handoffs.

Hinted Handoffs

If nodes holding some of the replicas are unreachable, then Dynamo asks other live nodes to hold the data on behalf of the unreachable ones.

Let us suppose there are N replicas and 1 of the replicas are unreachable. In this case, when write arrives at one of the N-1 live nodes, it persists in its local disk and sends off the data to the other N-2 live replica nodes. It also sends off this data to 1 of the live nodes in the cluster, which is not the assigned replica for the partition. Along with the data, a context is also sent. This context contains information about the actual unreachable node. Since this node is not the assigned replica for the partition, it writes this in a separate location in its disk and periodically checks if the actual node is back online. Once that node is back online, it transfers the replicated data and deletes it from its local disk. This technique is called hinted handoff.

Hinted Replica Failure

In the above example, Node 3 takes the hint and persists in its local temp storage. By the time Node 2 comes back online, Node 3 fails permanently. In this scenario, Node 2 replica will go out of sync with other replicas (Node 0 and Node 1) permanently.

Hence hinted handoff is not sufficient to handle such permanent failures. We would need a mechanism to synchronise the replicas periodically. One way to do it would be to exchange a hash value of the entire replica and in case of a mismatch, synchronise it by sending all the data from one node to another. The node which has the maximum number of writes (although it is not clear from the paper) can be the one who initiates the data transfer in case of a hash mismatch.

Since this involves a large amount of data to be transferred over network, this is not desired. Another strategy would be to do these round trips for every key in the replica. This is clearly expensive and not desired. Both of these strategies require a linear scan on all the keys in the replica to determine the mismatches.

Amazon Dynamo synchronises replicas by storing the hashes of each record in a data structure known as Merkle Trees.

Replica Synchronisation using Merkle Trees

In Merkle Trees, each node contains the cryptographic hash of its children. Leaf nodes in the tree corresponds to the key:values pairs in the replica.

k4:v4 is a mismatch and it is found in log(n) time complexity

Using this data structure, mismatches can be found in logarithmic time complexity and hence it requires lesser data transfers between nodes.

Key set held by a node changes when the number of nodes in the cluster changes. This would require a re-calculation of Merkle Trees in those nodes.

Amazon Dynamo — a conceptual overview (Part 2)

Quorum Failure

Hinted Handoffs

Hinted Replica Failure

Replica Synchronisation using Merkle Trees

References

Written by Rahul Pradeep