Amazon Dynamo — a conceptual overview (Part 1)

Rahul Pradeep
6 min readJul 5, 2021

I happened to read the Amazon Dynamo paper recently. I have read it before but not nearly as intently. Its architecture manages to bring in multiple computer science concepts together and make it work in tandem to achieve the desired goals of the system.

Ref: https://en.wikipedia.org/wiki/Amazon_DynamoDB

This will be a three part series of posts where each part will focus on a set of concepts used in the architecture of Amazon Dynamo.

  • Part 1 will focus on the challenges of building an always writeable store. We will discuss the concepts to overcome those challenges.
  • Part 2 will mainly focus on the failure handling techniques like hinted handoffs and replica synchronisation using Merkle Trees.
  • Part 3 will deep dive into the ring topology in Amazon Dynamo architecture.

An always writeable store !

Writes are pristine in the world of an always writeable store. They should never go wrong. A typical e-commerce example that the paper mentions is of a shopping cart. Any cart edits done by the user should surely happen. The potential revenue impact of a failure in doing so can be huge in Amazon’s scale.

To understand what an always writeable store mean, let’s look at simple distributed data storage system where every record (key:value pair) is replicated across the cluster. This redundancy is mainly for fault tolerance as it lets us recover from a node failure by re-using one of the replicas. Redundancy also improves read scalability. We can scatter read requests for a key to all the replicas instead of overloading only one. One of the replicas is assigned master and other followers. Master replica is responsible for handling write requests and followers are updated either synchronously or asynchronously depending on the consistency mode we choose in the system.

In this architecture, let us examine if it is an ‘always writeable store’ by looking at some of the failure scenarios.

Master replica failure

Since there is a single node (the master) handling writes for a subset of records, this node becomes a single point of failure. Failure of this node can make our system unavailable to write requests. We can surely do better !

We can promote one of the follower replicas to be the new master and also bring up another follower replica to cover the loss. Is that enough ? Not quite! All the clients need to be notified of this change so that they update their local mappings (record to node mappings) and send all the future write requests to the new master instead of the old one which failed.

Till now, this system seems to be tolerant of failures towards write requests. What about the in-flight requests just moments before the master replica failure ? Those is inevitably fail and client would need to employ retries to and hope that before all the retries are exhausted, a new fully functional master comes up and client mappings are updated.

While such a system is fault tolerant towards node failures, there can still be some write request failures while the system is recovering. Retries could take a long time and can impact SLAs. This isn’t an always writeable system yet !

Eliminating The Master !

If there are no master replicas, who actually assumes the responsibility to handle write requests. The answer is — everyone !

This definitely invites more trouble. If multiple nodes handle writes, our data can go inconsistent pretty quickly. This might not be as bad as you think, if we are able to resolve the conflicts efficiently !

Either the server or the client can take the responsibility to resolve this conflict. Amazon Dynamo chooses to do this at client side. Clients will have the flexibility to have any strategy for conflict resolution. It could be as simple as ‘last write wins’ or a complex business logic.

Since every replica can handle writes, we will end up with multiple versions of the same record. We don’t necessarily need to keep around all versions. If we are able to establish causality relations between two versions of the same record, we can simply overwrite the ancestor with the child. The key to this is how we represent versions.

Vector Clocks

In a single machine, it is easy to establish causality relations if we use local time of the machine as the version. When we have multiple machines like in a typical distributed system, it becomes difficult to establish a causal relation between two events as local clocks might not be consistent with each other.

Vector clock is simply a vector of local clocks, where each of the machine in the cluster having its own local clock. Every source machine sends its own view of the vector clock at that point along with message to the destination machine. When the destination machine receives this message, it updates its vector clock using the following algorithm.

for each clock in destination_vector_clock:
destination_vector_clock[i] =
max(destination_vector_clock[i], source_vector_clock[i])
// local clock of the destination machine
// corresponds to destination_vector_clock[k]
destination_vector_clock[k] = destination_vector_clock[k] + 1

Amazon Dynamo uses vector clocks to represent versions of every record. If there is a causal relationship between two updates, then the older version is overridden by the later one. Otherwise, both the versions are persisted.

Version change over time. Ref : https://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Quorum reads and writes

As we have seen above, to make the system highly available for writes, we need multiple nodes to participate in the write request. To offer sufficient redundancy, Amazon Dynamo allows us to configure the minimum number of nodes, W, that should accept a write request before it is deemed successful.

The same goes for read requests as well. Any read request is deemed successful only if at least R nodes return a valid response. If we set R+W>N where N is the total number of replicas, it mimics a quorum like system. This is because if R+W>N, then at least one node will be common between the read and the write. This gives us a strongly consistent system (CP).

R=2, W=2 and N=3

Tuneable consistency and durability semantics

Amazon Dynamo was built keeping in mind the need for flexibility. As per the CAP theorem, we can choose consistency or availability in the event of a network partition in a distributed system.

As we have seen in the above section we can make Amazon Dynamo behave as a strongly consistent system (CP) by configuring the values of R and W such that R+W>N.

Eventual Consistency

We can achieve eventual consistency if we keep a low value of R, say R=1 and R+W≤N. In this case, read request for a key will go to one of the N replicas but that replica could still hold a stale version of the key.

Durable writes

If W=1, writes will return as soon as it is persisted in one of the nodes. If that node goes down before replication is completed in at least one of the remaining replicas, we would lose that write. To make it durable, we should have W>1.

References

--

--