Thundering Herd and Stale Set Operations

Rahul Pradeep
4 min readJun 22, 2021

--

In the Internet, a lot of content is consumed on a daily basis. For global services like Google, Facebook, Netflix, etc serving this data to the end users in a seamless manner becomes a priority. To design a web service to satisfy these requirements, we need to be able to retrieve this data very fast at a very high scale. In this post, we will look at how caching can solve this problem and also understand the various problems that comes with it. We will also dig deep into two specific problems that come with caching — thundering herd and stale-sets.

Demand-filled Look Aside Caches

Web services which serves data from a database often use a cache on the side to prevent load on their DBs. Data in databases are usually stored in disk and serves as the sources of truth for the data.

As these databases are sources of truth, any modification to the data should be done directly in the DBs. There is escaping from writes. Write load needs to be handled by the DB itself. Can we do something about the read loads ?

Disk reads are typically slower than in-memory reads. This is because commodity disk hardware is made up of rotating disks, which involves position the head on the circular disk and then reading blocks of data. Solid-state drives offer random access but is way more expensive. RAMs, obviously, offer random-access and are way faster. Since RAMs are also expensive, we can cache a copy of the hot-data in it for faster access. Hot-data can be defined based on your use-case. Typically, we want to cache that subset of data which has the most probability of being read in the near future. A reasonable way to determine this subset is by simply caching recently read data.

Read requests in a web service can be made faster if we cache some subset of the data in RAM and always look into the RAM before querying the DB. This is called look-aside caching. What happens when the data changes ?

On any modification of the data, we can simply delete it from the cache. Any subsequent reads won’t find it in the RAM and fill fetch it from the DB. Since we want subsequent reads for this data to not go into the DB, we can fill the cache with this data to benefit in the near future. This type of a cache where the cache is filled on demand for making future reads faster, is called demand-filled look-aside caches.

Caching is powerful, but demands great responsibility !

Permanent in-consistency

When multiple concurrent clients observe a cache miss for a key and both of them try to fill the cache, the cached data could become permanently inconsistent with the source of truth database.

The above diagram illustrates a scenario where we could end up with a permanently inconsistent cache. This scenario is called stale sets.

Thundering Herd

The above diagram also highlights another problem. Client A and B, both observe a cache miss and try to fill the cache with the latest data from the DB. In a highly concurrent system, many such clients could read from the DB for the same key in a very short span of time. This can create hot-spotting for the key and can bring down our database.

Leases

To fix the stale-sets problem, we need to be able to discard the set operation initiated by Client A.

To do this, every time a client (A) observes cache miss for a key (k), our in-memory cache could be in one of these two states

  • Client A is the first client to see this cache miss. In this case, the cache grants a lease to Client A. This lease tells A to go ahead and query the DB for the latest data and set it in the cache.
  • Another client already has an active lease. In this case, the cache responds with ‘cache-miss’ but also telling the client to try after after some time.

This will make sure only one client goes to the database at a time. This will definitely solve the thundering herd problem.

But this still doesn’t solve the stale sets problem. In-memory cache should identify a set operation as stale or not-stale.

In line with our demand-filled model, we delete the key if it is updated in the DB. When we delete the key, we also expire the active lease, if any, for this key. Now, if we modify our set operation to carry the client’s lease along with the payload, our in-memory cache will be fully equipped to discard any set operation with an expired lease.

References

--

--