SSTables and LSM Trees

Rahul Pradeep
7 min readJun 21, 2021

--

SSTables and LSM-Trees are techniques which are used to efficiently store and retrieve data. In this post, we will limit our discussion to data retrieval based on a primary key. Before diving deeper into these techniques, let us understand a simplified storage engine, where the keys are stored in memory and values are stored in disk. We will understand the pros and cons of this engine and gradually work towards solving the cons of this approach by introducing SSTables and LSM-Trees.

There are two primary components to this storage engine — hashTable and a log (append-only file). Let’s see how various operations work in this database

  • write(key, data) : The data will be appended in the log at the latest offset, say, offset1. The first four bytes of the data will indicate the size of the data to be read. The key will be updated in the hashTable with the value in the hashTable being the offset to the data in the log.
  • read(key) : Offset to the data in log is found by doing a lookup in the hashTable. Once the offset is found, we seek to that offset in the file and start reading the data. Since the first four bytes indicate the size of the data, we would know when to stop reading.
  • update(key, data) : Since we are using log, we will never go and update the data for an already existing key. If we were to do that, we will have to possibly re-write all data from the offset. So, we will just append this to the log just like the write operation. And update the offset in our in-memory hashTable. There would be old data lying around in the old offset, but let’s worry about that later.
  • delete(key) : Delete is trickier. We will append a special record in the log, also called a tombstone in the log indicating that we need to delete data belonging to the specified offset. We will also remove the entry for the key from the hashTable.

So far, so good. All the operations run in constant time complexity. What about the data growth in our log ?

Log-structured Storage Segments

Data in our log is ever growing, since we are never really deleting anything. We will soon run into disk issues. One way to keep this data growth in check would be to have a background thread do the compaction of the log file. This will work but we cannot serve any requests while this is happening. This will compromise the availability of our system.

To make our system available for serving online requests, we can cut a snapshot of our current log file and create a new one instead of that by ignoring old and deleted data. These snapshots are also called segments While this is happening, another log file and hashTable can accept online requests. How would our write and read requests change in this approach ?

  • write(key, data) : Same as before except that we will append the data to the latest log file. The in-memory hashTable will have to store log file path along with the offset for a key as there can be multiple log files at a time in our system.
  • read(key) : Determine the log file path and the offset for the given key and read the data in the same way we did in our previous approach.

To make sure this background cleanup operation happens more frequently we would need to limit our segment size. We can create these segments after a threshold size and do the compaction and merging in a background thread periodically.

This approach seems to have kept our data size on disk in check. The Achilles heal in our design is really the hashTable now.

  • Fault-tolerance : In case of node failures, we will lose our hashTable. To fix this, we can persist our hashTable every time we cut a log segment. To recover from this fault, we can simply read the latest persisted hashTable and re-build our in-memory hashTable.
  • Large number of keys : After a point our keys wouldn’t fit in memory. To maintain a hashTable on disk is possible but complicated and not very performant.
  • Range queries : Scanning all keys between a given range will require us to look at all keys.

Sorted String Tables (SSTables)

Sorting is a powerful technique. Sorting all the keys allows us to reduce the size of our hashTable. Extending our previous idea of diving our file into multiple segments, we do the same here, only difference being that each of the file contains keys in a sorted manner.

Segment-0: ["aaabc":0, "aabc":980, "bbcd":1500]
Segment-1: ["bcd":2000, "ccc";2200, "ccd":2500]
Segment-2: ["cde":3000]

In this approach , we can limit the the in-memory hashTable to have only some of the keys. This is called a sparse-index. For example, sparse index will contains the following keys. “aaabc”:0, “bcd”:2000, “cde”:3000. Each of the keys in this index will be start of a segment file. This segment file is called SSTable. Any read operation for a key will require us to first determine the segment where the key would lie. This can be done by a binary search on the sparse-index. Once we know the segment, we simply read the segment sequentially from disk and return the data once we find the matching key.

We still need to scan through all the keys in a segment, which could make reads slow if segments are big enough. One possibility is to make the segments smaller. The con of this is that there would be far too many small segment files, which could be detrimental in our compaction and merging phase. If we make our segments larger, our sparse index becomes way smaller. In practise, we should be able to use more memory, if we remove the coupling of keys in the sparse-index to our segments. We can have our sparse-index have one key for every X KB of data. Scanning through few KBs of data is fast enough.

Keeping it Sorted !

The million dollar question is how do we keep the keys sorted on disk efficiently. To maintain keys sorted in-memory can be done using any of the balanced binary search tree implementations (AVL trees, RB trees, etc). All writes can be done in memory until a certain threshold of size is reached. After which, we can flush it onto disk. This in-memory data structure is called a memTable.

Keeping it sorted in memory is easy and flushing it periodically onto disk seems efficient. But how do we handle fault tolerance here. Any node failure can cause all un-flushed data to be lost forever. To prevent this, we can use a write-ahead-log (WAL). Any write operation, first appends the operation into the log and then written into memTable. The WAL writes are very fast as they are sequential writes. We can recover from any fault by building our memTable by scanning this WAL. We can limit the size of WALs by periodically rolling it since already flushed data needn’t be maintained in the WAL.

Every time a memTable is flushed onto disk, it creates a new segment file. Over time this can create many segment files. Again, borrowing ideas from our simplified storage engine, we can have a background thread to periodically do compaction and merging.

Log Structured Merge Trees (LSM Trees)

SSTables are periodically compacted and merged into larger SSTables. The idea is to keep a small number SSTable. We will see how small number of SSTables are key to having a good read performance. This cascading list if SSTables is the key idea of LSMTrees.

For each of the SSTable file, we have a sparse in-memory index. And each of these SSTable file can be ordered from oldest to latest. With this in mind, let us see how read and write operations work.

  • write(key, data) : Writes the key:data into WAL and then writes it into the memTable.
  • read(key) : First look at the sparse-index for the memTable. If the key doesn’t exist, it could mean two things
  • The key was updated/written long time back and could be present in an older SSTable file. In this case, we will have to go through all SSTables from latest to oldest. We return the data as soon as we find the key.
  • The key is deleted. We can only say that a key is deleted after going through all the SSTables.

Performance of LSMTrees

As the writes involve one sequential write to WAL and an in-memory update to memTable, LSMTrees can give a very high write throughput.

For reads, we have to go through all the SSTables in the worst case. To get a high read throughput, it is important to keep a low number of SSTables. We can do this by doing a size based compaction. In this, newer SSTables are merged into older and larger SSTables.

For deleted keys, it still takes scanning all of SSTables to even determine if that key is deleted. One optimisation we can do is using a BloomFilter. BloomFilter is an in-memory data structure which approximates the contents of a set. It can tell you with 100% certainty if a key isn’t present in the set.

Range Queries

One advantage of storing keys in a sorted fashion is that it enables us to do range queries efficiently. For any range of keys [key_x, key_y], we can determine the exact SSTables to look for and hence limiting the key space to scan for.

Conclusions

This approach of storage and retrieval is used in several database systems like Google BigTable, Apache HBase, Apache Cassandra, etc.

References

--

--