Stream processing using Kafka and Spark
The simplest architecture for processing a stream of data from Kafka would probably involve a process reading a single record at a time and processing it.
In the above architecture, depending on the processing complexity, we can achieve a fairly low single record processing latency. But our throughput would be limited since we only have a single process (thread) read, processing and moving the record forward. To increase throughput, we would want to parallelise some of the workload. With Kafka, it becomes fairly easy to do this because it has already partitioned the stream (topics) into multiple sub-streams (partitions). Now, if we can augment out process to have multiple threads, with each thread consuming from a single partition, it could significantly increase throughput.
This architecture would give you a low single record processing latency with a reasonably high throughput. True stream processing systems like Apache Storm has a similar architecture.
Apache Spark, on the other hand, optimises for throughput. To understand how we can achieve a better throughput, let us first break down the cost of processing N independent records in a stream. Assuming a fixed processing cost of X per record and IO (reads) cost of Y per record, it would take a total cost of N*X + N*Y.
Since all our records are independent, it is unlikely that we can reduce our processing cost. We can, however, try to reduce our IO cost. This can be done by doing batch reads. Instead of reading N records one by one, we can read all of them at once. This will reduce our cost by a factor of N. Total cost will be approximately N*X + Y. Let’s see what we get if we introduce this idea in out earlier architecture.
In the above architecture, each of our processing threads (Processor X) consume records in batches from a single partition (Partition X). There is a 1:1 mapping between processing threads and partition of the stream. Apache Spark employs a similar architecture to consume streams of data from Kafka. Every processing cycle processes a micro-batch of data from the stream.
Durability and Reliability
Once we have read the data, we need it to make sure we don’t loose it. This is critical to make sure we uphold at-least once processing guarantees. In case of an error scenario, we need to be able to re-process failed records. The techniques used to make sure data is durable depends on the source of the data stream. There are two kinds of stream sources
- Unreliable source: These kind of sources doesn’t implicitly guarantee durability. An example would be network data stream, where we wouldn’t be able to read back data which were earlier read but failed processing.
- Reliable source: These sources implicitly guarantee durability. Apache Kafka is a good example of a reliable source. Kafka ensures records are not deleted even if they are read. We can always go back and read older data. Kafka also ensures that there are no data loss in case of some of the Kafka nodes going down.
For an unreliable source, processing engine needs to ensure that the data that is read should be preserved, at least until it is completely processed. One way to do that is by making a local copy of the data in the disk managed by the processing engine. In this case, processing engine needs to make sure that there is sufficient redundancy in the data across the cluster to guard against node failures in the processing engine.
There is good benefit in making the processing engine simpler by offloading data storage to another reliable storage. Apache Spark provides an option to use HDFS for this.
For a reliable source, it is easier to ensure durability. Processing engine only needs to make sure that offsets / checkpoints (last read pointer in a stream of data) needs to be recorded properly so that it can re-read and process data in case of failures. These offsets will be much lesser data that the actual data itself. This can be stored in a reliable storage like HDFS, ZooKeeper, etc. In case of Kafka, it provides an API to manage these offsets for every consumer. In older versions of Kafka, it internally uses ZooKeeper. But in later versions, it uses an internal topic to manage these offsets per consumer.
Apache Spark and Kafka
Let’s see how all these ideas tie up to the architecture of stream processing using Apache Spark and Apache Kafka.
Driver
Driver is responsible to figure out which offset ranges to read for the current micro-batch. It then tells all the executors about which partitions should they care about.
Executor
Once the micro-batch processing starts, each of the executor talks to the assigned partitions and consumes the offset ranges instructed by the driver. It reads all the data in the offset range in bulk and starts processing on that chunk.
Offset Commits
Once the entire micro-batch is processed, driver sends an offset commit request to Kafka. This ensures that Kafka remembers the offset till which the consumer has already processed. Kafka internally uses ZooKeeper to keep this state.