Spark Streaming — the Good, the Bad and the Ugly

Rahul Pradeep
5 min readOct 25, 2021

--

Stream processing systems have become increasingly popular in the last decade. It found its usage in a variety of real time systems across the industry. There are mainly two ways to implementing a stream processing system — Continuous Streaming and Micro-batches.

Continuous streaming

The likes of Apache Storm and Apache Flink implements this approach. The idea is that the processing happens at the granularity of a single event in the stream. As soon as the event appears in the stream, the system reads it, processes it and moves it forward. Single event processing latencies are very low in this approach.

Micro-batches

This approach is born out of fitting a batch processing system into the streaming world. The key idea is to form batches of a small number of events and process them like a batch processing system would do. The system waits a certain duration to collect a batch of events before starting to process them.

Micro-batching approach is a natural evolution of a batch processing system. This extension was elegant and has a lot of positives going for it. But in certain scenarios, it starts to show weaknesses. For the remainder of the post, we will gloss over the good bit (because that is not interesting) and cover the bad bits (the interesting ones) fairly in detail.

The Good

High throughput

In most of the real life use cases like real time data processing systems, etc micro-batch based stream processing is good enough since we don’t really need the milli-second latencies. For real time data processing systems, we want higher throughput i.e maximum number of events passing through the system in a given time. But if we have a real time chat delivery, it might not be the best idea to implement it in Apache Spark.

Interoperability with batch processing code

With many minor changes, you could change your Spark stream processing workloads to batch processing and vice versa. All the key features available in the Spark batch processing world is also available in Spark streaming.

The Bad and the Ugly

While most of the drawbacks mentioned below are specifically based on my experience in Spark streaming but it should be equally true for any micro-batch based streaming system.

Starvation while consuming from multiple stream sources

Spark streaming allows you to write streaming applications which consumes from multiple streaming sources. For ex — a streaming application to generate real time recommendations can listen to order events and product-view events. While this can seem to be a good thing, it has a bunch of negatives which are hard to ignore. To understand this, let’s take a step back and understand how a Spark streaming application works when you have multiple streams involved.

If computation of the events in one of the streams is more expensive than the other, the throughput of the system with respect to the computationally cheaper stream suffers.

To workaround this, we will have to split this application into two, each consuming from one stream. This too has limitations.

  • Splitting application based on stream sources can only be done if there are interactions between streams.
  • Additional overhead of managing and maintaining two applications instead of one even though the business logic is very similar.

Plethora of jobs scheduled for one micro-batch

In spark, every action results in a job getting scheduled for the micro-batch. Suppose your streaming application does the following:

// pseudo code for generating popular products and user recommendations// read events from stream sources
product_view_events = sparkContext.readStream('product_view');
order_events = sparkContext.readStream('product_view');
// group events for a user
grouped_product_view_events = product_view_events.map(_business_logic_function_).groupBy(user_id);
// group product view events for a product id
pid_grouped_product_view_events = product_view_events.map(_business_logic_function_).groupBy(product_id);
pid_grouped_product_view_events.foreach(_update_view_count_per_day_);
grouped_order_events = order_events.map(_business_logic_function_).groupBy(user_id);// update recommendations for a user in a database
grouped_product_view_events.foreach(_update_user_recommendations_);
grouped_order_events.foreach(_update_user_recommendations_);

The above application will result in three jobs being scheduled, one for every action (calls to foreach function).

By default, Spark schedules these jobs sequentially. Which means, if the job corresponding to ‘popular products’ take a lot of time, user recommendations would be affected. Overall time taken for a single micro-batch to complete would be the sum of time taken by all the three jobs. Here, the total number of cores needed for the application will be the maximum required cores of all the jobs. But since job executions are sequential, some jobs which require lesser cores will bring down the utilisation of the total cores allocation. This isn’t very efficient.

There are ways to fix this, but it comes with its own challenges and drawbacks. Some of them are:

Running jobs within a micro-batch concurrently

There is a setting in spark which allows us to schedule and execute jobs within a micro-batch concurrently. Because of higher concurrency demand, we would need to allocate more cores to see the real benefit.

In our previous example, there are two jobs which originate from ‘product_view’ events. Running them concurrently would mean that there are multiple reads from the stream source and most of the compute is duplicated. In a sequential execution mode, we can eliminate this concern by caching intermediate results.

Breaking jobs into multiple applications

Breaking applications based on stream number of jobs adds an additional overhead of managing the code and deployment for each of these applications. This particularly becomes painful when the multiplexing factor between stream source and dependent jobs is high.

Similar to concurrent job executions, this also has the issue of compute duplication.

No HA during application upgrades

When you deal with real-time applications, you cannot compromise on its availability. In case of simple web service where you have many machines running the same application code, any deployment strategy supporting high availability would have some machine up and running at all times. This ensures a smooth upgrade of the application.

In a real time streaming application based on Spark streaming, this is not really possible out of the box. This is because there is one component — spark driver which is responsible for determining and schedule the next micro-batch. The driver is really a single point of failure. So, any downtime of driver causes a downtime of the application. But Spark master limits the damage by supervising it. So, any driver failure will be noticed by the Spark master and it restarts it.

Light at the end of the tunnel

Some of the drawbacks of enabling concurrent job executions like higher resource demand and inability to use caching is a necessary evil. If the re-compute cost is fairly small and you have enough capacity for the higher resource demand, then simply enabling concurrent job executions in Spark streaming can improve your application’s overall performance.

Continuous streaming approach doesn’t have a single driver responsible for cutting a micro-batch. This means that every node / thread in your distributed continuous streaming application will independently be consuming and processing events. This approach seems to eliminate some of the concerns we have with micro-batch based streaming. Fortunately, Apache Spark has an experimental continuous streaming mode, which do seem like a more suitable approach for real time streaming applications.

--

--