Capacity estimation for compute heavy web applications
Estimating the number of cores, memory and disk your application needs is pivotal to ensuring optimal resource usage and minimising cost. To understand how capacity estimation can be done, we need to understand the kind of application we are dealing with. Broadly an application can be either compute heavy or I/O heavy. In this post, I will cover the key things to keep in mind while estimating capacity for compute heavy applications.
Compute Heavy Applications
Web applications which primarily deals with doing expensive computations with minimal or no I/O calls are classified as compute heavy. For example — given a user and their shopping cart, compute the total cart value, given a user and recommended products, score using an ML model and rank them, etc.
In such kind of applications, the most important physical resource are the CPU cores. So, how do you come up with the number of CPU cores required for your application ? Well, we need to look at the requirements.
Let’s say we need to build an e-commerce mobile app which needs to display a ranked list of recommended products to the user on its homepage. We are particularly interested in estimating capacity for the Ranking Service which takes a user and a set of products he recently interacted with as input and outputs a ranked list of products based on scoring and ranking the products using an ML model. Assuming that the ML model is stored in memory and there is no extra I/O required, this service is a compute heavy service. Let’s answer a few basic questions to understand the scale of the system.
What is the maximum no of concurrent users accessing the homepage? Around 100K users at peak.
What is the expected latency of a single homepage render for an optimal user experience? Latency should be 500 ms or better.
As per the requirement, 100K users will be concurrently opening the homepage. So, our application should be able to support 100K concurrency. To enable this, we need to have 100K application threads in our service. Here, a single application thread will independently handle a user at peak load. But how does application threads equate to physical cores?
In a compute heavy application, all the work happens in the CPU core. This means, all of what a single application thread does in our example, will happen in a CPU core. So, can we allocate 100K physical cores to our application? Of course we can ! Would it be optimal? Well, depends on the compute workload. Let’s answer a few more questions.
What is the on-CPU time for the ranking computation? Worst case, it can take around 300 ms of CPU time.
Since our compute takes at most 300 ms on-CPU time, we still have around 200 ms to spare as our target homepage render latencies are around 500 ms. So, if we give 100K physical cores, each core will handle a single user request and the response will be returned in 300 ms. Since we have an extra 200 ms at hand, can we use lesser number of physical cores by trading off latencies? Of course, we can! So, how many physical cores can we save? Here is where it gets tricky.
Concurrency != Parallelism
Often we confuse between concurrency and parallelism. In our Ranking Service example, if we had allocated 100K cores to support 100K concurrency, we would have used a parallelism of 100K, by virtue of 100K cores, to support 100K concurrency. While this is totally acceptable, it may not be very optimal.
To understand this better, let’s refresh our memory on how threads are scheduled on CPU. In a typical OS, threads are given a time-slice of the CPU in a round robin manner. Every time a thread is taken out of the CPU, the context (operating data, instructions, etc) is saved so that it can be re-used the next time it is scheduled on the CPU. This is called thread context switch. There is a cost associated with a context switch and it often depends on the context itself.
Enough of OS theory ! So, how many cores should the Ranking Service use? The answer is — the number of cores at which off-CPU times (includes context switch times) of a thread aren’t exceeding 200 ms in a 500 ms window. That sounds complicated ! An easy way to determine the number is by performing a load test. We can perform a load test by iterating on the number of cores until your max latencies are 500 ms and concurrency supported is 100K. More often, this will be the limit of your system and CPU utilisation will be very high. Let’s say we performed the load test and the optimal number of cores came out to be 70K. So, we were able to support a concurrency of 100K using a parallelism of 70K.
In our Ranking Service example, based on our load test, we can support a peak concurrency of 100K with 70K physical cores with a max (or 99th percentile) latency of 500 ms. What if the peak load happens only for 1 hour in a day and rest of the 23 hours, the peak never exceeds 50K? This means, for 23 hours in a day, most of our cores will be idle. This is never a good thing since we will paying for that compute unnecessarily.
If we know the times of the day when the peak load happens, we can spawn additional cores just before the peak hours and destroy the cores after that. This can even be done on demand when you detect extra load. These capabilities of auto / on-demand scaling are very common in pretty much every cloud service provider these days. This way, we can further decrease the overall physical cores footprint of our application.