Concurrency control and Linux's Pressure Stall Information
I did some performance analysis recently where I needed to download and process a bunch of data. It was easily parallelizable since the data set was made up of many independent items. I was mostly interested in the average time to process each individual item. So the amount of parallelization wasn’t too interesting, other than that I wanted to do some parallelization but I didn’t want to overload my test machine.
One of the other topics I’ve been interested in recently is concurrency control, and how we can do that more effectively within my team’s services. We would like to allow our servers to do as much work as possible without overloading them. Overloading a server can make everything slow down, even cheap fast calls get crazy slow.
In Stop Rate Limiting, Jon Moore describes a method of concurrency control using a dynamically determined limit. The method works like this: given some limit to the number of jobs, when you go to schedule a job beyond the limit, you check to see if the system is overloaded. If it isn’t, you add a constant to the limit and allow the new request. On the other hand, when the system is overloaded, you reduce the limit by dividing by a constant. This is known as Additive Increase/Multiplicative Decrease, and is used in algorithms like TCP congestion control.
My performance problem provided a scenario that I could use to try AIMD. My jobs are all assumed to be similar in size, so the limit is the maximum number of jobs. (If the jobs weren’t consistently sized, I might estimate their weight and use a sum of the weights as the limit.) When the system isn’t overloaded, I increased the limit by 1, at most once per second. When the system was overloaded, I decreased the limit by a factor of 0.9, again at most once per second.
I needed a way to check if the system was overloaded. An obvious choice would be load average. But I’d recently learned about Linux’s Pressure Stall Information (PSI), so I thought I would try using it. PSI has three different counters, one each for CPU, memory, and I/O. PSI tells you how often some or all tasks are blocked waiting for a given resource. I looked at PSI values for some of our other servers, and decided to set up three ranges for each: overloaded, acceptable, underloaded, as a percent of the time that some tasks were stalled waiting for the resource. The max healthy limit that I used for CPU was 10%, for memory 1%, for I/O 1%.
The end result is in this gist.
So, how did it go?
I ended up running a couple different types of jobs inadvertently. One was more compute heavy, and one was more balanced between compute and network transfer. The more balanced version downloaded archives, reassembled them, and then ran a couple of consistency checks across the data. The compute heavy version did mostly the same thing, except that it ended up doing the consistency checks many times per job. Another way to look at it: the balanced job did O(N) downloads and O(N) scans; the compute heavy one did O(N) downloads and O(N2) scans.
For the balanced job, the concurrency control worked pretty well. I ran this job a few times. Each time, utilization was just about right so that work got done quickly but the system remained responsive. The number of concurrent workers was 50 - 70 for the duration. CPU usage was around 80%, load was 73 - 120 (on a 128 core system), I/O utilization was around 40% at the peak. Overall, I was happy with how it turned out.
For the compute heavy job, the concurrency control didn’t work as well. I ran this job once. In this case, the number of concurrent jobs was nearer 140, load was also about 140, and CPU usage was pegged at 100%. The system was overloaded badly enough that I interrupted it and ran it differently so that the network to compute ratio was closer to 1:1 again.
If I were to do this again or for production use, I would add a step to the compute-heavy job that checks in with the controller to see if it’s ok to do the next computational step. This would allow the limiter to claw back some of the work if it accidentally allowed too much through. I would also want to tune the overload detector quite a bit, I need to learn more about how to interpret the PSI values.