It’s I/O all the way down – Dallen's Landing

A long time ago, an engineer once told me that you can categorize all optimization problems as either CPU bound or I/O bound. While I rarely find myself dealing with optimization problems in general, I have wondered if these broad categories could help reveal potential bottlenecks before they arise. Each type of optimization problem requires a unique set of tools and analysis techniques to solve. If one category of problem is far more common than the other, perhaps we can hone our skills and tooling to ensure we are prepared to address issues when they do arise.

After some research, two things became clear to me:

When building web apps, CPU is almost never the bottleneck
I am hardly the first person to come to this conclusion

In fact, it would seem as though the sentiment that I/O causes the most latency is so wide spread that some authors go to great lengths to defend the practice of CPU optimizing even in a world dominated by I/O bottlenecks. Never the less, I would like to outline why it seems to me that optimization problems stemming from I/O are so rampant.

Why is I/O a Bottleneck?

First off, I think it’s important to remember Moore and his contribution to optimization problems everywhere. While Moore’s law hasn’t held true for many years as this point, CPU speeds do continue to increase. This creates a situation where CPU bound optimization problems can sometimes solve themselves thanks to ever increasing capabilities.

Second, I/O comes from so many places. Part of the problem with I/O bottlenecks is that they seem to work their way into many levels of the software stack. From decoding an incoming http request, to making database calls, to keeping relevant data in memory to ensure timely indexing. Each step requires I/O of some sort, and this is just for a fairly simple request. More broadly, I think this is indicative of a problem in the premise itself. The work which the CPU does it very well defined. I/O on the other hand just means communication between two devices. These can be devices on a single board, in a single machine, or across a single network.

Designing Around I/O Bottlenecks

So what can we do to optimize for I/O bottlenecks since they seem to be so pervasive. To start, it’s worth identifying the different types of I/O in our application to understand where it’s worth investing time to add telemetry. For most applications, I/O latency will be in making calls to a dependency like a database. If this is the case, focusing on generating metrics related to each database query time might be valuable along with monitoring of the databases slow queries log. Another area latency could also be stemming from is memory accesses. In general, I disable swap memory or keep it to an absolute minimum on machines I’m running to prevent unexpected I/O when accessing application memory under high load. In any case, memory allocation and usage statistics are often useful to start collecting before you run into a critical latency spike.

Regardless of the type of I/O in question, I think the simplest solution is likely the most impactful. Use less I/O wherever possible, and remember that CPU cycles are cheap. In practice, I’ve found this boils down to sending as little data over the wire as possible. JSON is highly readable, but even after compression it’s likely more data over the wire than protobuf or a simple CSV. Database side filtering vs application side filtering is always a tradeoff, but if more complex database filtering can prevent excess records being sent over the wire, the benefits are likely worth while. In a recent project, I worked to optimize a computation engine which operated on subsets of data from a database. Increasing the size of the request package when loading this data allowed me to select only the values actually required for computation. This tradeoff yielded substantial latency improvements as the time taken to load data before processing was dramatically reduced. Limiting I/O applies within a single machine as well. Disk reads are well known to be slow making them an easy target for optimization, however storing less memory is also an effective way to reduce run times. A number of small machines running in parallel will often outpace an equally sized large machine due to improved data locality and CPU cache hit rates.

Another approach for reducing the latency incurred by I/O is to leverage multi-threading, preloading, and asynchronous I/O. Languages like Kotlin and JavaScript have impressive capabilities when dealing with I/O asynchronously. If the language supports it, I have found that layering async I/O on top of multi-threaded parallel execution can significantly improve overall performance and throughput. When possible, I also being the process of loading data over the wire before it’s strictly needed. Preloading data which will be required for a significant number of operations can push the I/O latency into the startup phase of the application where it’s less critical.

Conclusion

While the conclusions in this post might not be revolutionary, I do find they come in handy more than expected. Distributed systems are hard to build and understanding tradeoffs is an art. The more litmus tests and intuition we can apply to the process of designing, the easier it becomes to focus on the root of the problem. Along those lines, I very often optimize to reduce I/O where possible, even if this requires somewhat more complex designs.