Dealing with operations is a challenging element of any live service. Nobody enjoys being chained to their phone and laptop at all hours of the day. Understandably, engineers often dread being on call. Never the less, a strong part of any high functioning team is operational readiness and health. There are many avenues which teams take to try to achieve this. Operational review meetings, multiple on calls at the same time, planned operational improvement sprint work, etc. Having worked on different teams and at different companies, I’ve begun to wonder if the core challenge attempting to be addressed is simpler than it would seem: the on call is overloaded. In my experience, engineers with too much on their plate often end up getting less done than more. Similarly to the astronauts on the ISS during the 1978 “strike in space“, software engineers require free time to do their best work. In this post I hope to outline how on calls become overloaded, and the benefits which arise when on calls have fewer tasks on their plate.
Operational work comes in waves
Over years of on call shifts, I’ve noticed that operational load tends to come in waves. These waves follow drastic changes being made to either the service or the company as a whole. When large new features are being released or optimizations rolled out, the risk of unexpected errors increases along with them. This can make for periods of high load for the on call or customer support engineers on the team. On the flip side, periods of heavy design or planning can create lulls where the system generally looks after itself. During these periods of lowered activity, it’s tempting for managers and lead engineers to increase the workload that on call’s are responsible for. In a perfect world, this added workload would mirror the cycles of operational demand. In practice, I’ve observed added responsibilities creating extremely hectic on call shifts when demand increases again.
Unlike normal sprint work, on call work is rarely prioritized and often requires frequent context switching. This combination creates an environment where operational work is lower quality than expected and often gets only half completed. Ironically, this can actually make the operational health of the team worse as on call engineers attempt to fix active issues while also completing tasks like running standup, collecting metrics, and pushing forward company wide operations initiatives. Under these conditions, alarms don’t get tuned and bugs don’t get properly root caused. The solution to this problem is to ensure that the on call always has enough reserved bandwidth to make the operational improvements required to fix issues correctly the first time.
Space to improve the service
Building on the issues which arise when on call engineers are over worked, I think it’s important to consider the natural motivations of high quality engineers. Engineers which are passionate about the systems they work on have a natural urge to improve the world around them. When you allow these engineers to have the time to explore and propose operational improvements, the whole team benefits. About a year ago, an engineer on my team expressed a deep interest in building smoke tests for our service. They had identified that we often saw customer issues related to changes made by other teams, and an automated periodic system test could help us identify these issues. I approved of the idea and we released our first automated smoke test about a week and a half later. When reflecting on this experience, I realized that the timing of this suggestions coincided with a major overhaul of alarm tuning we had recently done. With fewer alarms creating noise in the system, I realized that on call engineers now had the space to identify and fix areas of weakness which were previously unconsidered. If we had not taken steps to reduce the load on the on call, I wonder if we would have been able to consider undertaking an unplanned operational improvement which has since proved to be invaluable.
Alert Responsiveness
In the past, I worked closely with a team which managed a massive number of database instances to support active reads and writes to the majority of the companies data. As expected, this team had a slew of alarms and metrics which were used to capture the current health of the system. I suspect that many of these alarms had been created when the company was much smaller and as the amount of data being stored grew, the alarms increased but their tuning systems did not. As a result, the alarms were notoriously unreliable and noisy. Suddenly, on an otherwise normal tuesday, support staff started reaching out to the team about a flood of customer reports. While the application itself was still working fine, internal users reported no issues, external customers were seeing a significantly higher rate of data load failures. The root cause was eventually traced to a database reader instance which had become overloaded and locked up.
The obvious question on everyone’s minds was “how did we not get an alert related to this issue?” The answer, of course, was that we had. Due to the significant number of alarms which were set on various database health metrics, the on call engineer had received an alert, assumed it was noise, and moved on their other on call work. This cemented in me a key learning about service monitoring and alerting: too many alarms is as risky as too few. A key job of any on call engineer should be identifying alarms which are no longer working well and adjusting them. If a database appears to be overloaded regularly, maybe the metric being tracked isn’t a good one. I remember reading a post about system resources years ago which said “You can’t take it [CPU load] with you. You can’t save up low CPU load and spend it later on.” High CPU load isn’t necessarily a bad thing if the system is still responsive.
Conclusions
While thinking through my thoughts and experiences with on calls, I found myself circling back to the same simple conclusion. The engineers actually handling the on call shifts are the best equipped to identify shortcomings and address them. Empowering on call engineers to change alarms, tweak systems, or propose new monitoring styles is the fastest way to extract improvements to overall operational health. When engineering leaders or management inserts itself into the discussion too heavily, that’s when operational health begins to decline. This is not to say that providing guidance and feedback isn’t helpful. I have witnessed first hand how impactful a principal level engineer joining the on call rotation can be, but I continue to believe that best way to keep operations health is to allow the on call to be bored some of the time.