Sleep Friendly Alerting

Sleep Friendly Alerting

This article was originally published on Last9 blog.

When Things Break at 3AM

Suppose you are a Site Reliability Engineer (SRE). In that case, we are sure you have that one email/channel where you are continuously bombarded with multiple alert notifications, and unread messages are in the thousands and millions! However, you dare NOT snooze this channel due to FOMOA (fear of missing out on an alert 😅). “What if I miss that one important alert?” you think.

And we can all agree that this approach is not feasible - alerts do get missed, things break, and 3 am war rooms not uncommon!

Today, we want to talk about a common aspect of an SRE’s life - ALERT FATIGUE. The term, also known as alarm fatigue or pager fatigue, refers to the situation wherein one is exposed to numerous, frequent alarms, consequently making them desensitized to it.

In simpler words, alert fatigue happens due to the overwhelming number of alerts received. Due to the high alert exposure over a long period, naturally, the engineer pays less attention. This ignorance can lead to delayed responses, missed critical warnings, or even system failures. It reminds us of the famous “The Boy who cried wolf” story - your alerting application being the boy who continuously screams about the “wolf” attacking your system.

Also, interesting to note that this is not just a common occurrence in the software world - several fatal incidents have happened in the medical and aviation industry due to alert fatigue (beyond the topic of this article but one exciting example here if you are interested).

A Day in the Life of an SRE

With the emergence of cloud and digital transformation, monitoring metrics and alerting notifications have risen tenfold. System alerts, deployment mails, tickets, logging, codebase popups, environment bug alarms, etc., everything tends to shift your focus exponentially. Moreover, things might stretch your multitasking mind a little more if you’re scrolling your messages or social media on your phone.

Here comes the urge to turn off or stay numb to low-priority alerts like regular system updates or repetitive deployment emails. These can be further categorized as non-critical alerts with no call-to-action required. But let’s imagine a day when there is some major system blocker because of a security breach.

Understandably, turning off the alerts altogether or posing an ignorant approach makes the whole idea of alerting emails meaningless. The burnout feeling from tons of notifications from various apps like Teams, Slack, Skype, Outlook, etc., gradually habituate your eardrums. And it is perfectly natural to switch off the noise or purposely ignore them once in a while. But there has to be a balance between silencing alerts and not missing the potentially critical ones. On the non-work, a good solution has been Apple’s new iOS update that allows you to focus by sending fewer alerts during your focus time. It also notifies your contacts about the same when they try to reach you. If you want to avoid missing your favorite social media notification throughout the day - tweak the settings.

On similar lines, regular infrastructure monitoring and threshold tweaks are necessary for creating the right amount of “noise” - the perfect balance of silence and noise that lets you stay productive and is legitimate.

How to have a Perfect Balance of Alerts

We already know that DevOps engineers and SREs are wearing multiple hats and are exposed to alerts and context switching 24*7. Even acknowledging an open ticket, taking as little as a few seconds, might dramatically shift your focus and affect your productivity rate.

Low-priority alerts can be duplicated, irrelevant, or correlated. There is, therefore, a need to intelligently segregate and monitor the actual alerts over false positives or negatives. Fusing correlated alerts as event diagnostics instead of sending over monotonous emails can also avoid flooding your inbox. While it is often a good practice for the user to define rules and workflows to reduce mail clutter or message pings, our existing tools have room for improvement and should do better. Essentially, alerts should have the following features:

  1. Be as limited in quantity as possible - you don’t need multiple alerts for the same failure point

  2. Be actionable - an intelligent alert that just doesn’t state the problem but gives additional information (including but not limited to the cascading impact of the alert, possible resolution, quick link to the dashboard for additional details, etc.)

  3. Be directed to the right person - an alert should be categorized to alert the relevant point of contact and not add fatigue to the entire team.

  4. Be sent in advance of failure - the most effective alert comes before the failure, giving teams time to fix before S**T hits the fan!

Over and above the mentioned features, we have also noticed that engineers tend to stress out even when alerts are fewer. The silence leads to the overthinking mind going in a spiral - “is the system not breaking, or is the alerting system itself broken?”

Therefore, the first step is to trust your system, alerting tool, and the workflows you have set up. The next step, of course, is to choose the right tool, and only then can you go ahead and enjoy a good night’s sleep without the 3 am interruptions!

How can Last9 help?

Last9’s SLO Manager tool allows you to set and manage SLOs for your services seamlessly. It’s a simple 3 step process:

Source metrics from the multiple silos where your data lives →, define your SLIs and SLOs, → connect through a webhook with your favorite alerting system (Slack, Pagerduty, Opsgenie, email, etc.).

Going forward, you see the most manageable system of alerts you have ever experienced. We (cheekily) call it sleep-friendly-alerting: primarily because it grew out of our pain point of those midnight interruptions and fires!Here’s how it works:

  1. We understand what is “normal” for your system: Last9 analyzes your historical metrics to understand expected levels of important system metrics (throughput, availability, latency, etc.)

  2. Notice when “normal” is about to break: Before an outage reaches its peak, we can analyze that your system’s normal levels are changing and things may break. This allows us to send you alerts of service “under threat.”

  3. Send intelligent alerts that are actionable: Our alerts now know when your system is about to break. However, we go one step further and tell you the potential cascading effects of a particular failure and point you instantly to the breaking SLO and service.

But let's finally accept one thing before you leave. Your systems will go wrong. Surprising side effects will emerge. You want to find them as fast and early as possible. And if you are the on-call engineer at 3 AM and see an alert from Last9, you can avoid cursing us and be sure that it’s not just a “screamed wolf”!