Alerting Essentials

Dec 12, 2024

4 min read

The value your application provides to your customer is dependent on your application being available, functional and reliable. To ensure your application is, you need alerts when your application is or is about to not be 1 of or more of those 3. The key to writing good alerts is understanding the goals of your application. Folks in the SRE (Site Reliability Engineer) discipline often refer to these well defined goals as SLOs or service level objectives. Therefore, step 1 for writing good alerts to writing good SLOs, step 2 is defining/gathering metrics and step 3 is alert writing. As we go through the following exercise keep in mind that over-writing of alerts, alerts that are not actionable and alerts that highlight failures that are non-customer impacting could be a net negative for you and your engineering team. For the sake of example consider this simple payment processing application as we walk through the 3 steps

Implementing SLOs

SLOs are used to prioritize engineering work by calculating the impact on the error budget, which is the rate at which SLOs can be missed. You will need to elicit feedback from both internal and customer stakeholders to ensure the SLOs are measurable, achievable, valuable and have a clear error budget. For example an SLO of 100% uptime for the whole payments system is not achievable and does not have a good value proposition. However, an SLO for the payment system to process 95% of transactions within 2 seconds over a 24-hour aggregation window is measurable, reasonable, gives the customer an expectation and provides value. To be more specific, only the upper path in the diagram that flows through the Merchant Service needs to meet this SLO. The lower path through the Report Service could allow for much slower responses with a higher error budget.

Gathering Metrics

However, both paths need a translation from business to technical in order to be ingested as a metric that alerts can be written against, this is where SLIs come in. In this case a simple SLI, successful HTTP requests / total HTTP requests, will provide a valuable availability metric, request success rate.

sum by (status) (   count_over_time({job="payment-system"} |= "transaction_status" [24h]) ) / sum(   count_over_time({job="payment-system"} |= "transaction_id" [24h]) ) * 100

That coupled with an SLI of request duration, latency, are two metrics that will allow us to know if we are meeting our SLO.

sum by (processing_time) (   count_over_time({job="payment-system"} |= "transaction_processing_time <= 2s" [24h]) ) / sum(   count_over_time({job="payment-system"} |= "transaction_id" [24h]) ) * 100

We won’t dive into any implementation on how to export these metrics from the application but there are plenty of SDKs/utilities/services that provide this functionality. Once these two metrics are flowing into your observability tool of choice you can see a graph of availability and latency over time.

Writing Alerts

The alert for our “the payment system to process 95% of transactions within 2 seconds over a 24-hour aggregation window” SLO is simply a matter of writing metrics queries with a criteria/threshold applied.

(   sum by (status) (     count_over_time({job="payment-system"} |= "transaction_status" [24h])   ) / sum(     count_over_time({job="payment-system"} |= "transaction_id" [24h])   )  100 )  100 < 95

and

(   sum(     count_over_time({job="payment-system"} |= "transaction_processing_time <= 2s" [24h])   ) / sum(     count_over_time({job="payment-system"} |= "transaction_id" [24h])   ) ) * 100 < 95

More Considerations

The SLOs, SLIs, Alerts above cover the system shown above at a high level and at 95% success rate that might be sufficient. However, what if the allowed error rate is 99% or 99.99999%? Suddenly the speed at which the alert fires, the engineer logs in, root cause is identified, a fix is implemented and deployed is key. Being able to pinpoint the root cause using metrics is valuable and brings us back to the SLI drawing board. What lower level component metrics might give us a more granular view where the failure occurred? Some ideas:

DNS request success rate
API Gateway routing success rate
Load balance forwarding success rate
Merchant service resource usage
Merchant service restarts
External bank connection status
Merchant cluster resource usage
Database connection success rate/load
Database slow query occurrence rate

Once all these metrics are flowing into your observability system you are able to write alerts that would give a more granular picture of where the failure is occurring. For example, if your database connection success rate drops quickly in a short period you might look into if your connection pool is sufficient or if the database has enough resources assigned. There are hundreds of potential metrics and alerts that could be written so be selective with what provides the most value for your engineers/system to meet the requirements laid out by your customer. For a deeper dive on this topic we highly recommend Google’s SRE Workbook.

Conclusion

Implementing effective alerts for your application's reliability and performance is crucial and begins with well-defined Service Level Objectives (SLOs). By setting measurable, achievable, and valuable SLOs, you can prioritize engineering work and ensure your application meets customer expectations. Gathering relevant metrics through Service Level Indicators (SLIs) like request success rates and latency is essential for monitoring these SLOs. Writing alerts based on these metrics helps in identifying and addressing potential issues promptly, thereby maintaining the application's availability and functionality. Ultimately, a balanced approach to SLOs, SLIs, and alerts, coupled with the ability to drill down into granular component metrics, is key to ensuring the reliability and performance of your application.