Reduces alert noise by making alerts actionable, symptom-based, and tied to SLOs. Identifies and kills flappy, duplicate, and cause-based alerts. Use when alert fatigue is degrading on-call response quality.
Click to play with sound.
---
name: Alert Tuning
description: Reduces alert noise by making alerts actionable, symptom-based, and tied to SLOs. Identifies and kills flappy, duplicate, and cause-based alerts. Use when alert fatigue is degrading on-call response quality.
---
# Alert Tuning
Every alert that fires without requiring human action trains engineers to ignore alerts. Alert fatigue kills real incidents. The goal: every page wakes someone up for a reason they cannot automate away.
## The Audit — Run First
Before writing new alert rules, audit what you have.
1. Pull the last 30 days of alert history. Calculate: total pages, pages outside business hours, pages that resulted in no action taken.
2. Identify flappy alerts: any alert that fires and resolves more than 3 times in a 24-hour window without a corresponding incident.
3. Identify cause-based alerts: anything that fires on a system-internal metric (queue depth, pod restart count, CPU) without a direct user-impact correlation.
4. Calculate the actionability rate: (pages with a documented remediation / total pages). Below 80% means your alert set is broken.
## Symptom vs Cause
Page on symptoms, not causes. Symptoms are what users experience. Causes are what engineers investigate.
- Wrong: alert on CPU above 80%.
- Right: alert on p99 latency above SLO threshold.
- Wrong: alert on pod restart count above 2.… install to load the full skillSign in to rate and review this skill.
No reviews yet. Be the first to review this skill.