CodingOfficialVerified

Alert Tuning

by Skill Me

Reduces alert noise by making alerts actionable, symptom-based, and tied to SLOs. Identifies and kills flappy, duplicate, and cause-based alerts. Use when alert fatigue is degrading on-call response quality.

alertingsresloobservabilityon-call

Demo

Click to play with sound.

SKILL.md preview

View source on GitHub →

---
name: Alert Tuning
description: Reduces alert noise by making alerts actionable, symptom-based, and tied to SLOs. Identifies and kills flappy, duplicate, and cause-based alerts. Use when alert fatigue is degrading on-call response quality.
---

# Alert Tuning

Every alert that fires without requiring human action trains engineers to ignore alerts. Alert fatigue kills real incidents. The goal: every page wakes someone up for a reason they cannot automate away.

## The Audit — Run First

Before writing new alert rules, audit what you have.

1. Pull the last 30 days of alert history. Calculate: total pages, pages outside business hours, pages that resulted in no action taken.
2. Identify flappy alerts: any alert that fires and resolves more than 3 times in a 24-hour window without a corresponding incident.
3. Identify cause-based alerts: anything that fires on a system-internal metric (queue depth, pod restart count, CPU) without a direct user-impact correlation.
4. Calculate the actionability rate: (pages with a documented remediation / total pages). Below 80% means your alert set is broken.

## Symptom vs Cause

Page on symptoms, not causes. Symptoms are what users experience. Causes are what engineers investigate.

- Wrong: alert on CPU above 80%.
- Right: alert on p99 latency above SLO threshold.
- Wrong: alert on pod restart count above 2.… install to load the full skill

Alert Tuning

Demo

SKILL.md preview

Reviews

Write a review