Why
Detection without response is an alarm nobody answers. A mature anomaly management capability covers the full lifecycle: Detect → Notify → Triage → Resolve → Review. Defined SLAs turn anomaly alerts into resolved incidents rather than ignored notifications. Without a response workflow, the same spike recurs next month because root causes are never addressed.
What
Define a response workflow with severity-based SLAs, triage process, resolution tracking, and post-incident review. The output is a documented runbook that every alert links to.
How
Define Severity Levels and SLAs
| Severity | Detection → Acknowledgement | Acknowledgement → Resolution | Escalation Path |
|---|---|---|---|
| Critical (>$5K/day) | < 1 hour | < 4 hours | Auto-escalate to Eng Director at 1 hour |
| High ($1K–5K/day) | < 4 hours | < 24 hours | Auto-escalate to manager at 4 hours |
| Medium ($200–1K/day) | < 24 hours | < 5 business days | Included in weekly FinOps review |
| Low (<$200/day) | Weekly digest | Best effort | No escalation — tracked in backlog |
Define the Triage Process
When an alert arrives, the owner must answer one question: is this expected or unexpected?
Anomaly Lifecycle
═══════════════════════════════════════════════════════════════
Detect → Notify → Triage → Resolve → Review
───────── ───────── ─────────── ────────── ────────
ML-based Route to Expected or Stop the Root
or owner via unexpected? bleeding. cause.
threshold Slack. Who owns it? Remediate Prevent
or accept. again.
◄── AUTOMATED ──► ◄────── HUMAN ──────────────────► Expected causes: planned deployment, migration, seasonal spike, data transfer job, auto-scaling event. Action: acknowledge and close.
Unexpected causes: misconfiguration, runaway process, security breach, orphaned resources, accidental instance type change. Action: remediate and investigate root cause.
Create an ITSM ticket template for anomalies with fields: anomaly date, affected resource/account, estimated daily impact ($), severity, expected/unexpected classification, root cause, remediation action, lessons learned.
Configure ITSM Integration
Auto-create tickets from alerts for High and Critical severity:
| Provider | Integration Path |
|---|---|
| AWS | EventBridge → Lambda → Jira/ServiceNow API |
| Azure | Action Group → Logic App → Jira/ServiceNow connector |
| GCP | Pub/Sub → Cloud Function → Jira/ServiceNow API |
| Cross-cloud | Cloud Custodian notify action with ticket creation |
For Medium/Low, aggregate into a weekly digest rather than creating individual tickets.
Define KPIs for Anomaly Management
Track these metrics to know if the system is working:
| KPI | Target | Data Source |
|---|---|---|
| Mean Time to Detect (MTTD) | < 24h (threshold), < 4h (ML) | Detection system logs |
| Mean Time to Acknowledge | < 4h for Critical/High | Slack reaction or ticket creation time |
| Mean Time to Resolve | < 24h for Critical | ITSM ticket timestamps |
| False Positive Rate | < 30% | Manual triage classification |
| Financial Impact Avoided | Track per incident | (hourly cost × estimated hours saved) |
Establish Post-Incident Review
Run a lightweight review (15 minutes) after every Critical/High anomaly:
Post-Anomaly Review
═══════════════════════════════════════════════════════════════
Three questions:
1. What happened?
Root cause: deployment, misconfiguration, scaling,
data transfer, or external factor?
2. Could we have prevented it?
Should there be a policy, guardrail, or approval
gate that would stop this from recurring?
3. Could we have detected it faster?
Do thresholds need adjusting? Was the right
person notified?
Output:
→ Update detection thresholds
→ Create preventive ticket (policy change)
→ Update runbook if new pattern found Recurring anomalies that reveal systemic waste should be converted into optimisation tickets (links to S8-10 Anomaly-to-Optimisation Handoff).
Deliverable Checklist
- Severity levels and SLAs documented
- Triage process defined (expected vs unexpected decision tree)
- ITSM ticket template created for anomalies
- Auto-ticket creation configured for Critical/High
- Weekly digest configured for Medium/Low
- KPI tracking operational (MTTD, MTTA, MTTR, false positive rate)
- Post-incident review process documented
- Escalation paths configured with auto-escalation timers