Why

Detection without response is an alarm nobody answers. A mature anomaly management capability covers the full lifecycle: Detect → Notify → Triage → Resolve → Review. Defined SLAs turn anomaly alerts into resolved incidents rather than ignored notifications. Without a response workflow, the same spike recurs next month because root causes are never addressed.

What

Define a response workflow with severity-based SLAs, triage process, resolution tracking, and post-incident review. The output is a documented runbook that every alert links to.

How

Define Severity Levels and SLAs

Severity	Detection → Acknowledgement	Acknowledgement → Resolution	Escalation Path
Critical (>$5K/day)	< 1 hour	< 4 hours	Auto-escalate to Eng Director at 1 hour
High ($1K–5K/day)	< 4 hours	< 24 hours	Auto-escalate to manager at 4 hours
Medium ($200–1K/day)	< 24 hours	< 5 business days	Included in weekly FinOps review
Low (<$200/day)	Weekly digest	Best effort	No escalation — tracked in backlog

Define the Triage Process

When an alert arrives, the owner must answer one question: is this expected or unexpected?

Anomaly Lifecycle
═══════════════════════════════════════════════════════════════

  Detect    → Notify    → Triage       → Resolve    → Review
  ─────────   ─────────   ───────────   ──────────   ────────
  ML-based    Route to    Expected or   Stop the     Root
  or          owner via   unexpected?   bleeding.    cause.
  threshold   Slack.      Who owns it?  Remediate    Prevent
                                        or accept.   again.

  ◄── AUTOMATED ──►       ◄────── HUMAN ──────────────────►

Expected causes: planned deployment, migration, seasonal spike, data transfer job, auto-scaling event. Action: acknowledge and close.

Unexpected causes: misconfiguration, runaway process, security breach, orphaned resources, accidental instance type change. Action: remediate and investigate root cause.

Create an ITSM ticket template for anomalies with fields: anomaly date, affected resource/account, estimated daily impact ($), severity, expected/unexpected classification, root cause, remediation action, lessons learned.

Configure ITSM Integration

Auto-create tickets from alerts for High and Critical severity:

Provider	Integration Path
AWS	EventBridge → Lambda → Jira/ServiceNow API
Azure	Action Group → Logic App → Jira/ServiceNow connector
GCP	Pub/Sub → Cloud Function → Jira/ServiceNow API
Cross-cloud	Cloud Custodian `notify` action with ticket creation

For Medium/Low, aggregate into a weekly digest rather than creating individual tickets.

Define KPIs for Anomaly Management

Track these metrics to know if the system is working:

KPI	Target	Data Source
Mean Time to Detect (MTTD)	< 24h (threshold), < 4h (ML)	Detection system logs
Mean Time to Acknowledge	< 4h for Critical/High	Slack reaction or ticket creation time
Mean Time to Resolve	< 24h for Critical	ITSM ticket timestamps
False Positive Rate	< 30%	Manual triage classification
Financial Impact Avoided	Track per incident	(hourly cost × estimated hours saved)

Establish Post-Incident Review

Run a lightweight review (15 minutes) after every Critical/High anomaly:

Post-Anomaly Review
═══════════════════════════════════════════════════════════════

  Three questions:

  1. What happened?
     Root cause: deployment, misconfiguration, scaling,
     data transfer, or external factor?

  2. Could we have prevented it?
     Should there be a policy, guardrail, or approval
     gate that would stop this from recurring?

  3. Could we have detected it faster?
     Do thresholds need adjusting? Was the right
     person notified?

  Output:
  → Update detection thresholds
  → Create preventive ticket (policy change)
  → Update runbook if new pattern found

Recurring anomalies that reveal systemic waste should be converted into optimisation tickets (links to S8-10 Anomaly-to-Optimisation Handoff).

Deliverable Checklist

Severity levels and SLAs documented
Triage process defined (expected vs unexpected decision tree)
ITSM ticket template created for anomalies
Auto-ticket creation configured for Critical/High
Weekly digest configured for Medium/Low
KPI tracking operational (MTTD, MTTA, MTTR, false positive rate)
Post-incident review process documented
Escalation paths configured with auto-escalation timers