PHASE 03 // IMPLEMENT

recfo@implement:~/runbooks/s4-05
S4-05 · Understand Cloud Usage and Cost · Anomaly Management

Define Anomaly Response Workflow

Why

Detection without response is an alarm nobody answers. A mature anomaly management capability covers the full lifecycle: Detect → Notify → Triage → Resolve → Review. Defined SLAs turn anomaly alerts into resolved incidents rather than ignored notifications. Without a response workflow, the same spike recurs next month because root causes are never addressed.

What

Define a response workflow with severity-based SLAs, triage process, resolution tracking, and post-incident review. The output is a documented runbook that every alert links to.

How

Define Severity Levels and SLAs

SeverityDetection → AcknowledgementAcknowledgement → ResolutionEscalation Path
Critical (>$5K/day)< 1 hour< 4 hoursAuto-escalate to Eng Director at 1 hour
High ($1K–5K/day)< 4 hours< 24 hoursAuto-escalate to manager at 4 hours
Medium ($200–1K/day)< 24 hours< 5 business daysIncluded in weekly FinOps review
Low (<$200/day)Weekly digestBest effortNo escalation — tracked in backlog

Define the Triage Process

When an alert arrives, the owner must answer one question: is this expected or unexpected?

Anomaly Lifecycle
═══════════════════════════════════════════════════════════════

  Detect    → Notify    → Triage       → Resolve    → Review
  ─────────   ─────────   ───────────   ──────────   ────────
  ML-based    Route to    Expected or   Stop the     Root
  or          owner via   unexpected?   bleeding.    cause.
  threshold   Slack.      Who owns it?  Remediate    Prevent
                                        or accept.   again.

  ◄── AUTOMATED ──►       ◄────── HUMAN ──────────────────►

Expected causes: planned deployment, migration, seasonal spike, data transfer job, auto-scaling event. Action: acknowledge and close.

Unexpected causes: misconfiguration, runaway process, security breach, orphaned resources, accidental instance type change. Action: remediate and investigate root cause.

Create an ITSM ticket template for anomalies with fields: anomaly date, affected resource/account, estimated daily impact ($), severity, expected/unexpected classification, root cause, remediation action, lessons learned.

Configure ITSM Integration

Auto-create tickets from alerts for High and Critical severity:

ProviderIntegration Path
AWSEventBridge → Lambda → Jira/ServiceNow API
AzureAction Group → Logic App → Jira/ServiceNow connector
GCPPub/Sub → Cloud Function → Jira/ServiceNow API
Cross-cloudCloud Custodian notify action with ticket creation

For Medium/Low, aggregate into a weekly digest rather than creating individual tickets.

Define KPIs for Anomaly Management

Track these metrics to know if the system is working:

KPITargetData Source
Mean Time to Detect (MTTD)< 24h (threshold), < 4h (ML)Detection system logs
Mean Time to Acknowledge< 4h for Critical/HighSlack reaction or ticket creation time
Mean Time to Resolve< 24h for CriticalITSM ticket timestamps
False Positive Rate< 30%Manual triage classification
Financial Impact AvoidedTrack per incident(hourly cost × estimated hours saved)

Establish Post-Incident Review

Run a lightweight review (15 minutes) after every Critical/High anomaly:

Post-Anomaly Review
═══════════════════════════════════════════════════════════════

  Three questions:

  1. What happened?
     Root cause: deployment, misconfiguration, scaling,
     data transfer, or external factor?

  2. Could we have prevented it?
     Should there be a policy, guardrail, or approval
     gate that would stop this from recurring?

  3. Could we have detected it faster?
     Do thresholds need adjusting? Was the right
     person notified?

  Output:
  → Update detection thresholds
  → Create preventive ticket (policy change)
  → Update runbook if new pattern found

Recurring anomalies that reveal systemic waste should be converted into optimisation tickets (links to S8-10 Anomaly-to-Optimisation Handoff).

Deliverable Checklist

  • Severity levels and SLAs documented
  • Triage process defined (expected vs unexpected decision tree)
  • ITSM ticket template created for anomalies
  • Auto-ticket creation configured for Critical/High
  • Weekly digest configured for Medium/Low
  • KPI tracking operational (MTTD, MTTA, MTTR, false positive rate)
  • Post-incident review process documented
  • Escalation paths configured with auto-escalation timers