PHASE 03 // IMPLEMENT

recfo@implement:~/runbooks/s4-04
S4-04 · Understand Cloud Usage and Cost · Anomaly Management

Set Up Anomaly Detection

Why

A cost spike running 2 weeks unnoticed costs 14× what catching it on day one costs. Without automated detection, anomalies are discovered when the monthly invoice arrives — by then the financial impact is unrecoverable. Budget alerts ensure visibility early; ML-based detection catches patterns that static thresholds miss.

Without Anomaly Detection
═══════════════════════════════════════════════════════════════

  Day 1          Day 7           Day 30           Day 35
  Anomaly        Still           Invoice          Someone
  starts         running         arrives          notices

  $500/hr        $84K            $360K            "Why is
  runaway        wasted          on the bill       this so high?"

  ◄──────── Detection gap: 30+ days ─────────────────────────►
  ◄──────── Financial impact: unrecoverable ──────────────────►

What

Deploy automated anomaly detection using cloud-native tools at multiple levels of the hierarchy (organisation, account, service), with alert routing to the right person.

How

Choose Detection Method

Use cloud-native anomaly detection as the starting point — it’s free or low-cost and integrates with billing data natively.

Native Detection Pipelines
═══════════════════════════════════════════════════════════════

  AWS:   Cost Anomaly Detection → SNS → Chatbot → Slack/Teams
                                      └→ Webhook → PagerDuty

  GCP:   Anomaly Detection → Pub/Sub → Function → Slack/PagerDuty

  Azure: Anomaly Alert → Action Group → Logic App → Slack/Teams
                                      └→ Webhook → PagerDuty

AWS uses “Chatbot” for zero-code Slack/Teams integration. Azure uses Logic Apps for drag-and-drop workflow. GCP requires a Cloud Function (small code snippet) to route alerts.

Configure Detection Scope

Set up monitors at different levels for different purposes:

Monitor ScopeWhat It CatchesAlert Recipient
Organisation / BillingCatastrophic org-wide spikesFinOps lead, CTO
Account / Sub / ProjectTeam or application-level spikesService Owner, Eng Manager
Service levelSpecific service runaway (e.g., egress)Engineer, SRE
Individual resourceSingle resource gone rogue (optional)Resource owner (via tag)

Don’t monitor everything at the same granularity. Organisation-level catches catastrophic events (compromised credentials, massive deployment errors). Account-level catches team-specific issues. Service-level catches runaway services like data transfer spikes.

Set Up Notification Routing

Detection is worthless if the alert goes to a generic inbox. Route to the person who can act.

Routing Logic
═══════════════════════════════════════════════════════════════

  ANOMALY FOUND


  CHECK FOR "OWNER" TAG

       ├── YES → Route to owner (Slack/Email)

       └── NO  → Escalate to FinOps Practitioner
                  (manual triage for "homeless" spend)

Implement a serverless router — a lightweight function (Lambda, Azure Function, Cloud Function) that enriches alerts by looking up the owner tag on the affected resource and routing the notification to the correct Slack channel or email.

The function logic:

  1. Parse the alert payload to get Account ID and Resource ID
  2. Call Cloud API to read resource tags
  3. If owner tag exists, look up their Slack channel in a config file
  4. Post a formatted message to the correct channel
  5. If no owner tag, post to #finops-central for manual triage

Configure Budget Alerts

In addition to ML-based anomaly detection, set static budget alerts at provisioning time as a safety net.

ProviderToolConfiguration
AWSAWS BudgetsPer-account budgets with 50%, 80%, 100% thresholds
AzureAzure BudgetsPer-subscription budgets with progressive alerts
GCPGCP Billing BudgetsPer-project budgets with Pub/Sub notification

Budget alerts should be provisioned automatically as part of the workload onboarding pipeline (links to S2-02 Automated Provisioning).

Deliverable Checklist

  • Cloud-native anomaly detection enabled (per provider)
  • Organisation-level monitor configured
  • Account/subscription/project-level monitors configured
  • Serverless alert router deployed with tag-based routing
  • Fallback routing to FinOps central channel
  • Budget alerts configured per account/subscription/project
  • Alert routing tested with a synthetic anomaly