Why
A cost spike running 2 weeks unnoticed costs 14× what catching it on day one costs. Without automated detection, anomalies are discovered when the monthly invoice arrives — by then the financial impact is unrecoverable. Budget alerts ensure visibility early; ML-based detection catches patterns that static thresholds miss.
Without Anomaly Detection
═══════════════════════════════════════════════════════════════
Day 1 Day 7 Day 30 Day 35
Anomaly Still Invoice Someone
starts running arrives notices
$500/hr $84K $360K "Why is
runaway wasted on the bill this so high?"
◄──────── Detection gap: 30+ days ─────────────────────────►
◄──────── Financial impact: unrecoverable ──────────────────► What
Deploy automated anomaly detection using cloud-native tools at multiple levels of the hierarchy (organisation, account, service), with alert routing to the right person.
How
Choose Detection Method
Use cloud-native anomaly detection as the starting point — it’s free or low-cost and integrates with billing data natively.
Native Detection Pipelines
═══════════════════════════════════════════════════════════════
AWS: Cost Anomaly Detection → SNS → Chatbot → Slack/Teams
└→ Webhook → PagerDuty
GCP: Anomaly Detection → Pub/Sub → Function → Slack/PagerDuty
Azure: Anomaly Alert → Action Group → Logic App → Slack/Teams
└→ Webhook → PagerDuty AWS uses “Chatbot” for zero-code Slack/Teams integration. Azure uses Logic Apps for drag-and-drop workflow. GCP requires a Cloud Function (small code snippet) to route alerts.
Configure Detection Scope
Set up monitors at different levels for different purposes:
| Monitor Scope | What It Catches | Alert Recipient |
|---|---|---|
| Organisation / Billing | Catastrophic org-wide spikes | FinOps lead, CTO |
| Account / Sub / Project | Team or application-level spikes | Service Owner, Eng Manager |
| Service level | Specific service runaway (e.g., egress) | Engineer, SRE |
| Individual resource | Single resource gone rogue (optional) | Resource owner (via tag) |
Don’t monitor everything at the same granularity. Organisation-level catches catastrophic events (compromised credentials, massive deployment errors). Account-level catches team-specific issues. Service-level catches runaway services like data transfer spikes.
Set Up Notification Routing
Detection is worthless if the alert goes to a generic inbox. Route to the person who can act.
Routing Logic
═══════════════════════════════════════════════════════════════
ANOMALY FOUND
│
▼
CHECK FOR "OWNER" TAG
│
├── YES → Route to owner (Slack/Email)
│
└── NO → Escalate to FinOps Practitioner
(manual triage for "homeless" spend) Implement a serverless router — a lightweight function (Lambda, Azure Function, Cloud Function) that enriches alerts by looking up the owner tag on the affected resource and routing the notification to the correct Slack channel or email.
The function logic:
- Parse the alert payload to get Account ID and Resource ID
- Call Cloud API to read resource tags
- If
ownertag exists, look up their Slack channel in a config file - Post a formatted message to the correct channel
- If no
ownertag, post to#finops-centralfor manual triage
Configure Budget Alerts
In addition to ML-based anomaly detection, set static budget alerts at provisioning time as a safety net.
| Provider | Tool | Configuration |
|---|---|---|
| AWS | AWS Budgets | Per-account budgets with 50%, 80%, 100% thresholds |
| Azure | Azure Budgets | Per-subscription budgets with progressive alerts |
| GCP | GCP Billing Budgets | Per-project budgets with Pub/Sub notification |
Budget alerts should be provisioned automatically as part of the workload onboarding pipeline (links to S2-02 Automated Provisioning).
Deliverable Checklist
- Cloud-native anomaly detection enabled (per provider)
- Organisation-level monitor configured
- Account/subscription/project-level monitors configured
- Serverless alert router deployed with tag-based routing
- Fallback routing to FinOps central channel
- Budget alerts configured per account/subscription/project
- Alert routing tested with a synthetic anomaly