Why
Spot instances (AWS), Preemptible/Spot VMs (GCP), and Spot VMs (Azure) offer 60–90% discounts compared to on-demand pricing. For eligible workloads (batch processing, CI/CD, dev/test, stateless services), this is the highest single-action savings available. Most organisations avoid spot out of fear of interruptions — a clear eligibility policy and interruption handling strategy removes the risk.
What
Identify fault-tolerant and stateless workloads eligible for spot/preemptible instances, define an eligibility policy, and implement interruption handling.
How
Define Eligibility Criteria
| Eligible (Good candidates) | Not Eligible (Avoid) |
|---|---|
| Batch processing jobs | Stateful databases |
| CI/CD build agents | Single-instance production services |
| Dev/test environments | Long-running transactions (>2 hours) |
| Stateless web workers (with ASG) | Services without health check / restart |
| Data processing / ETL | Workloads without graceful shutdown |
| ML training (checkpointing) | Compliance workloads requiring guaranteed uptime |
Implement Interruption Handling
| Provider | Interruption Notice | Handling Strategy |
|---|---|---|
| AWS | 2-minute warning | EventBridge rule → drain connections → checkpoint → terminate gracefully |
| Azure | 30-second warning | Scheduled Events API → graceful shutdown |
| GCP | 30-second warning | Shutdown script → checkpoint → terminate |
For containerised workloads: use spot-aware node groups (EKS Managed Node Groups, AKS Spot Node Pools, GKE Preemptible Pools) with pod disruption budgets.
Deploy and Measure
Start with non-production workloads. Measure: interruption frequency, job completion rate, and cost savings. Expand to eligible production workloads once the team is confident in the interruption handling.
Deliverable Checklist
- Eligibility policy defined and published
- Eligible workloads identified with savings estimate
- Interruption handling implemented per provider
- Spot deployed for non-prod workloads
- Savings measured and tracked monthly