Why

Spot instances (AWS), Preemptible/Spot VMs (GCP), and Spot VMs (Azure) offer 60–90% discounts compared to on-demand pricing. For eligible workloads (batch processing, CI/CD, dev/test, stateless services), this is the highest single-action savings available. Most organisations avoid spot out of fear of interruptions — a clear eligibility policy and interruption handling strategy removes the risk.

What

Identify fault-tolerant and stateless workloads eligible for spot/preemptible instances, define an eligibility policy, and implement interruption handling.

How

Define Eligibility Criteria

Eligible (Good candidates)	Not Eligible (Avoid)
Batch processing jobs	Stateful databases
CI/CD build agents	Single-instance production services
Dev/test environments	Long-running transactions (>2 hours)
Stateless web workers (with ASG)	Services without health check / restart
Data processing / ETL	Workloads without graceful shutdown
ML training (checkpointing)	Compliance workloads requiring guaranteed uptime

Implement Interruption Handling

Provider	Interruption Notice	Handling Strategy
AWS	2-minute warning	EventBridge rule → drain connections → checkpoint → terminate gracefully
Azure	30-second warning	Scheduled Events API → graceful shutdown
GCP	30-second warning	Shutdown script → checkpoint → terminate

For containerised workloads: use spot-aware node groups (EKS Managed Node Groups, AKS Spot Node Pools, GKE Preemptible Pools) with pod disruption budgets.

Deploy and Measure

Start with non-production workloads. Measure: interruption frequency, job completion rate, and cost savings. Expand to eligible production workloads once the team is confident in the interruption handling.

Deliverable Checklist

Eligibility policy defined and published
Eligible workloads identified with savings estimate
Interruption handling implemented per provider
Spot deployed for non-prod workloads
Savings measured and tracked monthly