PHASE 03 // IMPLEMENT

recfo@implement:~/runbooks/s7-03
S7-03 · Optimize Usage & Cost · Architecting & Workload Placement

Assess Spot & Preemptible Instance Usage

Why

Spot instances (AWS), Preemptible/Spot VMs (GCP), and Spot VMs (Azure) offer 60–90% discounts compared to on-demand pricing. For eligible workloads (batch processing, CI/CD, dev/test, stateless services), this is the highest single-action savings available. Most organisations avoid spot out of fear of interruptions — a clear eligibility policy and interruption handling strategy removes the risk.

What

Identify fault-tolerant and stateless workloads eligible for spot/preemptible instances, define an eligibility policy, and implement interruption handling.

How

Define Eligibility Criteria

Eligible (Good candidates)Not Eligible (Avoid)
Batch processing jobsStateful databases
CI/CD build agentsSingle-instance production services
Dev/test environmentsLong-running transactions (>2 hours)
Stateless web workers (with ASG)Services without health check / restart
Data processing / ETLWorkloads without graceful shutdown
ML training (checkpointing)Compliance workloads requiring guaranteed uptime

Implement Interruption Handling

ProviderInterruption NoticeHandling Strategy
AWS2-minute warningEventBridge rule → drain connections → checkpoint → terminate gracefully
Azure30-second warningScheduled Events API → graceful shutdown
GCP30-second warningShutdown script → checkpoint → terminate

For containerised workloads: use spot-aware node groups (EKS Managed Node Groups, AKS Spot Node Pools, GKE Preemptible Pools) with pod disruption budgets.

Deploy and Measure

Start with non-production workloads. Measure: interruption frequency, job completion rate, and cost savings. Expand to eligible production workloads once the team is confident in the interruption handling.

Deliverable Checklist

  • Eligibility policy defined and published
  • Eligible workloads identified with savings estimate
  • Interruption handling implemented per provider
  • Spot deployed for non-prod workloads
  • Savings measured and tracked monthly