vastai-prod-checklistClaude Skill
Execute Vast.ai production deployment checklist and rollback procedures.
| name | vastai-prod-checklist |
| description | Execute Vast.ai production deployment checklist for GPU workloads. Use when deploying training pipelines to production, preparing for large-scale GPU jobs, or auditing production readiness. Trigger with phrases like "vastai production", "deploy vastai", "vastai go-live", "vastai launch checklist". |
| allowed-tools | Read, Bash(vastai:*), Bash(curl:*), Grep |
| version | 1.0.0 |
| license | MIT |
| author | Jeremy Longshore <jeremy@intentsolutions.io> |
| compatible-with | claude-code, codex, openclaw |
| tags | ["saas","vast-ai","deployment"] |
Vast.ai Production Checklist
Overview
Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls.
Prerequisites
- Vast.ai account with sufficient credits
- Docker images tested and published to registry
- Checkpoint-based training pipeline
Instructions
Account & Authentication
- API key stored in secrets manager (not in code or env files)
- Dedicated SSH key pair for Vast.ai (not shared with other services)
- Account balance sufficient for planned workload duration + 50% buffer
- Billing alerts configured at cloud.vast.ai
Instance Selection
- GPU type validated for workload (VRAM, compute capability)
- Reliability filter set to
>= 0.98for production jobs - Internet speed filter set to
inet_down >= 200for data transfer - Disk allocation includes room for checkpoints + data + 20% overhead
- CUDA version on host matches Docker image requirements
Data Safety
- Training data encrypted before upload to instances
- Checkpoint saving every N steps (not just per epoch)
- Checkpoints uploaded to persistent storage (S3/GCS) periodically
- Instance cleanup script removes data before destruction
- No sensitive data (API keys, PII) embedded in Docker images
Spot Instance Protection
- Spot preemption handler implemented (save checkpoint on SIGTERM)
- Auto-recovery: detect destroyed instance, provision replacement, resume
- On-demand fallback configured for critical final training stages
- Checkpoint integrity verification after recovery
Monitoring & Alerting
- GPU utilization monitoring (alert if < 50% for > 10 min)
- Instance health polling every 60 seconds
- Cost accumulation tracking with budget threshold alerts
- Training loss/metrics logged to external service (W&B, MLflow)
- Dead instance detection (auto-destroy stuck instances)
Cost Controls
- Maximum
dph_totalset in search queries - Auto-destroy timeout for all instances (e.g., 24h max)
- Daily spending limit configured
- Cost-per-job tracking for budget reporting
Verification Script
#!/bin/bash set -euo pipefail echo "Vast.ai Production Readiness Check" # 1. Auth vastai show user --raw | python3 -c " import sys, json; u=json.load(sys.stdin) balance = u.get('balance', 0) print(f' Auth: OK | Balance: \${balance:.2f}') assert balance >= 10, f'Balance too low: \${balance:.2f}' " && echo " Balance: PASS" || echo " Balance: FAIL" # 2. Offer availability COUNT=$(vastai search offers 'reliability>0.98 num_gpus=1 rentable=true' --raw --limit 1 | python3 -c "import sys,json; print(len(json.load(sys.stdin)))") echo " Offers available: $COUNT+ | PASS" # 3. Docker image pullable docker pull pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime > /dev/null 2>&1 && echo " Docker image: PASS" || echo " Docker image: FAIL" echo "Pre-flight checks complete."
Output
- Production readiness checklist verified
- Verification script passes all checks
- Cost controls and monitoring configured
- Data safety measures in place
Error Handling
| Error | Cause | Solution |
|---|---|---|
| Insufficient balance | Credits depleted mid-job | Set up auto-top-up or balance alerts |
| Instance preempted during final epoch | Spot instance reclaimed | Use on-demand for final training stage |
| Checkpoint corrupted | Interrupted mid-save | Implement atomic checkpoint writes (save to temp, rename) |
| GPU utilization drops to 0% | Data pipeline bottleneck | Profile data loading; increase disk I/O |
Resources
Next Steps
For version upgrades, see vastai-upgrade-migration.
Examples
Pre-launch audit: Run the verification script, check all boxes, confirm Docker image pulls successfully, and verify at least 3 matching offers are available before starting a production training run.
Budget-safe launch: Set max_dph=2.00, auto-destroy timeout of 12 hours, and daily spend alert at $50 to prevent cost overruns.
Similar Claude Skills & Agent Workflows
vercel-automation
Automate Vercel tasks via Rube MCP (Composio): manage deployments, domains, DNS, env vars, projects, and teams.
sentry-automation
Automate Sentry tasks via Rube MCP (Composio): manage issues/events, configure alerts, track releases, monitor projects and teams.
render-automation
Automate Render tasks via Rube MCP (Composio): services, deployments, projects.
posthog-automation
Automate PostHog tasks via Rube MCP (Composio): events, feature flags, projects, user profiles, annotations.
pagerduty-automation
Automate PagerDuty tasks via Rube MCP (Composio): manage incidents, services, schedules, escalation policies, and on-call rotations.
make-automation
Automate Make (Integromat) tasks via Rube MCP (Composio): operations, enums, language and timezone lookups.