vastai-prod-checklistClaude Skill

Execute Vast.ai production deployment checklist and rollback procedures.

1.9k Stars
259 Forks
2025/10/10

Install & Download

Linux / macOS:

请登录后查看安装命令

Windows (PowerShell):

请登录后查看安装命令

Download and extract to ~/.claude/skills/

namevastai-prod-checklist
descriptionExecute Vast.ai production deployment checklist for GPU workloads. Use when deploying training pipelines to production, preparing for large-scale GPU jobs, or auditing production readiness. Trigger with phrases like "vastai production", "deploy vastai", "vastai go-live", "vastai launch checklist".
allowed-toolsRead, Bash(vastai:*), Bash(curl:*), Grep
version1.0.0
licenseMIT
authorJeremy Longshore <jeremy@intentsolutions.io>
compatible-withclaude-code, codex, openclaw
tags["saas","vast-ai","deployment"]

Vast.ai Production Checklist

Overview

Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls.

Prerequisites

  • Vast.ai account with sufficient credits
  • Docker images tested and published to registry
  • Checkpoint-based training pipeline

Instructions

Account & Authentication

  • API key stored in secrets manager (not in code or env files)
  • Dedicated SSH key pair for Vast.ai (not shared with other services)
  • Account balance sufficient for planned workload duration + 50% buffer
  • Billing alerts configured at cloud.vast.ai

Instance Selection

  • GPU type validated for workload (VRAM, compute capability)
  • Reliability filter set to >= 0.98 for production jobs
  • Internet speed filter set to inet_down >= 200 for data transfer
  • Disk allocation includes room for checkpoints + data + 20% overhead
  • CUDA version on host matches Docker image requirements

Data Safety

  • Training data encrypted before upload to instances
  • Checkpoint saving every N steps (not just per epoch)
  • Checkpoints uploaded to persistent storage (S3/GCS) periodically
  • Instance cleanup script removes data before destruction
  • No sensitive data (API keys, PII) embedded in Docker images

Spot Instance Protection

  • Spot preemption handler implemented (save checkpoint on SIGTERM)
  • Auto-recovery: detect destroyed instance, provision replacement, resume
  • On-demand fallback configured for critical final training stages
  • Checkpoint integrity verification after recovery

Monitoring & Alerting

  • GPU utilization monitoring (alert if < 50% for > 10 min)
  • Instance health polling every 60 seconds
  • Cost accumulation tracking with budget threshold alerts
  • Training loss/metrics logged to external service (W&B, MLflow)
  • Dead instance detection (auto-destroy stuck instances)

Cost Controls

  • Maximum dph_total set in search queries
  • Auto-destroy timeout for all instances (e.g., 24h max)
  • Daily spending limit configured
  • Cost-per-job tracking for budget reporting

Verification Script

#!/bin/bash
set -euo pipefail
echo "Vast.ai Production Readiness Check"

# 1. Auth
vastai show user --raw | python3 -c "
import sys, json; u=json.load(sys.stdin)
balance = u.get('balance', 0)
print(f'  Auth: OK | Balance: \${balance:.2f}')
assert balance >= 10, f'Balance too low: \${balance:.2f}'
" && echo "  Balance: PASS" || echo "  Balance: FAIL"

# 2. Offer availability
COUNT=$(vastai search offers 'reliability>0.98 num_gpus=1 rentable=true' --raw --limit 1 | python3 -c "import sys,json; print(len(json.load(sys.stdin)))")
echo "  Offers available: $COUNT+ | PASS"

# 3. Docker image pullable
docker pull pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime > /dev/null 2>&1 && echo "  Docker image: PASS" || echo "  Docker image: FAIL"

echo "Pre-flight checks complete."

Output

  • Production readiness checklist verified
  • Verification script passes all checks
  • Cost controls and monitoring configured
  • Data safety measures in place

Error Handling

ErrorCauseSolution
Insufficient balanceCredits depleted mid-jobSet up auto-top-up or balance alerts
Instance preempted during final epochSpot instance reclaimedUse on-demand for final training stage
Checkpoint corruptedInterrupted mid-saveImplement atomic checkpoint writes (save to temp, rename)
GPU utilization drops to 0%Data pipeline bottleneckProfile data loading; increase disk I/O

Resources

Next Steps

For version upgrades, see vastai-upgrade-migration.

Examples

Pre-launch audit: Run the verification script, check all boxes, confirm Docker image pulls successfully, and verify at least 3 matching offers are available before starting a production training run.

Budget-safe launch: Set max_dph=2.00, auto-destroy timeout of 12 hours, and daily spend alert at $50 to prevent cost overruns.

Similar Claude Skills & Agent Workflows