vastai-deploy-integrationClaude Skill

Deploy Vast.ai integrations to Vercel, Fly.io, and Cloud Run platforms.

1.9k Stars
259 Forks
2025/10/10

Install & Download

Linux / macOS:

请登录后查看安装命令

Windows (PowerShell):

请登录后查看安装命令

Download and extract to ~/.claude/skills/

namevastai-deploy-integration
descriptionDeploy ML training jobs and inference services on Vast.ai GPU cloud. Use when deploying GPU workloads, configuring Docker images, or setting up automated deployment scripts. Trigger with phrases like "deploy vastai", "vastai deployment", "vastai docker", "vastai production deploy".
allowed-toolsRead, Write, Edit, Bash(vastai:*), Bash(docker:*), Bash(ssh:*)
version1.0.0
licenseMIT
authorJeremy Longshore <jeremy@intentsolutions.io>
compatible-withclaude-code, codex, openclaw
tags["saas","vast-ai","deployment"]

Vast.ai Deploy Integration

Overview

Deploy ML training jobs and inference services on Vast.ai GPU cloud. Covers Docker image optimization, automated provisioning scripts, data transfer strategies, and deployment automation.

Prerequisites

  • Vast.ai CLI authenticated
  • Docker image published to a registry
  • Training/inference code tested locally

Instructions

Step 1: Optimized Docker Image

# Dockerfile.vastai — optimized for fast pulls on Vast.ai
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

# Install dependencies in a single layer
COPY requirements.txt /tmp/
RUN pip install --no-cache-dir -r /tmp/requirements.txt && rm /tmp/requirements.txt

# Copy application code
COPY src/ /workspace/src/
COPY scripts/ /workspace/scripts/

WORKDIR /workspace
CMD ["python", "src/train.py"]
# Build and push
docker build -t ghcr.io/yourorg/training:v1 -f Dockerfile.vastai .
docker push ghcr.io/yourorg/training:v1

Step 2: Automated Deployment Script

#!/usr/bin/env python3
"""deploy.py — Automated Vast.ai deployment with monitoring."""
import subprocess, json, time, argparse, sys

def deploy(args):
    # Search for matching offer
    query = (f"num_gpus={args.gpus} gpu_name={args.gpu} "
             f"reliability>{args.reliability} dph_total<={args.max_price} "
             f"disk_space>={args.disk} rentable=true")

    offers = json.loads(subprocess.run(
        ["vastai", "search", "offers", query, "--order", "dph_total",
         "--raw", "--limit", "5"],
        capture_output=True, text=True, check=True).stdout)

    if not offers:
        print(f"ERROR: No offers matching: {query}", file=sys.stderr)
        sys.exit(1)

    offer = offers[0]
    print(f"Selected: {offer['gpu_name']} ${offer['dph_total']:.3f}/hr "
          f"(ID: {offer['id']})")

    # Create instance
    cmd = ["vastai", "create", "instance", str(offer["id"]),
           "--image", args.image, "--disk", str(args.disk)]
    if args.onstart:
        cmd.extend(["--onstart-cmd", args.onstart])

    result = json.loads(subprocess.run(
        cmd, capture_output=True, text=True, check=True).stdout)
    instance_id = result["new_contract"]
    print(f"Instance {instance_id} provisioning...")

    # Wait for running
    for _ in range(30):
        info = json.loads(subprocess.run(
            ["vastai", "show", "instance", str(instance_id), "--raw"],
            capture_output=True, text=True).stdout)
        if info.get("actual_status") == "running":
            print(f"READY: ssh -p {info['ssh_port']} root@{info['ssh_host']}")
            return instance_id, info
        time.sleep(10)

    raise TimeoutError("Instance did not start")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--gpu", default="RTX_4090")
    parser.add_argument("--gpus", type=int, default=1)
    parser.add_argument("--image", required=True)
    parser.add_argument("--disk", type=int, default=50)
    parser.add_argument("--max-price", type=float, default=0.50)
    parser.add_argument("--reliability", type=float, default=0.95)
    parser.add_argument("--onstart", default="")
    deploy(parser.parse_args())

Step 3: Data Transfer Strategies

# Small datasets (<5GB): SCP directly
scp -P $PORT ./data.tar.gz root@$HOST:/workspace/

# Large datasets (>5GB): Use rsync with compression
rsync -avz --progress -e "ssh -p $PORT" ./data/ root@$HOST:/workspace/data/

# Very large datasets: Pre-stage on cloud storage
ssh -p $PORT root@$HOST "wget -q https://storage.example.com/dataset.tar.gz -O /workspace/data.tar.gz"

Step 4: Health Check After Deploy

ssh -p $PORT -o StrictHostKeyChecking=no root@$HOST << 'CHECK'
echo "=== Deploy Health Check ==="
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
df -h /workspace | tail -1
echo "=== Ready ==="
CHECK

Output

  • Optimized Docker image for fast Vast.ai pulls
  • Automated deployment script with GPU/price selection
  • Data transfer patterns (SCP, rsync, cloud storage)
  • Post-deploy health check verification

Error Handling

ErrorCauseSolution
Docker pull timeoutImage too large (>10GB)Use multi-stage builds; minimize image layers
Disk space exhaustedInsufficient disk allocationIncrease --disk parameter
SSH timeout after deployInstance still loading imageWait longer or use smaller base image
CUDA version mismatchImage CUDA > host CUDAFilter offers by cuda_max_good

Resources

Next Steps

For event-driven workflows, see vastai-webhooks-events.

Examples

One-command deploy: python deploy.py --gpu A100 --image ghcr.io/org/train:v1 --max-price 2.00 --disk 100

Multi-GPU deploy: Set --gpus 4 and --gpu H100_SXM for distributed training with torchrun.

Similar Claude Skills & Agent Workflows