RunPod GPU Workers¶
On-demand GPU transcription workers for processing large backlogs. Uses Parakeet TDT 0.6B v3 by default for fast transcription (~100x realtime).
Overview¶
RunPod pods are managed by the cast2md server. They:
- Start on demand via the web UI or API
- Connect back to the server via Tailscale
- Register as transcriber nodes
- Process jobs from the queue
- Auto-terminate when the queue is empty
Prerequisites¶
- RunPod account with API key
- Tailscale account with auth key
- cast2md server accessible via Tailscale
Configuration¶
Server Environment¶
# Required
RUNPOD_ENABLED=true
RUNPOD_API_KEY=your_runpod_api_key
RUNPOD_TS_AUTH_KEY=tskey-auth-xxx
# Server connection (for pods to reach the server)
RUNPOD_SERVER_URL=https://your-server.ts.net
RUNPOD_SERVER_IP=100.x.x.x # Tailscale IP (required, MagicDNS unavailable in pods)
Settings¶
| Setting | Default | Description |
|---|---|---|
runpod_enabled |
false |
Master switch |
runpod_max_pods |
3 |
Maximum concurrent pods |
runpod_auto_scale |
false |
Auto-start on queue growth |
runpod_scale_threshold |
10 |
Queue depth to trigger auto-scale |
runpod_gpu_type |
NVIDIA RTX A5000 |
Preferred GPU |
runpod_blocked_gpus |
(see below) | GPUs to exclude |
runpod_whisper_model |
parakeet-tdt-0.6b-v3 |
Default model |
runpod_idle_timeout_minutes |
10 |
Auto-terminate idle pods |
Usage¶
Starting Pods¶
Monitoring¶
Check pod status:
Track pod creation progress:
Terminating¶
# Terminate all pods
curl -X DELETE http://localhost:8000/api/runpod/pods
# Terminate specific pod
curl -X DELETE http://localhost:8000/api/runpod/pods/{pod_id}
Transcription Models¶
Default Models¶
| Model | Backend | Languages | Speed |
|---|---|---|---|
parakeet-tdt-0.6b-v3 |
Parakeet (NeMo) | 25 EU languages | ~100x realtime |
large-v3-turbo |
Whisper | 99+ languages | ~30-40x realtime |
large-v3 |
Whisper | 99+ languages | ~20x realtime |
large-v2 |
Whisper | 99+ languages | ~20x realtime |
medium |
Whisper | 99+ languages | ~40x realtime |
small |
Whisper | 99+ languages | ~60x realtime |
Managing Models¶
Via the Settings page or API:
# List models
curl http://localhost:8000/api/runpod/models
# Add model
curl -X POST http://localhost:8000/api/runpod/models \
-H "Content-Type: application/json" \
-d '{"model_id": "large-v3-turbo"}'
# Remove model
curl -X DELETE http://localhost:8000/api/runpod/models \
-H "Content-Type: application/json" \
-d '{"model_id": "large-v3-turbo"}'
GPU Compatibility¶
Blocked GPUs
RTX 40-series consumer GPUs and NVIDIA L4 have CUDA compatibility issues with NeMo/Parakeet (CUDA error 35). These work fine with Whisper but fail with Parakeet.
Working GPUs (Parakeet):
- NVIDIA RTX A5000 (~$0.22/hr, ~87x realtime)
- NVIDIA RTX A6000
- NVIDIA RTX A4000
- NVIDIA GeForce RTX 3090
- NVIDIA L40
Blocked GPUs (default blocklist):
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4080
- NVIDIA L4
Blocked GPUs are automatically skipped during pod creation. To modify:
Auto-Termination¶
Pods automatically terminate to prevent runaway costs:
| Condition | Default | Description |
|---|---|---|
| Empty Queue | ~2 min | 2 consecutive empty checks, 60s apart |
| Idle Timeout | 10 min | No jobs processed (safety net) |
| Server Unreachable | 5 min | Can't reach server |
| Circuit Breaker | 3 failures | Consecutive transcription failures |
Server-Controlled Termination¶
When a node worker decides to terminate:
- Worker calls
POST /api/nodes/{node_id}/request-termination - Server releases claimed jobs back to queue
- Server terminates pod via RunPod API
- Server cleans up setup state and node registration
Persistent/Dev Mode¶
To keep pods running for debugging:
# Create persistent pod
curl -X POST http://localhost:8000/api/runpod/pods \
-H "Content-Type: application/json" \
-d '{"persistent": true}'
# Enable on existing pod
curl -X PATCH http://localhost:8000/api/runpod/pods/{instance_id}/persistent \
-H "Content-Type: application/json" \
-d '{"persistent": true}'
Pod Lifecycle¶
1. API call to create pod
└─> Server generates instance ID
└─> Background thread creates RunPod pod
└─> Pod starts with Tailscale + cast2md worker
2. Pod setup
└─> Start Tailscale (userspace networking)
└─> Install cast2md from GitHub
└─> GPU smoke test (Parakeet only)
└─> Register with server as transcriber node
3. Processing
└─> Claims jobs via standard node protocol
└─> 3-slot prefetch queue keeps audio ready
└─> Transcribes using Parakeet or Whisper
4. Auto-termination
└─> Empty queue / idle / unreachable / circuit breaker
└─> Notifies server before shutdown
└─> Server releases jobs and cleans up
GPU Smoke Test¶
During pod setup, a GPU smoke test runs before the worker starts (Parakeet only). It transcribes 1 second of silence to catch CUDA errors early.
- Timeout: 120 seconds
- If it fails, the pod is marked as FAILED in the admin UI
Docker Image¶
RunPod pods use a custom Docker image with pre-installed dependencies:
| Component | Notes |
|---|---|
| CUDA 12.4.1 | Runtime only |
| PyTorch 2.4.0+cu124 | Pinned for CUDA compatibility |
| NeMo toolkit | Latest version |
| Parakeet model | Pre-downloaded (~600MB) |
| faster-whisper | Fallback for Whisper models |
Image: meltforce/cast2md-afterburner:cuda124
Built automatically via GitHub Actions when deploy/afterburner/Dockerfile changes.
Tailscale Networking¶
RunPod containers don't have /dev/net/tun, so Tailscale runs in userspace mode:
- Inbound connections work normally (SSH, etc.)
- Outbound connections use HTTP proxy on
localhost:1055 - HTTPS not supported through the proxy (use HTTP -- still encrypted by Tailscale's WireGuard tunnel)
- MagicDNS unavailable --
RUNPOD_SERVER_IPis required
Debugging¶
# SSH into pod
ssh root@<pod-tailscale-hostname>
# View worker logs
tail -100 /tmp/cast2md-node.log
# Check proxy
ss -tlnp | grep 1055
# Test server connectivity
curl -x http://localhost:1055 http://<server-ip>:8000/api/health
# Tailscale status
tailscale status
CLI Afterburner¶
For running pods from the command line:
source deploy/afterburner/.env
python deploy/afterburner/afterburner.py # Process queue
python deploy/afterburner/afterburner.py --dry-run # Validate config
python deploy/afterburner/afterburner.py --test # Test connectivity
python deploy/afterburner/afterburner.py --keep-alive # Persistent mode
python deploy/afterburner/afterburner.py --terminate-all # Stop all pods
Supports parallel execution: