RunPod GPU Performance¶
Performance benchmarks and optimization guide for RunPod GPU transcription workers.
GPU Comparison¶
Parakeet-Compatible GPUs¶
| GPU | Price/hr | Speed | $/episode-hr | Episodes/$ |
|---|---|---|---|---|
| RTX A4000 | $0.16 | ~50x realtime | $0.0032 | 312 |
| RTX A5000 | $0.22 | ~87x realtime | $0.0025 | 395 |
| RTX 3090 | $0.30 | ~80x realtime | $0.0038 | 267 |
| RTX A6000 | $0.45 | ~110x realtime | $0.0041 | 244 |
Best Value
RTX A5000 offers the best cost per episode despite not being the fastest or cheapest GPU.
Blocked GPUs (CUDA Error 35)¶
These GPUs fail with NeMo/Parakeet due to CUDA compatibility issues (Ada Lovelace architecture):
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4080
- NVIDIA L4
Ampere GPUs (A-series, RTX 30-series) work fine.
Bandwidth Analysis¶
Upload Capacity vs GPU Speed¶
At typical podcast bitrate (~128 kbps = 0.96 MB/min of audio):
| Bandwidth | MB/s | Realtime Equivalent |
|---|---|---|
| 25 Mbit/s | 3.1 | ~195x |
| 50 Mbit/s | 6.25 | ~390x |
| 100 Mbit/s | 12.5 | ~780x |
Key Finding
Even the fastest GPU (~110x) only needs ~1 Mbit/s to stay saturated. At 50 Mbit/s upload, you can feed ~4-5 pods before bandwidth becomes a bottleneck.
Multi-Pod Scaling¶
Test Results (2x A5000 + MacBook)¶
Configuration:
- 2x RunPod A5000 pods (Parakeet)
- 1x MacBook local worker (Whisper large-v3-turbo)
- 50 Mbit/s upload bandwidth
Observed Performance:
- Last hour: 79 episodes, 8271 audio minutes
- Throughput: 138 hours of audio per wall-clock hour
- Both pods stayed constantly busy (no idle time)
Per-Node Stats (24h sample)¶
| Node | Jobs | Avg Time | Notes |
|---|---|---|---|
| RunPod A5000 (1) | 78 | 320 sec | Parakeet |
| RunPod A5000 (2) | 41 | 343 sec | Parakeet (newer pod) |
| MacBook | 326 | 344 sec | Whisper large-v3-turbo |
Scaling Recommendations¶
When to Add Pods¶
| Queue Size | Pods | Reasoning |
|---|---|---|
| < 50 | 1 | Single pod sufficient |
| 50-200 | 2 | Parallel processing, no bandwidth issues |
| 200-500 | 3 | Still within 50 Mbit/s capacity |
| > 500 | 3-4 | Consider auto-scale |
Bottlenecks (in order)¶
- GPU processing -- primary bottleneck, scales with pods
- Episode length -- longer episodes = lower throughput
- Job coordination -- minor overhead between jobs
- Bandwidth -- only limiting at 5+ pods
Verification¶
To confirm pods aren't bandwidth-limited:
# Check if pods are busy
curl -s https://server/api/nodes | jq '.nodes[] | {name, status}'
# Check running jobs (should be 2-3x pod count for prefetch)
curl -s https://server/api/queue/status | jq '.transcribe_queue.running'
Cost Optimization¶
A5000 Economics¶
| Metric | Value |
|---|---|
| Hourly cost | ~$0.22 |
| Processing speed | 87x realtime |
| Cost per episode-hour | $0.0025 |
| 100 episodes (avg 2 hrs each) | ~$0.50 |
Comparison to Local Processing¶
| Method | Speed | Cost/100 episodes |
|---|---|---|
| RunPod A5000 | 87x | $0.50 |
| MacBook M1 (MLX) | ~15x | Free (electricity) |
| Server CPU | ~1x | Free (electricity) |
Cost Control Tips¶
- Start pods only when queue has work -- empty queue = wasted billing
- Use auto-scale wisely -- only enable for regular large backlogs
- Monitor stuck jobs -- 10-min idle timeout catches these
- Server reliability -- 5-min unreachable timeout prevents orphaned pods