Three-backend GPU fleet live: NIM + Dell + DGX Spark GB10
Three-backend research compute fleet now live
The SMA research platform now operates three independent GPU compute backends in parallel, each running pharma-relevant workloads continuously:
Active backends (2026-04-27)
| Backend | Hardware | Workload | Status |
|---|---|---|---|
| NIM API (cloud) | NVIDIA hosted | Boltz-2 + ESMfold + MolMIM saturator | 162 calls/5min, 86% OK |
| Dell Demo Center (free 90-day) | RTX Pro 6000 Blackwell 96 GB | Boltz-2 PPI saturator | 4,200+ pair predictions, 82% GPU util |
| NVIDIA DGX Spark GB10 (owned) | Grace Blackwell 128 GB unified, sm_121 | Chai-1 ligand saturator + Qwen 35B local LLM | 21+ predictions completed, ~33/hour throughput |
Hardware comparison — measured benchmarks
Same workloads run on each backend produced apples-to-apples timing data, now published at /infrastructure/gpu-benchmark:
| Workload | Spark GB10 (48 SMs) | Dell RTX Pro 6000 (188 SMs) |
|---|---|---|
| LLM tokens/sec (Qwen 35B Q8) | 49.8 t/s | 186.6 t/s (3.7× faster) |
| Matmul BF16 8192² | 99.9 TFLOPS | 395.4 TFLOPS (4.0×) |
| Memory ceiling | 122 GB unified | 96 GB GDDR |
Trade-off: Dell wins compute throughput per second; Spark wins memory capacity for larger models (e.g. DeepSeek V4-Flash 140 GB MoE doesn't fit Dell's 96 GB).
Pharma-grade methodology decisions
- Boltz-2 vs Chai-1 dual deployment: Boltz-2 hangs on Spark sm_121 with PyTorch 2.11+cu130 (Lightning Predict phase blocked); Chai-1 0.6.1 works natively on the same hardware. Spark gets Chai-1, Dell stays primary Boltz-2 — both peer-reviewed methodologies provide complementary scoring.
- BindCraft dual-mode: Spark runs design + AF2-confidence filter (Bennett 2023 methodology); Dell runs full BindCraft incl. PyRosetta scoring (no aarch64 wheel exists for PyRosetta).
- RFdiffusion deprecated in favor of BindCraft 1.5 (Pacesa Nature 2025, 10–100% binder success vs RFdiffusion 1–10%).
Storage architecture (Phase 2 complete)
Canonical research data root migrated from moltbot (constrained) to Spark /data/research-data/ (3.6 TB capacity, 215 GB now hosted). Dropbox cloud mirror remains automated. Daily migrator pulls fleet-results staging into the canonical tree.
Next milestones
- 4th backend evaluation: Modal.com $30/mo free serverless tier as Dell-post-2026-07-22 successor
- TurboQuant KV-cache compression (Google ICLR 2026) integration when llama.cpp PR lands
- Cross-backend orthogonal validation pipeline (3-LLM consensus rule for any external claim)
All numbers are measured runs, not vendor specs. Source code + raw benchmark data: /api/v2/infrastructure/gpu-roi.