What is Cumulus Labs?

Cumulus Labs provides serverless GPU inference with 12.5-second cold starts (4x faster than Modal) and pay-per-compute pricing that eliminates idle GPU waste. Part of YC W2026 and an NVIDIA Inception Program member, it was founded by Veer Shah (ex-Space Force SBIR, NASA) and Suryaa Rajinikanth (ex-TensorDock lead engineer, ex-Palantir).

The platform supports any containerized AI model — LLMs, image generation, speech-to-text, computer vision — and handles GPU selection, load balancing, and failover automatically. Their proprietary inference engine Ion is optimized for NVIDIA Grace chips, achieving 7,167 tokens/second on a 7B model. Deployment is a single Python function call with scale-to-zero pricing.

Cumulus also offers Cumulus OS for on-premises GPU cluster management with fleet management, intelligent bin-packing, and Kubernetes-native orchestration. The founders claim 50-70% cost savings versus traditional GPU cloud providers through their pay-per-compute model that only charges for actual GPU usage.

Key Features

✓Multimodal inference optimization
✓High-speed inference OS
✓Scalable compute
✓Multi-model support

Pros & Cons

Pros

+12.5-second cold starts — 4x faster than Modal with pay-per-compute pricing
+Scale-to-zero eliminates idle GPU waste with 50-70% claimed savings
+NVIDIA Inception Program member with hardware partnership signal
+Supports both serverless cloud and on-prem deployment via Cumulus OS
+Strong technical backgrounds from TensorDock, Palantir, and NASA programs

Cons

-Only 2 people competing against well-funded Modal, Replicate, and RunPod
-Grace chip optimization is niche — most customers use H100/A100 GPUs
-No disclosed customers or revenue metrics
-Benchmarks are self-reported without independent validation

Cumulus Labs Pricing

Free trial available

ServerlessPay-per-computeusage-based

✓Scale-to-zero
✓12.5s cold starts
✓No idle costs
✓50-70% savings vs GPU cloud

Cumulus OSContact for pricing

✓On-prem GPU management
✓Fleet management
✓Kubernetes-native
✓Intelligent bin-packing

View official pricing page

Common Use Cases

Teams running multimodal AI models at scale

•Multimodal model serving
•High-throughput inference
•Production AI deployment

Using Cumulus Labs with Respan

Cumulus Labs provides GPU inference infrastructure while Respan monitors the AI applications running on it. Together they optimize both compute costs and AI output quality.