Serverless Inference
Run ML models without managing GPU infrastructure.
On this page
Overview
Serverless inference runs models on managed GPU infrastructure that scales from zero on demand. You pay only for inference runs, not idle compute.
How it works
- 1Select a model from the catalog or deploy your own
- 2Send an inference request via the invoke or predictions API
- 3Platform routes the request to GPU infrastructure
- 4Inference runs and results are returned
- 5Credits are deducted based on model pricing
When to use
- Variable or unpredictable traffic patterns
- Prototyping and development
- Infrequent or batch usage
- No infrastructure management overhead