Docsinference platformServerless Inference

Serverless Inference

Run ML models without managing GPU infrastructure.

On this page

Overview
How it works
When to use

Overview

Serverless inference runs models on managed GPU infrastructure that scales from zero on demand. You pay only for inference runs, not idle compute.

How it works

1Select a model from the catalog or deploy your own
2Send an inference request via the invoke or predictions API
3Platform routes the request to GPU infrastructure
4Inference runs and results are returned
5Credits are deducted based on model pricing

When to use

Variable or unpredictable traffic patterns
Prototyping and development
Infrequent or batch usage
No infrastructure management overhead

Related pages

Dedicated Servers Model Catalog Invoke Endpoint