Docsinference platformServerless Inference

Serverless Inference

Run ML models without managing GPU infrastructure.

Overview

Serverless inference runs models on managed GPU infrastructure that scales from zero on demand. You pay only for inference runs, not idle compute.

How it works

  1. 1Select a model from the catalog or deploy your own
  2. 2Send an inference request via the invoke or predictions API
  3. 3Platform routes the request to GPU infrastructure
  4. 4Inference runs and results are returned
  5. 5Credits are deducted based on model pricing

When to use

  • Variable or unpredictable traffic patterns
  • Prototyping and development
  • Infrequent or batch usage
  • No infrastructure management overhead