Back to glossary

Batch Inference

Processing multiple ML predictions as a group at scheduled intervals rather than one-at-a-time on demand, optimizing for throughput and cost over latency.

Batch inference processes predictions in bulk — running your model on thousands or millions of inputs at once, typically on a schedule (hourly, nightly). This contrasts with real-time inference, where predictions are generated on-demand for each request.

Batch inference is dramatically cheaper than real-time for several reasons: GPU utilization is higher when processing full batches, API providers offer 50% discounts for batch endpoints, and you can use spot/preemptible instances since timing isn't critical. A nightly batch job processing 100K recommendations might cost $50, while the same predictions served real-time could cost $500+.

Common growth use cases for batch inference: precomputing content recommendations for all users, generating personalized email content for campaigns, scoring all accounts for churn risk, embedding new content for search indexes, and generating SEO meta descriptions for product pages. The pattern is simple: if the prediction can be slightly stale (hours, not seconds), batch it. Reserve real-time inference for interactive features where freshness matters.

Related Terms

Further Reading