Batch vs realtime serving
There are two ways to run a model's predictions, and picking the wrong one wastes money or misses deadlines.
- Realtime — one request at a time, behind an endpoint. Low latency (answer in milliseconds), fresh (uses the moment's data), but higher cost per prediction (you keep a server warm). Use it when a user or workflow is waiting — fraud check at checkout, a chatbot reply, a support-ticket route on submit.
- Batch — score many rows at once in a scheduled job. Cheaper per item, but higher latency (results land hours later) and staler. Use it when nothing is waiting — overnight lead scoring, a nightly no-show prediction run, re-tagging a document archive.
Run the editor: the same 1000 predictions cost 5× more in realtime than batch — that's the price of "answer now."
The decision rule
Ask one question: is something waiting on this prediction right now?
- Yes → realtime.
- No → batch (cheaper, simpler, and you can re-run it).
Freshness is the tiebreaker: if a prediction must reflect data from seconds ago, realtime; if "as of last night" is fine, batch.
Why a builder cares
Teams burn money running realtime endpoints for jobs nothing waits on, or miss SLAs trying to batch a request a user is staring at. Naming the mode by "who's waiting" — and knowing batch is the cheaper default when no one is — is the call you'll actually make. You'll compute the tradeoff next.