Batch vs realtime serving (step 1/7) · model serving, ci/cd, and mlops

Batch vs realtime serving

There are two ways to run a model's predictions, and picking the wrong one wastes money or misses deadlines.

Realtime — one request at a time, behind an endpoint. Low latency (answer in milliseconds), fresh (uses the moment's data), but higher cost per prediction (you keep a server warm). Use it when a user or workflow is waiting — fraud check at checkout, a chatbot reply, a support-ticket route on submit.
Batch — score many rows at once in a scheduled job. Cheaper per item, but higher latency (results land hours later) and staler. Use it when nothing is waiting — overnight lead scoring, a nightly no-show prediction run, re-tagging a document archive.

Run the editor: the same 1000 predictions cost 5× more in realtime than batch — that's the price of "answer now."

The decision rule

Ask one question: is something waiting on this prediction right now?

Yes → realtime.
No → batch (cheaper, simpler, and you can re-run it).

Freshness is the tiebreaker: if a prediction must reflect data from seconds ago, realtime; if "as of last night" is fine, batch.

Why a builder cares

Teams burn money running realtime endpoints for jobs nothing waits on, or miss SLAs trying to batch a request a user is staring at. Naming the mode by "who's waiting" — and knowing batch is the cheaper default when no one is — is the call you'll actually make. You'll compute the tradeoff next.