Serving Modes

Updated May 14, 2023 ·

Overview

Model serving is how we provide predictions to users as a service, much like any other service. Users expect to get predictions when they request them, and we need to decide how to deliver them.

Batch Prediction

Batch prediction involves generating predictions on a set schedule, like once a day or week.

Predictions run on a large dataset at once
Best for static or offline predictions like monthly sales forecasts
Simplest to implement; go for this if use case allows

Batch prediction is suitable for tasks that don't require real-time responses. This makes it easy to manage and scale.

On-demand Prediction

On-demand prediction generates predictions when an event occurs or a user makes a request. This method is more flexible than batch prediction.

Triggered by events or user requests
Ideal for use cases that require timely predictions
More complex to implement than batch prediction

On-demand prediction provides flexibility and responsiveness, and allows users to get predictions when they need them.

Latency Types

Latency refers to the time it takes for a model to respond after receiving a request. The acceptable latency varies depending on the use case.

Near-real-time
- Predictions take minutes, suitable for stream processing
- Requests and responses are called data streams
Real-time
- Predictions takes less than a second,
- For high-priority use cases like fraud detection

Lower latency means faster predictions, but it can require stronger infrastructure or model optimization.

Edge Deployment

Edge deployment involves running models directly on users' devices. This minimizes latency by eliminating the need for cloud-based predictions.

Reduces latency to almost zero
Models run on smartphones, tablets, or other devices
Example: facial recognition, image filters, and navigation

Edge deployment improves speed and reduces dependence on cloud resources, and makes it ideal for applications requiring immediate responses.

Overview​

Batch Prediction​

On-demand Prediction​

Latency Types​

Edge Deployment​

Overview

Batch Prediction

On-demand Prediction

Latency Types

Edge Deployment