The Gap Between Research and Production
A data science team trains a model. It performs well on validation data. Then it takes six months to get into production — and when it finally arrives, the live environment behaves differently from the training data, latency exceeds what the application can tolerate, and the model starts drifting within weeks.
This is not a failure of machine learning. It's a failure of infrastructure. ML pipelines — the systems responsible for data ingestion, feature engineering, model serving, monitoring, and retraining — are as critical as the models themselves. In real-time contexts, where predictions must be generated in milliseconds to be useful, pipeline architecture becomes the central engineering challenge.
This article walks through the key components, design decisions, and technology choices involved in building ML pipelines that scale — with a focus on low-latency, high-throughput production environments.
Anatomy of a Real-Time ML Pipeline
Before optimizing anything, you need to understand what you're building. A real-time ML pipeline involves several distinct layers that must be designed, deployed, and monitored independently.
┌─────────────────────────────────────────────────────────────────────┐ │ REAL-TIME ML PIPELINE │ ├─────────────┬──────────────┬──────────────┬──────────────┬──────────┤ │ DATA │ FEATURE │ MODEL │ SERVING │ MONITOR │ │ INGESTION │ ENGINEERING │ TRAINING │ LAYER │ & OPS │ │ │ │ │ │ │ │ Kafka │ Feature │ Batch │ REST / gRPC │ Drift │ │ Kinesis │ Store │ retraining │ endpoints │ detect │ │ Pub/Sub │ (online + │ Continuous │ Model │ A/B │ │ Webhooks │ offline) │ learning │ registry │ testing │ └─────────────┴──────────────┴──────────────┴──────────────┴──────────┘
Each layer has distinct scaling characteristics, latency budgets, and failure modes. Real-time systems differ from batch systems primarily in the data ingestion and serving layers — the feature engineering and monitoring layers share many concerns with both paradigms.
1. Data Ingestion: Building the Live Feed
Real-time ML starts with a stream of events — user actions, sensor readings, transaction records, API calls. The ingestion layer's job is to capture these events reliably, at scale, with minimal latency, and deliver them to the systems that need them.
Event Streaming vs. Request-Response
The two primary ingestion patterns are event streaming (Kafka, AWS Kinesis, Google Pub/Sub) and synchronous request-response (the prediction service receives raw input directly via API call). The right choice depends on your latency requirements and whether you need to share events across multiple consumers.
| Approach | Latency | Throughput | Best For |
|---|---|---|---|
| Direct API Call | ~5–50ms | Medium | User-facing predictions, single consumer |
| Kafka / Kinesis | ~10–200ms | Very High | Multi-consumer events, audit trail, replay |
| CDC (Change Data Capture) | ~100–500ms | High | Database changes as event triggers |
| Batch micro-windows | ~1–60s | Very High | Near-real-time where <1s latency not needed |
Backpressure and Fault Tolerance
Traffic spikes are the rule, not the exception. Your ingestion layer must handle bursts gracefully — applying backpressure upstream when consumers are saturated, persisting events with durable storage (Kafka's log is the canonical example), and supporting replay for failure recovery and pipeline testing. Dropping events under load is not an acceptable failure mode for production ML systems.
2. Feature Engineering at Low Latency
Feature engineering — transforming raw events into the numerical representations your model expects — is often the largest source of latency in real-time pipelines. It's also where the "training-serving skew" problem most frequently originates: the features computed offline during training differ from those computed online at inference time.
The Feature Store Pattern
A feature store solves two problems simultaneously: consistency between training and serving (both use the same feature definitions), and low-latency retrieval at inference time (pre-computed features are cached and served from a fast store rather than computed on demand).
Key Distinction
Feature stores have two distinct tiers. The offline store (typically a data warehouse like BigQuery or Snowflake) holds historical feature values used for training. The online store (Redis, DynamoDB, Cassandra) holds the latest feature values and serves them to the inference layer with sub-millisecond reads. Keeping these in sync — while guaranteeing consistency — is the core engineering challenge of feature store design.
Point-in-Time Correct Training Data
One of the most common and costly mistakes in ML engineering is "data leakage" from feature computation — accidentally using future information when computing training features for a past event. Feature stores with proper point-in-time query semantics prevent this by ensuring that when you hydrate features for a training example at time T, you only use feature values that were known at time T.
from feast import FeatureStore
from datetime import datetime
import pandas as pd
store = FeatureStore(repo_path=".")
# Entity dataframe with event timestamps — critical for point-in-time correctness
entity_df = pd.DataFrame({
"user_id": [1001, 1002, 1003],
"event_timestamp": [
datetime(2025, 1, 10, 14, 30),
datetime(2025, 1, 11, 09, 15),
datetime(2025, 1, 12, 16, 45),
]
})
# Feast retrieves feature values as they existed AT each event_timestamp
training_df = store.get_historical_features(
entity_df=entity_df,
features=[
"user_features:purchase_count_7d",
"user_features:avg_session_duration",
"user_features:churn_risk_score",
]
).to_df()
3. Model Serving: The Inference Layer
Getting a trained model to serve predictions reliably under load is a separate engineering discipline from training it. The inference layer must satisfy strict latency SLAs, handle concurrent requests, version models safely, and integrate into your broader application architecture.
Serving Frameworks
Several frameworks have emerged specifically for production ML serving, each with different trade-offs:
- TorchServe / TF Serving: Framework-native options for PyTorch and TensorFlow respectively. Low overhead, mature, but tied to a single framework.
- Triton Inference Server (NVIDIA): Framework-agnostic, supports concurrent model execution, batching, and GPU optimization. The right choice when squeezing maximum throughput from GPU infrastructure.
- BentoML: Python-native, developer-friendly, with excellent support for custom preprocessing logic alongside model execution. Strong choice for teams prioritising iteration speed.
- FastAPI + custom serving: Maximum flexibility. Appropriate when models are simple (e.g. scikit-learn) or when you need tight control over the serving logic.
Dynamic Batching
Individual inference requests are inefficient on GPU hardware. Dynamic batching collects incoming requests within a short time window (typically 5–50ms) and processes them as a single batch, dramatically increasing GPU utilization without meaningfully increasing perceived latency. For high-traffic systems, this single optimization can reduce infrastructure costs by 3–5×.
"Production ML is a software engineering problem first. The model is only as good as the infrastructure that keeps it running, up-to-date, and observable." — Chip Huyen, Designing Machine Learning Systems
Model Registry and Versioning
Never deploy a model directly from a training run. A model registry (MLflow, Weights & Biases, SageMaker Model Registry) creates a governed artifact store with versioning, metadata, performance benchmarks, and deployment history. This makes rollbacks trivial, enables canary deployments, and maintains a full audit trail of what's in production.
4. Scaling Strategies
Scaling an ML serving layer requires thinking across multiple dimensions simultaneously: compute capacity for inference, storage for features and artifacts, and network throughput for data movement.
Horizontal vs. Vertical Scaling
For CPU-based models (tree ensembles, logistic regression, lightweight tabular models), horizontal scaling via Kubernetes replicas is straightforward and cost-effective. For GPU-based deep learning models, vertical scaling (larger GPU instances) often provides better cost-efficiency at moderate traffic levels before horizontal scaling takes over.
Caching Inference Results
For use cases where the same input recurs frequently — product recommendations, content scoring, fraud scoring for repeat customers — caching inference results at the application layer can eliminate the majority of model calls entirely. A well-designed cache key strategy and appropriate TTL can reduce inference load by 40–70% for typical recommendation and ranking systems.
Autoscaling Configuration
Kubernetes Horizontal Pod Autoscaler (HPA) works well for inference services when configured against custom metrics (requests per second or inference queue depth) rather than CPU alone. GPU-based workloads benefit from node autoscaling with instance warm-up buffers to avoid cold-start latency spikes during traffic surges.
Sigmix Labs Engineering Practice
On projects where we've built real-time inference systems, we consistently find that infrastructure pre-work pays the largest dividends. Investing two weeks in a clean feature store, a robust serving framework, and a monitoring baseline prevents months of debugging mysterious model degradation in production. We treat MLOps infrastructure with the same rigour as application infrastructure.
5. Monitoring, Drift Detection, and Retraining
A model that performed well at deployment will eventually degrade. The world changes: user behaviour shifts, data distributions evolve, upstream systems change schemas. Monitoring catches this before it causes business impact.
What to Monitor
- Service-level metrics: Latency (p50/p95/p99), throughput, error rates — standard infrastructure observability.
- Data quality metrics: Missing values, out-of-range inputs, schema violations — catches upstream data issues before they corrupt predictions.
- Feature drift: Statistical comparison of feature distributions between training data and live data. Population Stability Index (PSI) and KL divergence are common metrics.
- Prediction drift: Distribution shift in model outputs, even when ground truth labels aren't yet available. A sudden change in the fraction of "high risk" predictions is a signal worth investigating.
- Outcome monitoring: Where ground truth is available (e.g. did the predicted churner actually churn?), tracking actual model performance over time is the gold standard.
Automated Retraining Triggers
Rather than retraining on a fixed schedule, mature ML systems retrain in response to observed signals: feature drift exceeding a threshold, performance metric degradation, or a defined volume of new labeled examples accumulating. This creates a feedback loop that keeps models aligned with current data without unnecessary retraining overhead.
6. Technology Stack Reference
There is no universal stack for real-time ML pipelines. But the following combinations cover the majority of production use cases we encounter:
| Layer | Open Source Options | Managed Cloud Options |
|---|---|---|
| Event Streaming | Apache Kafka, Redpanda | AWS Kinesis, Google Pub/Sub, Confluent Cloud |
| Stream Processing | Apache Flink, Spark Structured Streaming | AWS Kinesis Data Analytics, Google Dataflow |
| Feature Store | Feast, Hopsworks | AWS SageMaker Feature Store, Vertex AI Feature Store |
| Model Serving | BentoML, Triton, TorchServe | AWS SageMaker Endpoints, Vertex AI Endpoints |
| Model Registry | MLflow, DVC | W&B, SageMaker Model Registry |
| Orchestration | Prefect, Airflow, Dagster | Vertex AI Pipelines, SageMaker Pipelines |
| Monitoring | Evidently AI, Prometheus + Grafana | AWS Model Monitor, Arize, WhyLabs |
Starting Points Over Silver Bullets
The temptation when building ML infrastructure is to adopt a comprehensive platform that handles everything. In practice, the projects that succeed faster start with the simplest infrastructure that closes the gap between model and production — a REST endpoint, a basic feature computation layer, and a single latency metric — then evolve incrementally.
Complexity should be pulled in by demonstrated need, not pushed in by architectural ambition. A single Kafka topic, a Redis-backed feature cache, a BentoML service, and a Grafana dashboard can take a model from notebook to production serving 10,000 requests per minute in less time than designing a fully distributed feature store.
At Sigmix Labs, we help engineering teams scope the right level of ML infrastructure for their use case — building what's needed now with the architecture to grow. If you're navigating model deployment, pipeline design, or MLOps practices, we're ready to dig into the specifics with you.
