Operate models at scale

MLOps and Production ML

Focus on repeatability, deployment, observability, governance, and the infrastructure decisions senior interviews often probe.

Open question bank Browse case studies

5 topic cards built for interview prep

Each topic includes a summary, practical learning goals, representative interview prompts, and a suggested roadmap day.

Intermediate85 minDay 31

Data Validation and Data Quality

Add production data quality coverage for schemas, ranges, missingness, freshness, outliers, and contract testing.

Learning objectives

• Write data checks that catch schema, distribution, and freshness failures
• Separate hard failures from warn-only checks

Intermediate95 minDay 31

Training Pipeline Orchestration

Cover scheduled and event-driven retraining, lineage, artifact storage, reproducibility, and failure recovery.

Learning objectives

• Design batch, streaming, and triggered training pipelines
• Track lineage across data, features, code, parameters, and model artifacts

Intermediate85 minDay 32

Model Registry and CI/CD for ML

Cover versioning, promotion, rollback, and the differences between software CI/CD and ML release workflows.

Learning objectives

• Track datasets, features, models, prompts, and evaluation reports together
• Explain shadow deployments, canaries, and rollback signals

Advanced100 minDay 33

Model Serving Patterns

Compare batch, online, streaming, edge, shadow, canary, blue-green, and async serving patterns.

Learning objectives

• Choose a serving pattern from latency, freshness, cost, and reliability constraints
• Design fallbacks for model, feature, and dependency failures

Intermediate90 minDay 34

Monitoring, Drift, and Retraining

Build an operational view of production models that covers data quality, drift, business outcomes, and recovery.

Learning objectives

• Separate data drift, concept drift, and performance degradation
• Choose alerting thresholds that drive action rather than noise

Practice prompts

Daily-plan topics tied directly to this pillar

These are pulled from the same 133-day roadmap content used by Browse Questions.

Day 57MLOps · ML lifecycle

Lifecycle: data → train → eval → deploy → monitor → retrain

• Walk through the end-to-end ML lifecycle and the failure modes at each stage.
• Where do most ML projects actually fail in the lifecycle, and what catches it earlier?

Day 57MLOps · ML lifecycle

ML roles: research, applied, MLE, MLOps, platform

• Compare the responsibilities of an ML researcher vs an MLE vs an MLOps engineer.
• When does a company actually need a dedicated ML platform team — and what's the smallest valid platform?

Day 57MLOps · ML lifecycle

Team topology & ownership boundaries

• How would you structure an ML team at a 50-person startup vs a 5,000-person company?
• Where do ownership disputes typically erupt between data, ML, and platform teams — and how do you preempt them?

Day 58MLOps · Feature stores

Why feature stores: training/serving skew, reuse

• What is training/serving skew and how does a feature store eliminate it?
• When is a feature store overkill?

Day 58MLOps · Feature stores

Feast, Tecton, Vertex Feature Store

• Walk through how Feast separates the offline and online stores.
• Compare Feast vs Tecton vs Vertex Feature Store — when does each fit?

Day 58MLOps · Feature stores

Point-in-time correctness

• Why is point-in-time correctness critical for training data, and how does feature store handle it?
• Walk me through how a leak from forward-looking features actually breaks model rollout.

Day 59MLOps · Model registry & versioning

MLflow registry: stages, lineage, model cards

• Walk through promoting a model from staging → production in MLflow.
• Why is lineage (data → model → deployment) important for compliance?

Day 59MLOps · Model registry & versioning

Versioning: code + data + model + features

• Why is versioning data as important as versioning code in ML?
• What does DVC give you that git-LFS doesn't?