Clarify migration risk
Ask why the migration is happening: reliability, cost, latency, feature reuse, compliance, or developer velocity. Then identify the non-negotiables: zero downtime, prediction parity, and reversible rollout.
Migration architecture
- Baseline audit: document legacy features, model artifact, preprocessing, dependencies, and current latency.
- Dual feature path: compute old and new features for the same requests.
- Shadow serving: send live traffic to the new stack without affecting user decisions.
- Parity dashboard: compare features, predictions, confidence, errors, and latency.
- Canary ramp: move 1%, 5%, 25%, 50%, then 100% only if guardrails pass.
- Rollback: keep legacy serving and feature paths warm until post-migration confidence is high.
Checks
- feature value parity by feature and segment
- prediction score differences and decision differences
- p50, p95, p99 latency
- error rates and timeout rates
- offline replay and online shadow comparison
- model registry and artifact lineage
Failure modes
- Feature skew: same feature name but different semantics.
- Silent preprocessing drift: missing value handling or categorical encoding changes.
- Rollback gap: old system is decommissioned too early.
- Shadow mismatch: shadow path lacks exact production context.
What the architect signal looks like
State the rollback trigger before ramping traffic. A good migration answer is more about risk control than new platform enthusiasm.