Design an Anomaly Detection Platform

Clarify what counts as an anomaly

An anomaly is not simply a rare point. It must be actionable. Ask who receives the alert, what they can do, and how many alerts per day are acceptable.

Architecture

Ingestion: metrics, logs, traces, events, transactions, and metadata.
Baseline models: seasonal statistics, robust z-scores, EWMA, isolation forests, or forecasting residuals.
Contextual rules: suppress known deployments, holidays, maintenance windows, and duplicate alerts.
Alert router: group related anomalies and route by service, product, or owner.
Feedback loop: responders mark useful, duplicate, benign, or missed alerts.

Metrics

Use precision at alert budget, time to detect, time to acknowledge, incident correlation, and responder feedback. If labels are weak, sampled human review is better than pretending all alerts are labeled.

Failure modes

Alert fatigue: too many low-value alerts destroy trust.
Seasonality miss: daily or weekly cycles look anomalous.
Concept drift: normal behavior changes after launches or growth.
Correlation blindness: hundreds of metrics alert for one root cause.

What the architect signal looks like

Close with an alert budget and a suppression strategy. A senior answer optimizes responder usefulness, not model novelty.