Clarify what counts as an anomaly
An anomaly is not simply a rare point. It must be actionable. Ask who receives the alert, what they can do, and how many alerts per day are acceptable.
Architecture
- Ingestion: metrics, logs, traces, events, transactions, and metadata.
- Baseline models: seasonal statistics, robust z-scores, EWMA, isolation forests, or forecasting residuals.
- Contextual rules: suppress known deployments, holidays, maintenance windows, and duplicate alerts.
- Alert router: group related anomalies and route by service, product, or owner.
- Feedback loop: responders mark useful, duplicate, benign, or missed alerts.
Metrics
Use precision at alert budget, time to detect, time to acknowledge, incident correlation, and responder feedback. If labels are weak, sampled human review is better than pretending all alerts are labeled.
Failure modes
- Alert fatigue: too many low-value alerts destroy trust.
- Seasonality miss: daily or weekly cycles look anomalous.
- Concept drift: normal behavior changes after launches or growth.
- Correlation blindness: hundreds of metrics alert for one root cause.
What the architect signal looks like
Close with an alert budget and a suppression strategy. A senior answer optimizes responder usefulness, not model novelty.