Day 89 of 133

Guardrails & safety: input/output filters, jailbreaks, red-teaming

Defense-in-depth; classify the major jailbreak categories.

DSA · NeetCode Greedy

Valid Parenthesis StringDSA · Greedy
Interview questions to prep
1. Prove the greedy choice — why is the locally-optimal pick safe globally? (Exchange argument or staying-ahead.)
2. When does greedy fail on a similar-looking problem, and what would you reach for instead (DP, BFS)?
3. Walk through edge cases that often break naive greedy: ties, negatives, single element.

Input filtering & output moderationGenerative AINeMo Guardrails
Interview questions to prep
1. Compare input vs output filtering — what does each catch and miss?
2. What latency does an output-moderation step add, and how do you keep it under your SLO?
Jailbreaks: DAN, role-play, payload smugglingGenerative AIAnthropic
Interview questions to prep
1. Walk through three categories of jailbreak attacks, and how you'd defend against each.
2. Why is many-shot jailbreaking so effective on long-context models, and what mitigates it?
Red-teaming & evals for safetyGenerative AIAnthropic
Interview questions to prep
1. How would you build a red-teaming process for a customer-facing LLM?
2. How would you scale red-teaming with automated attackers without missing novel attack vectors?

References & further reading