Day 89 of 133

Guardrails & safety: input/output filters, jailbreaks, red-teaming

Defense-in-depth; classify the major jailbreak categories.

DSA · NeetCode Greedy

  • Interview questions to prep

    1. Prove the greedy choice — why is the locally-optimal pick safe globally? (Exchange argument or staying-ahead.)
    2. When does greedy fail on a similar-looking problem, and what would you reach for instead (DP, BFS)?
    3. Walk through edge cases that often break naive greedy: ties, negatives, single element.

GenAI · Guardrails & safety

  • Input filtering & output moderationGenerative AINeMo Guardrails

    Interview questions to prep

    1. Compare input vs output filtering — what does each catch and miss?
    2. What latency does an output-moderation step add, and how do you keep it under your SLO?
  • Interview questions to prep

    1. Walk through three categories of jailbreak attacks, and how you'd defend against each.
    2. Why is many-shot jailbreaking so effective on long-context models, and what mitigates it?
  • Red-teaming & evals for safetyGenerative AIAnthropic

    Interview questions to prep

    1. How would you build a red-teaming process for a customer-facing LLM?
    2. How would you scale red-teaming with automated attackers without missing novel attack vectors?

References & further reading