Day 82 of 133

RAG evaluation: Ragas + LLM-as-judge + DSA Heap review

Faithfulness, context precision/recall, answer relevance.

DSA · NeetCode Heap / Priority Queue

  • Find Median From Data StreamDSA · Heap / Priority Queue

    Interview questions to prep

    1. Walk through the two-heaps trick (max-heap left, min-heap right). What invariant ties them?
    2. What's the space cost over a long-running stream, and how would you bound it (windowed median)?

GenAI · RAG evaluation

  • Interview questions to prep

    1. Walk through the four Ragas metrics and what each tells you.
    2. How would you build a golden eval set for a domain-specific RAG?
    3. Why is hallucination often a grounding or retrieval failure rather than only a model-quality failure?
    4. How would you separate hallucinations caused by missing context, bad retrieval, ambiguous prompts, and missing guardrails?
  • LLM-as-judge for RAGGenerative AIEvidently

    Interview questions to prep

    1. Compare LLM-as-judge vs human eval for RAG — when does LLM-as-judge fail?
    2. What biases would an LLM judge introduce, and how do you control for them?
  • Interview questions to prep

    1. Why are needle-in-haystack tests easy to game, and what are stronger evals?
    2. What does 'lost in the middle' mean for long-context LLMs, and how do you mitigate it?
  • Interview questions to prep

    1. How would you create a GOLD dataset for RAG from real user questions, documents, and expected citations?
    2. Which business metrics would you track for RAG: deflection, task success, escalation rate, latency, and cost?
    3. What dashboard slices show whether retrieval or generation is failing in production?
    4. How would you teach an assistant to say 'I don't know' when confidence or grounding drops?
    5. What validation checks would you add before allowing a grounded answer to reach the user?

References & further reading