Day 82 of 133
RAG evaluation: Ragas + LLM-as-judge + DSA Heap review
Faithfulness, context precision/recall, answer relevance.
DSA · NeetCode Heap / Priority Queue
- Find Median From Data StreamDSA · Heap / Priority Queue
Interview questions to prep
- Walk through the two-heaps trick (max-heap left, min-heap right). What invariant ties them?
- What's the space cost over a long-running stream, and how would you bound it (windowed median)?
GenAI · RAG evaluation
Interview questions to prep
- Walk through the four Ragas metrics and what each tells you.
- How would you build a golden eval set for a domain-specific RAG?
- Why is hallucination often a grounding or retrieval failure rather than only a model-quality failure?
- How would you separate hallucinations caused by missing context, bad retrieval, ambiguous prompts, and missing guardrails?
Interview questions to prep
- Compare LLM-as-judge vs human eval for RAG — when does LLM-as-judge fail?
- What biases would an LLM judge introduce, and how do you control for them?
Interview questions to prep
- Why are needle-in-haystack tests easy to game, and what are stronger evals?
- What does 'lost in the middle' mean for long-context LLMs, and how do you mitigate it?
Interview questions to prep
- How would you create a GOLD dataset for RAG from real user questions, documents, and expected citations?
- Which business metrics would you track for RAG: deflection, task success, escalation rate, latency, and cost?
- What dashboard slices show whether retrieval or generation is failing in production?
- How would you teach an assistant to say 'I don't know' when confidence or grounding drops?
- What validation checks would you add before allowing a grounded answer to reach the user?
References & further reading