Day 82 of 133

RAG evaluation: Ragas + LLM-as-judge + DSA Heap review

Faithfulness, context precision/recall, answer relevance.

DSA · NeetCode Heap / Priority Queue

Find Median From Data StreamDSA · Heap / Priority Queue
Interview questions to prep
1. Walk through the two-heaps trick (max-heap left, min-heap right). What invariant ties them?
2. What's the space cost over a long-running stream, and how would you bound it (windowed median)?

GenAI · RAG evaluation

Ragas: faithfulness, context recall, context precision, answer relevanceGenerative AIRagas
Interview questions to prep
1. Walk through the four Ragas metrics and what each tells you.
2. How would you build a golden eval set for a domain-specific RAG?
3. Why is hallucination often a grounding or retrieval failure rather than only a model-quality failure?
4. How would you separate hallucinations caused by missing context, bad retrieval, ambiguous prompts, and missing guardrails?
LLM-as-judge for RAGGenerative AIEvidently
Interview questions to prep
1. Compare LLM-as-judge vs human eval for RAG — when does LLM-as-judge fail?
2. What biases would an LLM judge introduce, and how do you control for them?
Needle-in-haystack & long-context evalsGenerative AIRead
Interview questions to prep
1. Why are needle-in-haystack tests easy to game, and what are stronger evals?
2. What does 'lost in the middle' mean for long-context LLMs, and how do you mitigate it?
Golden datasets, business metrics, and RAG dashboardsGenerative AIRagas
Interview questions to prep
1. How would you create a GOLD dataset for RAG from real user questions, documents, and expected citations?
2. Which business metrics would you track for RAG: deflection, task success, escalation rate, latency, and cost?
3. What dashboard slices show whether retrieval or generation is failing in production?
4. How would you teach an assistant to say 'I don't know' when confidence or grounding drops?
5. What validation checks would you add before allowing a grounded answer to reach the user?

References & further reading

Ragas metrics catalog ↗Ragas
Evidently AI — monitoring metrics catalog ↗Evidently
Anthropic — Testing & Evaluation ↗Anthropic