Day 52 of 133
Attention & Transformer (Q,K,V) + Graphs
Self-attention end-to-end; multi-head, RoPE, FlashAttention, MQA/GQA.
DSA · NeetCode Graphs
- Max Area OF IslandDSA · Graphs
Interview questions to prep
- Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
- Walk through complexity in terms of V and E. Where do those costs come from?
- How would you handle disconnected components, self-loops, or duplicate edges?
- Pacific Atlantic Water FlowDSA · Graphs
Interview questions to prep
- Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
- Walk through complexity in terms of V and E. Where do those costs come from?
- How would you handle disconnected components, self-loops, or duplicate edges?
DL · Attention & Transformer
Interview questions to prep
- Walk me through self-attention(Q, K, V) end-to-end.
- Why divide by √d_k inside softmax?
- What does multi-head attention buy you over a single head?
Interview questions to prep
- Walk me through the transformer block: attention → add+norm → FFN → add+norm.
- Compare absolute vs relative vs RoPE positional encodings.
- Why did transformers replace RNN/LSTM architectures for most GenAI workloads?
- What do residual connections, layer norm, and feed-forward activation choices contribute to transformer stability?
Interview questions to prep
- What problem does FlashAttention solve, and how?
- Compare MHA, MQA, and GQA — KV-cache trade-offs.
Interview questions to prep
- Implement scaled dot-product self-attention and track the shapes of Q, K, V, scores, and output.
- How do masks change self-attention for causal language modeling?
- What changes when you split attention into multiple heads and then concatenate them?
References & further reading
- Attention Is All You Need ↗Vaswani et al.
- Illustrated Transformer (Jay Alammar) ↗Jay Alammar
- CS224n — NLP with deep learning ↗Stanford