Day 52 of 133

Attention & Transformer (Q,K,V) + Graphs

Self-attention end-to-end; multi-head, RoPE, FlashAttention, MQA/GQA.

DSA · NeetCode Graphs

Max Area OF IslandDSA · Graphs
Interview questions to prep
1. Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
2. Walk through complexity in terms of V and E. Where do those costs come from?
3. How would you handle disconnected components, self-loops, or duplicate edges?
Pacific Atlantic Water FlowDSA · Graphs
Interview questions to prep
1. Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
2. Walk through complexity in terms of V and E. Where do those costs come from?
3. How would you handle disconnected components, self-loops, or duplicate edges?

Self-attention(Q, K, V) end-to-endDeep LearningJay Alammar
Interview questions to prep
1. Walk me through self-attention(Q, K, V) end-to-end.
2. Why divide by √d_k inside softmax?
3. What does multi-head attention buy you over a single head?
Encoder-decoder, positional encoding, residualsDeep LearningVaswani et al.
Interview questions to prep
1. Walk me through the transformer block: attention → add+norm → FFN → add+norm.
2. Compare absolute vs relative vs RoPE positional encodings.
3. Why did transformers replace RNN/LSTM architectures for most GenAI workloads?
4. What do residual connections, layer norm, and feed-forward activation choices contribute to transformer stability?
Sparse, linear, FlashAttention, MQA, GQADeep LearningFlashAttention
Interview questions to prep
1. What problem does FlashAttention solve, and how?
2. Compare MHA, MQA, and GQA — KV-cache trade-offs.
Code self-attention, multi-head attention, and a tiny transformerDeep LearningNeetCode ML
Interview questions to prep
1. Implement scaled dot-product self-attention and track the shapes of Q, K, V, scores, and output.
2. How do masks change self-attention for causal language modeling?
3. What changes when you split attention into multiple heads and then concatenate them?

References & further reading