Day 52 of 133

Attention & Transformer (Q,K,V) + Graphs

Self-attention end-to-end; multi-head, RoPE, FlashAttention, MQA/GQA.

DSA · NeetCode Graphs

  • Max Area OF IslandDSA · Graphs

    Interview questions to prep

    1. Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
    2. Walk through complexity in terms of V and E. Where do those costs come from?
    3. How would you handle disconnected components, self-loops, or duplicate edges?
  • Interview questions to prep

    1. Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
    2. Walk through complexity in terms of V and E. Where do those costs come from?
    3. How would you handle disconnected components, self-loops, or duplicate edges?

DL · Attention & Transformer

  • Self-attention(Q, K, V) end-to-endDeep LearningJay Alammar

    Interview questions to prep

    1. Walk me through self-attention(Q, K, V) end-to-end.
    2. Why divide by √d_k inside softmax?
    3. What does multi-head attention buy you over a single head?
  • Interview questions to prep

    1. Walk me through the transformer block: attention → add+norm → FFN → add+norm.
    2. Compare absolute vs relative vs RoPE positional encodings.
    3. Why did transformers replace RNN/LSTM architectures for most GenAI workloads?
    4. What do residual connections, layer norm, and feed-forward activation choices contribute to transformer stability?
  • Sparse, linear, FlashAttention, MQA, GQADeep LearningFlashAttention

    Interview questions to prep

    1. What problem does FlashAttention solve, and how?
    2. Compare MHA, MQA, and GQA — KV-cache trade-offs.
  • Interview questions to prep

    1. Implement scaled dot-product self-attention and track the shapes of Q, K, V, scores, and output.
    2. How do masks change self-attention for causal language modeling?
    3. What changes when you split attention into multiple heads and then concatenate them?

References & further reading