Day 50 of 133

NLP foundations: tokenization & embeddings

BPE/WordPiece/SentencePiece. word2vec → BERT contextual embeddings.

DSA · NeetCode Backtracking

  • Interview questions to prep

    1. Walk through your pruning strategy — what subtrees do you skip and why is it safe?
    2. Where does memoization apply? Could this be a DP problem in disguise?
    3. What's the worst-case time complexity, and what's the depth of the recursion stack?
  • N QueensDSA · Backtracking

    Interview questions to prep

    1. How do you check 'queen attacks me' in O(1) using the diagonal-set trick?
    2. What's the state-space size, and how much does pruning actually save in practice?

NLP · Foundations

  • Interview questions to prep

    1. Compare BPE, WordPiece, and SentencePiece tokenizers.
    2. Why does tokenizer choice affect cross-lingual performance?
  • Interview questions to prep

    1. Walk through how word2vec (skip-gram) is trained.
    2. How are contextual embeddings (BERT) different from static ones (word2vec)?
    3. What problem does negative sampling solve in skip-gram training?
    4. Compare CBOW, skip-gram, and GloVe — what objective is each optimizing?
    5. When would n-grams or TF-IDF outperform dense embeddings in a production baseline?
  • Interview questions to prep

    1. Compare token-level (NER) vs sequence-level (classification) tasks.
    2. Why is dependency parsing harder than POS tagging, and where does it still matter today?
  • Interview questions to prep

    1. Compare bag-of-words and TF-IDF for sentiment analysis — what signal does TF-IDF add?
    2. When would stemming hurt compared with lemmatization?
    3. How would you build a strong non-LLM NLP baseline before using transformers?

References & further reading