Day 50 of 133
NLP foundations: tokenization & embeddings
BPE/WordPiece/SentencePiece. word2vec → BERT contextual embeddings.
DSA · NeetCode Backtracking
- Letter Combinations OF A Phone NumberDSA · Backtracking
Interview questions to prep
- Walk through your pruning strategy — what subtrees do you skip and why is it safe?
- Where does memoization apply? Could this be a DP problem in disguise?
- What's the worst-case time complexity, and what's the depth of the recursion stack?
- N QueensDSA · Backtracking
Interview questions to prep
- How do you check 'queen attacks me' in O(1) using the diagonal-set trick?
- What's the state-space size, and how much does pruning actually save in practice?
NLP · Foundations
Interview questions to prep
- Compare BPE, WordPiece, and SentencePiece tokenizers.
- Why does tokenizer choice affect cross-lingual performance?
Interview questions to prep
- Walk through how word2vec (skip-gram) is trained.
- How are contextual embeddings (BERT) different from static ones (word2vec)?
- What problem does negative sampling solve in skip-gram training?
- Compare CBOW, skip-gram, and GloVe — what objective is each optimizing?
- When would n-grams or TF-IDF outperform dense embeddings in a production baseline?
Interview questions to prep
- Compare token-level (NER) vs sequence-level (classification) tasks.
- Why is dependency parsing harder than POS tagging, and where does it still matter today?
Interview questions to prep
- Compare bag-of-words and TF-IDF for sentiment analysis — what signal does TF-IDF add?
- When would stemming hurt compared with lemmatization?
- How would you build a strong non-LLM NLP baseline before using transformers?
References & further reading
- Hugging Face NLP course ↗Hugging Face
- CS224n — NLP with deep learning ↗Stanford
- Illustrated Transformer (Jay Alammar) ↗Jay Alammar
- 75Hard GenAI/LLM — Text representation techniques ↗75Hard GenAI/LLM Challenge