Day 47 of 133
ViT, CLIP, multimodal LLMs
Patch tokenization, [CLS] token, contrastive image-text training.
DSA · NeetCode Backtracking
- PermutationsDSA · Backtracking
Interview questions to prep
- Walk through your pruning strategy — what subtrees do you skip and why is it safe?
- Where does memoization apply? Could this be a DP problem in disguise?
- What's the worst-case time complexity, and what's the depth of the recursion stack?
- Subsets IIDSA · Backtracking
Interview questions to prep
- Walk through your pruning strategy — what subtrees do you skip and why is it safe?
- Where does memoization apply? Could this be a DP problem in disguise?
- What's the worst-case time complexity, and what's the depth of the recursion stack?
DL · ViT, CLIP, multimodal
Interview questions to prep
- How does ViT tokenize an image, and what's the role of the [CLS] token?
- When does a ViT beat a CNN, and when does data-hungriness hurt it?
Interview questions to prep
- How does CLIP enable zero-shot image classification?
- Walk me through CLIP's contrastive training objective.
Interview questions to prep
- How do multimodal LLMs like LLaVA fuse vision encoders with language models?
- Compare early fusion vs late fusion in vision-language models — what does each cost in compute and quality?
References & further reading
- Vision Transformer (ViT) paper ↗Google
- CLIP paper ↗OpenAI
- Papers with Code — SOTA leaderboards ↗Papers with Code