Day 47 of 133

ViT, CLIP, multimodal LLMs

Patch tokenization, [CLS] token, contrastive image-text training.

DSA · NeetCode Backtracking

PermutationsDSA · Backtracking
Interview questions to prep
1. Walk through your pruning strategy — what subtrees do you skip and why is it safe?
2. Where does memoization apply? Could this be a DP problem in disguise?
3. What's the worst-case time complexity, and what's the depth of the recursion stack?
Subsets IIDSA · Backtracking
Interview questions to prep
1. Walk through your pruning strategy — what subtrees do you skip and why is it safe?
2. Where does memoization apply? Could this be a DP problem in disguise?
3. What's the worst-case time complexity, and what's the depth of the recursion stack?

Vision Transformer (ViT)Deep LearningGoogle
Interview questions to prep
1. How does ViT tokenize an image, and what's the role of the [CLS] token?
2. When does a ViT beat a CNN, and when does data-hungriness hurt it?
CLIP: contrastive image-text pretrainingDeep LearningOpenAI
Interview questions to prep
1. How does CLIP enable zero-shot image classification?
2. Walk me through CLIP's contrastive training objective.
BLIP, LLaVA, multimodal LLMsDeep LearningLLaVA
Interview questions to prep
1. How do multimodal LLMs like LLaVA fuse vision encoders with language models?
2. Compare early fusion vs late fusion in vision-language models — what does each cost in compute and quality?

References & further reading