Day 49 of 133

CV consolidation + DSA Backtracking

Run 60-min CV breadth quiz; rehearse mAP, IoU, U-Net, DETR.

DSA · NeetCode Backtracking

Palindrome PartitioningDSA · Backtracking
Interview questions to prep
1. Walk through your pruning strategy — what subtrees do you skip and why is it safe?
2. Where does memoization apply? Could this be a DP problem in disguise?
3. What's the worst-case time complexity, and what's the depth of the recursion stack?

AlexNet → VGG → Inception → ResNet → DenseNet → EfficientNetDeep LearningCS231n
Interview questions to prep
1. What problem did ResNet's residual connections actually solve?
2. Why did 1×1 convs become so important (Inception, bottleneck blocks)?
Why ResNet works: identity mappingsDeep LearningHe et al.
Interview questions to prep
1. Explain why training error went UP with depth before ResNet.
2. Walk me through a residual block.
Depthwise separable convs (MobileNet, EfficientNet)Deep LearningEfficientNet
Interview questions to prep
1. How do depthwise separable convolutions reduce compute?
2. What does EfficientNet's compound scaling do that one-axis scaling doesn't?

Vision Transformer (ViT)Deep LearningGoogle
Interview questions to prep
1. How does ViT tokenize an image, and what's the role of the [CLS] token?
2. When does a ViT beat a CNN, and when does data-hungriness hurt it?
CLIP: contrastive image-text pretrainingDeep LearningOpenAI
Interview questions to prep
1. How does CLIP enable zero-shot image classification?
2. Walk me through CLIP's contrastive training objective.
Self-supervised representation learning for vision and multimodal modelsDeep LearningLilian Weng
Interview questions to prep
1. Compare contrastive learning, masked prediction, and autoencoding as self-supervised objectives.
2. How would you evaluate whether a self-supervised embedding transfers to a downstream product task?
3. What data leakage or shortcut-learning failure modes appear in self-supervised pretraining?
BLIP, LLaVA, multimodal LLMsDeep LearningLLaVA
Interview questions to prep
1. How do multimodal LLMs like LLaVA fuse vision encoders with language models?
2. Compare early fusion vs late fusion in vision-language models — what does each cost in compute and quality?

References & further reading