Day 100 of 133
CV deep dive: production detection, video, multimodal
YOLO production pipeline; video transformers; CLIP fusion.
DSA · NeetCode Graphs
- Redundant ConnectionDSA · Graphs
Interview questions to prep
- Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
- Walk through complexity in terms of V and E. Where do those costs come from?
- How would you handle disconnected components, self-loops, or duplicate edges?
Specialization · CV deep dive
Interview questions to prep
- Walk through deploying a YOLO model in production — pre/post-processing, NMS, calibration.
- How would you handle class imbalance and rare categories in a production detection model?
- When are classical image-processing features enough, and when do CNNs or ViTs become necessary?
- How would you debug preprocessing or signal-processing artifacts that quietly break a vision model?
Interview questions to prep
- Compare 3D CNNs vs two-stream vs video transformers for action recognition.
- Why is temporal modeling hard, and what does sparse temporal sampling buy you?
- What makes text-to-video generation harder than text-to-image generation?
- How would you evaluate temporal consistency, motion quality, and prompt adherence for generated video?
Interview questions to prep
- Walk me through how a multimodal model fuses vision and text — case study CLIP.
- Where does CLIP fail, and how do later models (SigLIP, EVA-CLIP) improve it?
References & further reading
- Papers with Code — SOTA leaderboards ↗Papers with Code
- Vision Transformer (ViT) paper ↗Google
- CLIP paper ↗OpenAI