Day 100 of 133

CV deep dive: production detection, video, multimodal

YOLO production pipeline; video transformers; CLIP fusion.

DSA · NeetCode Graphs

Redundant ConnectionDSA · Graphs
Interview questions to prep
1. Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
2. Walk through complexity in terms of V and E. Where do those costs come from?
3. How would you handle disconnected components, self-loops, or duplicate edges?

Production object detection pipelinesDeep LearningUltralytics
Interview questions to prep
1. Walk through deploying a YOLO model in production — pre/post-processing, NMS, calibration.
2. How would you handle class imbalance and rare categories in a production detection model?
3. When are classical image-processing features enough, and when do CNNs or ViTs become necessary?
4. How would you debug preprocessing or signal-processing artifacts that quietly break a vision model?
Video understanding (3D CNN, video transformers)Deep LearningViViT
Interview questions to prep
1. Compare 3D CNNs vs two-stream vs video transformers for action recognition.
2. Why is temporal modeling hard, and what does sparse temporal sampling buy you?
3. What makes text-to-video generation harder than text-to-image generation?
4. How would you evaluate temporal consistency, motion quality, and prompt adherence for generated video?
Multimodal CV: image-text, video-textDeep LearningOpenAI
Interview questions to prep
1. Walk me through how a multimodal model fuses vision and text — case study CLIP.
2. Where does CLIP fail, and how do later models (SigLIP, EVA-CLIP) improve it?

References & further reading