Day 100 of 133

CV deep dive: production detection, video, multimodal

YOLO production pipeline; video transformers; CLIP fusion.

DSA · NeetCode Graphs

  • Interview questions to prep

    1. Is this BFS, DFS, or Union-Find? Defend the choice over the other two.
    2. Walk through complexity in terms of V and E. Where do those costs come from?
    3. How would you handle disconnected components, self-loops, or duplicate edges?

Specialization · CV deep dive

  • Interview questions to prep

    1. Walk through deploying a YOLO model in production — pre/post-processing, NMS, calibration.
    2. How would you handle class imbalance and rare categories in a production detection model?
    3. When are classical image-processing features enough, and when do CNNs or ViTs become necessary?
    4. How would you debug preprocessing or signal-processing artifacts that quietly break a vision model?
  • Interview questions to prep

    1. Compare 3D CNNs vs two-stream vs video transformers for action recognition.
    2. Why is temporal modeling hard, and what does sparse temporal sampling buy you?
    3. What makes text-to-video generation harder than text-to-image generation?
    4. How would you evaluate temporal consistency, motion quality, and prompt adherence for generated video?
  • Interview questions to prep

    1. Walk me through how a multimodal model fuses vision and text — case study CLIP.
    2. Where does CLIP fail, and how do later models (SigLIP, EVA-CLIP) improve it?

References & further reading