Day 47 of 133

ViT, CLIP, multimodal LLMs

Patch tokenization, [CLS] token, contrastive image-text training.

DSA · NeetCode Backtracking

  • PermutationsDSA · Backtracking

    Interview questions to prep

    1. Walk through your pruning strategy — what subtrees do you skip and why is it safe?
    2. Where does memoization apply? Could this be a DP problem in disguise?
    3. What's the worst-case time complexity, and what's the depth of the recursion stack?
  • Subsets IIDSA · Backtracking

    Interview questions to prep

    1. Walk through your pruning strategy — what subtrees do you skip and why is it safe?
    2. Where does memoization apply? Could this be a DP problem in disguise?
    3. What's the worst-case time complexity, and what's the depth of the recursion stack?

DL · ViT, CLIP, multimodal

  • Vision Transformer (ViT)Deep LearningGoogle

    Interview questions to prep

    1. How does ViT tokenize an image, and what's the role of the [CLS] token?
    2. When does a ViT beat a CNN, and when does data-hungriness hurt it?
  • Interview questions to prep

    1. How does CLIP enable zero-shot image classification?
    2. Walk me through CLIP's contrastive training objective.
  • Interview questions to prep

    1. How do multimodal LLMs like LLaVA fuse vision encoders with language models?
    2. Compare early fusion vs late fusion in vision-language models — what does each cost in compute and quality?

References & further reading