Top 31 Computer Vision Interview Questions (2026)

Computer vision interviews at Google, Meta, Apple, Tesla, Nvidia, and Amazon have become the battleground for ML engineer candidates. These companies aren't just looking for someone who can implement a CNN from scratch: they want engineers who can architect real-time detection systems for autonomous vehicles, design multimodal models that understand both images and text, and debug distribution shifts in production vision pipelines serving millions of users daily.

What makes computer vision interviews particularly brutal is the expectation that you'll connect theoretical concepts to real-world constraints. You might start by explaining how Feature Pyramid Networks improve small object detection, then suddenly find yourself designing the entire inference pipeline for a Tesla FSD chip with 5ms latency requirements. Many candidates know the theory behind ResNet skip connections but stumble when asked to choose between MobileNet and EfficientNet for an edge deployment scenario with specific memory and power budgets.

Here are the top 31 computer vision interview questions organized by the core areas that matter most for landing ML engineer roles at top tech companies.

Advanced31 questions

Computer Vision Interview Questions

Top Computer Vision interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Machine Learning Engineer Google

Image Classification and CNN Architectures

Image classification and CNN architecture questions test whether you understand the fundamental building blocks of computer vision, but more importantly, whether you can make informed decisions about model selection and optimization for real deployments. Most candidates can recite the differences between ResNet and VGG, but they fail when asked to diagnose why their ResNet-50 is plateauing on a 10,000-class dataset or how to choose between architectures for strict latency requirements.

The key insight interviewers are looking for is your ability to connect architectural design choices to specific performance outcomes. When you explain why skip connections enable training deeper networks, you should immediately follow with scenarios where deeper isn't better, like when you're memory-constrained or when your dataset is too small to benefit from the additional capacity.

Image Classification and CNN Architectures

Before tackling any advanced CV topic, you need a rock-solid grasp of convolutional neural networks and image classification fundamentals. Interviewers at Google and Nvidia frequently probe your understanding of architectural design choices, training strategies, and the evolution from AlexNet to modern architectures, and candidates often stumble when asked to justify one design over another.

You're building an image classification model at Google for a dataset with 10,000 classes. You notice your ResNet-50 baseline is converging slowly and plateauing at low accuracy. Walk me through what architectural and training modifications you would prioritize and why.

GoogleHardImage Classification and CNN Architectures

Sample Answer

Most candidates default to simply making the network deeper or wider, but that fails here because with 10,000 classes the bottleneck is often in the classifier head and gradient flow, not raw capacity. You should first switch to a cosine learning rate schedule with warmup, add label smoothing (e.g., $\epsilon = 0.1$) to soften the massive softmax distribution, and consider replacing the single fully connected layer with a normalized classifier where both features and weights are L2-normalized before computing logits. Architecturally, swapping to a ResNeXt or EfficientNet backbone gives better accuracy per FLOP through grouped convolutions or compound scaling. Finally, mixup or CutMix regularization becomes critical at this class count to prevent overfitting on rare classes.

Explain why ResNets with skip connections can be trained to hundreds of layers while plain VGG-style networks degrade in performance beyond roughly 20 layers.

NvidiaEasyImage Classification and CNN Architectures

Sample Answer

Skip connections let each residual block learn a residual function $\mathcal{F}(x) = H(x) - x$ instead of the full mapping $H(x)$, so the network only needs to learn small perturbations to the identity. This directly addresses the degradation problem: without skip connections, deeper plain networks suffer from optimization difficulty because gradients vanish or explode through long chains of nonlinear layers, making it harder to learn even an identity mapping. With the shortcut, gradients flow directly through the addition operation during backpropagation, providing a clean path that scales to hundreds of layers. You should also mention that this is distinct from the vanishing gradient problem solved by BatchNorm; the degradation problem is specifically about optimization landscape, not gradient magnitude.

A Tesla Autopilot team asks you to choose between a MobileNetV2 and an EfficientNet-B0 for an on-device classification task running on their FSD chip with strict latency requirements under 5ms. How do you decide?

TeslaMediumImage Classification and CNN Architectures

Sample Answer

You could go with MobileNetV2 for its simpler inverted residual blocks with depthwise separable convolutions, or EfficientNet-B0 for its higher accuracy from compound scaling. MobileNetV2 wins here because its architecture uses uniform, hardware-friendly operations that map efficiently to fixed-function accelerators like Tesla's FSD chip, while EfficientNet-B0's squeeze-and-excitation blocks and variable resolution scaling introduce irregular memory access patterns that inflate real-world latency beyond what FLOPs alone predict. You should always benchmark actual inference time on the target hardware rather than relying on theoretical FLOP counts, since operations like SE blocks and swish activations can be disproportionately slow on specific accelerators. If accuracy is still insufficient, consider knowledge distillation from a larger EfficientNet teacher into the MobileNetV2 student.

During a design review at Meta, a colleague proposes replacing all $3 \times 3$ convolutions in your classification network with $1 \times 1$ convolutions to reduce compute. What is wrong with this proposal, and what would you suggest instead?

MetaMediumImage Classification and CNN Architectures

Sample Answer

Let's reason through this step by step. A $1 \times 1$ convolution operates on a single spatial location, so its receptive field covers only one pixel per layer. This means you lose the ability to capture local spatial patterns like edges, textures, and shapes, which are fundamental to image classification. The network would need an impractical number of layers just to aggregate spatial context, and even then the learned features would be poor because early layers cannot detect basic visual structures. What you should suggest instead is the Inception-style factorization: replace a $3 \times 3$ conv with a $1 \times 1$ conv to reduce channel dimensionality, followed by a $3 \times 3$ conv, cutting compute by roughly $3\times$ to $9\times$ depending on the reduction ratio while preserving spatial feature extraction.

You're at Amazon working on product image classification and observe that your model achieves 95% accuracy on the validation set but only 82% in production. Describe how you would diagnose whether this is an architectural issue or a data distribution issue, and what CNN-level changes might help.

AmazonHardImage Classification and CNN Architectures

Walk me through how batch normalization interacts with the convolutional layers in a CNN during training versus inference, and explain why the ordering of Conv, BN, and ReLU matters.

MicrosoftEasyImage Classification and CNN Architectures

Practice more Image Classification and CNN Architectures questions

Object Detection

Object detection interviews go beyond asking you to explain YOLO versus Faster R-CNN. Interviewers want to see that you understand the nuanced tradeoffs between one-stage and two-stage approaches, especially in the context of real-time systems like autonomous vehicles or robotics applications where latency directly impacts safety and user experience.

A common mistake candidates make is treating anchor generation and feature pyramid design as separate topics when they're deeply interconnected. The best answers demonstrate how anchor scales and aspect ratios must be carefully tuned based on your dataset's object size distribution, and how FPN's top-down pathway specifically addresses the small object detection problem that single-scale features can't solve.

Object Detection

Companies like Tesla, Waymo, and Amazon expect you to explain the full pipeline of object detection, from anchor generation to non-max suppression, and to compare one-stage versus two-stage detectors with precision. You will struggle here if you only memorize model names without understanding how region proposals, feature pyramids, and loss functions interact under real-world constraints.

Walk me through how anchor boxes are generated in Faster R-CNN and explain why using multiple scales and aspect ratios at each spatial location matters for detection performance.

WaymoEasyObject Detection

Sample Answer

Anchors are pre-defined bounding boxes placed at every spatial position of the feature map, each with multiple scales and aspect ratios, so the network has a diverse set of reference boxes to regress from. At each location, Faster R-CNN typically generates $k$ anchors (e.g., $k=9$ for 3 scales times 3 aspect ratios), and the Region Proposal Network predicts whether each anchor contains an object and refines its coordinates via learned offsets $\Delta x, \Delta y, \Delta w, \Delta h$. Without multiple scales and aspect ratios, you would miss objects whose shapes deviate from a single template, leading to poor recall on tall, wide, or small objects. This design lets the detector cover the space of possible object geometries densely without requiring an explicit search over box dimensions at inference time.

You are building a real-time pedestrian detection system for an autonomous vehicle. How would you decide between a one-stage detector like YOLO and a two-stage detector like Faster R-CNN, and what tradeoffs should you communicate to your team?

TeslaMediumObject Detection

Sample Answer

You could go with a two-stage detector like Faster R-CNN for higher accuracy on small or occluded pedestrians, or a one-stage detector like YOLOv8 for lower latency. One-stage wins here because autonomous driving demands real-time inference (often under 30ms per frame), and modern one-stage detectors have largely closed the accuracy gap through improvements like Feature Pyramid Networks and advanced label assignment strategies. You should communicate to your team that the tradeoff is not purely speed vs. accuracy anymore: with proper anchor-free designs and strong augmentation, one-stage models achieve competitive mAP while being 3 to 5 times faster. If your system still misses hard cases, you can add a lightweight second-stage refinement head without adopting the full two-stage pipeline.

Explain how Feature Pyramid Networks (FPN) improve object detection across different scales, and describe what would happen to your model's performance on small objects if you removed the top-down pathway.

GoogleMediumObject Detection

Sample Answer

Let's reason through this step by step. In a standard CNN backbone, early layers have high resolution but weak semantics, while deep layers have strong semantics but low resolution. FPN builds a top-down pathway that takes the semantically rich deep features and upsamples them, then fuses them with the corresponding high-resolution lateral connections via element-wise addition at each pyramid level. This gives every level both strong semantics and appropriate spatial detail. If you removed the top-down pathway, the lower pyramid levels would only have the raw backbone features, which lack the semantic context needed to distinguish small objects from background clutter. You would see a significant drop in AP for small objects because those detections rely on the fine-grained spatial information at early levels being enriched by the propagated high-level features.

During evaluation of your object detection model, you notice that mAP at IoU=0.5 is strong but mAP at IoU=0.75 drops significantly. What does this tell you about your model's predictions, and how would you systematically improve localization quality?

AmazonHardObject Detection

Describe how non-maximum suppression works, then explain a scenario where standard NMS fails and what alternative you would propose to handle it at Waymo-scale inference on dense urban scenes.

WaymoHardObject Detection

Practice more Object Detection questions

Image Segmentation

Segmentation questions reveal whether you can think beyond pixel-level accuracy to consider the practical constraints of real-world deployment. Tesla and Waymo interviewers particularly focus on real-time segmentation scenarios where you need to balance the rich spatial information from encoder-decoder architectures against the computational overhead that kills your frame rate.

The critical distinction most candidates miss is when to use semantic versus instance versus panoptic segmentation based on the specific application requirements. Understanding that Mask R-CNN's two-stage approach gives you instance-level precision but at the cost of inference speed helps you make the right architectural choice when the interviewer presents a concrete deployment scenario.

Image Segmentation

Segmentation questions test whether you can distinguish between semantic, instance, and panoptic segmentation and articulate the architectural differences each requires. Interviewers at Apple and Meta often ask you to design or improve segmentation systems for specific use cases, so surface-level knowledge of U-Net or Mask R-CNN alone will not be enough.

You are building a real-time pedestrian segmentation system for an autonomous vehicle at Waymo. Would you choose a single-stage approach like a fully convolutional network or a two-stage approach like Mask R-CNN, and why?

WaymoMediumImage Segmentation

Sample Answer

You could do a single-stage FCN-based approach or a two-stage Mask R-CNN approach. The single-stage approach wins here because real-time autonomous driving demands low latency, and fully convolutional architectures like BiSeNet or DDRNet can run at 30+ FPS while Mask R-CNN typically runs at 5-10 FPS. However, if you also need to distinguish individual pedestrians (instance segmentation) rather than just labeling all pedestrian pixels together, you would need to sacrifice some speed or adopt a real-time instance segmentation model like YOLACT or SOLOv2. The key tradeoff you should articulate is that semantic segmentation is faster but loses instance identity, while two-stage methods give you per-instance masks at a significant latency cost.

A Meta interviewer asks you to explain how panoptic segmentation unifies semantic and instance segmentation. Walk through how a model like Panoptic FPN produces its final output and how conflicts between overlapping predictions are resolved.

MetaHardImage Segmentation

Sample Answer

Start by recalling that panoptic segmentation requires labeling every pixel as either a "stuff" class (amorphous regions like sky or road) or a "thing" class (countable objects like cars or people) with unique instance IDs. Panoptic FPN uses a shared FPN backbone, then branches into two heads: a semantic segmentation head that predicts per-pixel stuff classes, and a Mask R-CNN style head that predicts instance masks for thing classes. The merging heuristic resolves conflicts by first ranking instance predictions by confidence score, then greedily assigning pixels to the highest-confidence instance, and finally filling any remaining unassigned pixels using the semantic segmentation output. You should mention that the quality metric is Panoptic Quality, defined as $PQ = \frac{\sum_{(p,g) \in TP} \text{IoU}(p,g)}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}$, which jointly captures recognition and segmentation quality.

You are improving a U-Net based medical image segmentation model at Apple for segmenting skin lesions, but it struggles with small, irregularly shaped lesions. What architectural or training modifications would you propose?

AppleMediumImage Segmentation

Sample Answer

This question is checking whether you can diagnose why a standard architecture fails on specific data characteristics and propose targeted fixes rather than generic improvements. For small lesions, you should add attention gates to the skip connections so the decoder focuses on relevant spatial regions instead of propagating noisy low-level features. Replacing the standard cross-entropy loss with a combination of Dice loss and focal loss directly addresses class imbalance and penalizes poor performance on small objects: $$L = \lambda_1 \cdot L_{\text{focal}} + \lambda_2 \cdot (1 - \frac{2|P \cap G|}{|P| + |G|})$$. You could also incorporate atrous spatial pyramid pooling (ASPP) at the bottleneck to capture multi-scale context, and apply aggressive data augmentation like elastic deformations and color jittering that reflect real clinical variation.

Tesla's occupancy network predicts a 3D voxel grid of semantic labels from multi-camera inputs. How does this differ from traditional 2D image segmentation, and what are the key architectural components needed to lift 2D features into a 3D segmentation volume?

TeslaHardImage Segmentation

Explain the difference between semantic segmentation and instance segmentation. Given an image with five overlapping cups on a table, describe what each method would output.

GoogleEasyImage Segmentation

Practice more Image Segmentation questions

Video Understanding and Temporal Models

Video understanding questions test your grasp of temporal modeling, but interviewers are really evaluating whether you can handle the computational and memory challenges that come with adding the time dimension to vision problems. The jump from processing single frames to video sequences introduces new constraints around memory usage, temporal consistency, and variable-length input handling that separate strong candidates from average ones.

What trips up most candidates is not recognizing that different temporal modeling approaches solve different problems: 3D convolutions capture short-term motions well but are computationally expensive, while Two-Stream networks can leverage pre-computed optical flow but add preprocessing overhead. Google and Meta particularly value candidates who can articulate these tradeoffs in the context of serving video models at scale.

Video Understanding and Temporal Models

When interviews shift to video, you are expected to reason about temporal modeling, action recognition, and efficient processing of sequential frames. This section is where many candidates falter because they have limited hands-on experience with 3D convolutions, optical flow, and video transformers, yet companies like Waymo, Google, and Meta consider these skills essential for production systems.

You are building an action recognition system at Waymo to classify pedestrian behaviors from dashcam video. Walk me through how you would decide between a Two-Stream network using optical flow versus a 3D convolutional approach like SlowFast, and what tradeoffs matter most for a real-time autonomous driving pipeline.

WaymoMediumVideo Understanding and Temporal Models

Sample Answer

Reason through it: Start by noting that Two-Stream networks require precomputing optical flow (e.g., via TV-L1 or RAFT), which adds a separate expensive preprocessing step that is hard to justify in a latency-sensitive driving pipeline. A 3D CNN like SlowFast avoids this by learning temporal features end-to-end, where the Slow pathway captures spatial semantics at low frame rate and the Fast pathway captures fine-grained motion at high frame rate. For real-time autonomous driving, you care about inference latency, so SlowFast is generally preferred because it processes raw frames directly without the optical flow bottleneck. However, if you have offline analysis tasks where accuracy matters more than speed, Two-Stream networks with precomputed flow can still be competitive because explicit motion representations give a strong inductive bias for motion-heavy classes like "pedestrian stepping off curb."

A Meta engineer asks you: we have a video recommendation system that needs to extract semantic embeddings from user-uploaded clips of varying lengths (3 seconds to 10 minutes). How would you design the temporal modeling component to handle this variable-length input efficiently at scale?

MetaHardVideo Understanding and Temporal Models

Sample Answer

This question is checking whether you can bridge the gap between academic video models (which assume fixed-length clips) and production constraints like variable duration, throughput, and memory. You should propose a hierarchical approach: first, uniformly sample or segment the video into fixed-length clips (e.g., 2-second chunks), extract per-clip embeddings using a frozen or fine-tuned video encoder like ViViT or VideoMAE, then aggregate across clips using a lightweight temporal model such as a Transformer with learned positional encodings or a simple temporal attention pooling layer. For 10-minute videos, you need to be explicit about memory: mention that you would cap the number of sampled clips (e.g., $K=32$) and use sparse sampling strategies rather than processing every frame. At Meta's scale, you would also want the clip encoder to be distilled or quantized for serving, with the aggregation head being small enough to run online.

Explain the core idea behind temporal shift modules (TSM) and why Google might prefer them over full 3D convolutions for deploying video classification on edge devices.

GoogleEasyVideo Understanding and Temporal Models

Sample Answer

The standard move is to use 3D convolutions (like C3D or I3D) to capture temporal information across frames. But here, computational cost matters because edge devices have strict latency and memory budgets. TSM works by shifting a portion of the channels along the temporal dimension before applying standard 2D convolutions, so you get temporal reasoning at zero extra FLOPs compared to a 2D ResNet. You shift, say, 1/8 of channels forward and 1/8 backward in time, then the subsequent 2D conv naturally mixes information across frames. This lets you reuse highly optimized 2D convolution kernels and pretrained ImageNet weights while still achieving accuracy competitive with I3D on benchmarks like Kinetics.

Tesla's Autopilot team asks you to design a module that fuses information across 8 camera views over a temporal window of 2 seconds for predicting whether a vehicle in an adjacent lane is about to cut in. How would you structure the spatiotemporal attention mechanism, and where would you place it in the perception stack?

TeslaHardVideo Understanding and Temporal Models

You are given a pretrained image-based Vision Transformer (ViT) and asked to adapt it for short video clip classification without training from scratch. Describe at least two strategies for injecting temporal reasoning into the existing architecture.

NvidiaMediumVideo Understanding and Temporal Models

Practice more Video Understanding and Temporal Models questions

Multimodal and Vision-Language Models

Multimodal and vision-language questions have become critical at companies building AI assistants, search engines, and content understanding systems. These questions test your understanding of how to bridge the semantic gap between visual and textual representations, but interviewers are especially interested in your ability to handle the practical challenges of training and deploying models that need to understand both modalities simultaneously.

The sophistication expected here has increased dramatically with the success of models like CLIP and GPT-4V. Candidates need to understand not just how contrastive learning aligns image and text embeddings, but also how to diagnose and fix failure modes like cultural bias or domain shift that become amplified in zero-shot scenarios where the model encounters concepts it has never seen during training.

Multimodal and Vision-Language Models

With the rise of models like CLIP, Flamingo, and GPT-4V, interviewers increasingly ask you to explain how visual and textual representations are aligned and what makes contrastive learning effective at scale. You should be prepared to discuss zero-shot transfer, prompt engineering for vision tasks, and the trade-offs of different fusion strategies, as these questions are now standard at Google, Meta, and Microsoft.

Explain how CLIP aligns image and text representations during training. If you were asked to adapt CLIP for a domain-specific zero-shot classification task (e.g., classifying manufacturing defects), what changes would you consider and why?

GoogleMediumMultimodal and Vision-Language Models

Sample Answer

This question is checking whether you can articulate contrastive learning mechanics and reason about domain adaptation trade-offs. CLIP trains by encoding images and text into a shared embedding space and optimizing a symmetric contrastive loss over a batch of image-text pairs, maximizing cosine similarity for matched pairs and minimizing it for unmatched ones: $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(v_i, t_j)/\tau)}$$. For a domain-specific task like defect classification, you would consider fine-tuning on domain-specific image-text pairs, crafting better text prompts that include domain terminology, or using a lightweight adapter on top of frozen CLIP features. The key insight is that CLIP's zero-shot ability degrades on distributions far from its web-scraped pretraining data, so prompt engineering and small-scale fine-tuning are your primary levers.

You are designing a vision-language model that needs to answer questions about images. Walk me through the trade-offs between early fusion (combining image and text features before a shared transformer) and late fusion (processing them separately then combining at the output), and when you would pick one over the other.

MetaHardMultimodal and Vision-Language Models

Sample Answer

The standard move is late fusion because it is modular, easier to pretrain each encoder independently, and computationally cheaper at inference since you can cache embeddings. But here, early fusion matters because tasks like visual question answering require fine-grained cross-modal reasoning where a word like 'left' must attend to specific spatial regions in the image, something late fusion handles poorly. Early fusion models (e.g., Flamingo's gated cross-attention or the cross-attention layers in BLIP-2) allow token-level interactions between modalities, yielding stronger performance on tasks requiring compositional understanding at the cost of higher compute and the inability to independently precompute embeddings. You should tell the interviewer that in practice, many state-of-the-art systems use a hybrid: separate encoders for initial feature extraction followed by cross-attention fusion layers, balancing modularity with expressive cross-modal interaction.

Suppose you deploy a CLIP-based zero-shot classifier for content moderation at scale, and you notice it performs well on common categories but fails on nuanced or culturally specific content. How would you diagnose and address this failure mode?

MicrosoftMediumMultimodal and Vision-Language Models

Sample Answer

Get this wrong in production and harmful content slips through moderation, creating real user safety and legal risks. The right call is to first audit per-class and per-demographic performance by building a stratified evaluation set that includes culturally specific edge cases, then measure embedding similarity distributions to see if nuanced categories cluster too closely in CLIP's shared space. You should address the gap by ensembling CLIP with a fine-tuned classifier trained on labeled examples of the failing categories, using prompt ensembling (multiple text descriptions per class averaged together), and potentially fine-tuning the model with a curated dataset using a contrastive or cross-entropy objective on the hard categories. Continuous human-in-the-loop feedback is essential here because the distribution of harmful content shifts over time.

In models like Flamingo, visual tokens are injected into a frozen large language model via gated cross-attention layers. Why is gating important here, and what would happen if you removed it and used standard cross-attention instead?

GoogleHardMultimodal and Vision-Language Models

What is prompt ensembling in the context of CLIP's zero-shot inference, and why does it typically improve classification accuracy compared to using a single text prompt per class?

NvidiaEasyMultimodal and Vision-Language Models

Practice more Multimodal and Vision-Language Models questions

Training, Deployment, and Real-World CV Systems

Training and deployment questions separate candidates who have only worked in research settings from those ready to build production vision systems. Companies like Amazon and Nvidia want to hear about your experience with model optimization techniques like quantization and pruning, but they're more interested in your systematic approach to diagnosing and solving real-world problems like distribution shift and hardware constraints.

The most revealing aspect of these questions is how you handle the inevitable tradeoffs between model accuracy and deployment constraints. When your YOLOv8 model exceeds the latency budget on edge hardware, there's no single right answer, but your approach to systematically reducing inference time while measuring the accuracy impact shows whether you can ship models that work in the real world, not just on benchmarks.

Training, Deployment, and Real-World CV Systems

Knowing model architectures is only half the battle: interviewers at Nvidia, Tesla, and Amazon will press you on data augmentation strategies, handling distribution shift, model compression, and latency optimization for edge deployment. This section covers the practical engineering challenges that separate candidates who have shipped CV systems from those who have only trained models in notebooks.

You are deploying a real-time object detection model on an Nvidia Jetson edge device for a warehouse robotics application, but your YOLOv8 model exceeds the latency budget of 30ms per frame. Walk me through the specific steps you would take to reduce inference time while preserving acceptable accuracy.

NvidiaHardTraining, Deployment, and Real-World CV Systems

Sample Answer

The standard move is to apply post-training quantization from FP32 to INT8 using TensorRT, which alone can cut latency by 2 to 4x on Jetson hardware. But here, you should also consider structured channel pruning before quantization, because removing entire convolutional filters reduces both compute and memory bandwidth, and the two techniques compound. Export the model to ONNX, run TensorRT's INT8 calibration with a representative dataset of roughly 500 to 1000 images from your warehouse domain, and benchmark layer by layer to find bottlenecks. If you are still over budget, reduce input resolution (e.g., from 640x640 to 416x416) or swap the backbone to a lighter variant like a MobileNet-based neck, since $\text{latency} \propto O(\text{resolution}^2 \times \text{channels})$ roughly. Always validate on your tail-case classes (small or occluded items) after each compression step, because accuracy degradation is rarely uniform across categories.

Your team at Amazon has trained a product defect detection model that performs well on validation data, but after deployment to a new fulfillment center, precision drops significantly. How do you diagnose and address this distribution shift?

AmazonMediumTraining, Deployment, and Real-World CV Systems

Sample Answer

Get this wrong in production and you either flood human reviewers with false positives, destroying throughput, or you let defective products ship to customers. The right call is to first collect a sample of flagged images from the new center and visually compare them against your training distribution, looking for lighting changes, camera angle differences, new product SKUs, or background variation. Compute dataset-level statistics like pixel intensity histograms and feature-space distances (using embeddings from your backbone) between the training set and the new site's data to quantify the shift. Then apply targeted domain adaptation: fine-tune on a small labeled set from the new center (even 200 to 500 images helps), add domain-specific augmentations (color jitter, brightness shifts matching the new environment), and consider using a domain-adversarial training objective if the shift is severe. Set up continuous monitoring of per-site precision and recall with automated alerts so you catch future drift early.

A colleague suggests using heavy data augmentation, including random erasing, CutMix, and aggressive color jitter, for training a medical imaging classifier on a dataset of only 5,000 labeled X-rays. Is this a good strategy, and what would you recommend instead or in addition?

GoogleMediumTraining, Deployment, and Real-World CV Systems

Sample Answer

Random erasing and CutMix sound reasonable but break under medical imaging constraints, because occluding or mixing pathological regions can destroy the very signal the model needs to learn, especially with small lesions. Aggressive color jitter does not work well either because diagnostic features in X-rays depend on subtle grayscale intensity patterns that artificial color shifts can corrupt. That leaves you with domain-appropriate augmentations: mild geometric transforms (small rotations, flips consistent with anatomy), elastic deformations, and controlled intensity scaling. You should pair these with transfer learning from a model pretrained on a large radiographic corpus (or even ImageNet as a fallback) and consider self-supervised pretraining (e.g., MAE or SimCLR) on unlabeled X-rays if available, since with only 5,000 labels, representation quality matters more than augmentation volume.

You are building Tesla's occupancy network and need to serve predictions at 10 FPS across 8 camera feeds simultaneously on the vehicle's onboard compute. How would you design the multi-camera inference pipeline to meet this throughput requirement, and what tradeoffs would you make?

TeslaHardTraining, Deployment, and Real-World CV Systems

Explain how you would set up an automated data quality pipeline for a large-scale image classification system that ingests hundreds of thousands of new training images per week.

MetaEasyTraining, Deployment, and Real-World CV Systems

Practice more Training, Deployment, and Real-World CV Systems questions

How to Prepare for Computer Vision Interviews

Practice Architecture Justification on Real Constraints

Don't just memorize model architectures. Set up scenarios with specific constraints (5ms latency, 2GB memory, 95% accuracy target) and practice justifying your architectural choices. Time yourself explaining why you'd pick MobileNet over ResNet for edge deployment, including specific numbers about FLOPs and parameter counts.

Build Mental Models for Scale and Performance Tradeoffs

Create a personal reference sheet mapping common CV architectures to their typical inference times, memory usage, and accuracy ranges on standard datasets. When an interviewer mentions deploying on a Jetson device or processing 1M images per day, you should immediately know which architectures are viable candidates.

Master the Art of Failure Mode Analysis

Practice diagnosing why models fail in production by working through specific scenarios: distribution shift, class imbalance, annotation errors, hardware limitations. For each failure mode, have a concrete debugging process and multiple potential solutions ready to discuss.

Connect Theory to Implementation Details

When you explain concepts like anchor generation or attention mechanisms, immediately follow with implementation considerations: memory layout, computational complexity, and numerical stability. Interviewers want to see that you can bridge from research papers to working code.

Prepare Real-World Deployment Stories

Have 2-3 detailed stories ready about computer vision projects where you had to handle real constraints: model compression for mobile deployment, handling edge cases in production, or optimizing inference pipelines. Include specific metrics and the business impact of your technical decisions.

How Ready Are You for Computer Vision Interviews?

1 / 6

Image Classification and CNN Architectures

You are fine-tuning a ResNet-50 pretrained on ImageNet for a medical imaging task with only 500 labeled samples. Your model quickly overfits. Which strategy is most likely to help?

Frequently Asked Questions

How deep does my Computer Vision knowledge need to be for an ML Engineer interview?

You should understand foundational concepts like convolutional neural networks, image classification, object detection, and segmentation at both a theoretical and practical level. Interviewers often expect you to explain architectural choices (e.g., why use ResNet over VGG, how feature pyramid networks work) and discuss trade-offs in real systems such as latency vs. accuracy. Be prepared to go deep on topics like data augmentation strategies, transfer learning, and evaluation metrics such as mAP and IoU.

Which companies ask the most Computer Vision questions during interviews?

Companies with heavy visual data pipelines tend to ask the most CV questions. Think Tesla (Autopilot), Apple (Face ID, ARKit), Meta (content understanding), Google (Google Lens, Photos), Amazon (Go stores, visual search), and numerous robotics startups. Defense and healthcare imaging companies like Palantir and Viz.ai also lean heavily on CV expertise. If a company's core product relies on processing images or video, expect CV to dominate your technical rounds.

Will I need to write code during a Computer Vision interview?

Yes, coding is almost always required for ML Engineer roles. You may be asked to implement parts of a CV pipeline from scratch, such as non-max suppression, an image preprocessing module, or a custom loss function in PyTorch or TensorFlow. Some interviews also include general algorithm and data structure problems, so keep your coding skills sharp. You can practice relevant problems at datainterview.com/coding to build confidence.

How do Computer Vision interviews differ for ML Engineers compared to other roles?

As an ML Engineer, your interviews will emphasize building and deploying CV models in production, not just research. Expect questions on model optimization (quantization, pruning, ONNX export), serving infrastructure, and handling real-world data challenges like class imbalance and domain shift. Compared to a Research Scientist role, you will face more system design and coding rounds and fewer questions about novel architectures or publishing papers.

How can I prepare for Computer Vision interviews if I have no real-world experience?

Start by completing hands-on projects using public datasets like COCO, Pascal VOC, or ImageNet subsets. Train and fine-tune models for tasks such as object detection or semantic segmentation, and document your results thoroughly. Contributing to open-source CV libraries or reproducing key papers also demonstrates practical skill. Review common interview questions at datainterview.com/questions to identify gaps in your knowledge and practice explaining your project decisions clearly.

What are the most common mistakes candidates make in Computer Vision interviews?

One major mistake is memorizing model architectures without understanding why specific design choices were made. Interviewers will probe your reasoning, not just your recall. Another common error is ignoring the data side of CV: failing to discuss data collection, labeling quality, augmentation, and bias. Finally, many candidates neglect to connect their answers to production realities like inference speed, model size, and edge deployment constraints, which are critical for ML Engineer roles.

Computer Vision Interview Questions

Computer Vision Interview Questions

Image Classification and CNN Architectures

Image Classification and CNN Architectures

Object Detection

Object Detection

Image Segmentation

Image Segmentation

Video Understanding and Temporal Models

Video Understanding and Temporal Models

Multimodal and Vision-Language Models

Multimodal and Vision-Language Models

Training, Deployment, and Real-World CV Systems

Training, Deployment, and Real-World CV Systems

How to Prepare for Computer Vision Interviews

Practice Architecture Justification on Real Constraints

Build Mental Models for Scale and Performance Tradeoffs

Master the Art of Failure Mode Analysis

Connect Theory to Implementation Details

Prepare Real-World Deployment Stories

Frequently Asked Questions

Dan Lee

Related Articles

Bertrand Duopoly with Capacity Constraints

Envy-Free Cake Cut with Three Players

Blotto Tournament with Budget Uncertainty