Duolingo Machine Learning Engineer Guide (2026): Job, Salary & Interviews

Duolingo Machine Learning Engineer at a Glance

Interview Rounds

7 rounds

Difficulty

PythonEdTechLanguage LearningPersonalizationMLOps

From what candidates tell us, Duolingo's coding bar is the part that catches ML specialists off guard. The role demands expert-level software engineering alongside expert-level ML, and if your algorithm skills are rusty, strong modeling chops alone won't carry you through.

Duolingo Machine Learning Engineer Role

Primary Focus

EdTechLanguage LearningPersonalizationMLOps

Skill Profile

Math & Stats

High

Strong understanding of statistical analysis, probability, and their application in machine learning models, including probabilistic models (e.g., BG/NBD, Gamma-Gamma) and different statistical approaches (Frequentist vs. Bayesian).

Software Eng

Expert

Exceptional proficiency in programming, data structures, algorithms, system design, and software development best practices. This includes extensive experience with coding challenges, pair programming, code reviews, and complexity analysis, with a focus on areas like backtracking, dynamic programming, and string manipulation.

Data & SQL

Medium

Experience in designing and optimizing data pipelines for machine learning models, ensuring efficient data flow and processing.

Machine Learning

Expert

Deep expertise in designing, implementing, and optimizing various machine learning models. This includes a solid understanding of ML principles, model evaluation (e.g., AUC), dimensionality reduction, and different learning paradigms (supervised, unsupervised, reinforcement learning).

Applied AI

Low

While not explicitly detailed in the provided sources as a primary requirement for this specific role, a general awareness of modern AI trends, potentially including NLP advancements or generative models, might be beneficial given Duolingo's domain. (Conservative estimate due to lack of explicit mention).

Infra & Cloud

Medium

Understanding of system design principles and the ability to integrate machine learning models into production systems. Specific cloud or MLOps platform expertise is not explicitly detailed but implied for deployment and scalability.

Business

Medium

Ability to collaborate effectively with cross-functional teams, a strong focus on improving user experiences, and a keen interest in educational technology and language learning.

Viz & Comms

Medium

Strong communication skills, including the ability to explain technical reasoning, discuss trade-offs, and present project work effectively. While data visualization is not explicitly mentioned, it is generally an expected component of communicating data insights.

What You Need

Machine Learning Model Design & Implementation
Data Structures & Algorithms
Statistical Analysis
Machine Learning Principles
Data Pipeline Optimization
System Design
Algorithmic Problem Solving
Collaborative Coding & Code Review
Problem Solving
Cross-functional Collaboration
Model Evaluation (e.g., AUC)
Dimensionality Reduction
Supervised, Unsupervised, and Reinforcement Learning
Probabilistic Models (e.g., BG/NBD, Gamma-Gamma)
Complexity Analysis

Nice to Have

Educational Technology Familiarity
Interest in Language Learning
User Experience Focus
Collaborative Spirit
Innovation

Languages

Python

Tools & Technologies

TensorFlowPyTorchML Libraries (e.g., scikit-learn, XGBoost, LightGBM)

Want to ace the interview?

Practice with real questions.

Start Mock Interview

Birdbrain, Duolingo's ML-powered lesson sequencing system, shows up in your first week and never leaves your screen. You'll work on the models that decide when learners see vocabulary again, how exercise difficulty gets calibrated, and which content surfaces next. ML engineers here own the full loop: feature engineering, model training, production deployment, and experiment analysis. Nobody hands off code to a separate platform team.

A Typical Week

A Week in the Life of a Duolingo Machine Learning Engineer

Typical L5 workweek · Duolingo

Weekly time split

Coding — 28%Meetings — 18%Analysis — 12%Writing — 12%Research — 10%Infrastructure — 10%Break — 10%

Culture notes

Duolingo runs at a fast but sustainable pace — the 'ship it' and 'test it first' values mean you're constantly iterating through experiments, but the Pittsburgh HQ culture is genuinely not a burnout shop and most people wrap up by 6 PM.
Duolingo requires in-office work at their Pittsburgh headquarters most days, with a hybrid policy that allows some flexibility, and the office itself is colorful and well-stocked in a way that makes being there easy.

The coding-heavy split is the thing that makes this role feel more like a software engineering job than most ML positions. You're not spending your days in notebooks and slide decks. Debugging a flaky training pipeline on Tuesday, writing PyTorch feature transformations on Wednesday, reviewing a teammate's PR on Thursday. Design docs and experiment plans also claim a real chunk of the week, because Duolingo's "test it first" culture means you write up latency benchmarks and success metrics before anyone greenlights a launch.

Projects & Impact Areas

Birdbrain's spaced repetition models decide when to resurface vocabulary and how to score exercise difficulty differently across language families (character-based vs. Romance languages, for example, require very different calibration). Newer product lines like Duolingo Math and Music are where greenfield ML work lives, since those surfaces need recommendation and sequencing approaches built from scratch rather than inherited from the language learning stack.

Skills & What's Expected

Software engineering is rated expert-level, not "nice to have," and that's what filters people out. ML depth is also rated expert, but most candidates already prep for that. The surprise is the algorithms bar. Meanwhile, GenAI knowledge is rated low. The skill profile emphasizes probabilistic models (BG/NBD, Gamma-Gamma), Bayesian vs. Frequentist reasoning, and classical ML paradigms like supervised, unsupervised, and reinforcement learning. Your comfort with stats and probability matters far more here than your opinions on the latest foundation model.

Levels & Career Growth

Growth at Duolingo tends to come from owning a new product surface end-to-end rather than managing people. From what we can tell, the jump between levels hinges on driving cross-functional alignment with curriculum and product teams who think in pedagogical terms, not model metrics. That soft skill is harder to develop than any technical gap.

Work Culture

Based on employee accounts, most people work from Duolingo's Pittsburgh office on a regular basis, though the exact policy isn't publicly documented. The pace is fast but not a burnout shop. The company offers a two-week winter break, and the day-to-day rhythm described by engineers suggests people wrap up at a reasonable hour. Duolingo's published operating principles ("test it first," "reduce complexity," "bias toward action") aren't just wall art; they shape how experiment launches get approved and how design docs get reviewed.

Duolingo Machine Learning Engineer Compensation

Duolingo's RSUs follow a four-year vesting schedule at roughly 25% per year, though the source data doesn't specify whether that includes a one-year cliff or quarterly vesting from day one. Ask your recruiter to clarify the exact vesting mechanics before you sign, because that distinction determines whether you're waiting 12 months for your first equity payout or receiving it much sooner.

Base salary, RSU grant size, and sign-on bonus are all negotiable components at Duolingo, from what candidates report. Your strongest play is bringing a competing offer to the table, then focusing the conversation on the RSU grant or sign-on rather than trying to stretch all three simultaneously. Duolingo's ML team is small enough that each hire fills a visible gap in their adaptive learning or NLP pipelines, which gives you more leverage than you might expect from a company of ~700 people.

Duolingo Machine Learning Engineer Interview Process

7 rounds·~4 weeks end to end

Initial Screen

1 round

Recruiter Screen

60mPhone

This initial conversation with a recruiter will cover your background, experience, and career aspirations. You'll discuss your interest in Duolingo, the specific Machine Learning Engineer role, and general fit with the company culture. Expect to briefly touch upon your technical skills and availability.

behavioralgeneral

Tips for this round

Clearly articulate your relevant experience and how it aligns with Duolingo's mission and the MLE role.
Research Duolingo's products, recent news, and values to demonstrate genuine interest.
Be prepared to discuss your salary expectations and visa sponsorship needs (if applicable).
Have a concise 'elevator pitch' ready for your professional background and why you're a good fit.
Prepare a few questions to ask the recruiter about the role, team, or interview process.

Technical Assessment

2 rounds

Coding & Algorithms

60mVideo Call

You'll participate in a technical video interview, likely conducted through datainterview.com/coding, focusing on fundamental data structures and algorithms. The interviewer will present a coding problem, and you'll be expected to write efficient and correct code while explaining your thought process. Proficiency in Python or Java is generally preferred.

algorithmsdata_structuresengineering

Tips for this round

Practice datainterview.com/coding-style problems, focusing on common data structures like arrays, linked lists, trees, and graphs.
Work on optimizing your solutions for both time and space complexity.
Clearly communicate your approach, assumptions, and edge cases before and during coding.
Be comfortable with Python or Java syntax and standard library functions.
Test your code thoroughly with various inputs, including edge cases, to catch potential bugs.

Coding & Algorithms

75mLive

This is a collaborative coding session where you'll work with an interviewer on a problem, likely involving an existing codebase or a more complex scenario than the initial screen. The focus is on your ability to write clean, maintainable code, debug, and collaborate effectively. Expect to use Python or Java.

algorithmsdata_structuresengineeringml_coding

Tips for this round

Demonstrate strong communication skills by actively discussing the problem, your approach, and potential solutions with the interviewer.
Focus on object-oriented programming concepts and how to integrate new features into an existing system.
Be prepared to debug code collaboratively and explain your debugging process.
Ensure your development environment and screen-sharing capabilities are tested and working smoothly beforehand.
Show an understanding of how your code interacts with other parts of a system, especially in a product-focused context.

Onsite

4 rounds

System Design

60mVideo Call

You'll be challenged to design a machine learning system from scratch, addressing various components from data ingestion and model training to deployment and monitoring. This round assesses your ability to think at a high level about scalable, robust, and production-ready ML solutions. No coding is required, but a deep understanding of ML lifecycle is essential.

ml_system_designsystem_designml_operations

Tips for this round

Clearly define the problem statement, scope, and key metrics for success at the outset.
Discuss data sources, feature engineering, model selection, training strategies, and evaluation metrics.
Consider aspects like scalability, latency, reliability, and potential failure points in your design.
Address MLOps considerations such as model versioning, deployment strategies (e.g., A/B testing), and monitoring.
Be prepared to justify your design choices and discuss trade-offs for different components.

Machine Learning & Modeling

60mVideo Call

This round, potentially an 'AI-assisted engineering interview' through datainterview.com/coding, will delve into your theoretical and practical knowledge of machine learning. You might be asked to explain complex ML concepts, discuss model architectures, evaluate model performance, or even implement a small ML component or algorithm. Expect questions on statistical analysis and ML principles.

machine_learningdeep_learningml_codingalgorithmsstatistics

Tips for this round

Review core machine learning algorithms (e.g., linear models, tree-based models, neural networks) and their underlying principles.
Understand common evaluation metrics for different ML tasks (classification, regression, ranking) and their trade-offs.
Be familiar with ML frameworks like TensorFlow or PyTorch and how to use them for model building and training.
Practice explaining complex ML concepts clearly and concisely to a technical audience.
Prepare to discuss your experience with data preprocessing, feature engineering, and hyperparameter tuning.

Behavioral

45mVideo Call

You'll be given a piece of code, likely through datainterview.com/coding, and asked to perform a code review. This session evaluates your ability to identify bugs, suggest improvements for readability, efficiency, and maintainability, and understand best practices in software engineering. You should be comfortable with Python or Java.

engineeringalgorithmsdata_structures

Tips for this round

Focus on identifying potential bugs, security vulnerabilities, and performance bottlenecks.
Suggest improvements for code clarity, modularity, and adherence to coding standards.
Consider edge cases and error handling mechanisms that might be missing or poorly implemented.
Explain the 'why' behind your suggestions, linking them to software engineering principles.
Demonstrate an understanding of testing strategies and how they apply to the reviewed code.

Behavioral

45mVideo Call

This interview focuses on your past experiences, teamwork, problem-solving approach, and how you align with Duolingo's culture and values. You'll likely discuss situations where you demonstrated leadership, handled conflict, or contributed to product success. Expect questions that probe your understanding of how backend systems impact user-facing features.

behavioralproduct_sensegeneral

Tips for this round

Prepare stories using the STAR method (Situation, Task, Action, Result) for common behavioral questions.
Highlight instances where you've used data-driven decision making to solve problems or improve products.
Emphasize your passion for accessibility to education and finding elegant solutions to cross-functional problems.
Showcase your ability to bridge technical implementation with user experience, especially regarding how ML impacts the frontend.
Be ready to discuss your collaboration style and how you work effectively within a team.

Tips to Stand Out

Master Python or Java. While some flexibility exists, being highly proficient in Python or Java for coding and pair programming rounds is crucial, as these are the primary languages used.
Focus on practical application. Duolingo emphasizes object-oriented programming, data structure implementation, and working within existing codebases over pure algorithmic grinding. Practice solving problems that involve building features or refactoring code.
Understand product impact. Duolingo values engineers who can connect technical solutions to user experience. Be prepared to discuss how your ML models and backend systems directly influence user-facing features and product metrics.
Prepare for collaborative coding. The pair programming round is significant. Practice communicating your thought process clearly, asking clarifying questions, and actively collaborating with an interviewer.
Solidify ML fundamentals and system design. For an MLE role, deep knowledge of machine learning algorithms, model evaluation, and the ability to design scalable ML systems are non-negotiable. Review MLOps concepts.
Demonstrate data-driven thinking. Duolingo is a data-driven company. Show how you use data to inform your decisions, evaluate experiments, and measure the success of your ML models.
Test your environment. For virtual technical rounds, ensure your development environment, internet connection, and screen-sharing tools are fully functional to avoid technical delays.

Common Reasons Candidates Don't Pass

✗Weak coding fundamentals. Candidates often struggle with writing clean, efficient, and bug-free code, or lack a solid grasp of core data structures and algorithms.
✗Poor communication during technical rounds. Inability to articulate thought processes, ask clarifying questions, or collaborate effectively during coding and design sessions is a significant red flag.
✗Lack of ML depth or practical experience. For an MLE role, insufficient understanding of machine learning principles, model lifecycle, or inability to apply ML concepts to real-world problems can lead to rejection.
✗Inability to connect tech to product. Failing to demonstrate how technical solutions, especially ML models, impact user experience and business metrics shows a lack of product sense valued by Duolingo.
✗Limited system design capabilities. Struggling to design scalable and robust ML systems, considering aspects like data pipelines, deployment, and monitoring, indicates a gap in senior-level readiness.
✗Cultural misalignment. Not demonstrating a collaborative spirit, passion for education, or alignment with Duolingo's values can result in a poor fit assessment.

Offer & Negotiation

Duolingo's compensation packages for Machine Learning Engineers typically include a competitive base salary, Restricted Stock Units (RSUs), and potentially a sign-on bonus. RSUs usually vest over a four-year period, with a common schedule of 25% per year. When negotiating, focus on your total compensation package, leveraging any competing offers you may have. Base salary, RSU grants, and sign-on bonuses are generally negotiable components. Be prepared to articulate your value and market worth based on your experience and the specific skills you bring to the role.

The most common reason candidates get rejected is weak coding fundamentals. Duolingo's second coding round has you working inside an existing codebase, collaborating with an interviewer on feature integration and debugging in Python or Java. That's a different muscle than solving isolated algorithm puzzles, and ML specialists who live in notebooks often struggle with it.

The two rounds labeled "behavioral" are misleading. One is actually a code review session where you evaluate someone else's code for bugs, maintainability, and engineering best practices. Only the final round covers traditional behavioral territory, probing how you've shipped ML products that connect to real user outcomes. From what candidates report, Duolingo's hiring committee cares whether you can tie model improvements back to learner efficacy (think lesson completion, retention curves across their 40+ language courses), not just offline metrics.

Duolingo Machine Learning Engineer Interview Questions

Algorithms & Coding

Expect questions that force you to write clean, bug-free Python under time pressure while explaining complexity trade-offs. Candidates often stumble by over-optimizing too early instead of nailing correct edge-case handling first.

Duolingo logs a user’s lesson outcomes as a string of '1' (correct) and '0' (wrong) in chronological order; return the length of the longest contiguous streak where the user has at most $k$ wrong answers. Implement in $O(n)$ time.

EasySliding Window

Sample Answer

Most candidates default to checking every substring, but that fails here because it is $O(n^2)$ and timeouts on long user histories. Use a sliding window with two pointers and a running count of zeros. Expand the right pointer, shrink from the left while zeros exceed $k$, and track the maximum window length. Edge cases: $k=0$, empty string, and all zeros.

from typing import Optional


def longest_streak_with_k_wrongs(outcomes: str, k: int) -> int:
    """Return the max length of a contiguous window with at most k '0's.

    Args:
        outcomes: String of '1' and '0' in chronological order.
        k: Maximum number of wrong answers allowed in the window.

    Returns:
        Length of the longest valid window.
    """
    if k < 0:
        return 0
    n = len(outcomes)
    left = 0
    zeros = 0
    best = 0

    for right in range(n):
        if outcomes[right] == '0':
            zeros += 1

        while zeros > k and left <= right:
            if outcomes[left] == '0':
                zeros -= 1
            left += 1

        best = max(best, right - left + 1)

    return best


if __name__ == "__main__":
    assert longest_streak_with_k_wrongs("", 2) == 0
    assert longest_streak_with_k_wrongs("111", 0) == 3
    assert longest_streak_with_k_wrongs("101001", 1) == 3  # "101" or "100"
    assert longest_streak_with_k_wrongs("000", 2) == 2

Given a list of (user_id, item_id) pairs from a practice session, return the top $k$ item_ids by number of distinct users who saw them, breaking ties by smaller item_id. Your solution should be better than sorting all items by full user lists.

MediumHashing + Top-K (Heap)

Sample Answer

Count distinct users per item using a map of item to a set of user_ids, then use a size-$k$ heap to keep only the current top items. This runs in $O(m)$ set insertions plus $O(i \log k)$ heap work, where $m$ is pairs and $i$ is number of unique items. Tie breaking is handled by ordering on count descending, then item_id ascending. You avoid materializing and sorting all per item user lists.

from __future__ import annotations

from collections import defaultdict
import heapq
from typing import Iterable, List, Tuple


def top_k_items_by_distinct_users(
    events: Iterable[Tuple[int, int]],
    k: int,
) -> List[int]:
    """Top-k items by distinct user count.

    Args:
        events: Iterable of (user_id, item_id) pairs.
        k: Number of items to return.

    Returns:
        List of item_ids sorted by distinct-user count desc, then item_id asc.
    """
    if k <= 0:
        return []

    users_per_item = defaultdict(set)
    for user_id, item_id in events:
        users_per_item[item_id].add(user_id)

    # Min-heap of (count, -item_id) so that the worst element is on top.
    # We want to keep larger counts, and for ties we want smaller item_id.
    heap: List[Tuple[int, int]] = []

    for item_id, users in users_per_item.items():
        count = len(users)
        entry = (count, -item_id)

        if len(heap) < k:
            heapq.heappush(heap, entry)
        else:
            # Replace if strictly better than the current worst.
            if entry > heap[0]:
                heapq.heapreplace(heap, entry)

    # Convert heap to sorted output: count desc, item_id asc.
    result = [(-neg_item_id, count) for count, neg_item_id in heap]
    result.sort(key=lambda x: (-x[1], x[0]))
    return [item_id for item_id, _ in result]


if __name__ == "__main__":
    events = [
        (1, 10), (2, 10), (1, 10),  # item 10 has users {1,2}
        (3, 20),
        (4, 20), (4, 30),
        (5, 30),
        (6, 40),
    ]
    assert top_k_items_by_distinct_users(events, 2) == [10, 20]
    assert top_k_items_by_distinct_users(events, 3) == [10, 20, 30]

You have $n$ skills in a course graph as prerequisites (a DAG) where an edge $u \rightarrow v$ means skill $u$ must be learned before $v$; return a valid learning order, or an empty list if the graph contains a cycle due to bad content configuration. Implement for $n$ up to $10^5$.

HardGraph Topological Sort

Practice more Algorithms & Coding questions

Machine Learning & Modeling

Most candidates underestimate how much model selection depends on objective/metric alignment for learning outcomes (retention, mastery, engagement). You’ll be pushed to justify features, evaluation (e.g., AUC vs calibration), and failure modes for personalization.

You trained a model to predict whether a learner will answer the next exercise correctly, but after launch you see AUC is unchanged while the predicted probabilities are consistently too high for all users. What metric and modeling change do you make to fix this, and why does AUC not catch the issue?

EasyEvaluation and Calibration

Sample Answer

Use a calibration-focused metric (log loss or Brier score, plus calibration curves like ECE) and calibrate the model with Platt scaling or isotonic regression. AUC only measures ranking, so it can stay flat even when every predicted probability is shifted upward. In Duolingo personalization, overconfident $p(\text{correct})$ breaks downstream decisions like difficulty selection and spaced repetition because thresholds and expected value calculations depend on calibrated probabilities, not just ordering.

You want to personalize review timing for each skill by modeling forgetting from timestamps and correctness, and you also need uncertainty for cold-start learners. Do you use a Bayesian Knowledge Tracing style model or a deep sequence model (RNN/Transformer), and what are the concrete trade-offs for offline evaluation and production behavior?

HardSequential Modeling for Personalization

Practice more Machine Learning & Modeling questions

Statistics & Probabilistic Modeling

Your ability to reason about uncertainty shows up in questions on probability, inference, and user-level heterogeneity (e.g., BG/NBD or Gamma-Gamma style thinking). Interviewers look for disciplined assumptions, not just formula recall.

You are modeling user practice activity with BG/NBD using (recency $r$, frequency $f$, age $T$) from Duolingo lessons. How would you decide between BG/NBD and a simple survival model for churn, and what diagnostic would you run to catch obvious misfit?

MediumUser-level Probabilistic Models

Sample Answer

You could do BG/NBD or a survival model. BG/NBD wins here because it is built for intermittent, noncontractual repeat events and directly predicts future event counts from $(r,f,T)$ while capturing heterogeneity. A survival model wins if the product question is strictly time-to-churn and you have strong time-varying covariates that matter more than event counts. Run calibration checks, for example compare predicted vs empirical holdout counts by decile and inspect whether high $f$ users are systematically underpredicted, that is where most people fail.

Duolingo wants to rank users by expected minutes studied next week using a Gamma-Gamma style model for spend-like outcomes, but minutes have many zeros and a heavy tail. How do you build a probabilistic model that handles both zero-inflation and user heterogeneity, and how would you validate it for personalization?

HardHierarchical and Mixture Modeling

Practice more Statistics & Probabilistic Modeling questions

ML System Design & MLOps

The bar here isn’t whether you can name components, it’s whether you can design an end-to-end personalization system that’s reliable in production. You’ll need to cover data/feature freshness, online vs batch scoring, monitoring, and safe rollout.

You are launching a new personalized "Next Lesson" ranker that selects the next skill for a learner. Design the offline-to-online pipeline so features are consistent and fresh, and name 3 monitors that would catch silent failures within 1 hour.

EasyFeature Store and Monitoring

Sample Answer

Reason through it: Start by defining the prediction moment (lesson end) and freeze the feature schema tied to that timestamp so training and serving read the same definitions. Use a daily batch job to build training examples with point-in-time correct features, and an online feature layer that computes fast-changing signals (recent mistakes, streak, session context) while slower signals (historical mastery, long-term engagement) come from a cached store updated hourly or daily. Put a single source of truth for feature transforms in shared code, then validate parity by logging a sample of online feature vectors and recomputing them offline. Monitors: feature null rate and distribution drift per key feature, training-serving skew checks on logged feature hashes, and a business proxy like completion rate or time-to-next-session dropping sharply post-deploy.

Duolingo wants real-time personalization for "Practice" recommendations, but labels (did the learner improve) only arrive days later. How do you design the logging, label-joining, and backfill strategy so you can train weekly without leakage and with reproducible datasets?

MediumData Logging, Labeling, and Backfills

Sample Answer

Start with what the interviewer is really testing: "This question is checking whether you can build a training set that matches production reality, even when labels are delayed." You need an immutable event log for impressions and outcomes with stable IDs (user, item, timestamp, model version, feature snapshot pointer) so you can join labels later without guessing. Use a label table keyed by the same IDs, enforce point-in-time joins with a cutoff time, and backfill by replaying impressions for a given time window while reading only data available up to each impression timestamp. Reproducibility comes from dataset versioning (code, feature definitions, and input partitions) and keeping raw logs append-only so reruns yield identical examples.

You deploy a new spaced repetition model to schedule reviews, optimizing expected recall probability $p(t)$ at next session. How do you roll it out safely if it might hurt retention for beginners, and what would you do when online metrics disagree (higher short-term engagement, lower 30-day retention)?

HardSafe Rollouts and Metric Trade-offs

Practice more ML System Design & MLOps questions

ML Coding (Model Implementation)

Rather than abstract theory, you’ll be asked to implement or modify core ML routines (training loop, evaluation, feature handling) with correctness and efficiency. Common pitfalls include data leakage, wrong metric computation, and sloppy train/val splitting.

Implement a PyTorch binary classifier to predict whether a Duolingo learner will answer the next exercise correctly, given dense features and a binary label, and report validation AUC with an early stopping criterion on AUC.

EasyTraining Loop and Metric Implementation

Sample Answer

This question is checking whether you can write a correct, leak free training loop and compute AUC properly. You are being graded on details: deterministic split, switching between train and eval modes, no gradient during evaluation, and a correct AUC that uses predicted probabilities, not hard labels. This is where most people fail, they accidentally compute accuracy or they compute AUC on logits without a sigmoid when they should be using probabilities. Keep it clean, keep it testable.

import math
import random
from dataclasses import dataclass
from typing import Tuple, Dict, Any, Optional

import numpy as np
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset


def set_seed(seed: int = 42) -> None:
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)


def train_val_split(
    X: np.ndarray,
    y: np.ndarray,
    val_frac: float = 0.2,
    seed: int = 42
) -> Tuple[Tuple[np.ndarray, np.ndarray], Tuple[np.ndarray, np.ndarray]]:
    """Deterministic split. No shuffling inside DataLoader for validation."""
    assert X.ndim == 2
    assert y.ndim == 1
    assert len(X) == len(y)
    n = len(X)
    idx = np.arange(n)
    rng = np.random.default_rng(seed)
    rng.shuffle(idx)
    n_val = int(round(n * val_frac))
    val_idx = idx[:n_val]
    tr_idx = idx[n_val:]
    return (X[tr_idx], y[tr_idx]), (X[val_idx], y[val_idx])


def binary_auc(y_true: np.ndarray, y_score: np.ndarray) -> float:
    """Compute ROC AUC from scratch using rank statistics.

    Handles ties by assigning average ranks.
    Returns 0.5 if AUC is undefined (all positives or all negatives).
    """
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score).astype(float)
    assert y_true.shape == y_score.shape

    n_pos = int((y_true == 1).sum())
    n_neg = int((y_true == 0).sum())
    if n_pos == 0 or n_neg == 0:
        return 0.5

    order = np.argsort(y_score)
    scores_sorted = y_score[order]
    y_sorted = y_true[order]

    ranks = np.empty_like(scores_sorted, dtype=float)
    i = 0
    rank = 1
    while i < len(scores_sorted):
        j = i
        while j + 1 < len(scores_sorted) and scores_sorted[j + 1] == scores_sorted[i]:
            j += 1
        avg_rank = (rank + (rank + (j - i))) / 2.0
        ranks[i:j + 1] = avg_rank
        rank += (j - i + 1)
        i = j + 1

    sum_ranks_pos = ranks[y_sorted == 1].sum()
    auc = (sum_ranks_pos - (n_pos * (n_pos + 1) / 2.0)) / (n_pos * n_neg)
    return float(auc)


class MLPBinaryClassifier(nn.Module):
    def __init__(self, d_in: int, hidden: int = 64, dropout: float = 0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, hidden),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden, 1)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x).squeeze(-1)  # logits


@dataclass
class TrainConfig:
    batch_size: int = 256
    lr: float = 1e-3
    weight_decay: float = 1e-5
    epochs: int = 50
    patience: int = 5
    seed: int = 42
    device: str = "cpu"


def train_model_auc_early_stop(
    X: np.ndarray,
    y: np.ndarray,
    config: TrainConfig,
    val_frac: float = 0.2
) -> Dict[str, Any]:
    set_seed(config.seed)

    (X_tr, y_tr), (X_va, y_va) = train_val_split(X, y, val_frac=val_frac, seed=config.seed)

    X_tr_t = torch.tensor(X_tr, dtype=torch.float32)
    y_tr_t = torch.tensor(y_tr, dtype=torch.float32)
    X_va_t = torch.tensor(X_va, dtype=torch.float32)
    y_va_t = torch.tensor(y_va, dtype=torch.float32)

    tr_loader = DataLoader(TensorDataset(X_tr_t, y_tr_t), batch_size=config.batch_size, shuffle=True)
    va_loader = DataLoader(TensorDataset(X_va_t, y_va_t), batch_size=config.batch_size, shuffle=False)

    model = MLPBinaryClassifier(d_in=X.shape[1]).to(config.device)
    opt = torch.optim.AdamW(model.parameters(), lr=config.lr, weight_decay=config.weight_decay)
    loss_fn = nn.BCEWithLogitsLoss()

    best_auc = -math.inf
    best_state: Optional[Dict[str, torch.Tensor]] = None
    bad_epochs = 0

    for epoch in range(1, config.epochs + 1):
        model.train()
        total_loss = 0.0
        n_seen = 0
        for xb, yb in tr_loader:
            xb = xb.to(config.device)
            yb = yb.to(config.device)
            opt.zero_grad(set_to_none=True)
            logits = model(xb)
            loss = loss_fn(logits, yb)
            loss.backward()
            opt.step()
            bs = len(xb)
            total_loss += float(loss.item()) * bs
            n_seen += bs
        train_loss = total_loss / max(1, n_seen)

        model.eval()
        all_probs = []
        all_true = []
        with torch.no_grad():
            for xb, yb in va_loader:
                xb = xb.to(config.device)
                logits = model(xb)
                probs = torch.sigmoid(logits).cpu().numpy()
                all_probs.append(probs)
                all_true.append(yb.numpy())
        y_prob = np.concatenate(all_probs)
        y_true = np.concatenate(all_true).astype(int)
        val_auc = binary_auc(y_true, y_prob)

        if val_auc > best_auc + 1e-6:
            best_auc = val_auc
            best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
            bad_epochs = 0
        else:
            bad_epochs += 1
            if bad_epochs >= config.patience:
                break

    if best_state is not None:
        model.load_state_dict(best_state)

    return {
        "model": model,
        "best_val_auc": best_auc,
        "train_size": len(X_tr),
        "val_size": len(X_va)
    }


if __name__ == "__main__":
    # Demo with synthetic data (replace with Duolingo feature matrix and labels).
    set_seed(42)
    n, d = 5000, 20
    X = np.random.normal(size=(n, d)).astype(np.float32)
    w = np.random.normal(size=(d,)).astype(np.float32)
    logits = X @ w
    p = 1.0 / (1.0 + np.exp(-logits))
    y = (np.random.uniform(size=(n,)) < p).astype(np.int64)

    cfg = TrainConfig(device="cpu")
    out = train_model_auc_early_stop(X, y, cfg)
    print({"best_val_auc": out["best_val_auc"], "train_size": out["train_size"], "val_size": out["val_size"]})

Implement a BG/NBD model to predict the probability a learner will return and complete at least one lesson in the next $t$ days from per user $(x, t_x, T)$, and compute the expected number of future lessons in $t$.

HardProbabilistic Model Implementation

Practice more ML Coding (Model Implementation) questions

Data Pipelines & Feature/Data Quality

In production personalization, small data issues become big model issues, so you must show you can reason about pipeline reliability. Focus on idempotency, backfills, schema changes, joins at user/item granularity, and keeping features consistent online/offline.

You log Duolingo lesson events (start, answer, complete) and build a daily feature table for a next-exercise ranking model: user_id, skill_id, rolling_7d_accuracy, rolling_7d_count. How do you make the pipeline idempotent and backfill-safe when late events arrive, without changing model features between offline training and online serving?

EasyIdempotency and Backfills

Sample Answer

The standard move is to build features from raw immutable events using deterministic keys, event-time windows, and partition overwrite for the affected dates (recompute $D-7$ to $D$). But here, late events matter because training labels and features must stay time-consistent, so you also need an explicit feature timestamp (as-of time) and strict event-time filtering so you never leak post-prediction events into a past training row.

A schema change adds item_version to exercises and the content team reorders skills, so joining user practice events to the skill graph can map a past event to a new skill_id. How do you design your feature join and data quality checks so that historical training data stays stable while online features use the latest content, and you can detect silent join drift?

HardSchema Evolution and Join Drift

Practice more Data Pipelines & Feature/Data Quality questions

Behavioral & Product Collaboration

You’ll be evaluated on how you work with product, learning science, and design when goals conflict (accuracy vs motivation vs fairness). Strong answers are structured, specific, and show you can drive decisions with evidence while staying collaborative.

A PM wants to ship a new lesson ranking model because offline AUC is up, but learning scientists report more rage quits after mistakes. How do you drive the decision, including what evidence you demand and what you are willing to compromise on (accuracy, motivation, fairness)?

EasyCross-functional decision-making under metric conflict

Sample Answer

Get this wrong in production and you ship a model that optimizes AUC while hurting retention, trust, and long-run learning. The right call is to insist on an online readout tied to the product goal (e.g., day-1 retention, lesson completion, time-to-next-session) plus learning outcomes, segmented by proficiency and locale. You push for a holdout or staged rollout with explicit guardrails (quit-rate, error-streak abandonment) and a pre-agreed reversal plan. You align on a north star, then treat AUC as a diagnostic, not the decision metric.

Your personalization model increases total lessons completed, but you discover it disproportionately serves easier content to beginners in one locale, slowing their progress. How do you work with product, design, and learning science to define fairness and ship a fix without tanking overall engagement?

MediumFairness, localization, and stakeholder alignment

Practice more Behavioral & Product Collaboration questions

The compounding challenge here isn't any single area. It's that Duolingo's loop pairs heavy code output with deep probabilistic reasoning about learner behavior, so you'll need to implement something like a BG/NBD model in PyTorch and defend your uncertainty estimates to an interviewer who knows forgetting curves cold. Most MLE candidates prep modeling and algorithms as separate tracks, but Duolingo's questions frequently blend them, asking you to code a working training loop for a model rooted in the same Gamma-Gamma or retention-probability math you discussed minutes earlier.

Sharpen that overlap at datainterview.com/questions, where you can drill the stats-meets-implementation style Duolingo favors.

How to Prepare for Duolingo Machine Learning Engineer Interviews

Know the Business

Updated Q1 2026

Official mission

“Our mission is to develop the best education in the world and make it universally available.”

What it actually means

Duolingo's real mission is to provide the highest quality education globally through technology, making it universally accessible. They achieve this by continuously improving their product, prioritizing long-term user growth, and leveraging a freemium business model to fund innovation.

Pittsburgh, PennsylvaniaHybrid - 3 days/week

Key Business Metrics

Revenue

$964M

+41% YoY

Market Cap

$5B

-74% YoY

Employees

830

+15% YoY

Current Strategic Priorities

Develop the best education in the world and make it universally available
Evolve from a language learning app into a broader educational platform
Bridge the gap between online learning and real-world impact

Competitive Moat

Scale advantageAI-driven personalizationFreemium business modelGamified language learningNetwork effects

Duolingo pulled in $964M in revenue with 41% year-over-year growth, yet the company runs with roughly 830 employees. That lean headcount signals something about how they build: their company strategy overview frames the next chapter as evolving from a language app into a broader education platform, with new subjects and a push to bridge online learning with real-world impact. For ML engineers, this means the product surface is expanding faster than the team.

Your "why Duolingo" answer needs to go beyond the product itself. Duolingo's operating principles emphasize data-driven decisions and measuring whether users are actually learning, not just opening the app. Reference that distinction, then connect it to something from their Scala backend rewrite or their strategy around making Duolingo a credible professional credential. Specificity about their engineering choices and educational mission beats enthusiasm about the owl every time.

Try a Real Interview Question

Online AUC for personalized ranking

python

Given an iterator of pairs $(y, s)$ where $y \in \{0,1\}$ is the label and $s \in \mathbb{R}$ is a model score, compute the AUC defined as $$\mathrm{AUC}=\frac{1}{n_+ n_-}\sum_{i:y_i=1}\sum_{j:y_j=0}\Big(\mathbb{1}[s_i>s_j]+\tfrac{1}{2}\mathbb{1}[s_i=s_j]\Big).$$ Return $\mathrm{AUC}$ as a float, or $\mathrm{None}$ if $n_+=0$ or $n_-=0$.

from typing import Iterable, Optional, Tuple


def auc_from_stream(examples: Iterable[Tuple[int, float]]) -> Optional[float]:
    """Compute AUC from a stream of (label, score) pairs.

    Args:
        examples: Iterable of (y, s) where y is 0/1 and s is a float.

    Returns:
        AUC as float, or None if there are no positive or no negative labels.
    """
    pass

from typing import Iterable, Optional, Tuple


def auc_from_stream(examples: Iterable[Tuple[int, float]]) -> Optional[float]:
    """Compute AUC from a stream of (label, score) pairs.

    Uses a sorting-based computation with correct handling of ties.

    Args:
        examples: Iterable of (y, s) where y is 0/1 and s is a float.

    Returns:
        AUC as float, or None if there are no positive or no negative labels.

    Raises:
        ValueError: If a label is not 0 or 1.
    """
    data = []
    n_pos = 0
    n_neg = 0

    for y, s in examples:
        if y not in (0, 1):
            raise ValueError(f"Label must be 0 or 1, got {y!r}")
        data.append((float(s), int(y)))
        if y == 1:
            n_pos += 1
        else:
            n_neg += 1

    if n_pos == 0 or n_neg == 0:
        return None

    # Sort by score ascending. AUC can be computed from the sum of ranks of positives.
    data.sort(key=lambda t: t[0])

    # Assign average ranks for ties.
    # Ranks are 1-based.
    sum_pos_ranks = 0.0
    rank = 1
    i = 0
    n = len(data)

    while i < n:
        j = i
        score_i = data[i][0]
        pos_in_group = 0

        while j < n and data[j][0] == score_i:
            if data[j][1] == 1:
                pos_in_group += 1
            j += 1

        group_size = j - i
        avg_rank = (rank + (rank + group_size - 1)) / 2.0
        sum_pos_ranks += pos_in_group * avg_rank

        rank += group_size
        i = j

    # Mann-Whitney U statistic for positives.
    u_pos = sum_pos_ranks - (n_pos * (n_pos + 1)) / 2.0
    auc = u_pos / (n_pos * n_neg)
    return float(auc)

700+ ML coding problems with a live Python executor.

Practice in the Engine

Duolingo's own engineering blog on interviewing makes clear that strong software engineering ability is a hard requirement, not a bonus. The coding rounds reward clean, production-quality solutions with time left to discuss tradeoffs. Practice consistently at datainterview.com/coding.

Test Your Readiness

How Ready Are You for Duolingo Machine Learning Engineer?

1 / 10

Algorithms & Coding

Can you design and code an O(n) or O(n log n) solution to a string or array problem (for example, longest substring without repeating characters), and explain time and space complexity tradeoffs?

Use this to find your weak spots before committing to a full prep cycle, then close the gaps at datainterview.com/questions.

Frequently Asked Questions

How long does the Duolingo Machine Learning Engineer interview process take?

From first recruiter call to offer, expect roughly 4 to 6 weeks. The process typically starts with a recruiter screen, moves to a technical phone screen, and then an onsite (or virtual onsite) loop. Scheduling the onsite can add a week or two depending on team availability. I've seen some candidates move faster if the team has urgent headcount, but don't count on it.

What technical skills are tested in the Duolingo MLE interview?

Python is the primary language, so be fluent in it. You'll be tested on data structures and algorithms, ML model design and implementation, statistical analysis, and system design. Data pipeline optimization also comes up, which makes sense given Duolingo's scale of over 500 million registered users. Collaborative coding and code review skills matter too, so write clean, readable code during your interviews.

How should I tailor my resume for a Duolingo Machine Learning Engineer role?

Lead with ML projects that had measurable impact. Duolingo cares about shipping things, so highlight models you actually deployed, not just trained. If you've worked on personalization, recommendation systems, or NLP, put those front and center. Quantify everything: latency improvements, accuracy gains, user engagement lifts. Keep it to one page and make sure Python is listed prominently since that's their primary language.

What is the total compensation for a Duolingo Machine Learning Engineer?

Duolingo pays competitively, especially for their Pittsburgh headquarters where cost of living is lower than the Bay Area. For a mid-level MLE, total comp (base plus equity plus bonus) typically falls in the $180K to $250K range. Senior roles can push $300K or higher. Equity is a significant component since Duolingo is publicly traded (DUOL). Exact numbers vary by level and negotiation, so always ask about the full package breakdown.

How do I prepare for the Duolingo behavioral interview as a Machine Learning Engineer?

Duolingo's values are very specific, so study them. 'Test it first,' 'Ship it,' and 'Reduce complexity' tell you exactly what they want to hear. Prepare stories about times you ran experiments before committing to a solution, shipped something imperfect and iterated, or simplified an overly complex system. Their 'Be candid and kind' value means they'll ask about conflict resolution too. Have 2 to 3 stories ready for each theme.

How hard are the coding and SQL questions in Duolingo's MLE interview?

The coding questions are medium to hard difficulty, focused on data structures and algorithms in Python. You should be comfortable with dynamic programming, graph problems, and string manipulation. SQL isn't always a standalone round, but data manipulation skills come up in the context of pipeline work. Practice Python coding problems regularly at datainterview.com/coding to build speed and accuracy. Algorithmic problem solving is a core skill they list, so don't skip this prep.

What ML and statistics concepts should I know for the Duolingo Machine Learning Engineer interview?

Expect questions on supervised and unsupervised learning, model evaluation metrics (precision, recall, AUC), A/B testing, and statistical significance. Duolingo is big on experimentation ('Test it first' is literally a core value), so understand hypothesis testing cold. You should also be ready to discuss model training pipelines, feature engineering, and how to handle class imbalance. NLP concepts are worth reviewing given Duolingo is a language learning platform.

What does the Duolingo onsite interview look like for Machine Learning Engineers?

The onsite typically has 4 to 5 rounds spread across a full day. Expect at least one coding round, one ML system design round, one ML fundamentals or applied ML round, and one or two behavioral rounds. The system design round will likely involve designing an ML system relevant to education or personalization. Cross-functional collaboration is something they evaluate, so expect questions about working with product teams and other engineers.

What business metrics and product concepts should I understand for a Duolingo MLE interview?

Duolingo is a $1B revenue company that monetizes through subscriptions, ads, and their English proficiency test. Understand engagement metrics like DAU/MAU ratio, retention curves, and streak behavior. They care deeply about learner outcomes ('Learners first' is value number one), so think about how ML can improve learning effectiveness, not just engagement. Knowing how their recommendation and notification systems likely work will help you stand out in system design rounds.

What format should I use to answer behavioral questions at Duolingo?

Use the STAR format (Situation, Task, Action, Result) but keep it tight. Duolingo values people who 'Prioritize ruthlessly,' so don't ramble. Spend 10% on situation, 10% on task, 60% on your specific actions, and 20% on results with numbers. Always tie your answer back to a Duolingo value when it fits naturally. For example, if you're describing a tradeoff you made, connect it to 'Reduce complexity' or 'Take the long view.' Practice your stories out loud until they're under 2 minutes each.

What are common mistakes candidates make in the Duolingo Machine Learning Engineer interview?

The biggest one I see is treating the ML system design round like a pure algorithms exercise. Duolingo wants to see you think about the full pipeline, from data collection to deployment to monitoring. Another common mistake is ignoring their mission. This is an education company, not a social media app. If you design a system that optimizes engagement at the expense of learning, that's a red flag. Finally, don't write messy code. They evaluate collaborative coding and code review skills, so treat your interview code like production code.

How can I practice for the Duolingo MLE interview effectively?

Start with Python coding problems at datainterview.com/coding, aiming for medium to hard difficulty. Then move to ML system design, practicing end-to-end designs for things like personalized lesson recommendations or adaptive difficulty systems. Review ML fundamentals and stats questions at datainterview.com/questions. Give yourself 3 to 4 weeks of focused prep. Mock interviews help a lot for the behavioral rounds, especially for getting your stories concise and value-aligned.

Duolingo Machine Learning Engineer Interview Guide

Duolingo Machine Learning Engineer Role

A Typical Week

A Week in the Life of a Duolingo Machine Learning Engineer

Weekly time split

Culture notes

Projects & Impact Areas

Skills & What's Expected

Levels & Career Growth

Work Culture

Duolingo Machine Learning Engineer Compensation

Duolingo Machine Learning Engineer Interview Process

Initial Screen

Recruiter Screen

Technical Assessment

Coding & Algorithms

Coding & Algorithms

Onsite

System Design

Machine Learning & Modeling

Behavioral

Behavioral

Tips to Stand Out

Common Reasons Candidates Don't Pass

Duolingo Machine Learning Engineer Interview Questions

Algorithms & Coding

Machine Learning & Modeling

Statistics & Probabilistic Modeling

ML System Design & MLOps

ML Coding (Model Implementation)

Data Pipelines & Feature/Data Quality

Behavioral & Product Collaboration

How to Prepare for Duolingo Machine Learning Engineer Interviews

Try a Real Interview Question

Online AUC for personalized ranking

Test Your Readiness

Frequently Asked Questions

Dan Lee

Related Articles

Mistral Machine Learning Engineer Interview Guide

Mistral AI Engineer Interview Guide

TikTok Machine Learning Engineer Interview Guide