ML Engineer MasterClass (April) | 6 seats left

GROUP BY

GROUP BY

Lesson Objectives

By the end of this lesson, you will:

  • Learn how to summarize data using GROUP BY
  • Understand aggregate functions like SUM(), COUNT(), AVG(), MIN(), MAX()
  • Write SQL queries to analyze user engagement with Netflix’s watch history

🎬 Scenario: Analyzing User Watch Patterns

Your manager at Netflix, Alex, wants insights into how users engage with the platform. They need a summary of:

  • The total number of shows watched per user
  • The average watch duration per user
  • The longest and shortest watch session per user

Alex asks:

Can you group user watch data and provide aggregated statistics for each user?

Your task is to query the watch_history table and provide the requested insights.

history_iduser_idseries_idwatch_datewatch_duration_minutescompleted
1122024-02-10T18:30:0060false
2242024-01-20T21:00:0050true
3312024-02-01T19:45:0075true
4432024-02-15T17:10:0030false
55102024-02-18T23:00:0045true

1. Understanding GROUP BY

The GROUP BY clause allows us to group data by a specific column and apply aggregate functions to summarize the results.

Syntax

SQL
SELECT column1, aggregate_function(column2)  
FROM table_name  
GROUP BY column1;
  • GROUP BY column1 groups rows that share the same value in column1.
  • The aggregate function performs calculations on the grouped rows.

2. Counting Shows Watched per User

To find the total number of shows watched per user, we can use COUNT() to count the number of series_id values for each user.

Example Query: Counting Shows Watched

SQL
SELECT user_id, COUNT(series_id) AS total_shows_watched  
FROM watch_history  
GROUP BY user_id;
user_idtotal_shows_watched
13
23
32
42
52

What’s Happening?

  • COUNT(series_id) counts how many different shows each user has watched.
  • GROUP BY user_id ensures each row in the output represents a unique user.

3. Calculating Average Watch Duration

To find the average time users spend watching, we use AVG() on the watch_duration_minutes column.

Example Query: Average Watch Duration Per User

SQL
SELECT user_id, AVG(watch_duration_minutes) AS avg_watch_time  
FROM watch_history  
GROUP BY user_id;
user_idavg_watch_time
145.0
260.0
367.5
430.0
542.5

What’s Happening?

  • AVG(watch_duration_minutes) calculates the average watch time per user.
  • GROUP BY user_id ensures each row represents a single user.

4. Finding Maximum & Minimum Watch Duration

We can use MAX() and MIN() to determine the longest and shortest viewing session for each user.

Example Query: Longest & Shortest Watch Session

SQL
1SELECT user_id,  
2       MAX(watch_duration_minutes) AS max_watch_time,  
3       MIN(watch_duration_minutes) AS min_watch_time  
4FROM watch_history  
5GROUP BY user_id;
user_idmax_watch_timemin_watch_time
16025
29040
37560
45030
54540

What’s Happening?

  • MAX(watch_duration_minutes) returns the longest watch session for each user.
  • MIN(watch_duration_minutes) returns the shortest watch session per user.

✍️ SQL Exercises

Exercise 1: Counting the Number of Shows Watched per User

Alex wants to know how many shows each user has watched.

Filter condition:

  • Count the number of series_id each user has watched.
  • Sort by total_shows_watched in descending order.
user_idtotal_shows_watched
13
23
32
42
52

Write an SQL query to return the requested data.


Exercise 2: Finding the Average Watch Duration per User

Alex also wants to see how much time each user spends watching on average.

Filter condition:

  • Calculate the average watch duration for each user.
  • Sort by avg_watch_time in descending order.
user_idavg_watch_time
367.5
260.0
145.0
542.5
430.0

Write an SQL query to return the requested data.