Hi @hal
That’s an amazing question, I’m very happy to dive into in. It’s absolutely critical to understand the difference between filtering inside a JOIN
statement vs filtering in the WHERE
clause.
I see that you stumbled upon this problem in a VERY complex query. For simplicity, let’s consider the most simple ARPU calculation based on all revenue and all users.
By definition, we need to sum up revenue from all purchases that haven’t been refunded and divide it by the number of all users (regardless whether they purchased or not). It’s important not to confuse ARPU and ARPPU.
Filtering inside the WHERE clause
Let’s consider an attempt to calculate ARPU via filtering refunded purchases inside the WHERE
clause. We don’t even need to aggregate anything yet, let’s simply join 2 tables to see the problem:
SELECT *
FROM users u
LEFT JOIN purchases p
ON p.user_id = u.id
WHERE
refunded = FALSE
Do you see it?
No worries if not, but here’s a little hint: let’s count records in this result set:
SELECT COUNT(*)
FROM users u
LEFT JOIN purchases p
ON p.user_id = u.id
WHERE
refunded = FALSE
Filtering inside the JOIN statement
As in the previous paragraph, let’s eye ball the result set first:
SELECT *
FROM users u
LEFT JOIN purchases p
ON p.user_id = u.id
AND refunded = FALSE
Is the difference clear now? What can we tell if we count the records?
SELECT COUNT(*)
FROM users u
LEFT JOIN purchases p
ON p.user_id = u.id
AND refunded = FALSE
As you can see, the number of records in the joined table is way higher when we filter inside the JOIN statement. Why is it so?
It has something to do with the order of SQL commands exection inside a query. The entire FROM
clause (including all JOIN
-s) is executed before the WHERE
clause.
With this new insight, let’s look at our queries.
The query with a WHERE
filter has only records of users with purchases aka “customers” in the result set. That’s because the JOIN
was executed first – we attached all purchases to all users. Then we applied the WHERE
filter to the entire result set and then selected all records that have refunded = FALSE
.
We filtered out all records with refunded = TRUE
(refunded purchases AND all records with refunded = NULL
). In other words, we removed all users without purchases. If we calculate ARPPU – that’s exactly what we want, but for ARPU calculation it’s a critical mistake.
Back to the query with a filter inside the JOIN
statement. The JOIN
is executed first, so it’s the same as joining a smaller purchases table to the users
table:
WITH non_refunded_purchases AS (
SELECT *
FROM purchases
WHERE
refunded = FALSE
)
SELECT *
FROM users u
LEFT JOIN non_refunded_purchases p
ON p.user_id = u.id
Final ARPU calculation
I hope it’s clear now why we need to distinguish between where we filter – inside a JOIN
or WHERE
statements. It’s not like one of them is wrong, it’s all about the context – what are we trying to calculate?
Let’s complete our challenge and calculate ARPU:
SELECT
SUM(amount) / COUNT(DISTINCT(u.id)) AS ARPU
FROM users u
LEFT JOIN purchases p
ON p.user_id = u.id
AND refunded = FALSE
I’ll leave you with a little homework here. Can you explain why do we have COUNT(DISTINCT(u.id))
and not COUNT(u.id)
?
Have fun!