In 105. a great and important point is made in the difference of limiting a query in the join conditions vs. limiting a query in the where clause.
I understand the scenario of why limiting the query in the join condition makes sense (filtering the purchases table to only consider all purchases that were not refunded) and I see how using the where clause ultimately returns wrong output (filtering in the where clause will ignore the ‘negative space’ of all the NULLs in purchase columns for users who did not make a purchase).
I tried challenging my understanding a bit and came up with the ‘b’ CTE below (‘a’ CTE is the correct query as described in the exercise). My question is then: how am I making a logical error in trying to utilise the where clause in the ‘b’ CTE below? As far as I can see, the difference in counts between ‘a’ and ‘b’ is 104 rows (purchases that were refunded?).
WITH a AS (
SELECT
SUM(amount) / COUNT(DISTINCT(u.id)) AS ARPU
FROM users u
LEFT JOIN purchases p
ON u.id = p.user_id
AND refunded = FALSE
),
b AS (
SELECT
SUM(amount) / COUNT(DISTINCT(u.id)) AS ARPU
FROM users as u
LEFT JOIN purchases AS p
ON u.id = p.user_id
WHERE refunded IS NULL OR refunded = FALSE
)
SELECT * FROM b
-- SELECT * FROM a
/*
The 'a' CTE above is correct. I can see that 'b' CTE is not correct,
but I'm unable to wrap my mind around why, for the 'b' version, the
'amount' is greater in relation to the distinct user count.
*/