FORUM Everything about SQL Filter of refunded values in ARPU calculation

Filter of refunded values in ARPU calculation

I am exploring an alternative solution to the one shown for Exercise 204.

My question is, why can we not filter out refunded values within the WHERE clause?

WITH ab_test_categorization AS (
  SELECT
    user_id,
    custom_parameters ->> 'ab_test_name' AS ab_test_name,
    custom_parameters ->> 'ab_test_variation' AS ab_test_variation,
    created_at AS categorized_at
  FROM mobile_analytics.events
  WHERE
    custom_parameters ->> 'ab_test_name' IS NOT NULL
    AND action = 'signup'
), ab_test_stats AS (
  SELECT
    ab_test_variation AS variation,
    COUNT(DISTINCT(c.user_id)) / SUM(p.amount) AS arpu
  FROM ab_test_categorization c
  LEFT JOIN purchases p
    ON c.user_id = p.user_id
  WHERE
    ab_test_name = 'longer_onboarding_201803'
    AND p.refunded = FALSE
  GROUP BY 1
)

SELECT
  variation,
  arpu
FROM ab_test_stats

REPLIES

Hi @hal 👋

That’s an amazing question, I’m very happy to dive into in. It’s absolutely critical to understand the difference between filtering inside a JOIN statement vs filtering in the WHERE clause.

I see that you stumbled upon this problem in a VERY complex query. For simplicity, let’s consider the most simple ARPU calculation based on all revenue and all users.

By definition, we need to sum up revenue from all purchases that haven’t been refunded and divide it by the number of all users (regardless whether they purchased or not). It’s important not to confuse ARPU and ARPPU.

Filtering inside the WHERE clause

Let’s consider an attempt to calculate ARPU via filtering refunded purchases inside the WHERE clause. We don’t even need to aggregate anything yet, let’s simply join 2 tables to see the problem:

SELECT *
FROM users u
LEFT JOIN purchases p
  ON p.user_id = u.id
WHERE
  refunded = FALSE

Do you see it? 👀

No worries if not, but here’s a little hint: let’s count records in this result set:

SELECT COUNT(*)
FROM users u
LEFT JOIN purchases p
  ON p.user_id = u.id
WHERE
  refunded = FALSE

Filtering inside the JOIN statement

As in the previous paragraph, let’s eye ball the result set first:

SELECT *
FROM users u
LEFT JOIN purchases p
  ON p.user_id = u.id
    AND refunded = FALSE

Is the difference clear now? What can we tell if we count the records?

SELECT COUNT(*)
FROM users u
LEFT JOIN purchases p
  ON p.user_id = u.id
    AND refunded = FALSE

As you can see, the number of records in the joined table is way higher when we filter inside the JOIN statement. Why is it so? 🤔

It has something to do with the order of SQL commands exection inside a query. The entire FROM clause (including all JOIN-s) is executed before the WHERE clause. 💡

With this new insight, let’s look at our queries. 🔍

The query with a WHERE filter has only records of users with purchases aka “customers” in the result set. That’s because the JOIN was executed first – we attached all purchases to all users. Then we applied the WHERE filter to the entire result set and then selected all records that have refunded = FALSE.

We filtered out all records with refunded = TRUE (refunded purchases ✅ AND ⚠ all records with refunded = NULL). In other words, we removed all users without purchases. 💣 If we calculate ARPPU – that’s exactly what we want, but for ARPU calculation it’s a critical mistake. 🐛

Back to the query with a filter inside the JOIN statement. The JOIN is executed first, so it’s the same as joining a smaller purchases table to the users table:

WITH non_refunded_purchases AS (
  SELECT *
  FROM purchases
  WHERE
    refunded = FALSE
)

SELECT *
FROM users u
LEFT JOIN non_refunded_purchases p
  ON p.user_id = u.id

Final ARPU calculation

I hope it’s clear now why we need to distinguish between where we filter – inside a JOIN or WHERE statements. It’s not like one of them is wrong, it’s all about the context – what are we trying to calculate?

Let’s complete our challenge and calculate ARPU:

SELECT 
  SUM(amount) / COUNT(DISTINCT(u.id)) AS ARPU
FROM users u
LEFT JOIN purchases p
  ON p.user_id = u.id
    AND refunded = FALSE

I’ll leave you with a little homework here. Can you explain why do we have COUNT(DISTINCT(u.id)) and not COUNT(u.id)? ❓

Have fun! 🍻 ❤

WRITE A REPLY