Hi @itslutfi
That’s a great question, JOIN-ing is one of the fundamentals of SQL and Data Analysis. I’d recommend you to skim through the lessons for INNER and LEFT joins.
I’m also glad that you stumbled upon the joining problem with that example, it’s super awesome because you’ll see this pattern in all conversion calculations later in the course and at work.
Debugging a query
Let’s simplify your query a bit and focus only on joining tables. I think I know what went wrong, let me give you an overview of how I usually debug such problems with queries.
PRO TIP: Use first letters of tables for aliases. That way it’s easy to understand which tables you use for filtering, sorting, selecting, etc.
Here’s our simplified query:
SELECT
COUNT(DISTINCT(u.id)) AS users_count,
COUNT(DISTINCT(p.id)) AS purchases_count
FROM users u
JOIN purchases p
ON p.user_id = u.id
AND refunded = FALSE
Here’s the result set:
users_count |
purchases_count |
805 |
812 |
Now let’s change the order of tables:
SELECT
COUNT(DISTINCT(u.id)) AS users_count,
COUNT(DISTINCT(p.id)) AS purchases_count
FROM purchases p
JOIN users u
ON p.user_id = u.id
AND refunded = FALSE
users_count |
purchases_count |
805 |
812 |
As you can see, the results are identical and the order of tables doesn’t make a difference. It means the problem is somewhere else. Since our query is very simple, the only other place to look at is the type of JOIN we’re using.
Fixing the problem
Let’s look at the JOIN type in our query. You’re using INNER JOIN
(JOIN
is just a shorter alias) and by definition it returns only rows that are present in both tables. That’s why when we change the order of tables the results are identical.
Let’s look at our task again. We need to calculate average revenue per user:
- We count all users.
- We sum up their revenue.
- We divide 2 by 3.
Atm our query has a bug in the first step – INNER JOIN
removes all users without purchases.
To count all users and still be able to sum up their revenue we need to use LEFT JOIN
– it keeps all rows from the left table (users) and joins only matching rows from the right table (purchases).
SELECT
COUNT(DISTINCT(u.id)) AS users_count,
COUNT(DISTINCT(p.id)) AS purchases_count
FROM users u
LEFT JOIN purchases p
ON p.user_id = u.id
AND refunded = FALSE
users_count |
purchases_count |
6082 |
812 |
As you can see, the amount of purchases stayed the same and the number of users is way higher. In fact, this query counts all users in our database, we can verify it by counting them separately:
SELECT COUNT(*) AS users_count
FROM users
Things to think about
Now what if we change the order in the query with LEFT JOIN
?
SELECT
COUNT(DISTINCT(u.id)) AS users_count,
COUNT(DISTINCT(p.id)) AS purchases_count
FROM purchases p
LEFT JOIN users u
ON p.user_id = u.id
AND refunded = FALSE
Of course, the results will be completely different because of our JOIN type. LEFT JOIN
will keep all records from the left table (in that case purchases) and only join matching records from the right table (users).
Summary
To select a correct JOIN type we need to deeply understand what we’re calculating. Ideally, break down our task into a simple algorithm and then validate each step with simpler queries like this one:
SELECT COUNT(*) AS users_count
FROM users
Most imporatnly, always remember that INNER JOIN
returns the records that are present in both table. You can visualize it as an intersection.
The LEFT JOIN
always keeps all records from the left table and only joins the matching records from the right table. The base of your conversion (the denominator) usually comes from the left table, because you need to count all records from it.
I hope that was useful! Ping me if you have more questions.