Hi @tks
When to use DISTINCT
One-to-many or many-to-many relationship are the most popular in data warehouses – a user has multiple pageviews, product has many purchases, book has many readers, etc. It simply comes from the nature of any business.
That way when we join users and purchases, for example, many user records will be duplicated (a user record for each user purchase). Note that this is the case for both INNER JOIN
and LEFT JOIN
.
When not to use DISTINCT
I think for safety reasons we can always use DISTINCT
to ensure that we’re counting unique records. If you doubt whether your query might count duplicates – throw in DISTINCT
and the problem is solved.
A subquery from your example is very simple. There’re no JOIN-s and we know that records in the users
table are always unique, no need for DISTINCT
:
SELECT
utm_source,
COUNT(*) AS users_count
FROM users
WHERE
utm_source IS NOT NULL
GROUP BY 1
But as soon as we joined smth we’ll have to use DISTINCT
to count users correctly:
SELECT
COUNT(*) AS wrong_users_count,
COUNT(DISTINCT(u.id)) AS correct_users_count
FROM users u
LEFT JOIN purchases p
ON u.id = p.user_id
Final thoughts
Always using DISTINCT
won’t save you from the need to deeply understand relations in your database.
I recommend to check results with/without DISTINCT
to make sure that you have a correct mental model of the data.