110. CPA: Cost per acquisition

Hello,

With reference to the CTE - total_users in following query for this chapter, my query is that why isn’t distinct used with COUNT(*) AS users_count

WITH spends AS (
  SELECT
    utm_source,
    SUM(amount) AS total_spend
  FROM marketing_spends
  GROUP BY 1  
), total_users AS (
  SELECT
    utm_source,
    COUNT(*) AS users_count
  FROM users
  WHERE 
    utm_source IS NOT NULL
  GROUP BY 1
)

SELECT
  s.utm_source,
  total_spend / users_count AS CPA
FROM spends s
INNER JOIN total_users u
  ON s.utm_source = u.utm_source

REPLIES

makaroni4

over 3 years Reply

Hi @tks

When to use DISTINCT

One-to-many or many-to-many relationship are the most popular in data warehouses – a user has multiple pageviews, product has many purchases, book has many readers, etc. It simply comes from the nature of any business.

That way when we join users and purchases, for example, many user records will be duplicated (a user record for each user purchase). Note that this is the case for both INNER JOIN and LEFT JOIN.

When not to use DISTINCT

I think for safety reasons we can always use DISTINCT to ensure that we’re counting unique records. If you doubt whether your query might count duplicates – throw in DISTINCT and the problem is solved.

A subquery from your example is very simple. There’re no JOIN-s and we know that records in the users table are always unique, no need for DISTINCT:

SELECT
  utm_source,
  COUNT(*) AS users_count
FROM users
WHERE 
  utm_source IS NOT NULL
GROUP BY 1

But as soon as we joined smth we’ll have to use DISTINCT to count users correctly:

SELECT  
  COUNT(*) AS wrong_users_count,
  COUNT(DISTINCT(u.id)) AS correct_users_count
FROM users u
LEFT JOIN purchases p
  ON u.id = p.user_id

Final thoughts

Always using DISTINCT won’t save you from the need to deeply understand relations in your database.

I recommend to check results with/without DISTINCT to make sure that you have a correct mental model of the data.

tks