Group By error

I created the following query for the exercise on lesson 114. Detecting campaign with the highest CPC:

SELECT
  utm_campaign,
  amount / clicks AS CPC
FROM marketing_spends
GROUP BY 1
ORDER BY 2 DESC

However this returns an error: column marketing_spends.amount must appear in the GROUP BY.

Why is it required to include it on the group by clause? Seems that the following query runs without errors:

SELECT
  utm_campaign,
  amount / clicks AS CPC
FROM marketing_spends
GROUP BY 1, 2
ORDER BY 2 DESC

Thank you!

REPLIES

Hi @madalina!

That’s a great question! I’m glad you asked it, it’s a huge milestone in understanding grouping in Data Analysis with SQL. Let’s dive into it!

When I write such queries, I always imagine them visually. When we use GROUP BY, we’re basically telling our SQL database to first split our dataset into multiple groups, and then calculate some value for each group. Let’s look at these 2 steps separately.

Step 1: grouping

How SQL database knows which groups we want to have? We specify it in the SELECT clause. Every column (like utm_campaign) or a superposition of columns (like amount / clicks) splits the dataset into groups.

How many groups? A group for each unique value of the specified column. For example, if we have 100 unique values in the utm_campaign – we’ll have 100 groups. Then, if we have 50 unique values of amount / clicks – we’ll split each utm_campaign group into more groups with unique combinations of utm_campaign + amount / clicks. We’ll have 50 * 100 = 5000 groups in total.

Step 2: calculating

All rigth, we have 5000 groups, what’s next? Now we need to specify what do we want to calculate for each group. That’s why these values are called aggregates – we’re transforming a group (multiple rows) into a single number.

You’ve seen these functions many times: COUNT, AVG, etc. The most important thing to remember – we output one value for each unique group.

Step 3: a hidden bug

In the end, the SELECT clause will have 2 kinds of keywords – columns that we use for specifying groups and aggregate functions we’re calculating for each group (yep, we can have multiple aggregates for each unique group, that’s very handy when doing research).

In your original query, SQL engine basically tells us that amount / clicks is not an aggregate function. It knows that it’s a superposition of columns and not an aggregate function, so it asks you to specify this in the GROUP BY clause:

SELECT
  utm_campaign,
  amount / clicks AS CPC
FROM marketing_spends
GROUP BY 1, 2
ORDER BY 2 DESC

⚠ This query is valid, but is it correct? 🤔

There’s no aggregate function and we’re using spend and click data not as a statistic, but as a dimension. Is it really our intention here?

Step 4: putting it all together

I guess what we’re really looking for is the amount of money we spent on a campaign and how much clicks we got out of it. Each row in the marketing_spends table contains data only for a campaign on a given day, so by grouping records by utm_campaign we can calculate spend per campaign and the amount of clicks:

SELECT
  utm_campaign,
  SUM(amount) AS total_spend,
  SUM(clicks) AS total_clicks
FROM marketing_spends
GROUP BY 1

Later we can use these statistics to calculate more complex metrics:

WITH campaign_stats AS (
  SELECT
    utm_campaign,
    SUM(amount) AS total_spend,
    SUM(clicks) AS total_clicks
  FROM marketing_spends
  GROUP BY 1
)

SELECT
  *,
  1.0 * total_clicks / total_spend AS CPC
FROM campaign_stats

Voilà! 🐰 🎩

Excellent explanation! I’m glad I asked because the answer is more complex than I thought. Thank you!

WRITE A REPLY