Hi @madalina!
That’s a great question! I’m glad you asked it, it’s a huge milestone in understanding grouping in Data Analysis with SQL. Let’s dive into it!
When I write such queries, I always imagine them visually. When we use GROUP BY
, we’re basically telling our SQL database to first split our dataset into multiple groups, and then calculate some value for each group. Let’s look at these 2 steps separately.
Step 1: grouping
How SQL database knows which groups we want to have? We specify it in the SELECT clause. Every column (like utm_campaign
) or a superposition of columns (like amount / clicks
) splits the dataset into groups.
How many groups? A group for each unique value of the specified column. For example, if we have 100 unique values in the utm_campaign
– we’ll have 100 groups. Then, if we have 50 unique values of amount / clicks
– we’ll split each utm_campaign
group into more groups with unique combinations of utm_campaign
+ amount / clicks
. We’ll have 50 * 100 = 5000
groups in total.
Step 2: calculating
All rigth, we have 5000 groups, what’s next? Now we need to specify what do we want to calculate for each group. That’s why these values are called aggregates
– we’re transforming a group (multiple rows) into a single number.
You’ve seen these functions many times: COUNT
, AVG
, etc. The most important thing to remember – we output one value for each unique group.
Step 3: a hidden bug
In the end, the SELECT
clause will have 2 kinds of keywords – columns that we use for specifying groups and aggregate functions we’re calculating for each group (yep, we can have multiple aggregates for each unique group, that’s very handy when doing research).
In your original query, SQL engine basically tells us that amount / clicks
is not an aggregate function. It knows that it’s a superposition of columns and not an aggregate function, so it asks you to specify this in the GROUP BY
clause:
SELECT
utm_campaign,
amount / clicks AS CPC
FROM marketing_spends
GROUP BY 1, 2
ORDER BY 2 DESC
This query is valid, but is it correct?
There’s no aggregate function and we’re using spend and click data not as a statistic, but as a dimension. Is it really our intention here?
Step 4: putting it all together
I guess what we’re really looking for is the amount of money we spent on a campaign and how much clicks we got out of it. Each row in the marketing_spends
table contains data only for a campaign on a given day, so by grouping records by utm_campaign
we can calculate spend per campaign and the amount of clicks:
SELECT
utm_campaign,
SUM(amount) AS total_spend,
SUM(clicks) AS total_clicks
FROM marketing_spends
GROUP BY 1
Later we can use these statistics to calculate more complex metrics:
WITH campaign_stats AS (
SELECT
utm_campaign,
SUM(amount) AS total_spend,
SUM(clicks) AS total_clicks
FROM marketing_spends
GROUP BY 1
)
SELECT
*,
1.0 * total_clicks / total_spend AS CPC
FROM campaign_stats
Voilà!