Hi, can you show an example of how to calculate Retention (1,7,30 days) by monthly cohorts? I have been trying to do it for 3 days on different datasets, the result is not working. thanks
Hey @ki.polikutin
Long time no see! Hope you’ve already figured it out! :fingers_crossed: Just in case, here’s my thinking process and the query for monthly cohorts retention.
First, let’s start with our cohorts – we want them to be monthly. More precisely, we want to group users into calendar month cohorts. There’re 2 functions that come very handy here.
DATE_TRUNC
will extract a specific part from a timestamp. In our case, it’ll return another timestamp with the first day of the month and 00:00:00
for hours, minutes and seconds.
The other function is TO_CHAR
– it’ll return a string in the format we specified. Here’re both in action:
SELECT
DATE_TRUNC('month', created_at),
TO_CHAR(created_at, 'YYYY-MM')
FROM users
Let’s use DATE_TRUNC()
to assign each user a cohort:
WITH users_with_cohorts AS (
SELECT
DATE_TRUNC('month', created_at) AS cohort,
*
FROM users
)
SELECT *
FROM users_with_cohorts
Now let’s look at user retention. For every user we need to answer the question “Was a user active on D7”? If we have it, then we just need to group this data by users’ cohorts and we’re done.
Here comes the question: how do we define if a user was active? Ideally, we should have an activity table that looks smth like this:
user_id | active_at |
1 | 2018-01-01 |
1 | 2018-01-02 |
2 | 2018-01-03 |
3 | 2018-01-04 |
Such an activity table could be built based on our production database or analytics events data. It’s all a matter of definition.
Let’s use Bindle’s mobile analytics data and have a very simple definition of activity: If there’s a mobile event on a certain date – then a user was active. Later, we can tighten our definition up and use a specific event. For example, a user is active on day X only if there’s a reading event.
WITH user_activity AS (
SELECT
user_id,
created_at AS active_at
FROM mobile_analytics.events
WHERE
user_id IS NOT NULL
)
SELECT *
FROM user_activity
Note that since user activity is based on mobile_analytics.events
table users might have multiple activity records for the same date. It might affect our calculation later, keep it in mind.
Now let’s join users with cohorts and user activity. At this point for every user record (from users_with_records
) we’ll have multiple activity records. Let’s subtract user’s signup timestamp from each activity’s timestamp – it’ll give us a day on which a user was active:
WITH users_with_cohorts AS (
SELECT
DATE_TRUNC('month', created_at) AS cohort,
id AS user_id,
*
FROM users
), user_activity AS (
SELECT
user_id,
created_at AS active_at
FROM mobile_analytics.events
WHERE
user_id IS NOT NULL
)
SELECT
c.user_id,
cohort,
DATE_PART('day', a.active_at - c.created_at) AS activity_day
FROM users_with_cohorts c
LEFT JOIN user_activity a
ON c.user_id = a.user_id
As you can see, until this point our preparations were quite generic – we haven’t specified which retention are we going to calculate – D1, D7, etc. It’s time to do this now.
All we have to do is to group and count users by their cohorts and activity days:
WITH users_with_cohorts AS (
SELECT
DATE_TRUNC('month', created_at) AS cohort,
id AS user_id,
created_at
FROM users
), user_activity AS (
SELECT
user_id,
created_at AS active_at
FROM mobile_analytics.events
WHERE
user_id IS NOT NULL
), activity_days_per_user AS (
SELECT
c.user_id,
cohort,
DATE_PART('day', a.active_at - c.created_at) AS activity_day
FROM users_with_cohorts c
LEFT JOIN user_activity a
ON c.user_id = a.user_id
)
SELECT
cohort,
COUNT(DISTINCT(CASE WHEN activity_day = 1 THEN user_id END)) AS D1,
COUNT(DISTINCT(CASE WHEN activity_day = 7 THEN user_id END)) AS D7,
COUNT(DISTINCT(CASE WHEN activity_day = 30 THEN user_id END)) AS D30
FROM activity_days_per_user
GROUP BY 1
Well here it is – retention for monthly cohorts. You can extend it to any day (D60
or D180
for example).
We can go even further – calculate activity weeks/months instead of days (DATE_PART('week')
or DATE_PART('month')
) and simply group by cohort
and activity_month
instead of conditionally counting D1/D7/D30.
Looking forward for more questions, have fun!