Recently I spotted a strange 200$ transaction in my bank statement. It was from a service I thought I’ve canceled. You know what was next: I wrote to customer support, asked for a refund and after some time got my 200$ back. This story had a happy ending but it made me wonder how many such recurring transactions I haven’t noticed.
It’s 2020 and I thought that bank apps are already smart enough to detect subscriptions. I went to my N26 bank’s Subscriptions report hoping to find a nice list of recurring transactions there. Drum roll… it was empty!
I definitely pay for some subscription services (Spotify, for example
) but none of them showed up in the report.
I’ve no idea how N26 detects subscriptions and why it’s broken for me. Good news are that N26, as any other bank, allows to download all transactions as a CSV file. Let’s download it and improvise a simple subscription detection mechanism based solely on SQL.
What we are about to do is a typical data research or an Ad Hoc analysis. It’s not far from the reality of data analysts, marketers and product managers mining their data.
In my experience, any analysis (or anything in general) has much higher chances of success if it has a goal. Here I want to know which companies charge me regularly, for how long and how much money have I already paid.
To answer these questions we’ll have to write one (potentially massive) SQL query. To kick it off I usually try to visualize the result set that our final query will return. Such visualization helps a lot to anticipate intermediate steps and bottlenecks. Looking at our goals
I’d imagine the final result like this:
payee | first_transaction_date | total_money_spent_eur |
Spotify | 2019-04-01 | -250 |
Netflix | 2020-01-01 | -30 |
It’s a good goal for a small blog post. Let’s get started!
Let’s download the CSV file with our bank transactions. Here’s a link for the N26 report file that will be used in this blog post (the data is made up, of course, but the format is real).
Feel free to download the transactions from you bank and follow along with them!
In case you have a problem with anything just leave a comment below or ping me on Twitter.
Let’s inspect the CSV file with transactions, it should look something like this:
Basically, CSV files are tables. In the next couple of steps we’ll replicate this table in the actual SQL database and copy the data.
I’ll use a PostgreSQL database with a free client called Postico. I’ll create a simple database via the client’s interface:
Alternative approach is to use SQL queries for database or table creation as well. To create a new database you’ll need to run a query from the template (template1
in case of PostgreSQL) database:
CREATE DATABASE personal_finance
Let’s call our database personal_finance
in case you want to add more data in the future (other bank accounts, investments, etc).
Now we need a table to host the transactions from the CSV file, let’s call it n26_transactions
(if you use Chase bank it’ll be chase_transactions
). Let’s examine what columns will we need by looking at the first row in the CSV file:
Date, Payee, Account number, etc. In our table, they’ll be called created_at
(date
is a reserved word in SQL and it’s better not to use it as a column name), payee
, account_number
, etc.
We’re using lower case for column names with underscores _ instead of spaces. In programming, it’s called snake case.
Now that we have a list of columns let’s create a table. Simply click + Table if you’re using Postico:
To add new columns switch to the Structure
tab of the n26_transactions
table and add columns (created_at
column has type date
, other columns have type text
):
The hacker way to add a table is by using a CREATE TABLE
SQL query:
CREATE TABLE n26_transactions (
created_at date,
payee text,
account_number text,
transaction_type text,
payment_reference text,
category text,
amount_eur text,
amount_foreign text,
foreign_currency text,
exchange_rate text
);
Note that we used
text
columns even for columns with numbers (amount_eur
, etc). The reason for it is that all values in the CSV file are wrapped in quotation marks ""
and from the database perspective they’re all text (it could lead to import errors when SQL engine can’t recognize a number in the empty string, for example). No problem though, we’ll be able to change column types later.
At this point we’ve got our personal_finance
database and n26_transactions
table. It’s time to import the data! Let’s use a special COPY
query for it. Here the final query that imports transactions from the CSV file:
COPY n26_transactions(
created_at,
payee,
account_number,
transaction_type,
payment_reference,
category,
amount_eur,
amount_foreign,
foreign_currency,
exchange_rate
) FROM '/Users/makaroni4/Downloads/n26-csv-transactions-fake.csv' DELIMITER ',' CSV HEADER;
The order of column names in the
COPY
command should map the order in the CSV file. Otherwise SQL will try to map data to the default order of the columns in the n26_transactions
table.
Don’t forget to adjust the path to the CSV file in the
FROM
command. As you can see, the query uses the path to the CSV file from my computer: /Users/makaroni4/Downloads/n26-csv-transactions-fake.csv
.
At this point, we should have all transactions from the CSV file in the transactions
table. I start every data research with a SELECT *
query and look at the raw data:
SELECT *
FROM n26_transactions
The records follow the same order as in the original CSV file. It’s quite hard to spot any recurring payments this way. Let’s order transactions by the payee
and transaction date (created_at
column):
SELECT *
FROM n26_transactions
WHERE
amount_eur::numeric < 0
ORDER BY payee ASC, created_at DESC
Note that we filtered out income transactions (with a positive amount).
I bet you’ve already spotted the transactions from Spotify, it’s definitely a subscription!
Looks like Spotify bills us every month on the 3rd day, but other companies could do it every 30 or 31 days, for example. Detecting the frequency is definitely a bottleneck here.
Let’s try counting days between each transaction for every payee
. To do this we’ll employ the ROW_NUMBER()
window function to index all transactions within each payee
and the LAG()
window function to access the date of the previous transaction:
SELECT
ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
*
FROM n26_transactions
As you can see, this query counts days between consecutive transactions for every payee. It also produces a lot of noise in the result (records with NULL
date_diff
). They definitely make it hard for us to spot the recurring transactions, so let’s filter them out.
We’ll do it by wrapping the previous transaction into a CTE (Common Table Expression or a subquery) and filter out the ones with NULL
date_diff
:
WITH transactions_with_date_diff AS (
SELECT
ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
*
FROM n26_transactions
)
SELECT *
FROM transactions_with_date_diff
WHERE
date_diff IS NOT NULL
Sweet, we’re down to 400 transactions from the initial 800. Let’s browse through the result and spot more recurring transactions to find patterns in billing dates.
Here’s the one from Vodafone Mobile, it’s definitely a subscription. The date_diff
varies from 28 to 33. Let’s try to filter only records with date_diff
in this range (let’s round it to 25 and 35):
WITH transactions_with_date_diff AS (
SELECT
ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
*
FROM n26_transactions
)
SELECT *
FROM transactions_with_date_diff
WHERE
date_diff IS NOT NULL
AND date_diff BETWEEN 25 AND 35
Awesome, we’re down to just 90 transactions and most of them look pretty recurring!
Let’s recall the final report we planned, we wanted to know:
Our job at this point is to convert the result of the previous query into the final report. We’ll definitely need to aggregate data, let’s translate our requirements into SQL statements:
MIN(created_at)
COUNT(*)
SUM(amount_eur::numeric)
. amount_eur
string value into a number.MAX(created_at)
Let’s put everything into one query:
WITH transactions_with_date_diff AS (
SELECT
ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
LAST_VALUE(amount_eur) OVER(PARTITION BY payee ORDER BY created_at) AS latest_amount,
*
FROM n26_transactions
)
SELECT
payee,
COUNT(*) AS transactions_count,
MIN(created_at) AS subscription_started,
MAX(created_at) AS latest_transaction,
SUM(amount_eur::numeric) AS total_amount
FROM transactions_with_date_diff
WHERE
date_diff IS NOT NULL
AND date_diff BETWEEN 25 AND 35
GROUP BY 1
ORDER BY 2 DESC
As you can see in the result, there’s a bunch of payee
-s with just 1 transaction, they don’t look like recurring transactions. Let’s filter them out for now (if they’re real subscriptions they’ll show up in the report next month):
WITH transactions_with_date_diff AS (
SELECT
ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
LAST_VALUE(amount_eur) OVER(PARTITION BY payee ORDER BY created_at) AS latest_amount,
*
FROM n26_transactions
)
SELECT
payee,
COUNT(*) AS transactions_count,
MIN(created_at) AS subscription_started,
MAX(created_at) AS latest_transaction,
SUM(amount_eur::numeric) AS total_amount
FROM transactions_with_date_diff
WHERE
date_diff IS NOT NULL
AND date_diff BETWEEN 25 AND 35
GROUP BY 1
HAVING COUNT(*) > 1
ORDER BY 2 DESC
Pretty neat, huh? I think this query used pretty much every SQL concept available – filters, aggregate and window functions, and sorting.
Let’s get back to our initial inspection query:
SELECT *
FROM transactions
ORDER BY payee ASC, created_at DESC
If you scroll the result set long enough you’ll see a LOCAL GYM recurring transactions:
payee |
LOCAL GYM AUG-29609684 |
LOCAL GYM DEC-33094239 |
LOCAL GYM JAN-33997758 |
Our final query didn’t spot them because the payee
was different for each month (somehow the LOCAL GYM managed to include the month of payment as part of the payee
value and not as a description).
There are a couple of ways we could fix that issue. We can manually adjust the payee values in the subquery and use it for our report:
SELECT
(CASE
WHEN payee LIKE 'LOCAL GYM %' THEN 'LOCAL GYM'
ELSE payee
END) AS adjusted_payee,
*
FROM n26_transactions
ORDER BY payee ASC, created_at DESC
or we can extract the first word from the payee
value and use it as the new payee
label. Anyways, you can turn the result of these queries into subqueries and use them as the input for our final query in Step 6.
SELECT
SPLIT_PART(payee, ' ', 1) adjusted_payee,
payee,
*
FROM n26_transactions
ORDER BY payee ASC, created_at DESC
The first approach is more precise but requires us to go over all transactions manually to spot recurring ones (which kind of defeats the purpose of our research).
Anyways, since we used CASE
statement and scalar function SPLIT_PART
we covered all aspects of SQL in one blog post!
As you can see, a “simple” SQL query could be very insightful for doing personal finance. Imagine how powerful data insights are when you’re running marketing campaigns or building a product.
In case you want to learn how SQL works and how to use it to run a modern Internet company I recommend you to try the SQL Habit course. It’s a simulation of a startup company that will teach you not just SQL mastery but how to use data for running successful marketing campaigns, building web, and mobile products.