In this lesson we’ll look into another application of regular expression – monitoring.
Previously, we saw how regular expressions are useful for validating user input – postal codes, for example. The good news is that we can run these validations right in the browser or in the mobile app (with exactly the same regular expressions) and prevent invalid data from entering our database.
Sometimes there’s no such possibility. For example, a typo in marketing campaign configuration might lead to a data gap in our analytics. From user’s perspective everything will be fine: they’re clicking our ads and use our product. From a data and marketing perspective it could be a big problem – we can’t process marketing attribution (UTM parameters) because they don’t follow our convention.
This is where regular expressions come into play – we can monitor (run SQL query regularly) marketing campaign names with a regex that will ensure everything is following the right convention.
You can read more about this approach in the Dashboards and alarms chapter, but the idea is simple: our BI (Business Intelligence) system runs SQL query that filters invalid campaign names. If the query returns something, the BI system will email us that there’s a campaign with an invalid name.
The monitoring system we just talked about might seem complex, but, in the end, it’s just an SQL query that knows how to filter invalid campaign names. All we need is a regex.
Let’s look at Bindle’s convention for campaign names (20180208_us-ca-gb-au_mf_bindle-launch), here’s a list of features we encode in every name:
date of launch: 20180208
list of countries: us, ca, gb and au
genders: male and female
ad description bindle-launch
Every time we need to match something really complex it’s better to break it down to separate regex-es first.
Matching the date of launch
Any launch date (20180208 or 20181231) is always a combination of a year (4 digits), month (2 digits) and day (2 digits).
How can we match 8 digits in the beginning of a string? Show explanation.
We can match a single digit either with a special symbol \d or with a range [0-9].
Let’s add a range quantifier (curly brackets) to specify that we need exactly 8 digits in a row: \d{8} or [0-9]{8}.
The last step is to match the beginning of string with the caret symbol: ^\d{8} or ^[0-9]{8}.
It’s always a good idea to test our regex at every step with the regexp_matches(text_input, regex) function to see if it’s actually working:
SELECT regexp_matches('20180208', '^\d{8}')
Matching the list of countries
This one is a bit trickier: the number of countries is arbitrary. It could be us, us-au, us-au-gb, etc.
The word “arbitrary” in regex terms would be one or many, sounds familiar?
Take a minute to think of a regex to match the list of countries. Show explanation.
A single country code is simply a combination of two lower case letters: [a-z]{2}.
Note that we can’t use a special character \w here because it contains digits as well.
The [a-z]{2} regex will match the first first country (which is always present ) – the rest is optional, there could be no countries in the list. “Optional” in regex is zero or many quantifier – asterisk symbol *.
Let’s define a non-matching group (because we’re not really interested in doing smth with the contents of this group) for the rest of the country list: (?:-[a-z]{2})*. Let’s put it all together:
Note that we’re using a non-matching group (?:). It’s a good practice when you’re interested only in filtering text records. Later we’ll take a look functions that operate on matching groups – either to extract them or to replace the matched group text.
Matching ad description
Matching the ad description is probably the easiest – since it’s just a bunch of words concatenated with a dash. Clearly they all can be of an arbitrary length: a-very-great-black-friday-campaign, so we can get away by defining a range of all lower case letters and a dash:
Note that we have 2 dashes inside the range: one for defining all letters a-z and then the dash character itself -.
Putting it all together
At this point we have 4 regex-es for every feature and features are concatenated with underscores _ in a campaign name. Let’s do the same with our regex-es (and add the end of string special character $ to match the whole name):
SELECT DISTINCT(utm_campaign)
FROM marketing_spends
WHERE
utm_campaign ~ '^\d{8}_[a-z]{2}(?:-[a-z]{2})*_(?:m|f|mf)_[a-z-]+$'
If we want to list all invalid campaign names (in our monitoring query, for example) we can negate the matching operator:
SELECT
DISTINCT(utm_campaign) AS invalid_campaign_name
FROM marketing_spends
WHERE
utm_campaign !~ '^\d{8}_[a-z]{2}(?:-[a-z]{2})*_(?:m|f|mf)_[a-z-]+$'
It’s probably the longest lesson in SQL Habit! Hope it was a nice ride, congratz on writing the longest regex ever!
SQL Habit is a course (or, as some of the students say, “business simulator”). It’s based on a story of a fictional startup called Bindle. You’ll play a role of their Data Analyst and solve real-life challenges from Business, Marketing, and Product Management.
SQL Habit course is made of bite-sized lessons (you’re looking at one atm) and exercises. They always have a real-life setting and detailed explanations. You can immediately apply everything you’ve learned at work.
“well worth the money”
Fluent in SQL in a month
Master Data Analysis with SQL with real life examples from Product Management, Marketing, Finance and more.