It's Oracle Groundbreaker's Appreciation Day! So I'm going to discuss one of my favourite features of SQL: Row Pattern Matching.
Added in Oracle Database 12c, the match_recognize
clause enables you to use a regular expression to search for a pattern across rows. This makes it easy to answer many questions which were difficult-to-impossible to solve using SQL
There are already many posts discussing the syntax of SQL pattern matching. So I'm going to focus on how the pattern processing works. And show how you can use this to solve common problems that are tricky with classic SQL, such as:
To do this, I'm going to analyse a sample of my running training data, as I discuss in this presentation:
Let's get started!
I want to search my training data for each series of consecutive days I ran. You can solve this using analytic functions with the Tabibitosan method. But I prefer the pattern matching solution!
To use this I need to define one or more pattern variables I'm looking for in my data. Then use these variables in a regular expression defining the pattern itself.
To find groups of consecutive runs, I'm searching for:
This gives the following pattern and variable:
pattern ( init consecutive* ) define consecutive as run_date = prev ( run_date ) + 1
Now you may be wondering: what's the deal with the init
variable? It's not in the define
clause, so what does it mean?
Any undefined variable in the pattern is "always true". This means it matches the next row in the data set.
To see why this is necessary, consider the definition of consecutive
. It compares the current date to the previous. But the first row in the data set has no prior row!
Without init
, the database tries to match the first row in the database to consecutive
. Which means it's checking if the current date equals null
. Which is never true. This continues throughout the data set. So the pattern never matches any rows. Fixing consecutive
to handle null
is messy.
To simplify this, the always-true init
variable matches the first row in the data set. The database then checks the second row to see if it matches consecutive
in the pattern. If it does, it adds this row to the group. consecutive
uses the regular expression quantifier, star. Which means zero or more matches. So it can add any number of rows to the group. The pattern ends when it encounters a row where the difference between the current date and the previous is not one day.
At this point the database starts searching for the pattern again. Beginning with the first variable, init
. As this is always-true, a new group begins on the next row. The database can then search for the next set of consecutive dates.
Searching for consecutive chains is a common business question, for example:
Whether you use Tabibitosan or match_recognize
, knowing how to group consecutive rows is useful SQL technique.
Running every day is a lofty goal. One that is unachievable for most of us. And it's definitely unachievable for me!
So rather than search for dates with consecutive runs, I want check I'm running at least three times in a seven day period.
Now here's the kicker: I don't want to check I ran three times in each calendar week, something you can do with group by
. There could be any period of time between one seven day group and the next. So if I go running on the 1st January, I want to include all runs up to and including the 7th January in the first group.
The next seven day period should start from the date of the next run. Not the 8th January. So if I run on the 10th, I want to check I ran three times between 10th and 17th.
At first glance it seems you can solve this using the window clause of analytic functions. But (in general) you can't. You either have to use recursive with
or the model
clause. Both of which are tricky to write.
With match_recognize
it's easy. You're looking for:
Which is the following pattern and variable:
pattern ( within7 {3, } ) define within7 as run_date < first ( run_date ) + 7
The quantifier in the pattern {3,}
means match at least three instances of this variable. The database must match the whole pattern to return the rows. So this excludes any runs where I ran two or less times in a seven day period.
The function first ( run_date )
returns the date for the first row matched in this group.
Notice there's no need for an always-true init
variable here. This is because for the first row we're asking:
Is the current value (the first) = the first value plus seven
i.e. is something less than itself plus seven. This already always true! So another always-true variable is unnecessary.
Often when splitting data into periods of time you want all the periods to be consecutive. For example, report number of sales each hour. But there are many cases where you want to split the activity into groups like this, such as:
In each of these cases it's critical that you calculate the duration of each from the time of its first activity. Otherwise you may end up over- or under-calculating some batches.
Photo by Doruk Yemenici on Unsplash
I'm training for a 10k. I want to estimate how long it'll take me to complete the race.
As the vast majority of training runs will be (notably!) less than 10k, I want to group together runs into batches with a total distance of 10k. I can then find my average pace across each batch. Enabling me to estimate how long it'll take me to complete the race.
To do this, I need to:
With pattern matching, all you need is:
pattern ( ten_k+ ) define ten_k as sum ( distince_in_km ) <= 10
This works because the sum
in the pattern variable gives the cumulative total. The quantifier plus matches one or more rows. So this keeps adding rows until the total distance exceeds 10k. At this point the database starts a new group. The row taking the total past 10k is the first in the new group. And the cumulative sum
is reset to the distance for this row.
Real-world examples of this are:
These are forms of bin-packing problem. This, and related issues such as the knapsack problem, are "hard" computer science problems. Which means there are no general case solutions which are guaranteed to be both fast and optimal.
While pattern matching doesn't guarantee either of these, it is easy write. If you need to process your data in a FIFO manner, it's often good enough. If not, it can give a rapid first pass for further refinement as needed.
Many training programs recommend each week you complete one long jog and several shorter sprints. To see if I'm following this pattern, I want to search for:
The pattern and variables for this are:
pattern ( five_km one_km {2,} ) define five_km as distance_in_km = 5, one_km as distance_in_km = 1 and run_date < first ( run_date ) + 7
The general form of this problem is searching for a sequence of two or more different events. Each of which can happen many times. Real-world examples of this include:
Technical share traders are hunting for trends in stock prices as a signal to buy or sell. For example, they may want to find all companies where the share price:
Also known as a W pattern. You can do this using lag
and/or laed
. But this is messy. Particularly if you want to adjust the number of days in each rise or fall.
With match_recognize
, the pattern and variables are as easy as:
pattern ( rise{3} fall rise fall{3} ) define rise as price > prev ( price ), fall as price < prev ( price )
Knowing who your best customers are and keeping them happy is a solid way to stay in business. So it's important to know when loyal customers have stopped buying from you.
For example, you may want to find all the customers that:
This is hard to solve with classic SQL. But with row pattern matching, you can identify the rows with:
pattern ( init two_or_more{6,} two_total* ) define two_or_more as orders_in_month >= 2 and mth < add_months ( sysdate, -6 ) and mth = add_months ( prev ( mth ), 1 ), two_total as mth >= add_months ( sysdate, -6 )
This returns all regular customers. To get only those with two or less recent orders, filter the total orders for two_total in a where clause after this (HT Iudith Mentzel).
Many applications export files containing different record types. The records must appear in a given order, for example:
You can confirm a file meets these criteria with this pattern:
pattern ( header ( order_head order_line+ )+ summary{2} footer )
The key with all these is you're searching for events in a specific order. Some of these events can happen one or more times. And often need to occur within a given time period.
This type of problem is where SQL pattern matching shines. Without it, queries often needs to access the same table many times. Or uses analytic functions nested in subquery upon subquery upon subquery.
SQL using either of these methods is tricky to write, debug, and adapt to changes. But pattern matching makes easy!
So I urge you to go out, learn how match_recognize
works, and start using its power to unlock the patterns in your data!
UPDATED 24 Oct 2019: Fixed customer retention analysis expression; see comment from Iudith Mentzel.
Regarding the example for "Customer Retention Analysis", it looks to me that some modifications are still required.
For example, for the following data it returns a match, though it should not:
create table cust_test (
cust_id integer,
mth date,
orders_in_month integer
)
/
insert into cust_test
select 1 cust_id,
add_months(date '2018-10-01', rownum - 1) mth,
case
when rownum = 8 then 1 else 2
end orders_in_month
from dual
connect by level = 2
and mth < add_months ( sysdate, -6 )
and mth = add_months ( prev ( mth ), 1 ),
two_total as mth >= add_months ( sysdate, -6 )
and sum ( two_total.orders_in_month ) = 2
and mth < add_months ( sysdate, -6 )
and ( mth = add_months ( prev ( mth ), 1 ) or
count(two_or_more.*) = 1 ),
two_total as mth >= add_months ( sysdate, -6 )
and mth < sysdate
and sum ( two_total.orders_in_month ) = sysdate
)
/
no data found
But the problem here is that we need the "end" row to ensure that the matching did not stop "too early",
before it has "covered" the last 6 months.
Otherwise, we can only check the last 6 months total in an outer query, which is kind of "missing the point" of the MATCH_RECOGNIZE:
select * from (
select * from cust_test
match_recognize (
partition by cust_id
order by mth
measures
match_number() mn,
classifier() var,
final sum(last_6.orders_in_month) last_6_total
all rows per match
pattern ( two_or_more{6,} last_6* )
define
two_or_more as orders_in_month >= 2
and mth < add_months ( sysdate, -6 )
and ( mth = add_months ( prev ( mth ), 1 ) or
count(two_or_more.*) = 1 ),
last_6 as mth >= add_months ( sysdate, -6 )
)
where nvl(last_6_total,0)
Note match_recognize returns a table. So you can place a where clause directly after it. There's no need to use a subquery to filter last_6_total.
e.g.:
select ...
from ...
match_recognize (
...
)
where last_6_total ...
So I disagree this is "missing the point" of this clause!