OGB Appreciation Day: SQL Pattern Matching (match_recognize) Use Cases

October 10, 2019 | 10 minute read
Chris Saxon
Developer Advocate
Text Size 100%:

It's Oracle Groundbreaker's Appreciation Day! So I'm going to discuss one of my favourite features of SQL: Row Pattern Matching.

Added in Oracle Database 12c, the match_recognize clause enables you to use a regular expression to search for a pattern across rows. This makes it easy to answer many questions which were difficult-to-impossible to solve using SQL

There are already many posts discussing the syntax of SQL pattern matching. So I'm going to focus on how the pattern processing works. And show how you can use this to solve common problems that are tricky with classic SQL, such as:

To do this, I'm going to analyse a sample of my running training data, as I discuss in this presentation:

Let's get started!

How to Find Consecutive Rows with SQL

I want to search my training data for each series of consecutive days I ran. You can solve this using analytic functions with the Tabibitosan method. But I prefer the pattern matching solution!

To use this I need to define one or more pattern variables I'm looking for in my data. Then use these variables in a regular expression defining the pattern itself.

To find groups of consecutive runs, I'm searching for:

  • A pattern of one or more rows
  • A variable where the current date equals the previous date plus one

This gives the following pattern and variable:

  pattern ( init consecutive* )   
    consecutive as run_date = prev ( run_date ) + 1

Now you may be wondering: what's the deal with the init variable? It's not in the define clause, so what does it mean?

Any undefined variable in the pattern is "always true". This means it matches the next row in the data set.

To see why this is necessary, consider the definition of consecutive. It compares the current date to the previous. But the first row in the data set has no prior row!

Without init, the database tries to match the first row in the database to consecutive. Which means it's checking if the current date equals null. Which is never true. This continues throughout the data set. So the pattern never matches any rows. Fixing consecutive to handle null is messy.

To simplify this, the always-true init variable matches the first row in the data set. The database then checks the second row to see if it matches consecutive in the pattern. If it does, it adds this row to the group. consecutive uses the regular expression quantifier, star. Which means zero or more matches. So it can add any number of rows to the group. The pattern ends when it encounters a row where the difference between the current date and the previous is not one day.

At this point the database starts searching for the pattern again. Beginning with the first variable, init. As this is always-true, a new group begins on the next row. The database can then search for the next set of consecutive dates.

So what's the real world use of this?

Searching for consecutive chains is a common business question, for example:

  • How many consecutive days did sales exceed one million?
  • How many consecutive days did each student attend classes?
  • What's the longest unbroken period of uptime for a server or application?

Whether you use Tabibitosan or match_recognize, knowing how to group consecutive rows is useful SQL technique.

How to Group Rows into Seven Days Batches, with Variable Gaps Between with SQL

Running every day is a lofty goal. One that is unachievable for most of us. And it's definitely unachievable for me!

So rather than search for dates with consecutive runs, I want check I'm running at least three times in a seven day period.

Now here's the kicker: I don't want to check I ran three times in each calendar week, something you can do with group by. There could be any period of time between one seven day group and the next. So if I go running on the 1st January, I want to include all runs up to and including the 7th January in the first group.

The next seven day period should start from the date of the next run. Not the 8th January. So if I run on the 10th, I want to check I ran three times between 10th and 17th.

At first glance it seems you can solve this using the window clause of analytic functions. But (in general) you can't. You either have to use recursive with or the model clause. Both of which are tricky to write.

With match_recognize it's easy. You're looking for:

  • Three or more instances of a variable
  • Where the date of the current row is within seven days of the first date in the group

Which is the following pattern and variable:

  pattern ( within7 {3, } )   
    within7 as run_date < first ( run_date ) + 7

The quantifier in the pattern {3,} means match at least three instances of this variable. The database must match the whole pattern to return the rows. So this excludes any runs where I ran two or less times in a seven day period.

The function first ( run_date ) returns the date for the first row matched in this group.

Notice there's no need for an always-true init variable here. This is because for the first row we're asking:

Is the current value (the first) = the first value plus seven

i.e. is something less than itself plus seven. This already always true! So another always-true variable is unnecessary.

So what's the real world use of this?

Often when splitting data into periods of time you want all the periods to be consecutive. For example, report number of sales each hour. But there are many cases where you want to split the activity into groups like this, such as:

  • Finding how long it takes to treat patients. To do this you need the duration between a patient first presenting a condition and their final treatment.
  • Splitting user application activity into sessions. Each session may last up to one hour. A new session starts at the exact time they use the application after the last session.
  • Check the duration between staff clock in and clock out times. This enables to you find people regularly working short or long shifts.

In each of these cases it's critical that you calculate the duration of each from the time of its first activity. Otherwise you may end up over- or under-calculating some batches.

How to Solve Bin Packing Problems with SQL

Photo by Doruk Yemenici on Unsplash

I'm training for a 10k. I want to estimate how long it'll take me to complete the race.

As the vast majority of training runs will be (notably!) less than 10k, I want to group together runs into batches with a total distance of 10k. I can then find my average pace across each batch. Enabling me to estimate how long it'll take me to complete the race.

To do this, I need to:

  • Group together all rows
  • Where the total distance of these rows is less than or equal to 10k

With pattern matching, all you need is:

  pattern ( ten_k+ )   
    ten_k as sum ( distance_in_km ) <= 10

This works because the sum in the pattern variable gives the cumulative total. The quantifier plus matches one or more rows. So this keeps adding rows until the total distance exceeds 10k. At this point the database starts a new group. The row taking the total past 10k is the first in the new group. And the cumulative sum is reset to the distance for this row.

So what's the real world use of this?

Real-world examples of this are:

These are forms of bin-packing problem. This, and related issues such as the knapsack problem, are "hard" computer science problems. Which means there are no general case solutions which are guaranteed to be both fast and optimal.

While pattern matching doesn't guarantee either of these, it is easy write. If you need to process your data in a FIFO manner, it's often good enough. If not, it can give a rapid first pass for further refinement as needed.

How to Find Sequences of Events with SQL

Many training programs recommend each week you complete one long jog and several shorter sprints. To see if I'm following this pattern, I want to search for:

  • One 5k run
  • Followed by two or more 1k runs
  • Where the 1k runs take place within seven days of the 5k run

The pattern and variables for this are:

  pattern ( five_km one_km {2,} )   
    five_km as distance_in_km = 5, 
    one_km as distance_in_km = 1
       and run_date < first ( run_date ) + 7

So what's the real world use of this?

The general form of this problem is searching for a sequence of two or more different events. Each of which can happen many times. Real-world examples of this include:

Stock Market Technical Analysis

Technical share traders are hunting for trends in stock prices as a signal to buy or sell. For example, they may want to find all companies where the share price:

  • Fell three days in a row
  • Rose one day
  • Fell one day
  • Rose three days in a row

Also known as a W pattern. You can do this using lag and/or laed. But this is messy. Particularly if you want to adjust the number of days in each rise or fall.

With match_recognize, the pattern and variables are as easy as:

  pattern ( rise{3} fall rise fall{3} )
    rise as price > prev ( price ),
    fall as price < prev ( price )

Customer Retention Analysis

Knowing who your best customers are and keeping them happy is a solid way to stay in business. So it's important to know when loyal customers have stopped buying from you.

For example, you may want to find all the customers that:

  • Placed two or more orders every month for a period of six months or more
  • In the past six months have placed a maximum of two orders

This is hard to solve with classic SQL. But with row pattern matching, you can identify the rows with:

  pattern ( init two_or_more{6,} two_total* )
    two_or_more as orders_in_month >= 2
       and mth < add_months ( sysdate, -6 )
       and mth = add_months ( prev ( mth ), 1 ),
    two_total as mth >= add_months ( sysdate, -6 )

This returns all regular customers. To get only those with two or less recent orders, filter the total orders for two_total in a where clause after this (HT Iudith Mentzel).

Check Files Conform to a Given Structure

Many applications export files containing different record types. The records must appear in a given order, for example:

  1. One overall header
  2. One or more order headers
  3. Each of these order headers must have one or more order detail lines beneath them
  4. Two summary rows
  5. A final footer

You can confirm a file meets these criteria with this pattern:

  pattern ( header ( order_head order_line+ )+ summary{2} footer )

The key with all these is you're searching for events in a specific order. Some of these events can happen one or more times. And often need to occur within a given time period.

This type of problem is where SQL pattern matching shines. Without it, queries often needs to access the same table many times. Or uses analytic functions nested in subquery upon subquery upon subquery.

SQL using either of these methods is tricky to write, debug, and adapt to changes. But pattern matching makes easy!

So I urge you to go out, learn how match_recognize works, and start using its power to unlock the patterns in your data!

Want More?

UPDATED 24 Oct 2019: Fixed customer retention analysis expression; see comment from Iudith Mentzel.

Chris Saxon

Developer Advocate

Chris Saxon is an Oracle Developer Advocate for SQL. His job is to help you get the best out of the Oracle Database and have fun with SQL!

To help you with this he blogs at All Things SQL. He also creates videos combining SQL and magic on YouTube at the The Magic of SQL.

If you have questions about working with Oracle Database technology, please reach out to him. You can do this via Twitter or on Ask Tom.

Previous Post

Announcing a New Sample Schema: Customer Orders

Chris Saxon | 4 min read

Next Post

How to Rollback After Create Table Commits in Oracle Database

Chris Saxon | 3 min read