Thursday Feb 06, 2014

Sessionization analysis with 12c SQL pattern matching is super fast

Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:

Most of these have been related to explaining the basic concepts and along with some specific use cases.

In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.

To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.

Analysis of Sessionization Data

To make things easy to understand I have divided our sessionization workflow into a number of steps.
Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:

Session_Id User
1 Mary
2 Sam
3 Richard
11 Mary
12 Sam
13 Richard
22 Sam
23 Mary
23 Richard
32 Sam
33 Richard
34 Mary
43 Richard
43 Sam
44 Mary
47 Sam
48 Sam
53 Mary
54 Richard
59 Sam
60 Sam
63 Mary
63 Richard
68 Sam


The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:

NewImage

The 12c SQL to create the initial result set is as follows:

SELECT user_id, session_id start_time, no_of_events, duration
FROM Events MATCH_RECOGNIZE
 (PARTITION BY User_ID ORDER BY Time_Stamp
  MEASURES match_number() session_id,
           count(*) as no_of_events,
           first(time_stamp) start_time,
           last(time_stamp) - first(time_stamp) duration
  ONE ROW PER MATCH 
  PATTERN (b s*)
  DEFINE
       s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10)
 )
;

as a comparison here is how to achieve the above using 11g analytical window functions

CREATE VIEW Sessionized_Events as
SELECT Time_Stamp, User_ID,
 Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID
FROM ( SELECT Time_Stamp, User_ID,
 CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10
 THEN 0 ELSE 1 END Session_Increment
 FROM Events);
SELECT User_ID, Min(Time_Stamp) Start_Time,
 Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration
FROM Sessionized_Events
GROUP BY User_ID, Session_ID;


As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.


Part 3 - However, to get business value from this derived data set we need to do some additional processing.  Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:

  • Within-partition Session_ID
  • Number of events in a session
  • Total duration 

To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2)  we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to compare the current record against the previous record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey,
session_id,
 start_time,
 no_of_events,
 duration
 from orders_v MATCH_RECOGNIZE
 (
  PARTITION BY o_pbykey
  ORDER BY O_custkey, O_Orderdate
  MEASURES match_number() session_id, count(*) as no_of_events,
           first(o_orderdate) start_time,
           last(o_orderdate) - first(o_orderdate) duration
  PATTERN (b s*)
  DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100)
 )
 )
 where No_Of_Events >= 20
); 

Here is code based on using the NEXT() function to compare the current record against the next record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey, session_id, start_time, no_of_events, duration
 from orders_v MATCH_RECOGNIZE
 (
PARTITION BY o_pbykey
ORDER BY O_custkey, O_Orderdate
MEASURES match_number() session_id, count(*) as no_of_events,
         first(o_orderdate) start_time,
         last(o_orderdate) - first(o_orderdate) duration
PATTERN (s* e)
DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100)
 )
 )
 where No_Of_Events >= 20

Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses  window functions (which in my opinion is a lot more complex):

select count(*)
from (
 select /* wf */ *
 from (select O_pbykey,
              Session_ID,
              min(O_Orderdate) Start_Time,
              count(*) No_Of_Events,
             (max(O_Orderdate) - Min(O_Orderdate)) Duration
        from (select O_Orderdate,
O_Custkey, o_pbykey,
                     sum(Session_Increment)
                     over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID
               from ( select O_Custkey,
                             O_Orderdate,
                             O_pbykey,
                             case when (O_Orderdate –
                             Lag(O_Orderdate)
                                 over(partition by o_pbykey
                                      order by O_custkey, O_Orderdate)) <= 100 -- Threshold
                             then 0 else 1 end Session_Increment
                      from orders_v
                    )
            )
       group by o_pbykey, Session_ID
    )
 where No_Of_Events >= 20


The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="http://lh6.ggpht.com/-HDaEnake_3o/UrL_AViG4iI/AAAAAAAAA20/WyCX1sPjRVk/NewImage.png?imgmax=800" alt="NewImage" width="598" height="275" border="0">

Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.

MR Performance Results

Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.

However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.

Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.

In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (keith.laker@oracle.com).
Have fun with MATCH_RECOGNIZE….

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« February 2014 »
SunMonTueWedThuFriSat
      
1
2
3
4
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
 
       
Today