By KLaker on Feb 06, 2014
Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:
- Oracle Database 12c Pattern Matching Podcasts - Part 1
- Oracle Database 12c Pattern Matching Podcasts - Part 2
- Oracle Database 12c Pattern Matching Podcasts - Part 3
- Apple iBook - SQL Pattern Matching on your iPad
- SQL analytical mash-ups deliver real-time WOW! for big data
Most of these have been related to explaining the basic concepts and along with some specific use cases.
In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.
To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.
Analysis of Sessionization Data
To make things easy to understand I have divided our sessionization workflow into a number of steps.
Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:
The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:
The 12c SQL to create the initial result set is as follows:
SELECT user_id, session_id start_time, no_of_events, duration FROM Events MATCH_RECOGNIZE (PARTITION BY User_ID ORDER BY Time_Stamp MEASURES match_number() session_id, count(*) as no_of_events, first(time_stamp) start_time, last(time_stamp) - first(time_stamp) duration ONE ROW PER MATCH PATTERN (b s*) DEFINE s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10) ) ;
as a comparison here is how to achieve the above using 11g analytical window functions
CREATE VIEW Sessionized_Events as SELECT Time_Stamp, User_ID, Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID FROM ( SELECT Time_Stamp, User_ID, CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10 THEN 0 ELSE 1 END Session_Increment FROM Events);SELECT User_ID, Min(Time_Stamp) Start_Time, Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration FROM Sessionized_Events GROUP BY User_ID, Session_ID;
As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.
Part 3 - However, to get business value from this derived data set we need to do some additional processing. Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:
- Within-partition Session_ID
- Number of events in a session
- Total duration
To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2) we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to compare the current record against the previous record:
select count(*) from ( select /* mr_sessionize_prev */ * from ( select o_pbykey, session_id, start_time, no_of_events, duration from orders_v MATCH_RECOGNIZE ( PARTITION BY o_pbykey ORDER BY O_custkey, O_Orderdate MEASURES match_number() session_id, count(*) as no_of_events, first(o_orderdate) start_time, last(o_orderdate) - first(o_orderdate) duration PATTERN (b s*) DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100) ) ) where No_Of_Events >= 20 );
Here is code based on using the NEXT() function to compare the current record against the next record:
select count(*) from ( select /* mr_sessionize_prev */ * from ( select o_pbykey, session_id, start_time, no_of_events, duration from orders_v MATCH_RECOGNIZE ( PARTITION BY o_pbykey ORDER BY O_custkey, O_Orderdate MEASURES match_number() session_id, count(*) as no_of_events, first(o_orderdate) start_time, last(o_orderdate) - first(o_orderdate) duration PATTERN (s* e) DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100) ) ) where No_Of_Events >= 20
Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses window functions (which in my opinion is a lot more complex):
select count(*) from ( select /* wf */ * from (select O_pbykey, Session_ID, min(O_Orderdate) Start_Time, count(*) No_Of_Events, (max(O_Orderdate) - Min(O_Orderdate)) Duration from (select O_Orderdate, O_Custkey, o_pbykey, sum(Session_Increment) over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID from ( select O_Custkey, O_Orderdate, O_pbykey, case when (O_Orderdate – Lag(O_Orderdate) over(partition by o_pbykey order by O_custkey, O_Orderdate)) <= 100 -- Threshold then 0 else 1 end Session_Increment from orders_v ) ) group by o_pbykey, Session_ID ) where No_Of_Events >= 20
The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="http://lh6.ggpht.com/-HDaEnake_3o/UrL_AViG4iI/AAAAAAAAA20/WyCX1sPjRVk/NewImage.png?imgmax=800" alt="NewImage" width="598" height="275" border="0">
Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.
Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.
However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.
Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.
In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (firstname.lastname@example.org).
Have fun with MATCH_RECOGNIZE….