Tuesday Apr 01, 2014

Limited Edition Exadata X4-2C - Brighten Up Your Data Center

Oracle has always been at the forefront of efforts to revolutionise your data center. To date, for obvious reasons, the focus has been on optimizing energy and space efficiency. As of today we are moving into an exciting new phase in terms of the look and feel of your data center. Oracle recently added a new fashion design team to its engineered system group to help us re-imagine the next generation data center and the first exciting fruits of this new partnership of both technology and fashion are now available for our customers to order…..

For a short period only, Oracle is offering its data warehouse customers the chance to buy a limited edition EXADATA X4-2C. This new Exadata configuration is going to brighten up your data center with its exciting range of color coordinated racks! Now you can enjoy running those really sophisticated business queries in glorious technicolor. Most importantly, the great news is that we are not charging you anything extra for this fabulous new technicolor data warehouse experience:

X4 2C

HARDWARE, SOFTWARE AND COLORENGINEERED TO WORK TOGETHER

Each color-coded rack comes with its own color-linked version of Enterprise Manager to add more colour, brightness and joy to all those day-to-day tasks as you can see below on these specially designed monitoring screens: 

EMC

 Your Exadata DBA is really going to thank you!

So what happens if you buy a 1/2 rack then slowly add more Exadata nodes? Great question - well, while stocks last you can actually create your own multi-colored Exadata rack. As always we are ahead of the game because we know what our customers want. SO WHY NOT HAVE A  TECHNICOLOR DATA WAREHOUSE in your data center! Go on, you know it makes sense….

X4 2C 2

BUT YOU GOTTA HURRY - This new Exadata X4-2C range is a limited edition, special order only model. Stocks are limited. To brighten up your data center make sure you contact your Oracle Sales Representative right now because you do not want to miss out on this exciting opportunity to put one of these gorgeous, colour-coded dudes in your data center. And don't forget, only Oracle gives you  HARDWARE, SOFTWARE AND COLORENGINEERED TO WORK TOGETHER

Oracle 1-800-633-0738


Thursday Feb 06, 2014

Sessionization analysis with 12c SQL pattern matching is super fast

Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:

Most of these have been related to explaining the basic concepts and along with some specific use cases.

In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.

To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.

Analysis of Sessionization Data

To make things easy to understand I have divided our sessionization workflow into a number of steps.
Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:

Session_Id User
1 Mary
2 Sam
3 Richard
11 Mary
12 Sam
13 Richard
22 Sam
23 Mary
23 Richard
32 Sam
33 Richard
34 Mary
43 Richard
43 Sam
44 Mary
47 Sam
48 Sam
53 Mary
54 Richard
59 Sam
60 Sam
63 Mary
63 Richard
68 Sam


The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:

NewImage

The 12c SQL to create the initial result set is as follows:

SELECT user_id, session_id start_time, no_of_events, duration
FROM Events MATCH_RECOGNIZE
 (PARTITION BY User_ID ORDER BY Time_Stamp
  MEASURES match_number() session_id,
           count(*) as no_of_events,
           first(time_stamp) start_time,
           last(time_stamp) - first(time_stamp) duration
  ONE ROW PER MATCH 
  PATTERN (b s*)
  DEFINE
       s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10)
 )
;

as a comparison here is how to achieve the above using 11g analytical window functions

CREATE VIEW Sessionized_Events as
SELECT Time_Stamp, User_ID,
 Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID
FROM ( SELECT Time_Stamp, User_ID,
 CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10
 THEN 0 ELSE 1 END Session_Increment
 FROM Events);
SELECT User_ID, Min(Time_Stamp) Start_Time,
 Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration
FROM Sessionized_Events
GROUP BY User_ID, Session_ID;


As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.


Part 3 - However, to get business value from this derived data set we need to do some additional processing.  Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:

  • Within-partition Session_ID
  • Number of events in a session
  • Total duration 

To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2)  we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to compare the current record against the previous record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey,
session_id,
 start_time,
 no_of_events,
 duration
 from orders_v MATCH_RECOGNIZE
 (
  PARTITION BY o_pbykey
  ORDER BY O_custkey, O_Orderdate
  MEASURES match_number() session_id, count(*) as no_of_events,
           first(o_orderdate) start_time,
           last(o_orderdate) - first(o_orderdate) duration
  PATTERN (b s*)
  DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100)
 )
 )
 where No_Of_Events >= 20
); 

Here is code based on using the NEXT() function to compare the current record against the next record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey, session_id, start_time, no_of_events, duration
 from orders_v MATCH_RECOGNIZE
 (
PARTITION BY o_pbykey
ORDER BY O_custkey, O_Orderdate
MEASURES match_number() session_id, count(*) as no_of_events,
         first(o_orderdate) start_time,
         last(o_orderdate) - first(o_orderdate) duration
PATTERN (s* e)
DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100)
 )
 )
 where No_Of_Events >= 20

Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses  window functions (which in my opinion is a lot more complex):

select count(*)
from (
 select /* wf */ *
 from (select O_pbykey,
              Session_ID,
              min(O_Orderdate) Start_Time,
              count(*) No_Of_Events,
             (max(O_Orderdate) - Min(O_Orderdate)) Duration
        from (select O_Orderdate,
O_Custkey, o_pbykey,
                     sum(Session_Increment)
                     over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID
               from ( select O_Custkey,
                             O_Orderdate,
                             O_pbykey,
                             case when (O_Orderdate –
                             Lag(O_Orderdate)
                                 over(partition by o_pbykey
                                      order by O_custkey, O_Orderdate)) <= 100 -- Threshold
                             then 0 else 1 end Session_Increment
                      from orders_v
                    )
            )
       group by o_pbykey, Session_ID
    )
 where No_Of_Events >= 20


The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="http://lh6.ggpht.com/-HDaEnake_3o/UrL_AViG4iI/AAAAAAAAA20/WyCX1sPjRVk/NewImage.png?imgmax=800" alt="NewImage" width="598" height="275" border="0">

Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.

MR Performance Results

Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.

However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.

Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.

In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (keith.laker@oracle.com).
Have fun with MATCH_RECOGNIZE….

Tuesday Apr 26, 2011

Big Data: Vertical Behavioral Analytics

[Read More]

Tuesday Mar 23, 2010

Big Data’s Killer App…

[Read More]

Sunday Feb 21, 2010

OpenWorld 2010: call for papers begins on March 2...

[Read More]

Wednesday Feb 17, 2010

Delivering software to support the cloud

[Read More]

Thursday Feb 11, 2010

Oracle Exadata: A Single Source of Truth (New Video)

[Read More]

Wednesday Feb 10, 2010

Oracle in Leaders Quadrant for Gartner's Data Integration MQ

[Read More]

Sunday Feb 07, 2010

New LinkedIn group for Oracle Data Warehouse Professionals

[Read More]

Monday Jan 18, 2010

Delivering hardware to support a data warehouse cloud strategy

[Read More]

Tuesday Jan 05, 2010

New whitepaper on Exadata Hybrid Columnar Compression

[Read More]

Monday Dec 28, 2009

EDW and the Cloud

[Read More]

Monday Nov 09, 2009

Next Generation Data Warehouse Platforms

[Read More]

Monday Aug 03, 2009

A not so fabulous new release

[Read More]

Tuesday Jun 30, 2009

Kindle - A great device for storing documentation?

[Read More]
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
4
5
6
7
8
9
10
11
12
13
14
16
18
19
20
21
23
24
25
26
27
28
29
30
   
       
Today