Tuesday Jan 27, 2015

MATCH_RECOGNIZE and the Optimizer

If you have already been working with the new 12c pattern matching feature you will have probably spotted some new keywords appearing in your explain plans. Essentially there are four new keywords that you need to be aware of:
  • MATCH RECOGNIZE
  • SORT
  • BUFFER
  • DETERMINISTIC FINITE AUTO
The fist three is bullet points are reasonably obvious (at least I hope they are!) but just incase…. the keywords MATCH RECOGNIZE refers to the row source for evaluating the match_recognize clause . The “SORT keyword means the row source sorts the data data before running it through the state machine to find the matches.  
The last keyword is the most interesting and is linked to the use of “state machine”, as mentioned in the previous sentence. Its appearance or lack of appearance affects the performance of your pattern matching query. The importance of this keyword is based on the way that pattern matching is performed. To search for a pattern containing a specific set of events we build something called a “state-machine”. At this point I will turn to Wikipedia to provide a definition of a state machine:
…a mathematical model of computation used to design both computer programs and sequential logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time; the state it is in at any given time is called the current state. It can change from one state to another when initiated by a triggering event or condition; this is called a transition…
The classic example of a state machine is a traffic light which moves through a given sequence of events in a set order and always in that order: Red  ->  Red & Yellow -> Green -> Yellow. The traffic light model can also be viewed as at “deterministic” state machine. This implies that  for every state there is exactly one transition for a given input, i.e. it is not possible to have two different transitions leading out of a particular state. With our traffic light state model it is clear that given a red light state there is only one next transition which is to red & amber.
Let’s use our normal stock ticker sample schema that tracks stock market prices for three ticker symbols. Let’s look at two very similar pattern matching queries:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   AFTER MATCH SKIP TO LAST UP
 PATTERN (STRT DOWN* UP*)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
   AFTER MATCH SKIP TO LAST UP
 PATTERN (STRT DOWN UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;


Match Recognize keywords in Otimizer explain plan


Match Recognize keywords in Otimizer explain plan

Note that the key difference between the two sql statements is the PATTERN clause. The statement on the left checks for zero or more instances of two different events: 1) where the price in the current row is less then the price in the previous row and 2) where the price in the current row is more then the price in the previous row. The statement on the right checks for only once instance of each down-up pattern. This difference in the definition of the pattern results in different explain plans where the plan on the right includes the key phrase “DETERMINISTIC FINITE AUTO” .

The phrase “DETERMINISTIC FINITE AUTO” means that the state machine that we constructed is deterministic and thus when running the sorted rows through the state machine, we don’t do backtracking (I will write a separate blog post on this topic very soon as it is a key concept in pattern matching. For the moment I will simply point you to Wikipedia page on backtracking, personally I found the section headed “Description of the method” the most useful). The key benefit of building a “DETERMINISTIC FINITE AUTO” plan is that the execution is more efficient when there is no backtracking.

When we analyze the PATTERN clause and build the corresponding state machine we are able to detect deterministic finite automaton by checking the state machine. If any state has two or more outgoing transitions then we regard the state machine as non-deterministic, if any final state is followed by a non-final state, then the state machine is regarded as non-deterministic. At the moment we can only detect a few trivial cases such as PATTERN (A B C), PATTERN (A B+), PATTERN (A B*), etc.
The first example of these patterns that we can detect is shown above (see the statement on the right where we have STRT DOWN UP pattern) and the other two examples of these types of deterministic patterns are shown below:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT DOWN UP+)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT DOWN UP*)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;

Match Recognize keywords in Otimizer explain plan



Match Recognize keywords in Otimizer explain plan

For PATTERN (A | B) , or PATTERN (A B+ C) we just regard the state machine as non-deterministic, therefore, the explain plans only contain the keywords MATCH RECOGNIZE (SORT) as shown below:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT | DOWN | UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
   PARTITION BY symbol ORDER BY tstamp
   MEASURES STRT.tstamp AS start_tstamp,
            LAST(DOWN.tstamp) AS bottom_tstamp,
            LAST(UP.tstamp) AS end_tstamp
   ONE ROW PER MATCH
 PATTERN (STRT DOWN* UP)
   DEFINE
      DOWN AS DOWN.price < PREV(DOWN.price),
      UP AS UP.price > PREV(UP.price)
) MR
WHERE symbol='ACME'
ORDER BY MR.symbol, MR.start_tstamp;

Screen Shot 2015 01 27 at 14 56 07

Screen Shot 2015 01 27 at 14 55 46
Within the current version of 12c (12.1.2) we are not checking the mutual exclusiveness of the DEFINE predicates in detecting a deterministic state machine, therefore, the execution plan defaults to a MATCH RECOGNIZE (SORT) style plan, where we may or may have to use backtracking. Obviously, as we continue to develop the MATCH_RECOGNIZE feature will expand our ability to detect a deterministic state machine which means we will process your patter more efficiently.
In summary, if you want the most efficient execution plan then try to define your pattern in such way that we are able to create a deterministic state machine. This assumes, of course, that backtracking is not needed within each partition/data set in order to identify the required pattern (more on this in my next blog post).
Hope this information is useful. If you have any questions then feel free to contact me directly (keith.laker@oracle.com).

Wednesday Mar 04, 2009

Managing Optimizer statistics in an Oracle Database 11g

Knowing when and how to gather optimizer statistics has become somewhat of dark art especially in a data warehouse environment where statistics maintenance can be hindered by the fact that as the data set increases the time it takes to gather statistics will also increase. By default the DBMS_STATS packages will gather global (table level), partition level, and sub-partition statistics for each of the tables in the database. The only exception to this is if you have hash sub-partitions. Hash sub-partitions do not need statistics, as the optimizer can accurately derive any necessary statistics from the partition level statistic because the hash partitions are all approximately the same size due to linear hashing algorithm.
As mentioned above the length of time it takes to gather statistics will grow proportionally with your data set, so you may now be wondering if the optimizer truly need statistics at every level for a partitioned table or if time could be saved by skipping one or more levels? The short answer is "no" as the optimizer will use statistics from one or more of the levels in different situations.

The optimizer will use global or table level statistics if one or more of your queries touches two or more partitions.

The optimizer will use partition level statistics if your queries do partition elimination, such that only one partition is necessary to answer each query. If your queries touch two or more partitions the optimizer will use a combination of global and partition level statistics.

The optimizer will user sub-partition level statistics if your queries do partition elimination, such that only one sub-partition is necessary. If your queries touch two more sub-partitions the optimizer will use a combination of sub-partition and partition level statistics.

How to gather statistics?
Global statistics are by far the most important statistics but they also take the longest time to collect because a full table scan is required. However, in Oracle Database 11g this issue has been addressed with the introduction of Incremental Global statistics. Typically with partitioned tables, new partitions are added and data is loaded into these new partitions. After the partition is fully loaded, partition level statistics need to be gathered and the global statistics need to be updated to reflect the new data. If the INCREMENTAL value for the partition table is set to TRUE, and the DBMS_STATS GRANULARITY parameter is set to AUTO, Oracle will gather statistics on the new partition and update the global table statistics by scanning only those partitions that have been modified and not the entire table. Below are the steps necessary to do use incremental global statistics

SQL> exec dbms_stats.set_table_prefs('SH', 'SALES', 'INCREMENTAL', 'TRUE');

SQL> exec dbms_stats.gather_table_stats( Owname=>'SH', Tabname=>'SALES', Partname=>'23_MAY_2008', Granularity=>'AUTO');

Incremental Global Stats works by storing a synopsis for each partition in the table. A synopsis is statistical metadata for that partition and the columns in the partition. Each synopsis is stored in the SYSAUX tablespace and takes approximately 10KB. Global statistics are generated by aggregating the synopses from each partition, thus eliminating the need for the full table scan (see Figure below). When a new partition is added to the table you only need to gather statistics for the new partition. The global statistics will be automatically updated by aggregating the new partition synopsis with the existing partitions synopsis.

incremental_stats_gathering.bmp

But what if you are not using Oracle Database 11g and you can't afford to gather partition level statistic (not to mention global statistics) after data is loaded? In Oracle Database 10g (10.2.0.4) you can use the DBMS_STATS.COPY_TABLE_STATS procedure. This procedure enables you to copy statistics from an existing [sub] partition to the new [sub] partition and will adjust statistics to account for the additional partition of data (for example the number of blks, number of rows). It sets the new partition's high bound partitioning value as the maximum value of the first partitioning column and high bound partitioning value of the previous partition as the minimum value of the first partitioning column for a range partitioned table. For a list-partitioned table it will find the maximum and minimum from the list of values.

SQL>exec dbms_stats.copy_table_stats('sh','sales','sales_q3_2000','sales_q44_2000', force=>TRUE);

When should you gather Statistics?
If you use the automatic stats job or dbms_stats.gather_schema_stats with the option "GATHER AUTO", Oracle only collect statistics at the global level if the table has changed more than 10% or if the global statistics have not yet been collected. Partition level statistics will always be gathered if they are missing. For most tables this frequency is fine.
However, in a data warehouse environment there is one scenario where this is not the case. If a partition table is constantly having new partitions added and then data is loaded into the new partition and users instantly begin querying the new data, then it is possible to get a situation where an end-users query will supply a value in one of the where clause predicate that is outside the [min,max] range for the column according to the optimizer statistics. For predicate values outside the statistics [min,max] range the optimizer will prorates the selectivity for that predicate based on the distance between the value the max (assuming the value is higher than the max). This means, the farther the value is from the maximum value the lower is the selectivity will be, which may result in sub-optimal execution plans.
You can avoid this "Out of Range" situation by using the new incremental Global Statistics or the copy table statistics procedure.

More information on Incremental Global Statistics or the copy table statistics procedure can be found on the Optimizer developers blog.

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« March 2015
SunMonTueWedThuFriSat
1
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today