Maintaining statistics on large partitioned tables

We have gotten a lot of questions recently regarding how to gather and maintain optimizer statistics on large partitioned tables. The majority of these questions can be summarized into two topics:
  1. When queries access a single partition with stale or non-existent partition level statistics I get a sub optimal plan due to "Out of Range" values
  2. Global statistics collection is extremely expensive in terms of time and system resources

This article will describe both of these issues and explain how you can address them both in Oracle Database 10gR2 and 11gR1.

Out of Range Large tables are often decomposed into smaller pieces called partitions in order to improve query performance and ease of data management. The Oracle query optimizer relies on both the statistics of the entire table (global statistics) and the statistics of the individual partitions (partition statistics) to select a good execution plan for a SQL statement. If the query needs to access only a single partition, the optimizer uses only the statistics of the accessed partition. If the query access more than one partition, it uses a combination of global and partition statistics.

"Out of Range" means that the value supplied in a where clause predicate is outside the domain of values represented by the [minimum, maximum] column statistics. The optimizer prorates the selectivity based on the distance between the predicate value and the maximum value (assuming the value is higher than the max), that is, the farther the value is from the maximum value, the lower the selectivity will be. This situation occurs most frequently in tables that are range partitioned by a date column, a new partition is added, and then queried while rows are still being loaded in the new partition. The partition statistics will be stale very quickly due to the continuous trickle feed load even if the statistics get refreshed periodically. The maximum value known to the optimizer is not correct leading to the "Out of Range" condition. The under-estimation of selectivity often leads the query optimizer to pick a sub optimal plan. For example, the query optimizer would pick an index access path while a full scan is a better choice.

The "Out of Range" condition can be prevented by using the new copy table statistics procedure available in Oracle Database10.2.0.4 and 11g. This procedure copies the statistics of the source [sub] partition to the destination [sub] partition. It also copies the statistics of the dependent objects: columns, local (partitioned) indexes etc. It adjusts the minimum and maximum values of the partitioning column as follows; it uses the high bound partitioning value as the maximum value of the first partitioning column (it is possible to have concatenated partition columns) and high bound partitioning value of the previous partition as the minimum value of the first partitioning column for range partitioned table. It can optionally scale some of the other statistics like the number of blocks, number of rows etc. of the destination partition.

Assume we have a table called SALES that is ranged partitioned by quarter on the SALES_DATE column. At the end of every day data is loaded into latest partition. However, statistics are only gathered at the end of every quarter when the partition is fully loaded. Assuming global and partition level statistics (for all fully loaded partitions) are up to date, use the following steps in order to prevent getting a sub-optimal plan due to "out of range".

1. Lock the table statistics using LOCK_TABLE_STATS procedure in DBMS_STATS. This is to avoid interference from auto statistics job.

EXEC DBMS_STATS.LOCK_TABLE_STATS('SH','SALES');

2. Before beginning the initial load into each new partition (say SALES_Q4_2000) copy the statistics from the previous partition (say SALES_Q3_2000) using COPY_TABLE_STATS. You need to specify FORCE=>TRUE to override the statistics lock. 

EXEC DBMS_STATS.COPY_TABLE_STATS ('SH', 'SALES', 'SALES_Q3_2000', 'SALES_Q4_2000', FORCE=>TRUE);

Expensive global statistics collection

In data warehouse environment it is very common to do a bulk load directly into one or more empty partitions. This will make the partition statistics stale and may also make the global statistics stale. Re-gathering statistics for the effected partitions and for the entire table can be very time consuming. Traditionally, statistics collection is done in a two-pass approach:

  • In the first pass we will scan the table to gather the global statistics
  • In the second pass we will scan the partitions that have been changed to gather their partition level statistics.
The full scan of the table for global statistics collection can be very expensive depending on the size of the table. Note that the scan of the entire table is done even if we change a small subset of partitions.

In Oracle Database 11g, we avoid scanning the whole table when computing global statistics by deriving the global statistics from the partition statistics. Some of the statistics can be derived easily and accurately from partition statistics. For example, number of rows at global level is the sum of number of rows of partitions. Even global histogram can be derived from partition histograms. But the number of distinct values (NDV) of a column cannot be derived from partition level NDVs. So, Oracle maintains another structure called a synopsis for each column at the partition level. A synopsis can be considered as sample of distinct values. The NDV can be accurately derived from synopses. We can also merge multiple synopses into one. The global NDV is derived from the synopsis generated by merging all of the partition level synopses. To summarize


  1. Gather statistics and create synopses for the changed partitions only
  2. Oracle automatically merges partition level synopses into a global synopsis
  3. The global statistics are automatically derived from the partition level statistics and global synopses

Incremental maintenance feature is disabled by default. It can be enabled by changing the INCREMENTAL table preference to true. It can also be enabled for a particular schema or at the database level. If you are interested in more details of the incremental maintenance feature, please refer to the following paper presented in SIGMOD 2008 and to our previous blog entry on new ndv gathering in 11g.

Assume we have table called SALES that is range partitioned by day on the SALES_DATE column. At the end of every day data is loaded into latest partition and partition statistics are gathered. Global statistics are only gathered at the end of every month because gathering them is very time and resource intensive. Use the following steps in order to maintain global statistics after every load.

1 -Turn on incremental feature for the table. 

EXEC DBMS_STATS.SET_TABLE_PREFS('SH','SALES','INCREMENTAL','TRUE');

2 -At the end of every load gather table statistics using GATHER_TABLE_STATS command. You don't need to specify the partition name. Also, do not specify the granularity parameter. The command will collect statistics for partitions where data has change or statistics are missing and update the global statistics based on the partition level statistics and synopsis.

EXEC DBMS_STATS.GATHER_TABLE_STATS('SH','SALES');
Note: that the incremental maintenance feature was introduced in Oracle Database 11g Release 1. However, we also provide a solution in Oracle Database10g Release 2 (10.2.0.4) that simulates the same behavior. The 10g solution is a new value, 'APPROX_GLOBAL AND PARTITION' for the GRANULARITY parameter of the GATHER_TABLE_STATS procedures. It behaves the same as the incremental maintenance feature except that we don't update the NDV for non-partitioning columns and number of distinct keys of the index at the global level. For partitioned column we update the NDV as the sum of NDV at the partition levels. Also we set the NDV of columns of unique indexes as the number of rows of the table. In general, non-partitioning column NDV at the global level becomes stale less often. It may be possible to collect global statistics less frequently then the default (when table changes 10%) since approx_global option maintains most of the global statistics accurately.

Let's take a look at an example to see how you would effectively use the Oracle Database 10g approach.
After the data load is complete, gather statistics using DBMS_STATS.GATHER_TABLE_STATS for the last partition (say SALES_11FEB2009), specify granularity => 'APPROX_GLOBAL AND PARTITION'. It will collect statistics for the specified partition and derive global statistics from partition statistics (except for NDV as described before).

EXEC DBMS_STATS.GATHER_TABLE_STATS ('SH', 'SALES', 'SALES_11FEB2009', GRANULARITY => 'APPROX_GLOBAL AND PARTITION');

It is necessary to install the one off patch for bug 8719831 if you are using the above features in 10.2.0.4 (patch 8877245) or in 11.1.0.7 (patch 8877251)
Comments:

I have a partitioned table, where each day data is loaded in a new partition. I implemented this feature,gathered stats using incremental option(ALL required parameters for this are correct) but when I check the LAST_ANALYZED in the USER_TAB_PARTITIONS for that particular table, all the partitions of that table have TODAY's date(effectively, the date when I run the gather stats).My question is if INCREMENTAL option doesn't even touch the unchanged partitions, why the LAST_ANALYZED data is changing for all of them.???

Posted by guest on April 30, 2012 at 05:38 AM PDT #

How command:
EXEC DBMS_STATS.GATHER_TABLE_STATS('SH','SALES');
works on Oracle Database10g Release 2 (10.2.0.4).

I assume it is more efficient to execute
EXEC DBMS_STATS.GATHER_TABLE_STATS ('SH', 'SALES', 'SALES_11FEB2009', GRANULARITY => 'APPROX_GLOBAL AND PARTITION');

Am I correct?

Posted by korepetycje on July 06, 2012 at 01:44 AM PDT #

Hi,

You are correct there is quite a difference between the two commands you specified. The first command,
EXEC DBMS_STATS.GATHER_TABLE_STATS('SH','SALES');
will automatically gather partition and global level statistics and will therefore take longer to complete.

However, your second command,
EXEC DBMS_STATS.GATHER_TABLE_STATS ('SH', 'SALES', 'SALES_11FEB2009', GRANULARITY => 'APPROX_GLOBAL AND PARTITION');
will collect statistics for only partition SALES_11FEB2009 and derive the global level statistics. Because the global statistics are derived rather than gathered this will execute much faster than the first command.

Thanks,
Maria

Posted by Maria Colgan on July 13, 2012 at 05:58 PM PDT #

What is the difference between Incremental Statistics Gathering and partition level statistics gathering. They both seems to be doing the same (gather partition stats & update global stats).

What are the benefits of using Incremental Statistics Gathering over partition level stats?

Appreciate your input on this.

Posted by guest on September 10, 2012 at 06:30 AM PDT #

As described in the blog post, if incremental statistics is enabled, the global statistics are derived from partition level statistics and synopses. If it is disabled, the global statistics are collected by scanning the entire table and can be very expensive.

You can collect statistics only for partitions by specifying
granularity => 'PARTITION' (not recommended though).
In this case, dbms_stats aggregate the statistics of partitions
to derive the global statistics without scanning the entire table.
However, the statistics aggregated will not be
accurate (especially NDV) as synopses are not available at
partition level. This type of aggregation happens if
- incremental option is NOT enabled.
- global stats are not collected at all (global_stats column
of dba_tab_statistics is NO for the table)
- all partitions have stats

Posted by guest on September 10, 2012 at 10:33 AM PDT #

Hi,

Thanks for the article. I have one quick question. If we use Incremental apporach (APPROX_GLOBAL AND PARTITION), don't we need to collect global stats at all?

Thanks
Ashok Ravi

Posted by guest on February 04, 2013 at 03:51 AM PST #

Hi Ashok,

APPROX_GLOBAL AND PARTITION is the 10.2 version of incremental statistics. It behaves the same as the incremental maintenance feature except that we don't update the NDV for non-partitioning columns and number of distinct keys of the index at the global level. For partitioned column we update the NDV as the sum of NDV at the partition levels. Also we set the NDV of columns of unique indexes as the number of rows of the table.

So, you don't need to gather global statistics they will be aggregated from the partition statistics. That said, you may want to collect global statistics once in a while when they get really out of sync with the actual ndv of non partitioned columns.

Thanks,
Maria

Posted by guest on February 04, 2013 at 02:38 PM PST #

Hi ,

I am working on statistics but a have come across a strange behaviour of gather_database_stats. A bug has already been reported in 11.2.0.3.0 version where gather_schema_stats doesn't work when a filter list is provided.

Please find the below code

DECLARE
l_obj_filter_list DBMS_STATS.objecttab := DBMS_STATS.objecttab ();
BEGIN
l_obj_filter_list.EXTEND (1);
l_obj_filter_list (1).ownname := 'OWNER';
l_obj_filter_list (1).objtype := 'TABLE';
l_obj_filter_list (1).objname :='PARTITION_TABLE';
l_obj_filter_list (1).partname :='PARTITION_P1';

dbms_stats.gather_database_stats (
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
block_sample => FALSE,
method_opt => 'FOR ALL COLUMNS SIZE AUTO',
DEGREE => DBMS_STATS.DEFAULT_DEGREE,
granularity => 'AUTO', --or ALL- or PARTITION nothing works when I pass a partition name .When partition_name is NULL then it works
CASCADE => DBMS_STATS.AUTO_CASCADE,
stattab => null,
statid =>NULL,
options => 'GATHER',
statown => NULL,
gather_sys => FALSE,
no_invalidate => DBMS_STATS.AUTO_INVALIDATE,
gather_temp => FALSE,
gather_fixed => FALSE,
stattype => 'DATA',
obj_filter_list => l_obj_filter_list);

end;

If I supply a partition name to the above code then in no case the partition is going to get analysed. But if i analyse the table by passing the partition name to be NULL then it works. Could be please suggest me something? My main motto is to pass the partition name ,keep the granularity as 'AUTO' so that when a partition name is passed then that partition should get analysed and when it not passed the table should be analysed globally .

Thanks

Rahul

Posted by Rahul Kumar on March 21, 2013 at 09:41 PM PDT #

Hi

I have a table of size 35GB, 3 list partition and 8 hash SUBPARTITIONS on each partition. I am trying to get the stats:--

EXEC DBMS_STATS.GATHER_TABLE_STATS ('SCOTT','EMP',GRANULARITY =>'ALL' ,estimate_percent=>30 , degree => 32,no_invalidate=> false, FORCE=>TRUE);

But it is taking long time...
Can you explain what should be the best way to handle these type of situations

Posted by guest on October 25, 2013 at 01:31 AM PDT #

I would avoid specifying the non default values in general.
Please see comments for the following parameters to improve performance.

1) Granularity => ALL
BY default, dbms_stats does not collect statistics
for hash sub partitions. Usually hash sub partitions are uniform we rarely see need for stats for them. When you specify ALL for the
argument, it is going to collect stats for them too and it will increase time.

2) estimate_percent =>30

Strongly recommended to use DBMS_STATS.AUTO_SAMPLE_SIZE from 11.
Please see the following post.

https://blogs.oracle.com/optimizer/entry/improvement_of_auto_sampling_statistics_gathering_feature_in_oracle_database_11g

3) degree 32

By default, it will use degree specified at the table. If you are forcing parallel, try using DBMS_STATS.AUTO_DEGREE

Posted by Sunil Chakkappen on October 25, 2013 at 03:33 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

The Oracle Optimizer blog is written by members of the Optimizer development team. The goal of this blog is to provide an insight into the workings of the Optimizer and the statistics it relies on. The views expressed on this blog are our own and do not necessarily reflect the views of Oracle and its affiliates. The views and opinions expressed by visitors on this blog are theirs solely and may not reflect ours.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today