Insights into Statistics, Query Optimization and the Oracle Optimizer

Oracle's AUTO Sampling Statistics Gathering

Maria Colgan
Master Product Manager
Optimizer statistics in Oracle are managed via a PL/SQL package, DBMS_STATS. It provides several procedures to gather statistics for a table, schema, or a database. For example, gather_table_statistics is used to gather statistics on a table. This procedure has an estimate_percent parameter, which specifies the sampling percentage of the statistics gathering. The users can specify any number between 0 ~ 100 for this parameter. For example, suppose you have a table BIGT, you can specify a 1% sampling percentage as follows:
exec dbms_stats.gather_table_stats(null, 'BIGT', 
                       estimate_percent => 1);

It is not always easy for users to pick the appropriate sampling percentage. If the specified sampling percentage is too high, it can take longer to gather statistics. On the contrary, if the data is skewed and the specified sampling percentage is too low, the resulting statistics can be inaccurate.

For this reason, Oracle introduced the AUTO value for the estimate_percent parameter. For example, you can gather statistics on BIGT as follows:

exec dbms_stats.gather_table_stats(null, 'BIGT', 
                       estimate_percent => dbms_stats.auto_sample_size);
The use of AUTO_SAMPLE_SIZE is strongly recommended and it has many advantages over a specified estimate percentage. First, when AUTO sample size is specified, the system automatically determines the appropriate sampling percentage. Second, AUTO sample size is more flexible than a fixed sampling percentage. A fixed sampling percentage size that was good at some point in time may not be appropriate after the data distribution in the table has changed. On the other hand when AUTO value is used Oracle will adjust the sample size when the data distribution changes. Finally, some statistics-gathering features depend on the use of auto sampling, such as incremental statistics and HYBRID histograms in Oracle Database 12c.
Do demonstrate the performance of AUTO_SAMPLE_SIZE, we used the standard TPC-D data generator to generate a Lineitem table. The Lineitem table is about 230G and contains 1.8 million rows with 16 columns. The schema of the lineitem table is as follows:


column name column type
l_shipdate date
l_orderkey number
l_discount number
l_extendedprice number
l_suppkey number
l_quantity number
l_returnflag varchar2
l_partkey number
l_linestatus varchar2
l_tax number
l_commitdate date
l_receiptdate date
l_shipmode varchar2
l_linenumber number
l_shipinstruct varchar2
l_comment varchar2

The following table gives the elapsed time of gathering statistics on the Lineitem table by different sampling percentages:
Sampling Percentage Elapsed Time (sec)
1% sampling 797
100% sampling (Compute) 18772
Auto sampling in Oracle 1908

We also compare the quality of the statistics gathered using different estimate percentages. Among all the statistics of a column, number of distinct values (NDV) is the one whose accuracy used to be an issue. We define the accuracy rate of NDV of a column as follows:

accuracy rate = 1 - (|estimated NDV - actual NDV|)/actual NDV. 
The accuracy rate ranges from 0% to 100%. The higher the accuracy rate is, the more accurate the gathered statistics are. Since 100% sampling always lead to an accuracy rate of 100%, we do not report it. We focus on the columns which has at least one statistics accuracy rate below 99.9% when using different estimate percentages. The following table illustrates the accurate rates of the columns:
Column Name Actual NDV Auto Sampling 1% Sampling
orderkey 450,000,000 98.0% 50%
comment 181,122,127 98.60% 4.60%
partkey 60,000,000 99.20% 98.20%
suppkey 3,000,000 99.60% 99.90%
extendedprice 3,791,320 99.60% 94.30%

In short, the elapsed time of AUTO sampling is 10 times faster than 100% sampling but yields the statistics of similar quality (accuracy rate close to 100%).

Join the discussion

Comments ( 1 )
  • Eli dias Sunday, March 15, 2015

    Fantastic information

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.