There's now an updated post on auto sample size that covers Oracle Database 12c, but this one is still relevant to Oracle Database 11g...
I recently got asked this question and thought it was worth a quick blog post to explain in a little more detail what is going on with the new AUTO_SAMPLE_SIZE in Oracle Database 11g and what you should expect to see in the dictionary views.
Let’s take the SH.CUSTOMERS table as an example. There are 55,500 rows in the SH.CUSTOMERS tables.
If we gather statistics on the SH.CUSTOMERS using the new AUTO_SAMPLE_SIZE but without collecting histogram we can check what sample size was used by looking in the USER_TABLES and USER_TAB_COL_STATISTICS dictionary views.
The sample sized shown in the USER_TABLES is 55,500 rows or the entire table as expected. In USER_TAB_COL_STATISTICS most columns show 55,500 rows as the sample size except for four columns (CUST_SRC_ID, CUST_EFF_TO, CUST_MARTIAL_STATUS, CUST_INCOME_LEVEL ).
The CUST_SRC_ID and CUST_EFF_TO columns have no sample size listed because there are only NULL values in these columns and the statistics gathering procedure skips NULL values. The CUST_MARTIAL_STATUS (38,072) and the CUST_INCOME_LEVEL (55,459) columns show less than 55,500 rows as their sample size because of the presence of NULL values in these columns. In the SH.CUSTOMERS table 17,428 rows have a NULL as the value for CUST_MARTIAL_STATUS column (17428+38072 = 55500), while 41 rows have a NULL values for the CUST_INCOME_LEVEL column (41+55459 = 55500). So we can confirm that the new AUTO_SAMPLE_SIZE algorithm will use all non-NULL values when gathering basic table and column level statistics.
Now we have clear understanding of what sample size to expect lets include histogram creation as part of the statistics gathering.
Again we can look in the USER_TABLES and USER_TAB_COL_STATISTICS dictionary views to find the sample size used.
The sample size seen in USER_TABLES is 55,500 rows but if we look at the column statistics we see that it is same as in previous case except for columns CUST_POSTAL_CODE and CUST_CITY_ID. You will also notice that these columns now have histograms created on them. The sample size shown for these columns is not the sample size used to gather the basic column statistics. AUTO_SAMPLE_SIZE still uses all the rows in the table - the NULL rows to gather the basic column statistics (55,500 rows in this case). The size shown is the sample size used to create the histogram on the column.
When we create a histogram we try to build it on a sample that has approximately 5,500 non-null values for the column. Typically all of the histograms required for a table are built from the same sample. In our example the histograms created on CUST_POSTAL_CODE and the CUST_CITY_ID were built on a single sample of ~5,500 (5,450 rows) as these columns contained only non-null values.
However, if one or more of the columns that requires a histogram has null values then the sample size maybe increased in order to achieve a sample of 5,500 non-null values for those columns. n addition, if the difference between the number of nulls in the columns varies greatly, we may create multiple samples, one for the columns that have a low number of null values and one for the columns with a high number of null values. This scheme enables us to get close to 5,500 non-null values for each column.
You can get a copy of the script I used to generate this post here.
hi Maria,
We have some datawarehouse environments which uses method_opt =>'FOR ALL INDEXED COLUMNS SIZE AUTO' for stats collection. As a result we dont have stats on any non-indexed column. In case we have query which included non indexed column we try to gather stats for that column by adding it in method_opt. I was thinking if moving to method_opt =>'FOR ALL COLUMNS SIZE AUTO' will be ok for datawarehouse environment with daily partitions (retention of 12 months or more). Can this change of approach have any pitfalls (one thing is increased stats gather time).
Regards
Amit
Hi,
Its always great reading such informative details on this blog.I wanted to know what are the criteria for determining the the sample size with dbms_stats.auto_sample_size.Sometimes optimizer takes sample_size 50 sometimes might take ~100%, so what all are the determining factors.
Anand
Hi Amit,
We would strongly encourage you to let the method_opt parameter default to 'FOR ALL COLUMNS SIZE AUTO', even in a data warehousing environment.This will allow Oracle to create histograms on the column seen in the where clause predicates and group-by clauses of the workload that have data skews, without impacting the statistics gathered on any other column.
We discourage the setting of the method_opt to 'FOR ALL INDEXED COLUMNS SIZE AUTO' because it causes Oracle to gather histograms on every column used in an index, which could unnecessarily extend the elapse time to gather statistics and waste a lot of system resources. This setting also has a nasty side effect of preventing Oracle from collecting basic column statistics for non-index columns, which in a data warehousing environment, with ad-hoc queries, could result in some very poor plans.
Thanks,
Maria
Hi Anand,
In Oracle Database 11g the DBMS_STATS.AUTO_SAMPLE_SIZE algorithm has been completely rewritten. It is now a hash based algorithm that looks at all of the rows in the table (100 % sample). More information on how the new approach works can be found in the white paper, Efficient and scalable statistics gathering for large databases in Oracle 11g (http://portal.acm.org/citation.cfm?id=1376721)
Thanks,
Maria
Hi Maria,
thanks for your explanation of auto_sample_size.
Up to now, we are still using 'estimate_percent' with values like 100% or 30% (Oracle 11gR2). Our concern is with the histograms: estimate_percent=30 will use a 30% sample for creating the histogram. auto_sample_size will only create the histograms on basis of 5500 samples. That sounds somehow arbitrary. We have lots of tables with huge data skew (and millions of records) and are suspecting that the histograms may not be precise enough.
Due to that, I am not convinced to take that auto_sample_size approach.
Time is not such an issue for us: We can afford gathering the statistics over night.
Can you explain the reason why you have chosen to always work with a 5500 sample for creating histograms? Are there any plans to make that more flexible/intelligent in the future?
Best
Reiner Kuehl
Reiner,
I have the same concern as you, and can confirm that it doesn't work. We have a table with 100 million records, and the histogram for one of the columns randomly only gets 1 entry, which causes the CBO to estimate that 50 million records have a particular value when only a few do, and full table scans aren't our friend. I am not a statistician, but I'm pretty sure that if 0 rows in a random sample of 5500 have a particular value, that its extremely unlikely that half of 100 million have it.
Take a look at http://jonathanlewis.wordpress.com/2009/04/23/histogram-change/, it looks like Oracle went from one extreme to the other for values missing in the histogram. You can even turn the "fix" off, see http://www.dbaportal.eu/?q=node/111.