I thought the new AUTO_SAMPLE_SIZE in Oracle Database 11g looked at all the rows in a table so why do I see a very small sample size on some tables?
By Maria Colgan on Apr 02, 2012
I recently got asked this question and thought it was worth a quick blog post to explain in a little more detail what is going on with the new AUTO_SAMPLE_SIZE in Oracle Database 11g and what you should expect to see in the dictionary views. Let’s take the SH.CUSTOMERS table as an example. There are 55,500 rows in the SH.CUSTOMERS tables.
If we gather statistics on the SH.CUSTOMERS using the new AUTO_SAMPLE_SIZE but without collecting histogram we can check what sample size was used by looking in the USER_TABLES and USER_TAB_COL_STATISTICS dictionary views.
The sample sized shown in the USER_TABLES is 55,500 rows or the entire table as expected. In USER_TAB_COL_STATISTICS most columns show 55,500 rows as the sample size except for four columns (CUST_SRC_ID, CUST_EFF_TO, CUST_MARTIAL_STATUS, CUST_INCOME_LEVEL ).
The CUST_SRC_ID and CUST_EFF_TO columns have no sample size listed because there are only NULL values in these columns and the statistics gathering procedure skips NULL values. The CUST_MARTIAL_STATUS (38,072) and the CUST_INCOME_LEVEL (55,459) columns show less than 55,500 rows as their sample size because of the presence of NULL values in these columns. In the SH.CUSTOMERS table 17,428 rows have a NULL as the value for CUST_MARTIAL_STATUS column (17428+38072 = 55500), while 41 rows have a NULL values for the CUST_INCOME_LEVEL column (41+55459 = 55500). So we can confirm that the new AUTO_SAMPLE_SIZE algorithm will use all non-NULL values when gathering basic table and column level statistics.
Now we have clear understanding of what sample size to expect lets include histogram creation as part of the statistics gathering.
Again we can look in the USER_TABLES and USER_TAB_COL_STATISTICS dictionary views to find the sample size used.
The sample size seen in USER_TABLES is 55,500 rows but if we look at the column statistics we see that it is same as in previous case except for columns CUST_POSTAL_CODE and CUST_CITY_ID. You will also notice that these columns now have histograms created on them. The sample size shown for these columns is not the sample size used to gather the basic column statistics. AUTO_SAMPLE_SIZE still uses all the rows in the table - the NULL rows to gather the basic column statistics (55,500 rows in this case). The size shown is the sample size used to create the histogram on the column.
When we create a histogram we try to build it on a sample that has approximately 5,500 non-null values for the column. Typically all of the histograms required for a table are built from the same sample. In our example the histograms created on CUST_POSTAL_CODE and the CUST_CITY_ID were built on a single sample of ~5,500 (5,450 rows) as these columns contained only non-null values.
However, if one or more of the columns that requires a histogram has null values then the sample size maybe increased in order to achieve a sample of 5,500 non-null values for those columns. n addition, if the difference between the number of nulls in the columns varies greatly, we may create multiple samples, one for the columns that have a low number of null values and one for the columns with a high number of null values. This scheme enables us to get close to 5,500 non-null values for each column.
You can get a copy of the script I used to generate this post here.