In real-world data, there is often a relationship or correlation between the data stored in different columns of the same table. For example, consider a customers table where the values in a cust_state_province column are influenced by the values in a country_id column, because the state of California is only going to be found in the United States. If the Oracle Optimizer is not aware of these real-world relationships, it could potentially miscalculate the cardinality estimate if multiple columns from the same table are used in the where clause of a statement. With extended statistics you have an opportunity to tell the optimizer about these real-world relationships between the columns.
By creating extended statistics on a group of columns, the optimizer can determine a more accurate cardinality estimate when the columns are used together in a where clause of a SQL statement.You can use DBMS_STATS.CREATE_EXTENDED_STATS to define the column group you want to have statistics gathered on as a whole. Once the group has been established Oracle will automatically maintain the statistics on that column group when statistics are gathered on the table.
If we continue with the initial example of the customers table, When the value of cust_state_province is 'CA' we know the value of country_id will be 52790 or the USA. There is also a skew in the data in these two columns; because the company used in the SH is based in San Francisco so the majority of rows in the table have the values 'CA' and 52790. Both the relationship between the columns and the skew in the data can make it difficult for the optimizer to calculate the cardinality of these columns correctly when they are used together in a query.
With just basic statistics we see the optimizer thinks there will only be 127 row returned because it assumes both columns will reduce the number of rows returned (# of rows in the table X 1/NDV of column1 X 1/NDV of column2). We know that this is not true in this case. We must provide better statistic to the optimizer so it can determine the correct cardinality estimate. Prior to Oracle Database 11g the only option open to us would be to make the optimizer aware of the data skew in both the country_id column (most rows have 52790 as the value) and the cust_state_province column (most rows have 'CA' as the value). We can do this by gathering histograms on the skewed columns.
Now that we have histograms on both the country_id and the cust_state_province columns let's see if the optimizers estimate is more accurate.
There is a slightly improvement in the estimate since we have histograms for the individual columns. However the optimizer is still not aware that there is a relationship or correlation between these two columns. We can tell the optimizer about this correlation by creating extended statistics on these two columns as a group. Once the extended statistics have been created the next time statistics are gathered on the
customers table an extra set of statistics, for the combine group of
country_id and
cust_state_province, will be collected. The DBMS_STATS.CREATE_EXTENDED_STATS function can be used to create the extended statistics or a 'column group' for
country_id and
cust_state_province.
After creating the extended statistics and regather statistics, you will see a system generated column name in USER_TAB_COL_STATISTICS, which represents the new column group. A subset of statistics are maintained for column groups including;
- Number of distinct values
- Number of Nulls
- Histograms
So now we have statistics on the column group lets confirm that is enough information for the optimizer to get the correct estimation.
The cardinality estimate is still off. Why did the extended statistics not help in this case? The extended statistics were not actually used here. If you look back at the output from USER_TAB_COL_STATISTICS you can see there is a histogram created on the
country_id and
cust_state_province columns. However, there is no histogram created on the column group. Because a histogram provides the optimizer with more information than standard statistics the optimizer ignores the extended statistics and uses the individual column statistics instead.
Since we have executed the query again, the optimizer will have recorded that a histogram on the created extension is beneficial for the query. A histogram will be automatically created on the column group the next time statistics are gathered on the table
With the histogram in place on the column group, the optimizer will now use the extended statistics and the cardinality estimates is now accurate.
Extended statistics are used even if we have them for only a subset of predicates in the statement. Lets say there is a column group created on (c1, c2) and we have a SQL statement with a where clause that contains c1 = 1 and c2 =1 and c3 = 1. The optimizer will use the extended statistics on C1,C2 and multiply that by the selectivity of third predicate. It will use all the available statistics on column group, including histograms. The optimizer will also use the extended statistics (in a limited way) if a subset of the column group is present in the query.
maybe you could post an article about the Cardinality Feedback sometime in the future. There is little known about this new feature. Thank you in advance.
Regards,
Martin
Very interesting article, I was wondering why extended could not be used for correlated columns accross two tables in case of a join leading to bad cardinality due to the assumption that columns are not correlated ?
We took your advice and blogged about Cardinality Feedback in our latest post.
http://blogs.oracle.com/optimizer/entry/cardinality_feedback
Hope you found it useful!
Thanks,
Maria
Thx
very nice and useful article.
Will try this.
Great Article!
I have question though. When extended statistics are created, Oracle assigns a System Generated name to the extension which is fine. However, is it possible to import the extended stats into another schema name(or uat to production for example). Will they work with the system generated name?
Again, Thanks for sharing the knowledge.
Cheers!
Extended statistics can be exported and imported on another system just like all other column statistics. The system generated name of the extended statistics is based off of the column names, so the same name will be generate on all systems. Thus ensuring we will be able to use the extended statistics on any system.
Very Nice Article
Dear Ms. Colgan;
On your "Understanding Optimizer Statistics" paper, dated January 2012, you write:
"...This whitepaper is the first in a two part series on Optimizer statistics..."
Has the second part come out? Given the quality of the first part, am very eager to get my hands on the second.
Thank you.
Hi Ross,
I am hoping to post the second part of the statistics paper on Oracle.com later this week. As soon as I post it I will add a link to the paper on the blog.
Thanks,
Maria
If you are stuck on 10g, another option is to partition your table. e.g. one partition for country USA, default partition for others. Oracle will collect statistics on a per-partition basis.
Use of a literal will allow it to come up with a plan specific to the partition you have nominated. In addition to this, a histogram would provide estimates on the skew within a partition.
Hi ,
Nice article but I have a question on the example provided in this article. You showed that extended stats works better with Histogram. Most of the time, our queries use bind variables and not literals. Due to this , it's not feasible for us to create histograms. Would extended stats still help us or we need to follow some other approach ?
Thanks
Yes, extended statistics will still help you if you have multiple columns that are used together in a query and those columns are correlated.
Have you considered revising your approach to histograms and binds since the introduction of Adaptive Cursor Sharing?
Thanks,
Maria
Very nice article Maria, thank you very much.
Regards,
Srikar
Is it possible to force a histogram onto an extended stat if column usage is not getting the histogram to appear? Thanks
Hi Fran,
Yes, it is possible to force a histogram on a column group using the method_opt parameter in the gather_table_stats procedure. You just need to know the name of the virtual column that represents the column group you are interested in. Below is an example of how to do it.
However, I have never come across a scenario where we haven't created a histogram on the column after the column usage information got recorded for the column group.
Example
========
SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';
COLUMN_NAME HISTOGRAM
------------------------------ ---------------
SYS_STUWMBUN3F$#398R7BS0YVS86R NONE
CUST_ID NONE
CUST_FIRST_NAME NONE
CUST_LAST_NAME NONE
CUST_GENDER FREQUENCY
CUST_YEAR_OF_BIRTH NONE
CUST_MARITAL_STATUS NONE
CUST_STREET_ADDRESS NONE
CUST_POSTAL_CODE HEIGHT BALANCED
CUST_CITY HEIGHT BALANCED
CUST_STATE_PROVINCE FREQUENCY
COUNTRY_ID NONE
CUST_MAIN_PHONE_NUMBER NONE
CUST_INCOME_LEVEL NONE
CUST_CREDIT_LIMIT NONE
CUST_EMAIL NONE
So, to create a histogram on the virtual column SYS_STUWMBUN3F$#398R7BS0YVS86R I would use the following gather_table_stats command.
BEGIN
dbms_stats.gather_table_stats('SH','CUSTOMERS', method_opt=>'FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 SYS_STUWMBUN3F$#398R7BS0YVS86R')
;
END;
/
SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';
COLUMN_NAME HISTOGRAM
------------------------------ ---------------
SYS_STUWMBUN3F$#398R7BS0YVS86R HEIGHT BALANCED
CUST_ID NONE
CUST_FIRST_NAME NONE
CUST_LAST_NAME NONE
CUST_GENDER FREQUENCY
CUST_YEAR_OF_BIRTH NONE
CUST_MARITAL_STATUS NONE
CUST_STREET_ADDRESS NONE
CUST_POSTAL_CODE HEIGHT BALANCED
CUST_CITY HEIGHT BALANCED
CUST_STATE_PROVINCE FREQUENCY
COUNTRY_ID NONE
CUST_MAIN_PHONE_NUMBER NONE
CUST_INCOME_LEVEL NONE
CUST_CREDIT_LIMIT NONE
CUST_EMAIL NONE
thanks for this nice article!