X

Insights into Statistics, Query Optimization and the Oracle Optimizer

Extended Statistics

Maria Colgan
Master Product Manager

In real-world data, there is often a relationship or correlation between the data stored in different columns of the same table. For example, consider a customers table where the values in a cust_state_province column are influenced by the values in a country_id column, because the state of California is only going to be found in the United States. If the Oracle Optimizer is not aware of these real-world relationships, it could potentially miscalculate the cardinality estimate if multiple columns from the same table are used in the where clause of a statement. With extended statistics you have an opportunity to tell the optimizer about these real-world relationships between the columns.

By creating extended statistics on a group of columns, the optimizer can determine a more accurate cardinality estimate when the columns are used together in a where clause of a SQL statement.You can use DBMS_STATS.CREATE_EXTENDED_STATS to define the column group you want to have statistics gathered on as a whole. Once the group has been established Oracle will automatically maintain the statistics on that column group when statistics are gathered on the table.

If we continue with the initial example of the customers table, When the value of cust_state_province is 'CA' we know the value of country_id will be 52790 or the USA. There is also a skew in the data in these two columns; because the company used in the SH is based in San Francisco so the majority of rows in the table have the values 'CA' and 52790. Both the relationship between the columns and the skew in the data can make it difficult for the optimizer to calculate the cardinality of these columns correctly when they are used together in a query.
 

initial_query_and_plan.png



With just basic statistics we see the optimizer thinks there will only be 127 row returned because it assumes both columns will reduce the number of rows returned (# of rows in the table X 1/NDV of column1 X 1/NDV of column2). We know that this is not true in this case. We must provide better statistic to the optimizer so it can determine the correct cardinality estimate. Prior to Oracle Database 11g the only option open to us would be to make the optimizer aware of the data skew in both the country_id column (most rows have 52790 as the value) and the cust_state_province column (most rows have 'CA' as the value). We can do this by gathering histograms on the skewed columns.
 
histogram_stats.png


Now that we have histograms on both the country_id and the cust_state_province columns let's see if the optimizers estimate is more accurate.
 
histogram_plan.png


There is a slightly improvement in the estimate since we have histograms for the individual columns. However the optimizer is still not aware that there is a relationship or correlation between these two columns. We can tell the optimizer about this correlation by creating extended statistics on these two columns as a group. Once the extended statistics have been created the next time statistics are gathered on the customers table an extra set of statistics, for the combine group of country_id and cust_state_province, will be collected. The DBMS_STATS.CREATE_EXTENDED_STATS function can be used to create the extended statistics or a 'column group' for country_id and cust_state_province.
 
create_extended_stats.png


After creating the extended statistics and regather statistics, you will see a system generated column name in USER_TAB_COL_STATISTICS, which represents the new column group. A subset of statistics are maintained for column groups including;
  • Number of distinct values
  • Number of Nulls
  • Histograms
 
system_generated_stats.png


So now we have statistics on the column group lets confirm that is enough information for the optimizer to get the correct estimation.
 
Not_working.png

The cardinality estimate is still off. Why did the extended statistics not help in this case? The extended statistics were not actually used here. If you look back at the output from USER_TAB_COL_STATISTICS you can see there is a histogram created on the country_id and cust_state_province columns. However, there is no histogram created on the column group. Because a histogram provides the optimizer with more information than standard statistics the optimizer ignores the extended statistics and uses the individual column statistics instead.

Since we have executed the query again, the optimizer will have recorded that a histogram on the created extension is beneficial for the query. A histogram will be automatically created on the column group the next time statistics are gathered on the table
 
p1.png

With the histogram in place on the column group, the optimizer will now use the extended statistics and the cardinality estimates is now accurate.
 
working.png

Extended statistics are used even if we have them for only a subset of predicates in the statement. Lets say there is a column group created on (c1, c2) and we have a SQL statement with a where clause that contains c1 = 1 and c2 =1 and c3 = 1. The optimizer will use the extended statistics on C1,C2 and multiply  that by the selectivity of third predicate. It will use all the available statistics on column group, including histograms. The optimizer will also use the extended statistics (in a limited way) if a subset of the column group is present in the query.

Join the discussion

Comments ( 16 )
  • Martin Decker Friday, March 18, 2011
    Maria,
    maybe you could post an article about the Cardinality Feedback sometime in the future. There is little known about this new feature. Thank you in advance.
    Regards,
    Martin
  • guest Monday, May 9, 2011
    Hello,
    Very interesting article, I was wondering why extended could not be used for correlated columns accross two tables in case of a join leading to bad cardinality due to the assumption that columns are not correlated ?
  • Maria Colgan Thursday, June 2, 2011
    Hi Martin,
    We took your advice and blogged about Cardinality Feedback in our latest post.
    http://blogs.oracle.com/optimizer/entry/cardinality_feedback
    Hope you found it useful!
    Thanks,
    Maria
  • Anjul Saturday, July 2, 2011

    Thx

    very nice and useful article.

    Will try this.


  • guest Sunday, September 11, 2011

    Great Article!

    I have question though. When extended statistics are created, Oracle assigns a System Generated name to the extension which is fine. However, is it possible to import the extended stats into another schema name(or uat to production for example). Will they work with the system generated name?

    Again, Thanks for sharing the knowledge.

    Cheers!


  • Maria Colgan Monday, September 12, 2011

    Extended statistics can be exported and imported on another system just like all other column statistics. The system generated name of the extended statistics is based off of the column names, so the same name will be generate on all systems. Thus ensuring we will be able to use the extended statistics on any system.


  • feras Friday, September 30, 2011

    Very Nice Article


  • Ross Monday, March 26, 2012

    Dear Ms. Colgan;

    On your "Understanding Optimizer Statistics" paper, dated January 2012, you write:

    "...This whitepaper is the first in a two part series on Optimizer statistics..."

    Has the second part come out? Given the quality of the first part, am very eager to get my hands on the second.

    Thank you.


  • Maria Colgan Monday, March 26, 2012

    Hi Ross,

    I am hoping to post the second part of the statistics paper on Oracle.com later this week. As soon as I post it I will add a link to the paper on the blog.

    Thanks,

    Maria


  • WW Wednesday, August 29, 2012

    If you are stuck on 10g, another option is to partition your table. e.g. one partition for country USA, default partition for others. Oracle will collect statistics on a per-partition basis.

    Use of a literal will allow it to come up with a plan specific to the partition you have nominated. In addition to this, a histogram would provide estimates on the skew within a partition.


  • guest Wednesday, February 27, 2013

    Hi ,

    Nice article but I have a question on the example provided in this article. You showed that extended stats works better with Histogram. Most of the time, our queries use bind variables and not literals. Due to this , it's not feasible for us to create histograms. Would extended stats still help us or we need to follow some other approach ?

    Thanks


  • guest Wednesday, February 27, 2013

    Yes, extended statistics will still help you if you have multiple columns that are used together in a query and those columns are correlated.

    Have you considered revising your approach to histograms and binds since the introduction of Adaptive Cursor Sharing?

    Thanks,

    Maria


  • guest Tuesday, March 26, 2013

    Very nice article Maria, thank you very much.

    Regards,

    Srikar


  • Fran Wednesday, April 24, 2013

    Is it possible to force a histogram onto an extended stat if column usage is not getting the histogram to appear? Thanks


  • guest Wednesday, April 24, 2013

    Hi Fran,

    Yes, it is possible to force a histogram on a column group using the method_opt parameter in the gather_table_stats procedure. You just need to know the name of the virtual column that represents the column group you are interested in. Below is an example of how to do it.

    However, I have never come across a scenario where we haven't created a histogram on the column after the column usage information got recorded for the column group.

    Example

    ========

    SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';

    COLUMN_NAME HISTOGRAM

    ------------------------------ ---------------

    SYS_STUWMBUN3F$#398R7BS0YVS86R NONE

    CUST_ID NONE

    CUST_FIRST_NAME NONE

    CUST_LAST_NAME NONE

    CUST_GENDER FREQUENCY

    CUST_YEAR_OF_BIRTH NONE

    CUST_MARITAL_STATUS NONE

    CUST_STREET_ADDRESS NONE

    CUST_POSTAL_CODE HEIGHT BALANCED

    CUST_CITY HEIGHT BALANCED

    CUST_STATE_PROVINCE FREQUENCY

    COUNTRY_ID NONE

    CUST_MAIN_PHONE_NUMBER NONE

    CUST_INCOME_LEVEL NONE

    CUST_CREDIT_LIMIT NONE

    CUST_EMAIL NONE

    So, to create a histogram on the virtual column SYS_STUWMBUN3F$#398R7BS0YVS86R I would use the following gather_table_stats command.

    BEGIN

    dbms_stats.gather_table_stats('SH','CUSTOMERS', method_opt=>'FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 SYS_STUWMBUN3F$#398R7BS0YVS86R')

    ;

    END;

    /

    SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';

    COLUMN_NAME HISTOGRAM

    ------------------------------ ---------------

    SYS_STUWMBUN3F$#398R7BS0YVS86R HEIGHT BALANCED

    CUST_ID NONE

    CUST_FIRST_NAME NONE

    CUST_LAST_NAME NONE

    CUST_GENDER FREQUENCY

    CUST_YEAR_OF_BIRTH NONE

    CUST_MARITAL_STATUS NONE

    CUST_STREET_ADDRESS NONE

    CUST_POSTAL_CODE HEIGHT BALANCED

    CUST_CITY HEIGHT BALANCED

    CUST_STATE_PROVINCE FREQUENCY

    COUNTRY_ID NONE

    CUST_MAIN_PHONE_NUMBER NONE

    CUST_INCOME_LEVEL NONE

    CUST_CREDIT_LIMIT NONE

    CUST_EMAIL NONE


  • rajmohan Tuesday, September 2, 2014

    thanks for this nice article!


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha