### Extended Statistics

#### By Maria Colgan-Oracle on Mar 17, 2011

In real-world data, there is often a relationship or correlation between the data stored in different columns of the same table. For example, in the customers table, the values in the cust_state_province column are influenced by the values in the country_id column, as the state of California is only going to be found in the United States. Until now, the Optimizer had no way of knowing about these real-world relationships and could potentially miscalculate the cardinality estimate if multiple columns from the same table are used in the where clause of a statement. With extended statistics you now have an opportunity to tell the Optimizer about these real-world relationships between the columns.

By creating extended statistics on a group of columns, the Optimizer can determine a more accurate cardinality estimate when the columns are used together in a where clause of a SQL statement.You can use DBMS_STATS.CREATE_EXTENDED_STATS to define the column group you want to have statistics gathered on as a whole. Once the group has been established Oracle will automatically maintain the statistics on that column group when statistics are gathered on the table.

If we continue with the initial example of the customers table, When the value of cust_state_province is 'CA' we know the value of country_id will be 52790 or the USA. There is also a skew in the data in these two columns; because the company used in the SH is based in San Francisco so the majority of rows in the table have the values 'CA' and 52790. Both the relationship between the columns and the skew in the data can make it difficult for the optimizer to calculate the cardinality of these columns correctly when they are used together in a query.

With just basic statistics we see the optimizer thinks there will only be 127 row returned because it assumes both columns will reduce the number of rows returned (# of rows in the table X 1/NDV of column1 X 1/NDV of column2). We know that this is not true in this case. We must provide better statistic to the optimizer so it can determine the correct cardinality estimate. Prior to Oracle Database 11g the only option open to us would be to make the optimizer aware of the data skew in both the country_id column (most rows have 52790 as the value) and the cust_state_province column (most rows have 'CA' as the value). We can do this by gathering histograms on the skewed columns.

Now that we have histograms on both the country_id and the cust_state_province columns let's see if the optimizers estimate is more accurate.

There is a slightly improvement in the estimate since we have histograms for the individual columns. However the optimizer is still not aware that there is a relationship or correlation between these two columns. In Oracle Database 11g we can tell the optimizer about this correlation by creating extended statistics on these two columns as a group. Once the extended statistics have been created the next time statistics are gathered on the CUSTOMERS table an extra set of statistics, for the combine group of country_id and cust_state_province, will be collected. The DBMS_STATS.CREATE_EXTENDED_STATS function can be used to create the extended statistics or a 'column group' for country_id and cust_state_province.

After creating the extended statistics and regather statistics, you will see a system generated column name in USER_TAB_COL_STATISTICS, which represents the new column group. A subset of statistics are maintained for column groups including;

So now we have statistics on the column group lets confirm that is enough information for the optimizer to get the correct estimation.

The cardinality estimate is still off. Why did the extended statistics not help in this case? The extended statistics were not actually used here. If you look back at the output from USER_TAB_COL_STATISTICS you can see there is a histogram created on the country_id and cust_state_province columns. However, there is no histogram created on the column group. Because a histogram provides the Optimizer with more information than standard statistics the Optimizer ignores the extended statistics and uses the individual column statistics instead.

Since we have executed the query again, the optimizer will have recorded that a histogram on the created extension is beneficial for the query. A histogram will be automatically created on the column group the next time statistics are gathered on the table

With the histogram in place on the column group, the Optimizer will now use the extended statistics and the cardinality estimates is now accurate.

Extended Statistics are used even if we have them for only a subset of predicates in the statement. Lets say there is a column group created on (c1, c2) and we have a SQL statement with a where clause that contains c1 = 1 and c2 =1 and c3 = 1. The optimizer will use the extended statistics on C1,C2 and multiply that by the selectivity of third predicate. It will use all the available statistics on column group, including histograms. The optimizer will also use the extended statistics (in a limited way) if a subset of the column group is present in the query.

Maria Colgan+

By creating extended statistics on a group of columns, the Optimizer can determine a more accurate cardinality estimate when the columns are used together in a where clause of a SQL statement.You can use DBMS_STATS.CREATE_EXTENDED_STATS to define the column group you want to have statistics gathered on as a whole. Once the group has been established Oracle will automatically maintain the statistics on that column group when statistics are gathered on the table.

If we continue with the initial example of the customers table, When the value of cust_state_province is 'CA' we know the value of country_id will be 52790 or the USA. There is also a skew in the data in these two columns; because the company used in the SH is based in San Francisco so the majority of rows in the table have the values 'CA' and 52790. Both the relationship between the columns and the skew in the data can make it difficult for the optimizer to calculate the cardinality of these columns correctly when they are used together in a query.

With just basic statistics we see the optimizer thinks there will only be 127 row returned because it assumes both columns will reduce the number of rows returned (# of rows in the table X 1/NDV of column1 X 1/NDV of column2). We know that this is not true in this case. We must provide better statistic to the optimizer so it can determine the correct cardinality estimate. Prior to Oracle Database 11g the only option open to us would be to make the optimizer aware of the data skew in both the country_id column (most rows have 52790 as the value) and the cust_state_province column (most rows have 'CA' as the value). We can do this by gathering histograms on the skewed columns.

Now that we have histograms on both the country_id and the cust_state_province columns let's see if the optimizers estimate is more accurate.

There is a slightly improvement in the estimate since we have histograms for the individual columns. However the optimizer is still not aware that there is a relationship or correlation between these two columns. In Oracle Database 11g we can tell the optimizer about this correlation by creating extended statistics on these two columns as a group. Once the extended statistics have been created the next time statistics are gathered on the CUSTOMERS table an extra set of statistics, for the combine group of country_id and cust_state_province, will be collected. The DBMS_STATS.CREATE_EXTENDED_STATS function can be used to create the extended statistics or a 'column group' for country_id and cust_state_province.

After creating the extended statistics and regather statistics, you will see a system generated column name in USER_TAB_COL_STATISTICS, which represents the new column group. A subset of statistics are maintained for column groups including;

- Number of distinct values
- Number of Nulls
- Histograms

So now we have statistics on the column group lets confirm that is enough information for the optimizer to get the correct estimation.

The cardinality estimate is still off. Why did the extended statistics not help in this case? The extended statistics were not actually used here. If you look back at the output from USER_TAB_COL_STATISTICS you can see there is a histogram created on the country_id and cust_state_province columns. However, there is no histogram created on the column group. Because a histogram provides the Optimizer with more information than standard statistics the Optimizer ignores the extended statistics and uses the individual column statistics instead.

Since we have executed the query again, the optimizer will have recorded that a histogram on the created extension is beneficial for the query. A histogram will be automatically created on the column group the next time statistics are gathered on the table

With the histogram in place on the column group, the Optimizer will now use the extended statistics and the cardinality estimates is now accurate.

Extended Statistics are used even if we have them for only a subset of predicates in the statement. Lets say there is a column group created on (c1, c2) and we have a SQL statement with a where clause that contains c1 = 1 and c2 =1 and c3 = 1. The optimizer will use the extended statistics on C1,C2 and multiply that by the selectivity of third predicate. It will use all the available statistics on column group, including histograms. The optimizer will also use the extended statistics (in a limited way) if a subset of the column group is present in the query.

Posted by

Martin Deckeron March 17, 2011 at 05:03 PM PDT #Posted by

gueston May 09, 2011 at 12:47 AM PDT #Posted by

Maria Colganon June 02, 2011 at 02:21 AM PDT #Thx

very nice and useful article.

Will try this.

Posted by

Anjulon July 02, 2011 at 04:46 PM PDT #Great Article!

I have question though. When extended statistics are created, Oracle assigns a System Generated name to the extension which is fine. However, is it possible to import the extended stats into another schema name(or uat to production for example). Will they work with the system generated name?

Again, Thanks for sharing the knowledge.

Cheers!

Posted by

gueston September 10, 2011 at 06:44 PM PDT #Extended statistics can be exported and imported on another system just like all other column statistics. The system generated name of the extended statistics is based off of the column names, so the same name will be generate on all systems. Thus ensuring we will be able to use the extended statistics on any system.

Posted by

Maria Colganon September 12, 2011 at 11:50 AM PDT #Very Nice Article

Posted by

ferason September 30, 2011 at 09:20 AM PDT #Dear Ms. Colgan;

On your "Understanding Optimizer Statistics" paper, dated January 2012, you write:

"...This whitepaper is the first in a two part series on Optimizer statistics..."

Has the second part come out? Given the quality of the first part, am very eager to get my hands on the second.

Thank you.

Posted by

Rosson March 26, 2012 at 09:48 AM PDT #Hi Ross,

I am hoping to post the second part of the statistics paper on Oracle.com later this week. As soon as I post it I will add a link to the paper on the blog.

Thanks,

Maria

Posted by

Maria Colganon March 26, 2012 at 09:50 AM PDT #If you are stuck on 10g, another option is to partition your table. e.g. one partition for country USA, default partition for others. Oracle will collect statistics on a per-partition basis.

Use of a literal will allow it to come up with a plan specific to the partition you have nominated. In addition to this, a histogram would provide estimates on the skew within a partition.

Posted by

WWon August 28, 2012 at 07:49 PM PDT #Hi ,

Nice article but I have a question on the example provided in this article. You showed that extended stats works better with Histogram. Most of the time, our queries use bind variables and not literals. Due to this , it's not feasible for us to create histograms. Would extended stats still help us or we need to follow some other approach ?

Thanks

Posted by

gueston February 27, 2013 at 11:22 AM PST #Yes, extended statistics will still help you if you have multiple columns that are used together in a query and those columns are correlated.

Have you considered revising your approach to histograms and binds since the introduction of Adaptive Cursor Sharing?

Thanks,

Maria

Posted by

gueston February 27, 2013 at 01:28 PM PST #Very nice article Maria, thank you very much.

Regards,

Srikar

Posted by

gueston March 26, 2013 at 03:51 AM PDT #Is it possible to force a histogram onto an extended stat if column usage is not getting the histogram to appear? Thanks

Posted by

Franon April 24, 2013 at 09:45 AM PDT #Hi Fran,

Yes, it is possible to force a histogram on a column group using the method_opt parameter in the gather_table_stats procedure. You just need to know the name of the virtual column that represents the column group you are interested in. Below is an example of how to do it.

However, I have never come across a scenario where we haven't created a histogram on the column after the column usage information got recorded for the column group.

Example

========

SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';

COLUMN_NAME HISTOGRAM

------------------------------ ---------------

SYS_STUWMBUN3F$#398R7BS0YVS86R NONE

CUST_ID NONE

CUST_FIRST_NAME NONE

CUST_LAST_NAME NONE

CUST_GENDER FREQUENCY

CUST_YEAR_OF_BIRTH NONE

CUST_MARITAL_STATUS NONE

CUST_STREET_ADDRESS NONE

CUST_POSTAL_CODE HEIGHT BALANCED

CUST_CITY HEIGHT BALANCED

CUST_STATE_PROVINCE FREQUENCY

COUNTRY_ID NONE

CUST_MAIN_PHONE_NUMBER NONE

CUST_INCOME_LEVEL NONE

CUST_CREDIT_LIMIT NONE

CUST_EMAIL NONE

So, to create a histogram on the virtual column SYS_STUWMBUN3F$#398R7BS0YVS86R I would use the following gather_table_stats command.

BEGIN

dbms_stats.gather_table_stats('SH','CUSTOMERS', method_opt=>'FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 SYS_STUWMBUN3F$#398R7BS0YVS86R')

;

END;

/

SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';

COLUMN_NAME HISTOGRAM

------------------------------ ---------------

SYS_STUWMBUN3F$#398R7BS0YVS86R HEIGHT BALANCED

CUST_ID NONE

CUST_FIRST_NAME NONE

CUST_LAST_NAME NONE

CUST_GENDER FREQUENCY

CUST_YEAR_OF_BIRTH NONE

CUST_MARITAL_STATUS NONE

CUST_STREET_ADDRESS NONE

CUST_POSTAL_CODE HEIGHT BALANCED

CUST_CITY HEIGHT BALANCED

CUST_STATE_PROVINCE FREQUENCY

COUNTRY_ID NONE

CUST_MAIN_PHONE_NUMBER NONE

CUST_INCOME_LEVEL NONE

CUST_CREDIT_LIMIT NONE

CUST_EMAIL NONE

Posted by

gueston April 24, 2013 at 09:59 AM PDT #thanks for this nice article!

Posted by

rajmohanon September 02, 2014 at 03:00 AM PDT #