Extended Statistics

In real-world data, there is often a relationship or correlation between the data stored in different columns of the same table. For example, in the customers table, the values in the cust_state_province column are influenced by the values in the country_id column, as the state of California is only going to be found in the United States. Until now, the Optimizer had no way of knowing about these real-world relationships and could potentially miscalculate the cardinality estimate if multiple columns from the same table are used in the where clause of a statement. With extended statistics you now have an opportunity to tell the Optimizer about these real-world relationships between the columns.

By creating extended statistics on a group of columns, the Optimizer can determine a more accurate cardinality estimate when the columns are used together in a where clause of a SQL statement.You can use DBMS_STATS.CREATE_EXTENDED_STATS to define the column group you want to have statistics gathered on as a whole. Once the group has been established Oracle will automatically maintain the statistics on that column group when statistics are gathered on the table.

If we continue with the initial example of the customers table, When the value of cust_state_province is 'CA' we know the value of country_id will be 52790 or the USA. There is also a skew in the data in these two columns; because the company used in the SH is based in San Francisco so the majority of rows in the table have the values 'CA' and 52790. Both the relationship between the columns and the skew in the data can make it difficult for the optimizer to calculate the cardinality of these columns correctly when they are used together in a query.

initial_query_and_plan.png



With just basic statistics we see the optimizer thinks there will only be 127 row returned because it assumes both columns will reduce the number of rows returned (# of rows in the table X 1/NDV of column1 X 1/NDV of column2). We know that this is not true in this case. We must provide better statistic to the optimizer so it can determine the correct cardinality estimate. Prior to Oracle Database 11g the only option open to us would be to make the optimizer aware of the data skew in both the country_id column (most rows have 52790 as the value) and the cust_state_province column (most rows have 'CA' as the value). We can do this by gathering histograms on the skewed columns.

histogram_stats.png


Now that we have histograms on both the country_id and the cust_state_province columns let's see if the optimizers estimate is more accurate.

histogram_plan.png


There is a slightly improvement in the estimate since we have histograms for the individual columns. However the optimizer is still not aware that there is a relationship or correlation between these two columns. In Oracle Database 11g we can tell the optimizer about this correlation by creating extended statistics on these two columns as a group. Once the extended statistics have been created the next time statistics are gathered on the CUSTOMERS table an extra set of statistics, for the combine group of country_id and cust_state_province, will be collected. The DBMS_STATS.CREATE_EXTENDED_STATS function can be used to create the extended statistics or a 'column group' for country_id and cust_state_province.

create_extended_stats.png


After creating the extended statistics and regather statistics, you will see a system generated column name in USER_TAB_COL_STATISTICS, which represents the new column group. A subset of statistics are maintained for column groups including;
  • Number of distinct values
  • Number of Nulls
  • Histograms

system_generated_stats.png


So now we have statistics on the column group lets confirm that is enough information for the optimizer to get the correct estimation.

Not_working.png

The cardinality estimate is still off. Why did the extended statistics not help in this case? The extended statistics were not actually used here. If you look back at the output from USER_TAB_COL_STATISTICS you can see there is a histogram created on the country_id and cust_state_province columns. However, there is no histogram created on the column group. Because a histogram provides the Optimizer with more information than standard statistics the Optimizer ignores the extended statistics and uses the individual column statistics instead.

Since we have executed the query again, the optimizer will have recorded that a histogram on the created extension is beneficial for the query. A histogram will be automatically created on the column group the next time statistics are gathered on the table

p1.png


With the histogram in place on the column group, the Optimizer will now use the extended statistics and the cardinality estimates is now accurate.

working.png


Extended Statistics are used even if we have them for only a subset of predicates in the statement. Lets say there is a column group created on (c1, c2) and we have a SQL statement with a where clause that contains c1 = 1 and c2 =1 and c3 = 1. The optimizer will use the extended statistics on C1,C2 and multiply  that by the selectivity of third predicate. It will use all the available statistics on column group, including histograms. The optimizer will also use the extended statistics (in a limited way) if a subset of the column group is present in the query.

Comments:

Maria, maybe you could post an article about the Cardinality Feedback sometime in the future. There is little known about this new feature. Thank you in advance. Regards, Martin

Posted by Martin Decker on March 17, 2011 at 05:03 PM PDT #

Hello, Very interesting article, I was wondering why extended could not be used for correlated columns accross two tables in case of a join leading to bad cardinality due to the assumption that columns are not correlated ?

Posted by guest on May 09, 2011 at 12:47 AM PDT #

Hi Martin, We took your advice and blogged about Cardinality Feedback in our latest post. http://blogs.oracle.com/optimizer/entry/cardinality_feedback Hope you found it useful! Thanks, Maria

Posted by Maria Colgan on June 02, 2011 at 02:21 AM PDT #

Thx
very nice and useful article.
Will try this.

Posted by Anjul on July 02, 2011 at 04:46 PM PDT #

Great Article!
I have question though. When extended statistics are created, Oracle assigns a System Generated name to the extension which is fine. However, is it possible to import the extended stats into another schema name(or uat to production for example). Will they work with the system generated name?

Again, Thanks for sharing the knowledge.
Cheers!

Posted by guest on September 10, 2011 at 06:44 PM PDT #

Extended statistics can be exported and imported on another system just like all other column statistics. The system generated name of the extended statistics is based off of the column names, so the same name will be generate on all systems. Thus ensuring we will be able to use the extended statistics on any system.

Posted by Maria Colgan on September 12, 2011 at 11:50 AM PDT #

Very Nice Article

Posted by feras on September 30, 2011 at 09:20 AM PDT #

Dear Ms. Colgan;

On your "Understanding Optimizer Statistics" paper, dated January 2012, you write:

"...This whitepaper is the first in a two part series on Optimizer statistics..."

Has the second part come out? Given the quality of the first part, am very eager to get my hands on the second.

Thank you.

Posted by Ross on March 26, 2012 at 09:48 AM PDT #

Hi Ross,

I am hoping to post the second part of the statistics paper on Oracle.com later this week. As soon as I post it I will add a link to the paper on the blog.

Thanks,
Maria

Posted by Maria Colgan on March 26, 2012 at 09:50 AM PDT #

If you are stuck on 10g, another option is to partition your table. e.g. one partition for country USA, default partition for others. Oracle will collect statistics on a per-partition basis.

Use of a literal will allow it to come up with a plan specific to the partition you have nominated. In addition to this, a histogram would provide estimates on the skew within a partition.

Posted by WW on August 28, 2012 at 07:49 PM PDT #

Hi ,

Nice article but I have a question on the example provided in this article. You showed that extended stats works better with Histogram. Most of the time, our queries use bind variables and not literals. Due to this , it's not feasible for us to create histograms. Would extended stats still help us or we need to follow some other approach ?

Thanks

Posted by guest on February 27, 2013 at 11:22 AM PST #

Yes, extended statistics will still help you if you have multiple columns that are used together in a query and those columns are correlated.

Have you considered revising your approach to histograms and binds since the introduction of Adaptive Cursor Sharing?

Thanks,
Maria

Posted by guest on February 27, 2013 at 01:28 PM PST #

Very nice article Maria, thank you very much.

Regards,
Srikar

Posted by guest on March 26, 2013 at 03:51 AM PDT #

Is it possible to force a histogram onto an extended stat if column usage is not getting the histogram to appear? Thanks

Posted by Fran on April 24, 2013 at 09:45 AM PDT #

Hi Fran,

Yes, it is possible to force a histogram on a column group using the method_opt parameter in the gather_table_stats procedure. You just need to know the name of the virtual column that represents the column group you are interested in. Below is an example of how to do it.

However, I have never come across a scenario where we haven't created a histogram on the column after the column usage information got recorded for the column group.

Example
========

SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';

COLUMN_NAME HISTOGRAM
------------------------------ ---------------
SYS_STUWMBUN3F$#398R7BS0YVS86R NONE
CUST_ID NONE
CUST_FIRST_NAME NONE
CUST_LAST_NAME NONE
CUST_GENDER FREQUENCY
CUST_YEAR_OF_BIRTH NONE
CUST_MARITAL_STATUS NONE
CUST_STREET_ADDRESS NONE
CUST_POSTAL_CODE HEIGHT BALANCED
CUST_CITY HEIGHT BALANCED
CUST_STATE_PROVINCE FREQUENCY
COUNTRY_ID NONE
CUST_MAIN_PHONE_NUMBER NONE
CUST_INCOME_LEVEL NONE
CUST_CREDIT_LIMIT NONE
CUST_EMAIL NONE

So, to create a histogram on the virtual column SYS_STUWMBUN3F$#398R7BS0YVS86R I would use the following gather_table_stats command.

BEGIN
dbms_stats.gather_table_stats('SH','CUSTOMERS', method_opt=>'FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 SYS_STUWMBUN3F$#398R7BS0YVS86R')
;
END;
/

SQL> select column_name, histogram from user_tab_col_statistics where table_Name='CUSTOMERS';

COLUMN_NAME HISTOGRAM
------------------------------ ---------------
SYS_STUWMBUN3F$#398R7BS0YVS86R HEIGHT BALANCED
CUST_ID NONE
CUST_FIRST_NAME NONE
CUST_LAST_NAME NONE
CUST_GENDER FREQUENCY
CUST_YEAR_OF_BIRTH NONE
CUST_MARITAL_STATUS NONE
CUST_STREET_ADDRESS NONE
CUST_POSTAL_CODE HEIGHT BALANCED
CUST_CITY HEIGHT BALANCED
CUST_STATE_PROVINCE FREQUENCY
COUNTRY_ID NONE
CUST_MAIN_PHONE_NUMBER NONE
CUST_INCOME_LEVEL NONE
CUST_CREDIT_LIMIT NONE
CUST_EMAIL NONE

Posted by guest on April 24, 2013 at 09:59 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

The Oracle Optimizer blog is written by members of the Optimizer development team. The goal of this blog is to provide an insight into the workings of the Optimizer and the statistics it relies on. The views expressed on this blog are our own and do not necessarily reflect the views of Oracle and its affiliates. The views and opinions expressed by visitors on this blog are theirs solely and may not reflect ours.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today