Lies, damned lies, and statistics

There was huge interest in our OOW session on Managing Optimizer Statistics. It seems statistics and the maintenance of them continues to baffle people. However, understanding and managing Optimizer statistics is key to optimal SQL execution. Knowing when and how to gather statistics in a timely manner is critical to maintaining acceptable performance. In order to clarify all of the information surrounding statistics we have put together a two part whitepaper series on Optimizer statistics and the management of them.

Today we have published part one of the series, which describes in detail, with worked examples, the different concepts of Optimizer statistics. Part two of the series will focus on the best practices for gathering statistics in specific cases, and will examine the fears and misconception that surround certain types of statistics. We hope to publish part two by the end of this year.

Here is a quick look at the Introduction and the start of the paper. You can find the full paper here.

Introduction

When the Oracle database was first introduced the decision of how to execute a SQL statement was determined by a Rule Based Optimizer (RBO). The Rule Based Optimizer, as the name implies, followed a set of rules to determine the execution plan for a SQL statement. The rules were ranked so if there were two possible rules that could be applied to a SQL statement the rule with the lowest rank would be used.

In Oracle Database 7, the Cost Based Optimizer (CBO) was introduced to deal with the enhanced functionality being added to the Oracle Database at this time, including parallel execution and partitioning, and to take the actual data content and distribution into account. The Cost Based Optimizer examines all of the possible plans for a SQL statement and picks the one with the lowest cost, where cost represents the estimated resource usage for a given plan. The lower the cost the more efficient an execution plan is expected to be. In order for the Cost Based Optimizer to accurately determine the cost for an execution plan it must have information about all of the objects (tables and indexes) accessed in the SQL statement, and information about the system on which the SQL statement will be run.

This necessary information is commonly referred to as Optimizer statistics. Understanding and managing Optimizer statistics is key to optimal SQL execution. Knowing when and how to gather statistics in a timely manner is critical to maintaining acceptable performance. This whitepaper is the first in a two part series on Optimizer statistics, and describes in detail, with worked examples, the different concepts of Optimizer statistics including;

  • What are Optimizer statistics
  • Gathering statistics
  • Managing statistics
  • Additional types of statistics

What are Optimizer Statistics?

Optimizer statistics are a collection of data that describe the database, and the objects in the database. These statistics are used by the optimizer to choose the best execution plan for each SQL statement. Statistics are stored in the data dictionary, and can be accessed using data dictionary views such as USER_TAB_STATISTICS. Optimizer statistics are different from the performance statistics visible through V$ views. The information in the V$ views relates to the state of the system and the SQL workload executing on it.

Figure 1:Optimizer statistics stored in the data dictionary used by the Optimizer to determine execution plans

Table and Column Statistics

Table statistics include information on the number of rows in the table, the number of data blocks used for the table, as well as the average row length in the table. The optimizer uses this information, in conjunction with other statistics, to compute the cost of various operations in an execution plan, and to estimate the number of rows the operation will produce. For example, the cost of a table access is calculated using the number of data blocks combined with the value of the parameter DB_FILE_MULTIBLOCK_READ_COUNT. You can view table statistics in the dictionary view USER_TAB_STATISTICS.

Column statistics include information on the number of distinct values in a column (NDV) as well as the minimum and maximum value found in the column. You can view column statistics in the dictionary view USER_TAB_COL_STATISTICS. The optimizer uses the column statistics information in conjunction with the table statistics (number of rows) to estimate the number of rows that will be returned by a SQL operation. For example, if a table has 100 records, and the table access evaluates an equality predicate on a column that has 10 distinct values, then the optimizer, assuming uniform data distribution, estimates the cardinality to be the number of rows in the table divided by the number of distinct values for the column or 100/10 = 10.


You can find the complete paper here. Part 2 should be coming soon. Happy Reading!

Maria Colgan+

Comments:

Hi Maria,

The Paper on Optimizer and statistics is very informative.

Thanks a lot,
baskar.l

Posted by baskar on November 14, 2011 at 03:40 PM PST #

The comment on Height Based Histogram that for Column with nearly Unique values, the histgrams are only automatically created if there are Range Predicate queries seems to be incorrect. I tested this in my test database, both 10g and 11g, the histograms are created even when the queries against these columns are just EQUALITY predicate columns.

Can you please validate this ?

Posted by Vivek Sharma on November 24, 2011 at 01:26 AM PST #

Hi Maria,
I would like your view on the use of cardinality feedback in 11.2.0.2. We have encountered serveral performance issues over the past number of months where a query will suddenly choose a different poorly optimized plan. Reviewing the different plans for these queries I could see it was using cardinality feedback. We have also encountered a situation where an incorrect value was being returned for a count(*) on a view this was attributed to the use of cardinality feedback by Oracle support.

Regards Jean

Posted by guest on February 17, 2012 at 04:23 AM PST #

Hi Maria,

First of all I would like to thank you for your effort in helping Oracle Database professionals in understanding the Oracle Database optimizer better.

I have few questions...

1> When we gather system statistics using dbms_stats.gather_system_stats procedure, what exactly Oracle does in the background.

2> How is noworkload statistics help?

3> What happens when we collect system statistics during peak hour.

4> What is the performance overhead of the above procedure.

5> Is it fine if we gather system statistics every 6 hours(or N hours) since we call this procedure from our application which can not predict the peak hours.

Please help me in getting my concepts right.

Posted by guest on March 29, 2012 at 10:21 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

The Oracle Optimizer blog is written by members of the Optimizer development team. The goal of this blog is to provide an insight into the workings of the Optimizer and the statistics it relies on. The views expressed on this blog are our own and do not necessarily reflect the views of Oracle and its affiliates. The views and opinions expressed by visitors on this blog are theirs solely and may not reflect ours.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today