Improving Cardinality Estimates with Dynamic Statistics

Introduction

This blog post is the third in a series. The previous posts explain:

Why poor cardinality misestimates can lead to poor query performance
How, in many cases, extended statistics can correct the problem

The overarching concept is:

If a SQL statement performs poorly, you should first consider whether or not cardinality misestimates are the root cause.

This post explains:

Why extended statistics cannot always improve cardinality estimates
What you can do if extended statistics don’t help
Why you might want to test a problem SQL statement with dynamic statistics

Take a closer look at your ‘problem’ SQL

I mentioned earlier in this series that some query predicates need help to estimate cardinality accurately. For example:

select *
from items
where substr(item_type_code,2,2) = 'ZZ';

Queries like this suggest (to me) that there’s an issue with the database schema because the item_type_code is probably non-atomic; a composite of multiple codes concatenated together rather than stored in several columns. A schema change would be ideal, but that is not usually practical, in which case, expression statistics can be the answer instead.

Extended statistics are not a panacea. For example, column groups require the relevant columns in the SQL statement to be equality (“=”) predicates. Here, I will present a case that is not fixable with expression statistics, one I saw in a genuine (and complex) application query. Before that, I want to reassure you that the Oracle Optimizer is excellent at estimating LIKE operators equivalent to range filters. For example:

select count(*)
from  items
where item_type_code like 'A1%';

The next case is more challenging because the extra “%” means that the LIKE is no longer equivalent to a range filer:

select count(*)
from  items
where item_type_code like '%A1%';

This query has only one plan on my database: a FULL scan of ITEMS. The quality of cardinality estimates cannot affect the plan. The point of the example is not to demonstrate performance problems; instead, it explores how discrepancies can arise between the optimizer’s row count estimate vs. the actual row count. In more complex real-world queries, this can sometimes lead to suboptimal performance.

You can try the example for yourself, but expect different row counts to me because the item type code is random:

create table items as
select rownum id,
       dbms_random.string('U',2)||(mod(rownum,91)+9)
             item_type_code
from dual connect by rownum < 200000;
--
-- Enable stats so we see E-Rows and A-Rows in the plan
--
alter session set statistics_level = all;
--
-- Our test query
--
select count(*) from items where item_type_code like '%A1%';

  COUNT(*)
----------
       874

select *
from   table(DBMS_XPLAN.DISPLAY_CURSOR
                  (format=>'allstats last'));
---------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |       |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| ITEMS |      1 |  10000 |    874 |
---------------------------------------------------------------
(I have removed some text from the plan above for brevity)

The count(*) returned 874 rows, A-Rows (the actual number of rows seen during execution) agrees with this, but the optimizer estimated 10000 (E-Rows). A bug? No!

For homework, see if you can beat the optimizer and think of a method to calculate an accurate estimate for the number of rows ‘%A1%’ will match. You have access to a statistical summary: number of rows, number of nulls, min, max, average column value length, and number of distinct values. The optimizer doesn’t know the method used to generate the data, so it would be cheating to use this knowledge to your advantage.

The optimizer chose 10000, a built-in rule used because it could not use statistics to derive an accurate estimate. A cost-based optimizer still needs rules! The value 10000 is not fixed; it will change if you adjust the number of rows in the table.

Expression statistics won’t help; we are not using a basic expression.

The take-away point here is that it is challenging to estimate the selectivity accurately for some predicates. Basic statistics or even extended statistics cannot help in all cases. The optimizer uses predicate selectivity to calculate cardinality so that a poor selectivity estimate can lead to a suboptimal plan in some cases. A suboptimal plan can result in noticeably poor performance in some cases.

How can you identify misestimates?

You can spot misestimates using SQL Monitor or DBMS_XPLAN by looking for discrepancies between estimated and actual row counts (E-Rows vs. A-Rows). You can see this in the example above. For more complex plans, you must consider the effect of “Starts,” where a query branch is executed more than once (see the Starts column in the plan above). The row estimate is usually per start. With practice, inspecting SQL monitor reports or DBMS_XPLAN output (with plan statistics) will help you identify problem areas in the SQL execution plan.

You can manually inspect SQL statement WHERE clauses, but it isn’t easy to be specific about what to look for because the database has ways of dealing with many complex cases. For example, I used to think that CASE statements in WHERE clauses would always make accurate costing impossible. Who knows what might happen at runtime? Nevertheless, the database can often generate reasonable selectivity estimates for CASE statements.

I like to experiment, so I prefer to isolate predicates I am unsure about and try them with COUNT(*) queries on test tables (much like the example above). It is usually easy to do this, and the effort is often well spent. The same pattern is often repeated in many SQL statements (shining a spotlight into awkward corners of the database schema). If you have a working example, you can immediately try extended statistics to see if they help. It is often possible to identify a small number of problem areas that, when resolved, will fix many problem SQL statements at once.

Having said all of this, we do not necessarily need to over-think this too much. You probably want to avoid a lot of effort and cut straight to the chase—more on this below.

Why can predicates be so complicated?

Complex predicates and poor SQL performance in OLTP and ODS-style databases can sometimes be a side-effect of a poorly designed schema or an evolved application. The content and meaning of column data might change over time, usually in response to changing business requirements. Ultimately, SQL statements may need to accommodate a less-than-perfect schema, and complex query predicates can be the end result. The good news is that extended statistics will often help in cases like this.

Data warehouse/BI/DSS queries and ELT SQL statements are often inherently complex, with many conditional predicates, aggregations, and multiple table joins. Accurate costing will be a challenge even if the schema is exceptionally well designed. It would be fantastic if the database could measure cardinality at parse-time, especially for systems that have long-running, complex SQL statements. You can be forgiven for thinking that this is chicken-and-egg, but luckily for us, the Oracle Database has a way to do it.

What is the solution?

Here is something you can try:

-- Setting to allow us to see E-Rows and A-Rows in the plan
alter session set statistics_level = all;

select /*+ dynamic_sampling(5) */ count(*)
from items where item_type_code like '%A1%';

  COUNT(*)
----------
       874

select *
from   table(DBMS_XPLAN.DISPLAY_CURSOR(
       format=>'allstats last'));
---------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |       |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| ITEMS |      1 |    858 |    874 |
---------------------------------------------------------------
Note
-----
   - dynamic statistics used: dynamic sampling (level=5)

We now have an estimate of 858 rows and an actual of 874. Pretty good. What is happening? The hint guides the optimizer to sample the data accessed by the query to test how many rows ‘%A1%’ is likely to match. The number 5 in brackets is the dynamic sampling level. It tells the optimizer how and what to sample—more on this below.

That looks too easy. What’s the catch?

Sampling data takes time.

When a SQL statement is hard parsed, and dynamic sampling is enabled, the optimizer will transparently execute some SQL statements to sample the underlying data. This is the process of dynamic sampling and is used to gather dynamic statistics. The optimizer uses dynamic statistics to adjust cardinality estimates. Sampling queries take time to execute, so the hard parse will take longer than it would otherwise.

The system overhead for dynamic sampling will depend on the hard parse rate and the dynamic sampling level, which controls the type of sampling used and the number of database blocks accessed. The default level is two, allowing sampling to kick in if a SQL statement accesses objects without statistics. Changing this value database-wide or container-wide is a significant change. Higher levels will increase hard parse times and potentially affect many SQL execution plans. If hard parse rates are low, the parse-time overhead may be completely negligible, especially if SQL statements are relatively long-running and executed multiple times per hard parse. Nevertheless, it is best to be selective about when and where it is used: target specific SQL statements or (after testing) database sessions that fulfill a particular business function (e.g., end-of-day ELT).

Larger sample sizes will take longer, but the results will be more accurate and less likely to be led astray by skewed data values. Sampling a skewed dataset (rather than looking at the whole set) risks missing significant data values.

Dynamic statistics will not resolve all cases of cardinality misestimate. It must balance parse time vs. how much is achievable in such a short space of time. Also, remember that you can use the SQL Tuning Advisor if you have the appropriate license. It can spend more time looking for alternative plans and correcting misestimates and will create a SQL profile to ‘repair’ your plan without the need for dynamic statistics.

Dynamic statistics are not intended to replace gathered statistics; the feature is designed to improve the quality of optimizer costing.

Testing dynamic sampling

If you test dynamic sampling on a workload, be careful about how you interpret the results. If workload SQL statements don’t take very long to complete, an increase in parse time could affect end-to-end times enough to make results look poor. Dynamic statistics are collected during hard parse, not soft parse. Suppose you want to test SQL statements that would normally execute many times in a production system. Your test profile should be similar: one hard parse and multiple executions (with or without soft parse). This will reduce the significance of the hard parse overhead and more accurately represent the behavior of the production workload.

You might find that dynamic statistics consistently delivers good enough performance and not necessarily the best performance for all SQL statements. Take a holistic view. In my experience, some fast queries may take a little longer to complete, but longer-running SQL benefits significantly. If you value consistent, good enough performance over everything else, bear this in mind when looking over your test results.

Controlling dynamic sampling

You can set the dynamic sampling level for a database, container, or individual session using a parameter. For example:

alter session set optimizer_dynamic_sampling =  5;

There are a couple of hints and hint patterns, but I suggest that you keep it simple:

select /*+ dynamic_sampling(5) */ count(*)
from items where item_type_code like '%A1%';

You can use a SQL patch to apply hints to individual SQL statements, avoiding the need to change the application:

-- Get the SQL ID for our query (yours might be different)
select sql_id
from   v$sql
where  sql_text like 'select count(*) from items%';

SQL_ID
-------------
0w1kf2nh7qwz7

declare
   n varchar2(100);
begin
  n := dbms_sqldiag.create_sql_patch(
         sql_id    =>'0w1kf2nh7qwz7',
         hint_text =>'dynamic_sampling(11)',
         name      =>'mypatch');
end;
/

select count(*) from items where item_type_code like '%A1%';
select * from table(DBMS_XPLAN.DISPLAY_CURSOR());
--------------------------------------------
| Id  | Operation          | Name  | Rows  |
--------------------------------------------
|   0 | SELECT STATEMENT   |       |       |
|   1 |  SORT AGGREGATE    |       |     1 |
|*  2 |   TABLE ACCESS FULL| ITEMS |   858 |
--------------------------------------------

Note
-----
   - dynamic statistics used: dynamic sampling (level=AUTO)
   - SQL patch "mypatch" used for this statement

If you used SQL patch and you want to remove the overhead of dynamic sampling, it is easy to create a SQL plan baseline and drop the SQL patch:

var n number
begin
   :n := dbms_spm.load_plans_from_cursor_cache
                      (sql_id=>'0w1kf2nh7qwz7');
end;
/

exec dbms_sqldiag.drop_sql_patch('mypatch')

select count(*) from items where item_type_code like '%A1%';
select * from table(DBMS_XPLAN.DISPLAY_CURSOR());
---------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |       |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| ITEMS |      1 |  10000 |    874 |
---------------------------------------------------------------

Note
-----
   - SQL plan baseline SQL_PLAN_4wpcgptm5rpbt43e9edce used for this statement

After dropping the SQL patch, “E-Rows” will revert to the misestimated value(s), in my case, that is 10000. There is no longer any dynamic sampling, but the misestimate will not affect the plan because the SQL plan baseline is now in control.

Another possible scenario is that you have a session setting of level 8 (for example), and most SQL statements run very well. Nevertheless, one or two queries do not perform well with dynamic statistics (which can happen in rare cases), or perhaps you need to target level 11. You can override the session setting with a hint or a SQL patch. In this example, dynamic statistics is disabled for a single query:

declare
   n varchar2(100);
begin
  n := dbms_sqldiag.create_sql_patch(
    sql_id    =>'0w1kf2nh7qwz7',
    hint_text =>'dynamic_sampling(0)',
    name      =>'disqry9');
end;
/

If you have been trying this out, here’s the post-test SQL patch and SQL plan baseline cleanup:

var n number
exec dbms_sqldiag.drop_sql_patch('mypatch')
exec dbms_sqldiag.drop_sql_patch('disqry9')
begin
   :n := dbms_spm.drop_sql_plan_baseline(
        plan_name=>
           'SQL_PLAN_364d69z2utm3u43e9edce'); /* from above */
end;
/

What dynamic sampling level should you use?

Increasing dynamic sampling levels up to 10 will cause the optimizer to sample more blocks, increase parse times, and generally derive more accurate results. Level 11 samples data differently. It may execute multiple samples and does more work to handle skew, group-by, and table join cardinality misestimates (more on this below).

Targeting an individual query

Try levels 8, 10, and 11 to see if dynamic statistics help. If these tests do not improve the situation, it is unlikely that dynamic sampling will be a solution. Level 10 reads all blocks, which is OK in a test scenario, but it can take a long time and is not really intended to be a full-time setting.

After testing higher levels, try testing lower levels to see how sensitive the plan is to sample size. Lower levels may yield a good plan and keep the parse-time overhead very low. If you consider it necessary to keep the hard parse overhead down, use the lower level.

For individual longer-running queries that are not frequently hard-parsed, I would generally recommend levels 8,9, or 11. You might notice that a large number of internal Oracle queries use level 11.

An entire workload

Generally speaking, dynamic sampling is most appropriate for longer-running queries that access large volumes of data. It is less likely to be suitable for OLTP-style workloads where SQL statements should terminate in seconds (or sub-second), where we risk taking longer to parse a query than to execute it.

If you intend to use dynamic sampling level 3 or more for an entire workload, you should test it first. I have encountered customer and commercial (non-OLTP) applications running with levels 4 through 6. I generally recommend testing levels 8,9 and 11, too, if possible. If you are licensed to use SQL Performance Analyzer, it is a great way to test the effects of dynamic statistics on a workload captured in a SQL tuning set.

It is not just complex predicates

Throughout this series of posts, I have concentrated on query predicates. The reason for this is that I see so many WHERE clause predicates that make accurate costing non-viable (unless action is taken). In addition, I want SQL developers to think carefully about the predicates they use and consider whether schema improvements would make them simpler. A bug is often suspected when a query performs poorly, especially if a plan changes after an upgrade. Nevertheless, if you care to look, the reason is often easy to see and sitting right there in the WHERE clause.

Now is the time to mention that accurate costing can be difficult for beautifully written queries accessing a near-perfect schema. Sometimes, an exemplary query will have cardinality issues associated with join and group-by misestimates. This will be the topic for the next blog post and, when published, I will provide a link here.

Statistics feedback

I need to clarify something you might observe if you execute the example above. If you run the test query more than once, you are likely to see the cardinality estimate settle to the correct value – something like this:

select count(*) from items where item_type_code like '%A1%'

---------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |
---------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |       |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| ITEMS |      1 |    874 |    874 |
---------------------------------------------------------------
Note
-----
   - statistics feedback used for this statement

This is statistics feedback, as indicated by the “Note” section of the plan. This topic is out of scope for this blog post, but you can see that the Oracle Database has more than one mechanism in place to push queries into line and will, in many cases, rescue plans on the second execution. Dynamic statistics is designed to get the plan right on initial execution and has a broader scope than statistics feedback.

A diagnostic tool

If dynamic sampling yields a good plan, then you might want to look more closely at the SQL statement to see if extended statistics could help (which might help many SQL statements in one go). Testing SQL statements with dynamic statistics is a valuable diagnostic tool. You can use it up-front if you are investigating a SQL statement performance issue.

You could argue that if you can resolve a SQL performance problem with dynamic statistics, there is no need to consider extended statistics. Perhaps, but taking time to consider extended statistics might solve many problems at once. It will reduce the risk of encountering problems with new SQL statements and reduce dependence on dynamic statistics overall (and its associated overheads).

Improving Cardinality Estimates with Dynamic Statistics

Targeting an individual query

An entire workload

Nigel Bayliss

Product Manager

Why do I have hundreds of child cursors when cursor_sharing is set to similar in 10g

How do I create statistics to make ‘small’ objects appear ‘large’ to the Optmizer?

Improving Cardinality Estimates with Dynamic Statistics

Targeting an individual query

An entire workload

Authors

Nigel Bayliss

Product Manager

Why do I have hundreds of child cursors when cursor_sharing is set to similar in 10g

How do I create statistics to make ‘small’ objects appear ‘large’ to the Optmizer?