Subscribe

Share

DBA

A Higher-Level Perspective on SQL Tuning, Part 4

Common causes for poorly performing SQL

By Connor McDonald

August 12, 2019

In my previous article on SQL tuning, I discussed the importance of taking the correct approach to identifying the true execution plan for a SQL statement. Using the EXPLAIN PLAN command will sometimes yield the true plan, but with the optimizer advancements over the years, the only reliable way to observe the plan that will actually be used for the execution of SQL statement is to (1) execute the statement and then (2) use DBMS_XPLAN.DISPLAY_CURSOR or Real-Time SQL Monitoring to access the true plan.

I also showed that the key to analyzing the execution plan was to understand the importance of cardinality to the optimizer. The concept of the “best” plan—the plan achieving the best performance from a SQL statement—is predominantly determined by how closely the optimizer estimates of the row cardinality for each step in the SQL plan match the actual rows yielded from those steps. Alignment of the estimated values versus actual values is a good indication that the plan is optimal, and, conversely, a large divergence between the estimated values and the actual values is a strong indicator that the optimizer may have chosen a suboptimal plan. I’ll spend this article covering the most common reasons why a SQL statement ends up with a poor plan and, hence, poor performance.

Stale or Missing Statistics

In a completely default installation of Oracle Database, a nightly automatic task will gather statistics on those objects in the database that do not have any optimizer statistics recorded or for which the existing statistics are deemed to be “stale”—that is, out of date. For most databases, this is sufficient, but exceptional circumstances may arise that require additional steps. For example, if a table changes in volume by a large amount during the day, an explicit refresh of the statistics may be required to ensure that subsequent SQL statement executions do not degrade from that point in time until the nightly statistics update task executes. Listing 1 shows an example of how stale statistics lead to a poor execution plan.

Listing 1: Stale statistics leading to poor execution plan
SQL> select /*+ gather_plan_statistics */ count(hiredate) 
  2  from   emp
  3  where  job = 'CLERK';

COUNT(HIREDATE)
---------------
         262144

1 row selected.

SQL> select *
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------
SQL_ID  5q4k4713n5whf, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(hiredate) from emp 
where job = 'CLERK'

Plan hash value: 2177616286

-------------------------------------------------------------------------------------------------
| Id |Operation                            | Name  |Starts|E-Rows |A-Rows |   A-Time   |Buffers |
-------------------------------------------------------------------------------------------------
|  0 |SELECT STATEMENT                     |       |     1|       |     1 |00:00:05.05 |   6204 |
|  1 | SORT AGGREGATE                      |       |     1|     1 |     1 |00:00:05.05 |   6204 |
|  2 |  TABLE ACCESS BY INDEX ROWID BATCHED| EMP   |     1|     3 |   262K|00:00:05.06 |   6204 |
|* 3 |   INDEX RANGE SCAN                  | JOB_IX|     1|     3 |   262K|00:00:05.02 |    831 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("JOB"='CLERK')

Using GATHER_PLAN_STATISTICS (covered in the previous article) to compare estimated versus actual rows, the optimizer was expecting only three rows (E-Rows) to match the criteria of JOB = 'CLERK'. A small number of rows typically will direct the optimizer toward using an index to quickly locate that small subset, which is why the JOB_IX index was chosen. However, the A-Rows column shows that more than 262,000 rows actually matched the JOB = 'CLERK' criteria, hence 262,000 index lookups were performed to obtain the query results, leading to the five-second response time (A-Time). Such a large discrepancy in estimated rows versus actual rows should direct the person tuning the SQL to check that the statistics for the table are a true reflection of the data, and the quick check in Listing 2 shows that this is not the case.

Listing 2: Statistics versus true table data

SQL> select count(*)
  2  from emp;

  COUNT(*)
----------
    917504

1 row selected.

SQL>
SQL> select num_rows
  2  from   user_tables
  3  where  table_name = 'EMP';

  NUM_ROWS
----------
        14

1 row selected.

The optimizer “thinks” there are only 14 rows in the table, whereas there are nearly one million rows (917,504). This example shows that a large data population was performed without fresh statistics having been gathered. Listing 3 refreshes the statistics and shows the improvement in the execution plan.

Listing 3: Up-to-date statistics yield a better plan

SQL> exec dbms_stats.gather_table_stats('','EMP')

PL/SQL procedure successfully completed.

SQL> select /*+ gather_plan_statistics */ count(hiredate)
  2  from   emp
  3  where  job = 'CLERK';

COUNT(HIREDATE)
---------------
         262144

1 row selected.

SQL> select *
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------
SQL_ID  5q4k4713n5whf, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(hiredate) from emp where job = 'CLERK'

Plan hash value: 2083865914

---------------------------------------------------------------------------------
| Id | Operation          | Name |Starts |E-Rows |A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------
|  0 | SELECT STATEMENT   |      |     1 |       |     1 |00:00:02.02 |    5417 |
|  1 |  SORT AGGREGATE    |      |     1 |     1 |     1 |00:00:02.02 |    5417 |
|* 2 |   TABLE ACCESS FULL| EMP  |     1 |   262K|   262K|00:00:02.02 |    5417 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("JOB"='CLERK')

Of course, the addition of even a single row to a table renders the optimizer statistics “out of date,” but it is not practical to recalculate optimizer statistics every time a table is changed. Good communication between developers and administrators is essential so that both are kept informed of when additional statistics gathering operations may be required. Oracle Database 19c reduces the likelihood that out-of-date statistics will cause performance problems, with two new enhancements:

  • Standard data manipulation language (DML) is tracked, and some statistics are automatically adjusted to reflect the changes.

  • There’s a lightweight and more frequent gathering of statistics on those tables that get stale between the daily full statistics refresh cycles.

Skewed Data

Because databases are designed to collect an ever-growing volume of data, much of that data is continuous in nature. For example, data attributes such as time stamps, sequence numbers, and sales amounts have no theoretical upper and lower limits beyond those imposed by the business definitions or the database software itself.

In many cases, there will be an even distribution of values between the lower and upper limits, and hence the probability that a candidate value will be present in a table column is the same as the probability that any other value will be present. However, data domains also exist that are discrete in nature—where the number of elements is finite and known in advance. In such circumstances, often the frequency distribution of the elements is not even across the domain. Listing 4 shows an example of this, with the discrete domain of JOB not evenly distributed across all possible values.

Listing 4: Skewed data for JOB

SQL> select job, count(*)
  2  from   emp
  3  group by job
  4  order by 1;

JOB              COUNT(*)
-------------- ----------
ANALYST            200000
CLERK              400000
DIRECTOR            10026
MANAGER            300000
SALESMAN           400000
SENIOR MANAGER      89974

By default, the optimizer will assume an even distribution of data across each domain, which can lead to variability in the quality of the plans it chooses. For queries on the JOB value of ANALYST, the actual and estimated values are comparable, as shown in Listing 5.

Listing 5: Good estimate for ANALYST

SQL> select /*+ gather_plan_statistics */ count(hiredate)
  2  from   emp
  3  where  job = 'ANALYST';

COUNT(HIREDATE)
---------------
         200000

1 row selected.

SQL>
SQL> select *
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------
SQL_ID  56x6jyj0pbr6w, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(hiredate) from emp where job = 'ANALYST'

Plan hash value: 2083865914

--------------------------------------------------------------------------------
| Id | Operation          | Name |Starts |E-Rows |A-Rows |   A-Time   |Buffers |
--------------------------------------------------------------------------------
|  0 | SELECT STATEMENT   |      |     1 |       |     1 |00:00:00.02 |   8294 |
|  1 |  SORT AGGREGATE    |      |     1 |     1 |     1 |00:00:00.02 |   8294 |
|* 2 |   TABLE ACCESS FULL| EMP  |     1 |   233K|   200K|00:00:00.03 |   8294 |
--------------------------------------------------------------------------------

However, when the same query is performed for a different JOB position, such as when the JOB value is DIRECTOR, the comparison is much worse, as shown in Listing 6.

Listing 6: Poor estimate for DIRECTOR

SQL> select /*+ gather_plan_statistics */ count(hiredate)
  2  from   emp
  3  where  job = 'DIRECTOR';

COUNT(HIREDATE)
---------------
          10026

1 row selected.

SQL>
SQL> select *
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------
SQL_ID  gvudu9njvwn0a, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(hiredate) from emp where job = 'DIRECTOR'

Plan hash value: 2083865914

---------------------------------------------------------------------------------
| Id | Operation          | Name |Starts |E-Rows |A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------
|  0 | SELECT STATEMENT   |      |     1 |       |     1 |00:00:00.03 |    8294 |
|  1 |  SORT AGGREGATE    |      |     1 |     1 |     1 |00:00:00.03 |    8294 |
|* 2 |   TABLE ACCESS FULL| EMP  |     1 |   233K| 10026 |00:00:00.03 |    8294 |
---------------------------------------------------------------------------------

Note that in both Listing 5 and Listing 6, the estimated rows (E-Rows) value was 233K, because the optimizer assumed an even distribution of values throughout the JOB column. Listing 7 shows some simple arithmetic that reveals where the row estimate comes from.

Listing 7: Calculation of row estimate

SQL> select count(*) from emp;

  COUNT(*)
----------
   1400000

1 row selected.

SQL> select count(distinct job) from emp;

COUNT(DISTINCTJOB)
------------------
                 6

1 row selected.

There are 1,400,000 rows in the table, distributed across six distinct JOB values. The optimizer assumes an even distribution (1,400,000 / 6 = 233,333 rows).

The tuning solution is to create a histogram on the column(s) that have skewed data to provide the optimizer with more information for each of the discrete values in the domain. With the additional information available, the optimizer estimates are now correct for each of the discrete values, as shown in Listing 8.

Listing 8: Corrected optimizer estimates with histogram

SQL> begin
  2    dbms_stats.gather_table_stats('','EMP',
  3      method_opt=>'for all columns size auto');
  4  end;
  5  /

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL> select /*+ gather_plan_statistics */ count(hiredate)
  2  from   emp
  3  where  job = 'ANALYST';

COUNT(HIREDATE)
---------------
         200000

1 row selected.

SQL>
SQL> select *
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------
SQL_ID  56x6jyj0pbr6w, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(hiredate) from emp where job = 'ANALYST'

Plan hash value: 2083865914

---------------------------------------------------------------------------------
| Id | Operation          | Name |Starts |E-Rows |A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------
|  0 | SELECT STATEMENT   |      |     1 |       |     1 |00:00:00.02 |    8294 |
|  1 |  SORT AGGREGATE    |      |     1 |     1 |     1 |00:00:00.02 |    8294 |
|* 2 |   TABLE ACCESS FULL| EMP  |     1 |   200K|   200K|00:00:00.02 |    8294 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("JOB"='ANALYST')


20 rows selected.

SQL> select /*+ gather_plan_statistics */ count(hiredate)
  2  from   emp
  3  where  job = 'DIRECTOR';

COUNT(HIREDATE)
---------------
          10026

1 row selected.

SQL>
SQL> select *
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------
SQL_ID  gvudu9njvwn0a, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(hiredate) from emp where job = 'DIRECTOR'

Plan hash value: 2083865914

---------------------------------------------------------------------------------
| Id | Operation          | Name |Starts |E-Rows |A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------
|  0 | SELECT STATEMENT   |      |     1 |       |     1 |00:00:00.02 |    8294 |
|  1 |  SORT AGGREGATE    |      |     1 |     1 |     1 |00:00:00.02 |    8294 |
|* 2 |   TABLE ACCESS FULL| EMP  |     1 | 10026 | 10026 |00:00:00.03 |    8294 |
---------------------------------------------------------------------------------

By default, the database will track queries to automatically decide which columns should have a histogram gathered at the next DBMS_STATS call to gather statistics, including the automatic nightly task. To learn more about how and when histograms are created, check out this excellent blog post on the topic by Oracle Database Optimizer Product Manager Nigel Bayliss.

Cross-Column Correlation

Having accurate statistics on each table and each column might still not be enough information for the optimizer when SQL queries include multiple predicates. For example, just because I know there is a 50-50 split of boys and girls among the students at my children’s school and that there is a 50-50 split of children who take the soccer option in sports versus other sports, I cannot conclude that half of the soccer class are girls. The reality is that less than 10% of the soccer class is girls, because soccer has traditionally been a male-dominated sport in Australia. An interesting aside is that it is also one of the fastest-growing contact sports for girls in Australia, so perhaps there will soon be a 50-50 split.

This is an example of a correlation between two or more columns in a database table. The optimizer, by default, assumes no such correlation, so in the school example, it would estimate that 25% of the soccer class are girls (50% girls overall times 50% choosing the soccer option). It would also assume the same of the boys. In either case, the estimate is nowhere close to the actual percentages (10% girls, 90% boys), and hence a poor optimizer plan is likely.

Using some information on publicly available vehicle registration data in Western Australia, Listing 9 shows another example of cross-column correlation resulting in poor optimizer estimates, this time using a correlation between the manufacturer of a car (column MAKE) and the model name (column MODEL).

Listing 9: Cross-column correlation leading to poor estimates

SQL> select /*+ gather_plan_statistics */ count(*)
  2  from   VEHICLES
  3  where  MAKE  = 'HONDA'
  4  and    MODEL = 'CIVIC';

  COUNT(*)
----------
    114468

SQL> select * 
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');
 
-------------------------------------------------------------------
| Id | Operation               |Name    |Starts | E-Rows | A-Rows |
-------------------------------------------------------------------
|  1 |  SORT AGGREGATE         |        |     1 |      1 |      1 |
|  2 |   TABLE ACCESS BY INDEX |VEHICLES|     1 |     20K|    114K|
|  3 |     INDEX RANGE SCAN    |MAKE_IX |     1 |     45K|    182K|
-------------------------------------------------------------------

The optimizer does not understand that because Honda produces a car named Civic, it is unlikely, or even impossible, that there will ever be a Toyota Civic, a Volkswagen Civic, or a General Motors Civic. It assumes that all such combinations are equally likely. The execution plan above shows that the optimizer estimates 20,000 Honda Civics, whereas in reality there were 114,000 such registered vehicles.

The statistics to be collected on a table can also be extended with a facility called, unsurprisingly, extended statistics. Listing 10 shows the DBMS_STATS call that nominates the columns for which a correlation exists (MAKE and MODEL). The call to create extended statistics does not regather statistics for the table; it adds a hidden column to the table to hold representative data to enable the optimizer to map the correlated data. The return value from the function call (SYS_STUF3GLKIOP5F4B0BTTCFTMX0W) is the name of the hidden column, but the developer does not need to know this value for ongoing work.

Listing 10: Creating extended statistics for cross-column relationships

SQL> select 
  2    dbms_stats.create_extended_stats('','VEHICLE',
  3         '(MAKE,MODEL)') ext
  2  from dual;

EXT
----------------------------------------------------------
SYS_STUF3GLKIOP5F4B0BTTCFTMX0W

SQL> begin
  2    dbms_stats.gather_table_stats('','VEHICLE'
  3      ,method_opt=>'for all columns size auto');
  4  end;
  5  /

With the statistics regathered, the optimizer has additional information about the data correlation. Listing 11 shows the same vehicle query and that the estimated rows now match the actual rows, giving the optimizer a much higher probability for choosing the optimal execution plan.

Listing 11: Optimizer utilizing extended statistics for better estimate

SQL> select /*+ gather_plan_statistics */ count(*)
  2  from   VEHICLES
  3  where  MAKE = 'HONDA'
  4  and    MODEL = 'CIVIC';

  COUNT(*)
----------
    114468

SQL> select * 
  2  from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');

--------------------------------------------------------------
| Id | Operation          |Name    |Starts | E-Rows | A-Rows |
--------------------------------------------------------------
|  1 |  SORT AGGREGATE    |        |     1 |      1 |      1 |
|* 2 |   TABLE ACCESS FULL|VEHICLES|     1 |    114K|    114K|
--------------------------------------------------------------

Note that extended statistics target equality predicates. If the MODEL or MAKE predicates in the example were range or inequality predicates, the benefits of extended statistics would be significantly reduced.

To Infinity and Beyond!

Up-to-date statistics on every single table and every single column, histograms, and cross-column correlation information represent only a tiny fraction of all the statistics an optimizer for a database platform needs in order to claim to optimize any SQL statement. The volume of statistics needed is infinite, because there is an infinite number of ways to code a SQL statement. There is no practical limit to the number of tables in a query, the way those tables may be joined to each other, and the permutations of predicates that could be applied to any of the columns. It is simply not possible to have all the required statistics collected in advance.

Even with a single table, it is easy to create a SQL query for which none of the existing statistics gathered by the database will be useful. Listing 12 creates a table, appropriately named TOUGH_DATA, with 1,000,000 rows that stores randomly generated character strings in the column STR. I have gathered statistics and then performed a count of those strings that have a XX suffix.

Listing 12: Table TOUGH_DATA with random strings in a column STR

SQL> create table TOUGH_DATA nologging as
  2  select
  3    rownum pk,
  4    dbms_random.string('U',10) str
  5  from dual
  6  connect by level < 1000000
  7  /

Table created.

SQL> exec dbms_stats.gather_table_stats('','TOUGH_DATA')

PL/SQL procedure successfully completed.

SQL> select count(*)
  2  from   TOUGH_DATA
  3  where  str like '%XX'
  4  /

  COUNT(*)
----------
      1452

The count is 1,452, but the optimizer is supposed to derive an estimated result close to that value before the query is executed. Any statistics the optimizer might have available to it on the STR column are reflective of the column values in their entirety—not just the trailing two characters. Thus, beyond the fact that the number of matching rows is less than the total number of rows in the table, there is no other information that can assist the optimizer, so the optimizer falls back on some hard-coded defaults to use as a “best guess.” Listing 13 shows the defaults used by the optimizer in this instance.

Listing 13: Optimizer resorting to a default guess

SQL> select count(*)
  2  from   TOUGH_DATA
  3  where  str like '%XX'
  4  /

------------------------------------------------
| Id | Operation          | Name       | Rows  |
------------------------------------------------
|  0 | SELECT STATEMENT   |            |     1 |
|  1 |  SORT AGGREGATE    |            |     1 |
|* 2 |   TABLE ACCESS FULL| TOUGH_DATA | 50000 |
------------------------------------------------

The value of 50,000 rows is based on the total number of rows in the table (1,000,000). In this instance, when the optimizer decided that no existing statistics would be useful for estimating the number of matching rows, it used a default of 5% of the table rows. This might seem like the best estimate that can be hoped for, because the only way to garner a good estimate of the rows that will match the predicate would be to run the actual query.

But there is another option available to the optimizer: dynamic sampling. The optimizer will use the supplied predicates to perform some sampling of the true data in the table instead of using the statistics in the data dictionary to estimate the selectivity of the data based on those predicates. Listing 14 shows the DYNAMIC_SAMPLING hint for looking up a subset of the data to see how much of it matches the XX suffix predicate.

Listing 14: Dynamic sampling to help the optimizer make an estimate

SQL> select /*+ dynamic_sampling(t 2) */ count(*)
  2  from   TOUGH_DATA t
  3  where  str like '%XX';

------------------------------------------------
| Id | Operation          | Name       | Rows  |
------------------------------------------------
|  0 | SELECT STATEMENT   |            |     1 |
|  1 |  SORT AGGREGATE    |            |     1 |
|* 2 |   TABLE ACCESS FULL| TOUGH_DATA |  1252 |
------------------------------------------------

The use of dynamic sampling yielded a much better estimate, 1,252, instead of the default, 50,000. The improved accuracy of the estimate comes at the cost of performing some additional database work to evaluate a subset of the data. For queries that are run at a low frequency, such as ad hoc queries in data warehouses, the slight cost of dynamically sampling the data can often be justified by the corresponding improvement in the resultant optimizer plan.

Summary

This concludes my series on SQL tuning. You now know about ensuring the functional correctness of SQL, common SQL syntax mistakes that are often misinterpreted as performance issues, detecting and analyzing poorly performing SQL statements, and the most common tuning solutions to keep SQL running optimally.

Always keep in mind that correctness of functionality trumps performance, but once you have ensured that your SQL is meeting the business requirement it was coded for, the better you can make it perform and the more value you will be able to deliver for the users of the applications you build. Hopefully this series has given you the tools and techniques to get your SQL statements running smoothly.

Next Steps

LEARN more about SQL tuning.

READ
A Higher-Level Perspective on SQL Tuning.”
A Higher-Level Perspective on SQL Tuning, Part 2.”
A Higher-Level Perspective on SQL Tuning, Part 3.”
more about real-time statistics.

TRY Oracle Autonomous Transaction Processing.

DOWNLOAD Oracle Database 19c.

Illustration by Wes Rowell