I’ve heard you say the optimizer learns from its mistakes. What do you mean by that?
The optimizer works by estimating how many rows the various steps in a plan would produce and, drawing predominantly on that information, it chooses a plan. (I discussed this in detail in the November/December 2014 issue of Oracle Magazine.) After running the plan, the optimizer has access to what really happened during query execution—how many rows were actually produced by each step—but historically it has never done anything with that information. The optimizer would continue to make the same estimate every time it optimized that query.
The estimated row count is perhaps the single most important element of the query optimization process. The optimizer assigns costs to the different steps in a plan, based in large part on row count estimates, and uses that information to determine access methods (use an index or do a full table scan, for example), join methods (use nested loops or hash joins), and parallel execution methods (broadcast an entire set of rows from one layer of execution servers to the next or distribute them round-robin), to name a few. If the optimizer guesses wrong about how many rows will be returned, the result will often be a bad query plan.
For example, if I told you that your SELECT * FROM one_billion_row_table WHERE indexed_column = 5 query would return half a billion rows, you and the optimizer would probably choose to use a full table scan. If upon executing the query, you received only two rows, you would realize instantly that you and the optimizer should have used an index. In earlier database releases—prior to Oracle Database 11g—the optimizer would not have realized that it had made a mistake; it would have stuck with the same full table scan plan over and over. But as of Oracle Database 11g, the optimizer may well realize that it has made a mistake and correct itself. And in Oracle Database 12c, the optimizer will not only correct itself but will also remember—persistently—that it has made a mistake and continue to correct itself, even after your query is flushed from the shared pool. And it will correct other similar queries as well.
Let’s see this in action. I’ll start by creating a copy of some test data I have—this data contains city and state names and a row for each person in that city/state—so there is a row for each person in a given city/state.
SQL> create table people 2 compress 3 as 4 select name city_name,state 5 from cities_state 6 / Table created.
I am using Oracle Database 12c, so there are already statistics on this table, as you can see in Listing 1.
Code Listing 1: Existing table statistics with Oracle Database 12c
SQL> select to_char(sysdate,'dd-mon-yyyy hh24:mi') now, 2 num_rows num_rows, 3 to_char( last_analyzed, 'dd-mon-yyyy hh24:mi' ) la 4 from user_tables 5 where table_name = 'PEOPLE' 6 / NOW NUM_ROWS LA ——————————————————————— ——————————————— —————————————————— 24-apr-2015 20:13 114,673,091 24-apr-2015 20:13 SQL> select column_name, 2 num_distinct, 3 round(114673091/num_distinct) the_guess, 4 num_nulls, 5 histogram, 6 notes 7 from user_tab_col_statistics 8 where table_name = 'PEOPLE' 9 / COLUMN_NAM NUM_DISTINCT THE_GUESS NUM_NULLS HISTOGRAM NOTES —————————— ———————————— ———————————— ———————————— ———————————— —————————————— CITY_NAME 675 169,886 0 NONE STATS_ON_LOAD STATE 50 2,293,462 0 NONE STATS_ON_LOAD
The fact that the NOW column of the query against USER_TABLES is the same as the LAST_ANALYZED column shows that statistics were created as the data was loaded—this is a new feature of Oracle Database 12c. Since Oracle Database 10g, the database has been computing statistics automatically for indexes when you create or rebuild them. With Oracle Database 12c, the database does the same for tables. Further, as you can see in the example in Listing 1, there are basic statistics on the columns in the table as well.
I added a column to the query against the USER_TAB_COL_STATISTICS – THE_GUESS column. This is the estimate of rows the optimizer will use if I use that column in an equality predicate in the WHERE clause. Because I have no histograms on either of my columns yet, the optimizer will assume that the data is distributed uniformly and the expected number of rows from WHERE STATE = something will be the number of rows in the table divided by the number of distinct values in the STATE column. When I look at the data in the STATE column, I quickly see that this assumption is not true and that the optimizer guess will be very far off:
SQL> select * 2 from ( 3 select state, 4 count(*) cnt 5 from people 6 group by state 7 order by state 8 ) 9 where rownum <= 5 10 / STATE CNT ————————————————— ———————— Alabama 1,139,854 Alaska 291,825 Arizona 4,130,606 Arkansas 667,541 California 25,145,815
I can see that Alaska has two orders of magnitude fewer people in this table than California does, for example. The data is obviously skewed, but right now the optimizer does not realize that. If I were to query by using WHERE STATE = something, the optimizer would guess that it would receive about 2.3 million rows, where, in fact, it would receive wildly varying numbers of rows, depending on the STATE value input into the query. I can observe this easily, as shown in Listing 2.
Code Listing 2: Plan statistics with serious skew
SQL> select /*+ gather_plan_statistics */ count(*) 2 from people 3 where state = 'Alaska' 4 / COUNT(*) ———————————— 291,825 SQL> select * 2 from table 3 (dbms_xplan.display_cursor 4 (format=>'allstats last') 5 ) 6 / ———————————————————————————————————————————————————————————————— | Id | Operation | Name | Starts | E-Rows | A-Rows | ———————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| PEOPLE | 1 | 2293K| 291K| ————————————————————————————————————————————————————————————————
By using the GATHER_PLAN_STATISTICS hint, I can get additional, extremely useful information from DBMS_XPLAN; namely, I can see the E-Rows (how many rows the optimizer estimated) and the A-Rows (the actual number of rows processed by the query) values in the plan output. I was able to guess that the optimizer would guess 2.3 million rows (E-Rows), but I knew that it would actually process an order of magnitude fewer rows—about 291,000 rows—instead. I knew that, but the optimizer didn’t. Furthermore, this skew would happen with other states as well, as shown in Listing 3.
Code Listing 3: Plan statistics for a different state and more skew
SQL> select /*+ gather_plan_statistics */ count(*) 2 from people 3 where state = 'California' 4 / COUNT(*) ———————————— 25,145,815 ———————————————————————————————————————————————————————————————— | Id | Operation | Name | Starts | E-Rows | A-Rows | ———————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| PEOPLE | 1 | 2293K| 25M| ————————————————————————————————————————————————————————————————
In this case, given the information it has access to—the current set of statistics—the optimizer underestimates the number of rows by an order of magnitude. As I pointed out in my November/December 2014 column, when the optimizer is off by orders of magnitude, the possibility of a bad plan is very high, so the skew in Listing 3 is cause for concern.
So, what can you or I do about it? For a query this simple—a single-table query with a single-column predicate—I might only need to gather statistics again. Since Oracle Database 10g, the database has been “spying” on you and me. It has been watching your queries and recording the columns you use in your WHERE clause in the data dictionary. DBMS_STATS can use that information later to determine which columns might warrant additional inspection in the form of histograms. (I wrote about that originally in the May/June 2006 issue of Oracle Magazine.)
After I gather statistics again, I see a marked difference in the statistics collected:
SQL> exec dbms_stats.gather_table_stats( user, 'PEOPLE', no_invalidate=>FALSE ); PL/SQL procedure successfully completed. SQL> select column_name, 2 num_distinct, 3 histogram 4 from user_tab_col_statistics 5 where table_name = 'PEOPLE' 6 / COLUMN_NAM NUM_DISTINCT HISTOGRAM —————————— ———————————— —————————————— CITY_NAME 675 NONE STATE 50 FREQUENCY
This is an early example of learning on the part of the optimizer. By watching the predicates used in queries, the optimizer will learn to inspect those columns when gathering statistics and will automatically generate histograms on single columns used in WHERE clauses that contain skewed information. That is, DBMS_STATS knew I used STATE in a WHERE clause, and when it gathered statistics this second time, it took the time to inspect the values in the STATE column, concluded that the data was not uniformly distributed, and decided to create a FREQUENCY histogram on the data.
Now, when I query again against STATE as shown in Listing 4, the estimated cardinalities are going to be much more accurate. Now the E-Rows estimate is very close to the A-Rows actual values; this is exactly what I want to see. When the optimizer estimates the cardinality correctly, the odds are good that the plan I am looking at is the right one. (Conversely, if the estimates are far off, the odds are that I am looking at a suboptimal plan.)
Code Listing 4: Better plan statistics after DBMS_STATS
SQL> select /*+ gather_plan_statistics */ count(*) 2 from people 3 where state = 'Alaska' 4 / COUNT(*) ———————————— 291,825 ———————————————————————————————————————————————————————————————— | Id | Operation | Name | Starts | E-Rows | A-Rows | ———————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| PEOPLE | 1 | 291K| 291K| ———————————————————————————————————————————————————————————————— SQL> select /*+ gather_plan_statistics */ count(*) 2 from people 3 where state = 'California' 4 / COUNT(*) ———————————— 25,145,815 ———————————————————————————————————————————————————————————————— | Id | Operation | Name | Starts | E-Rows | A-Rows | ———————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| PEOPLE | 1 | 25M| 25M| ————————————————————————————————————————————————————————————————
But that is just the tip of the iceberg. I still have other columns in that table (the CITY_NAME column, for example) that might suffer from the same issue, and I might also have much more complicated predicates that the optimizer would have a hard time estimating. Fixing single-column estimates is easy—I only need to reference those columns in a predicate and gather statistics with DBMS_STATS. Note that I need to use the default METHOD_OPT parameter to enable DBMS_STATS to automatically determine on which columns to gather histograms. If I supply a METHOD_OPT=> ‘some inputs….’ parameter value, only the histograms I request will be collected.
If I query the CITY_NAME column, for example, as shown in Listing 5, I can see that the CITY_NAME column has a histogram on it as well, because it, too, is skewed. (There are a lot more people living in New York City, New York, than in Anchorage, Alaska, for example.) But that solves only some simple cases: predicates involving a single column. What about something more complex and more likely to happen in the real world? What about a predicate involving two columns, for example?
Code Listing 5: DBMS_STATS adding another histogram
SQL> select count(*) 2 from people 3 where city_name = 'Anchorage municipality' 4 / COUNT(*) ———————————— 291,825 SQL> exec dbms_stats.gather_table_stats( user, 'PEOPLE', no_invalidate=>FALSE ); PL/SQL procedure successfully completed. SQL> select column_name, 2 num_distinct, 3 num_nulls, 4 histogram, 5 notes 6 from user_tab_col_statistics 7 where table_name = 'PEOPLE' 8 / COLUMN_NAM NUM_DISTINCT NUM_NULLS HISTOGRAM NOTES —————————— ——————————————— ————————— ——————————————— ————————————————— CITY_NAME 675 0 HYBRID STATE 50 0 FREQUENCY
The optimizer has a harder job with more and more-complex predicates—even with a predicate involving just two columns. When the optimizer sees a predicate such as WHERE x = ‘A’ and y = ‘B’, it will default to simply taking what it assumes to be the probability of x = ‘A’ being true (X percent of the rows being returned) and the probability of y = ‘B’ being true (Y percent of the rows being returned) and multiplying them by each other. So, if x = ‘A’ is assumed to return about 1 percent of the rows and y = ‘B’ is assumed to return about 1 percent of the rows, the optimizer will assume that 0.01 percent of the rows will be returned by the WHERE clause. But what if it were almost always true that when Y = ‘B’ and X = ‘A’, that query would return 1 percent of the rows in the table, not 0.01 percent? (That estimate would be off by two orders of magnitude.)
This is where the optimizer needs some assistance or some introspection (self-learning). In Oracle Database 11g, the DBA could create extended statistics on (X,Y), enabling the optimizer to know that there is a relationship between X and Y, but that presupposes that the DBA knows that those statistics need to be created. What if this is a system where the queries are not known ahead of time—such as a data warehouse or a system where the DBA just isn’t familiar with every single query submitted to the system (and that probably describes almost every database today)? It would be hard for the DBA to proactively create these statistics.
Enter the learning optimizer—an optimizer that learns from its mistakes. Let’s take a look at what the optimizer would estimate for a two-column predicate first, as shown in Listing 6.
Code Listing 6: First estimate on two-column predicate
SQL> select /*+ gather_plan_statistics */ city_name, state, count(*) 2 from people 3 where city_name = 'New York City' 4 and state = 'New York' 5 group by city_name, state 6 / CITY_NAME STATE COUNT(*) ————————————— ——————————— ————————————————————— New York City New York 8,175,132 —————————————————————————————————————————————————————————————————— | Id | Operation | Name | Starts | E-Rows | A-Rows | —————————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT GROUP BY NOSORT| | 1 | 671K| 1 | |* 2 | TABLE ACCESS FULL | PEOPLE | 1 | 671K| 8175K| ——————————————————————————————————————————————————————————————————
I can see that the optimizer estimated 671,000 rows but actually received 8,175,000 rows. It was off by more than an order of magnitude, even though I have histograms on the CITY_NAME column and the STATE column. The problem is that even though I have a lot of details on the individual columns, I have no information on the relationship between the two columns.
In my data set, I have only “big” cities. New York City appears only in the state of New York. By default, the optimizer will use the probability of CITY_NAME = ‘New York City’ being true and multiply that by the probability of STATE = ‘New York’ being true. That will grossly underestimate the expected number of rows, because the CITY_NAME in this case implies the state!
Starting in Oracle Database 11g, the optimizer would recognize this condition. It would realize that in this query, the E-Rows and A-Rows columns are very far apart, and it would reoptimize the query on the next execution, as shown in Listing 7.
Code Listing 7: New estimate on two-column predicate reoptimized with statistics feedback
SQL> select /*+ gather_plan_statistics */ city_name, state, count(*) 2 from people 3 where city_name = 'New York City' 4 and state = 'New York' 5 group by city_name, state 6 / CITY_NAME STATE COUNT(*) ————————————— ——————————— ————————————————————— New York City New York 8,175,132 —————————————————————————————————————————————————————————————————— | Id | Operation | Name | Starts | E-Rows | A-Rows | —————————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT GROUP BY NOSORT| | 1 | 8175K| 1 | |* 2 | TABLE ACCESS FULL | PEOPLE | 1 | 8175K| 8175K| —————————————————————————————————————————————————————————————————— Note ———————— - statistics feedback used for this statement
Now I see that the E-Rows and A-Rows column values are very close to each other. The optimizer reoptimized the plan with the actual observed cardinalities, not its own estimate. I can see that this happened in the Note section of the plan. (In Oracle Database 11g, this was called cardinality feedback, and in Oracle Database 12c, it is called statistics feedback.)
In Oracle Database 11g, this reoptimizing capability would fix the New York City/New York problem, but only for that specific query and only for as long as that query existed in the shared pool. If I restarted the database, gathered statistics, or parsed a lot of SQL so that the shared pool aged this query out, the optimizer would have to relearn this lesson. In Oracle Database 11g, the optimizer’s learning process was short-term—the learning would not last forever.
Starting in Oracle Database 12c, however, the optimizer learning that takes place not only is long-term but is also shared across similar queries. A new capability called SQL plan directives (bit.ly/dbms_spd) enables the optimizer to share the knowledge that it is having a hard time estimating cardinalities of certain types of predicates and also to alert DBMS_STATS to gather additional statistics to help out in the future.
For example, after running the New York query in Listing 7, the optimizer has learned that when it attempts to estimate the rows returned by a predicate in the form WHERE CITY_NAME = X and STATE = Y, the default heuristic it uses—just multiplying probabilities by each other—is flawed. The optimizer needs more information that it can get via dynamic sampling (bit.ly/asktom_dynsam). In Oracle Database 12c, the optimizer will generate a SQL plan directive in the SGA (it is eventually written to disk to a persistent table) that warns all subsequent queries that use CITY_NAME and STATE together that they should dynamically sample to get a better cardinality estimate. I can demonstrate this easily: I’ll force any SQL plan directives I have in memory to disk so I can query them, as shown in Listing 8.
Code Listing 8: Creating SQL plan directives
SQL> exec dbms_spd.flush_sql_plan_directive; PL/SQL procedure successfully completed. SQL> select o.subobject_name col_name, 2 o.object_type, d.type, d.state, d.reason 3 from dba_sql_plan_directives d, dba_sql_plan_dir_objects o 4 where d.directive_id = o.directive_id 5 and o.owner = user 6 and o.object_name = 'PEOPLE' 7 order by 1, 2, 3, 4, 5; COL_NAME OBJECT TYPE STATE REASON ————————— —————— ———————————————— —————— ———————————————————————————————————— CITY_NAME COLUMN DYNAMIC_SAMPLING USABLE SINGLE TABLE CARDINALITY MISESTIMATE STATE COLUMN DYNAMIC_SAMPLING USABLE SINGLE TABLE CARDINALITY MISESTIMATE TABLE DYNAMIC_SAMPLING USABLE SINGLE TABLE CARDINALITY MISESTIMATE
This SQL plan directive is telling me that if I use CITY_NAME and STATE together, I had better use dynamic sampling when hard-parsing (optimizing) the query, because there will be a serious cardinality misestimation if I don’t. I can see this directive in effect by querying a different CITY_NAME and STATE combination, as shown in Listing 9.
Code Listing 9: Demonstration of SQL plan directive and dynamic sampling
SQL> select /*+ gather_plan_statistics */ count(*) 2 from people 3 where city_name = 'Los Angeles city' 4 and state = 'California' 5 / COUNT(*) ———————————— 3,792,620 ———————————————————————————————————————————————————————————————— | Id | Operation | Name | Starts | E-Rows | A-Rows | ———————————————————————————————————————————————————————————————— | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| PEOPLE | 1 | 5756K| 3792K| ———————————————————————————————————————————————————————————————— Note ———————— - dynamic statistics used: dynamic sampling (level=2) - 1 SQL plan directive used for this statement
In my data set, there is just one Los Angeles city, and it is in California. If the optimizer had used its default method of estimating cardinalities, it would have underestimated the number of rows coming back—by a large margin. Instead, the Note section of this plan tells me that the SQL plan directive said to use dynamic sampling. And because the optimizer did that, the estimated cardinality (E-Rows) is very close—within an order of magnitude—to the actual cardinality (A-Rows).
In Listing 9, the optimizer has learned from its mistake in a broad sense. It has corrected an entire class of queries—any query using CITY_NAME and STATE together—regardless of the value used in the predicate. It has also learned this across database restarts, statistics gathering, and so on. This SQL plan directive persists in the data dictionary and will continue to be used.
Dynamic sampling is a good fix for the issue at hand, but it is not without cost. Every time I hard-parse a query to which a directive applies, the database must do this extra sampling work. In most cases, this is nominal overhead, but it is overhead and something I can avoid. If I had permanent—on-disk—extended statistics (bit.ly/dbms_stats) on the column group (CITY_NAME,STATE), the optimizer wouldn’t have to dynamically collect them. Fortunately, in Oracle Database 12c, DBMS_STATS will not only pick up those single-column statistics as demonstrated previously but will also generate the multicolumn, extended statistics, as shown in Listing 10.
Code Listing 10: Generating multicolumn, extended statistics
SQL> exec dbms_stats.gather_table_stats( user, 'PEOPLE', no_invalidate=>FALSE ); PL/SQL procedure successfully completed. SQL> select column_name, 2 num_distinct, 3 histogram 4 from user_tab_col_statistics 5 where table_name = 'PEOPLE' 6 / COLUMN_NAM NUM_DISTINCT HISTOGRAM ———————————————————————————————— ———————————— ————————————————— CITY_NAME 675 HYBRID STATE 50 FREQUENCY SYS_STSJTGBD9HKF9J6ZR#9#ZGT39Y 714 HYBRID SQL> SELECT extension_name, extension 2 FROM user_stat_extensions 3 WHERE table_name = ‘PEOPLE’; EXTENSION_NAME EXTENSION ——————————————————————————————— ————————————————————— SYS_STSJTGBD9HKF9J6ZR#9#ZGT39Y ("CITY_NAME","STATE")
Over time, the Oracle Database optimizer has automated many of the things we used to have to do manually. In Oracle Database 10g, the optimizer started collecting single-column histograms; in Oracle Database 11g, it introduced cardinality feedback (now known as statistics feedback); and in Oracle Database 12c, it introduced SQL plan directives, providing much better long-term memory for “things learned” over time while we’re querying our data.
READ more Tom
DOWNLOAD Oracle Database 12c
LEARN more about Oracle Database 12c
FOLLOW Tom on Twitter
Photography by Aaron Burson, Unsplash