Introduction
This is a follow-on from a previous blog about plan stability. It is worth looking at this first if you haven’t already.
This blog post proposes that if you have a SQL performance issue, you should first consider whether extended statistics will resolve it.
Column Expressions Used in Predicates
Here is a classic example of a column that is wrapped in an expression and used in a predicate:
select sum(sale_val),count(*) from customers where upper(surname) = 'ALLEN';
The optimizer needs to know the statistical properties of UPPER(SURNAME) if it is to estimate cardinality accurately. The basic statistics for the SURNAME column alone will not help. Instead, the cardinality estimate will be a value chosen by the optimizer and deemed to work in most cases. You can see that at work in this simplified example. The estimate is 10, but the reality is 500:
create table t as select 'HELLO'||mod(rownum,2) as hcol from dual connect by rownum<=1000; select count(*) from t where upper(hcol) = 'HELLO0'; COUNT(*) ---------- 500 select * from table(DBMS_XPLAN.DISPLAY_CURSOR()); --------------------------------------------------- | Id | Operation | Name | Rows | Bytes | --------------------------------------------------- | 0 | SELECT STATEMENT | | | | | 1 | SORT AGGREGATE | | 1 | 7 | |* 2 | TABLE ACCESS FULL| T | 10 | 70 | ---------------------------------------------------
A straightforward solution is available:
select dbms_stats.create_extended_stats(user,'t','(upper(hcol))') from dual; begin dbms_stats.gather_table_stats(user,'t', method_opt=>'for columns (upper(hcol))',no_invalidate=>FALSE); end; /
Rerun the query and DBMS_XPLAN, and you will see the improved estimate:
--------------------------------------------------- | Id | Operation | Name | Rows | Bytes | --------------------------------------------------- | 0 | SELECT STATEMENT | | | | | 1 | SORT AGGREGATE | | 1 | 7 | |* 2 | TABLE ACCESS FULL| T | 500 | 3500 | ---------------------------------------------------
UPPER(column) statistics are now available, so it is possible to derive cardinality estimates from the data itself rather than relying on a value generated by an internal rule.
I was about to publish this post when (coincidentally) a real case came in. It was a perfect example, so I decided to include it below (the complete script is here). You might see a way to improve the query, but I am making no judgments here. It is a real example that benefits from expression statistics, so we can address the problem without changing the application.
Create a test table:
create table t as
select sysdate-50+rownum/100 delivery_date,1 control_flag,
rownum account_id
from dual connect by rownum<=10000;
The column names have been changed, but the real query included the following predicates:
SELECT /* MYQUERY */ COUNT(*)
FROM t
WHERE ( "DELIVERY_DATE" >= trunc(sysdate@!, 'fmyear')
AND ( coalesce("CONTROL_FLAG", 1) = 1
OR coalesce("CONTROL_FLAG", 1) = 3 )
AND ( coalesce(substr(to_char("ACCOUNT_ID"), 4, 1), '1') = '1'
OR coalesce(substr(to_char("ACCOUNT_ID"), 4, 1), '1') = '3' )
AND "DELIVERY_DATE" <= trunc(sysdate@! - 1) );
COUNT(*)
----------
1775
SELECT * FROM table(DBMS_XPLAN.DISPLAY_CURSOR());
SQL_ID 8caz15dwj8w8k, child number 0
----------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
----------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | SORT AGGREGATE | | 1 | 15 |
|* 2 | FILTER | | | |
|* 3 | TABLE ACCESS FULL| T | 2 | 30 | < Poor estimate
----------------------------------------------------
You can see that count(*) returns over 1775, but the estimate is 2. This is a clear sign of potential trouble!
Expressions used in WHERE clauses can be identified by inspection, but if you use Oracle Database 12c Release 2 and above, expression statistics can help.
select distinct table_name,owner,expression_text
from dba_expression_statistics
where (table_name,owner)
in (select object_name, object_owner
from v$sql_plan
where object_type = 'TABLE'
and sql_id = '8caz15dwj8w8k'); /* the SQL ID taken
from DBMS_XPLAN */
TABLE_NAME OWNER EXPRESSION_TEXT
---------- ----- --------------------
T APP COALESCE(SUBSTR(TO_CHAR("ACCOUNT_ID"),4,1),'1')
T APP COALESCE("CONTROL_FLAG",1)
T APP "DELIVERY_DATE"
Warning! The rows returned above will contain false positives because irrelevant expressions used in the select list will also appear. The database doesn’t track all expression usage either, but I still think that this view is a useful starting point.
Now for the fun part. I will steal the expression text listed in the dictionary view:
select dbms_stats.create_extended_stats(USER,'T',
q'[(COALESCE("CONTROL_FLAG",1))]')
from dual;
select dbms_stats.create_extended_stats(USER,'T',
q'[(COALESCE(SUBSTR(TO_CHAR("ACCOUNT_ID"),4,1),'1'))]')
from dual;
begin
dbms_stats.gather_table_stats(USER,'T',
method_opt=>q'[for columns (COALESCE("CONTROL_FLAG",1))]',
no_invalidate=>FALSE);
dbms_stats.gather_table_stats(USER,'T',
method_opt=>q'[for columns (COALESCE(SUBSTR(TO_CHAR("ACCOUNT_ID"),4,1),'1'))]',
no_invalidate=>FALSE);
end;
/
Here is the plan now:
---------------------------------------------------- | Id | Operation | Name | Rows | Bytes | ---------------------------------------------------- | 0 | SELECT STATEMENT | | | | | 1 | SORT AGGREGATE | | 1 | 13 | |* 2 | FILTER | | | | |* 3 | TABLE ACCESS FULL| T | 1365 | 17745 | < Great estimate ----------------------------------------------------
This is an elementary example, and the correction in cardinality estimate did not change the SQL execution plan. We should note that just because a useful statistic is missing and cardinality estimates are not perfect, it does not necessarily mean that all plans will be sub-optimal. The real query included additional joins, and further plan decisions were based on the poor cardinality estimate. This ultimately led to a sub-optimal plan being chosen.
Expression Statistics are Important
It is generally considered bad practice to wrap expressions around columns used in WHERE-clause predicates because it makes cardinality estimation more difficult and potentially unreliable. It may negatively affect SQL execution performance, too (but this might not be noticeable in many cases). I hope, therefore, that you find you don’t need expression statistics very often. However, a significant proportion of problem SQL statements I see do benefit from them. For this reason, if you are faced with a SQL statement to tune, then I suggest that you consider the potential need to expression statistics first.
You can get a quick idea by inspecting WHERE clauses, or try using the dba_expression_statistics query above.
Column Groups
Another scenario that can significantly affect cardinality estimates is where column values are strongly correlated. In the following query, values in the make and model columns are correlated (because, if the model is 320d, then the make is always going to be BMW):
select count(*) from purchased_cars where make = 'BMW' and model = '320d';
The optimizer needs to know that it should increase the cardinality estimate because its default assumption is that the column values are not correlated. The optimizer will not modify the estimate unless it knows there is a correlation. Column groups are the solution, and if you want more detail, check out an earlier post on the topic.
A side note:
If you are using Oracle Database 12c Release 2 onwards (or Release 1 with BP) and you have set the DBMS_STATS preference AUTO_STAT_EXTENSIONS to ON, then column groups will be created automatically in response to SQL plan directives. However, OFF is now the default and recommended setting, so I will assume that you will want to initiate column group creation and/or detection yourself.
It is straightforward to deal with the example above as follows:
select dbms_stats.create_extended_stats(user,'purchased_cars','(make,model)') from dual; begin dbms_stats.gather_table_stats(user,'purchased_cars', method_opt=>'for columns (make,model)'); end; /
Identifying column value correlations is (on the face of it) not very easy unless you have a good understanding of application queries and the underlying data. Few of us have the time. Is there an easy way to determine whether a poorly performing SQL statement will benefit from a column group? There is. The basic technique is outlined in this post. It is possible to acquire candidates for an entire workload.
While writing this post, I looked into making the process as easy and as quick as possible for an individual ‘problem statement’. Therefore, I have uploaded some examples to GitHub. You will find scripts that will allow you to check if column groups will benefit a specific SQL statement. The procedure is as follows:
1) Locate the problem SQL statement in the cursor cache. Note its SQL_ID
2) Log in as a DBA and execute the following script (supplying the SQL ID).
SQL> @@cg_from_sqlid 11r5va7shq602 n
The script above will produce a list of proposed column groups (the “n” parameter tells the script not to create column groups immediately, but report them instead). It works by seeding column usage, parsing the SQL statement, and listing the recommended column groups for the tables accessed by the SQL statement. This technique reports on columns used together and will recommend column groups whether or not the column values are closely correlated. It makes sense to do this because you will be providing an accurate measure of correlation when the columns are used together in predicates.
Alternatively, if you want to understand your data more closely and are interested in only specifying column groups for column values that are closely correlated, then it is possible to do this. It can be time-consuming and resource-intensive to do it manually, but you can use a sample of table rows to make it faster. I have included some scripts for fun: they look for pairs of columns with correlated values. It is a brute force method – testing combinations of column pairs – so consider it a starting point for you to play around with.
You can check for column correlations in a table using a 50% row sample:
SQL> @@corr_from_table user customers 50 n
Check all tables accessed by a SQL statement (specified using a SQL ID):
SQL> @@corr_from_sqlid 11r5va7shq602 50 n
Auto Column Group Detection
The examples above rely on DBMS_STATS.SEED_COL_USAGE or manual data analysis. Oracle Database 12c onwards will automatically detect the need for column groups using SQL plan directives, but note that detection will sometimes require relevant SQL statements to be hard parsed more than once. This mechanism has the advantage of suggesting column groups only where significant cardinality misestimates are detected.
You can see if a column group has been detected for a table using the column usage report – the column group is (A,B) in this case:
SQL> select dbms_stats.report_col_usage(user,'test_table') 2 from dual; LEGEND: ....... EQ : Used in single table EQuality predicate RANGE : Used in single table RANGE predicate LIKE : Used in single table LIKE predicate NULL : Used in single table is (not) NULL predicate EQ_JOIN : Used in EQuality JOIN predicate NONEQ_JOIN : Used in NON EQuality JOIN predicate FILTER : Used in single table FILTER predicate JOIN : Used in JOIN predicate GROUP_BY : Used in GROUP BY expression .................................................. ################################################## COLUMN USAGE REPORT FOR APP.TEST_TABLE ................................ 1. A : EQ 2. B : EQ 3. (A, B) : FILTER ##################################################
You can create the column group as follows:
SQL> select dbms_stats.create_extended_stats(user,'test_table') from dual; ################################################## EXTENSIONS FOR SPD.TAB2 ....................... 1. (A, B) : SYS_STSNA$6DVXJXTP05EH56DTIR0X created ##################################################
Column Groups for all Queries
If you discover that you like column groups, remember that you can seed or auto-detect column usage for an entire workload (you can do it in real-time or using a workload captured in a SQL tuning set). You don’t need to work on one SQL statement or one table at a time.
Caution
Extended statistics are associated with a table and, for this reason, each one can affect many SQL execution plans. This can be seen as both a positive and a negative. For example, a new column group can improve the performance of many queries at once. However, there is always a small risk that it will harm the performance of one or more queries. If you want to proceed with caution, then test the effect of adding extended statistics. Perceived risk is perhaps a reason why extended statistics can be under-used, but there are strategies you can use to convince yourself of their benefit and, as a side-effect, strongly mitigate risk:
- Keep in mind that you can use pending statistics and optimizer_use_pending_statistics to hide changes in statistics from workloads while testing them in a dedicated session. See “t4.sql” for an example.
- If you are licensed to use SQL Performance Analyzer (SPA), you can test a workload with and without column group statistics. For example, you can compare two SPA tests, one executed in a session where pending statistics are invisible and one where they are visible (using optimizer_use_pending_statistics).
If you intend to create a column group for columns used in a join, then you should create a matching column group at both ends of the join, i.e., matching column groups on both tables involved in the join. Seeding column usage will result in this anyway, but there might be some rare corner-case exceptions.
Expression statistics are evaluated when statistics are gathered, so some overhead may be noticeable. In general, the overhead is minimal, but it does depend on several factors. If a table has many columns and UPPER(surname) is added, the difference is unlikely to be noticed. If the table has two columns, and UPPER(surname) is added, then the difference will be more noticeable. The proportional overhead also depends on the number of expression statistics created and the cost of evaluating the expression.
Conclusion
It is arguably better to improve SQL statements and data models to avoid the need for expression statistics. However, in the real world, I think we have to accept that this is not always practical, so you should always be on the lookout for potential cases whenever you encounter a poorly performing SQL statement.
Column groups are sometimes a necessity, and they are generally not the result of poorly-formed queries or sub-optimal data models.
Post a comment if you have any questions or if you’d like to suggest corrections/improvements to the GitHub scripts.
