Friday Feb 03, 2017

Why are HCC stats being bumped when Smart Scanning row major data in 12.2?

In 12.2, there is a stat "cell blocks pivoted" that points to a new optimization. When Smart Scan processes data, it has to create a new synthetic block that only contains the columns that are needed and the rows that pass the offloaded predicates.

If Smart Scan has to create a new block, why would you create a new row-major block when we can just as easily create uncompressed HCC block? It is roughly the same amount of work but once the synthetic block is returned to the RDBMS for processing, columnar data is 10X cheaper to process when the query uses rowsets.  SO where does the name of this stat come from? Simple, it works just like a spreadsheet PIVOT operation, swapping rows for columns.

So what is a "rowset"? It simply means that a step in a query plan will take in a set of rows before processing and then process that entire batch of rows in one go before fetching more rows; this is significantly more efficient than processing one row at a time. You can see when a query uses rowsets via:

select * from table(dbms_xplan.display_cursor('<sql_id>',1,'-note +projection'));

in which case you will see projection metadata with details such as "(rowsets=200)".

When Smart Scan sees that the query on the RDBMS is using rowsets, it tries to create a columnar output block instead of a row-major output block and when this arrives on the RDBMS, the table scan will see an HCC block and bump the HCC stats even though the table being scanned doesn't use HCC. Let's see how this works in practice:

I created a TPC-H Scale 1 database in NOCOMPRESS format:

SQL> select COMPRESSION,COMPRESS_FOR from user_tables where table_name='SUPPLIER';

-------- ------------------------------

and run Query 6:

SQL> select sum(l_extendedprice * l_discount) as revenue
from   lineitem
where  l_shipdate >= to_date( '1994-01-01',  'YYYY-MM-DD')
        and l_shipdate < add_months(to_date( '1994-01-01',  'YYYY-MM-DD'),  12)
        and l_discount between .06 - 0.01 and .06 + 0.01
        and l_quantity < 24;

SQL> select name,value from v$sysstat where name = 'cell blocks pivoted';

NAME                           VALUE
------------------------------ ----------
cell blocks pivoted                  8261

Just to show how this is tied to rowsets, let's disable this optimization and recheck the stat:

SQL> alter session set "_rowsets_enabled"=false;

SQL> select sum(l_extendedprice * l_discount) as revenue
from   lineitem
where  l_shipdate >= to_date( '1994-01-01',  'YYYY-MM-DD')
        and l_shipdate < add_months(to_date( '1994-01-01',  'YYYY-MM-DD'),  12)
        and l_discount between .06 - 0.01 and .06 + 0.01
        and l_quantity < 24;

SQL> SQL> select name,value from v$sysstat where name = 'cell blocks pivoted';

NAME                           VALUE
------------------------------ ----------
cell blocks pivoted                  8261

I hope this helps. 

Saturday Jan 21, 2017

When bloggers get it wrong - part 2


[Read More]

Thursday Jan 12, 2017

When bloggers get it wrong - part 1

I've read a number of blog entries from various people who've clearly put great diligence into trying to understand how the decision to use the buffer cache for a table scan or whether it is better to use direct read for the scan is made. Since this is critical decision from a performance perspective, I've decided to write a more definitive account. Part 1 deals with a few principles and part 2 (which will follow very shortly) will show how this works with tracing and clear up a few misunderstandings. Part 3 will deal with PQ in more detail as soon as I have time.

Serial Scans

Small Table Threshold (STT) 

Serial scans use direct path reads when the object is not a small table and, since, the size used to determine whether it is "small" has come stats rather than the size from segment header.

_small_table_threshold defaults to the 2% buffer cache or to 20 blocks whichever is bigger (NB: this parameter ignores the effects of compression). An object is small enough for buffer cache reads to be more effective than direct path read if the number of blocks it contains is less than or equal to the value of _small_table_threshold. Such scans will bump the stat “table scans (short tables)”

Medium Table Threshold (MTT)

So what happens when an object is bigger than STT? We fall into the MTT case: MTT is not directly configurable and defaults to 10% of the buffer cache blocks. But it is also depends on a cost based decision that includes how busy the cache is. 
NB: PQ and Exadata have different rules.
NB: MTT also depends on _object_statistics being enabled.

For STT < #blocks < MTT, if the  cost based decision means a buffer cache scan is chosen, the stats “table scans (short tables)” will still get bumped.

So when is a medium table not worth caching?

First, when MTT < #blocks (ignore VLOT in blogs, it hasn't been used for many years but persists in the trace and blogs for historic reasons).

Second, if  the scan is not a select for update: different rules apply because cache scans are more useful if the segment being updated is first the subject of a query. Table larger than MTT also do factor in storage reduction factor. The ratios used are 10X for HCC and 3.3.X for OLTP.

Third, a cost based analysis is done comprising the % currently in cache, the % buffers are dirty (check-pointing cost), the current I/O bottleneck, and any RAC access costs. If the buffer cache is still used the stat “table scans (long tables)” will get bumped. But, if Direct Read is chosen the stat “table scans (direct read)” will get bumped.

Automatic Big Table Caching (ABTC)

So does this mean the buffer cache never helps with large tables? Not at all, large tables can still use the buffer cache for a portion of their  scan based on the public parameter DB_BIG_TABLE_CACHE_PERCENT_TARGET which sets aside a percentage of the buffer cache for larger tables

On a single instance, ABTC works with both Serial and PQ scans BUT mixing Serial and PQ may lead to fewer cache hits. On RAC, it only works with PQ and requires that PARALLEL_DEGREE_POLICY set to AUTO or ADAPTIVE.

ABTC works by tracking the heat of medium and large tables and is designed to take advantage of any runs of buffers currently cached. Then if the decision to use ABTC is made, it periodically checks the cache for how many buffers in the next section are already in memory to see whether Direct Read or Buffer Cache scan would be better for the next chunk. Let's see how the heat tracking works which looking at the v$bt_scan_obj_temps view.


---------- ------------ -----------     
     79042         1689        3000

Each time we scan a medium or larger table it will bump that object's temperature by 1000. Then, the object with the highest temperature will get absolute priority for caching and, if needed, will evict any previously cached ABTC blocks of other objects.
If the hottest object only partially fits in the buffer cache (which is quite likely), the first section that fits will be read through the buffer cache and cached then the rest of the scan will switch to Direct Read when the ABTC allocation in the cache is fully utilized.
NB: The object will still be check-pointed at the beginning so that it is safe to switch to Direct Read at any time.

Part 2 will include tracing to how how to make use of this feature. 

Friday Nov 18, 2016

Controlling the offload of specific operators

One of the joys of regexp is that you can write a pattern that is painfully expensive to match and offloading these to the cell can cause significant impact on other users and overall throughput (including heartbeat issues). If you have a user who is prone to writing bad regexp expressions you as DBA can prevent regexp (or any other operator) from being offloaded to the cells.

Let's take a very simple example using a cut down version of TPC-H Query 16 and a NOT LIKE predicate: 

SQL> explain plan for select p_brand, p_type, p_size
from part
where p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
group by p_brand, p_type, p_size;


 |*  3 |    TABLE ACCESS STORAGE FULL| PART | 29833 |  1048K|    |   217   (2)| 00:00:01 |  1 |  8

   3 - storage(("P_SIZE"=3 OR "P_SIZE"=9 OR "P_SIZE"=14 OR "P_SIZE"=19 OR "P_SIZE"=23 OR "P_SIZE"=36 OR "P_SIZE"=45 OR "P_SIZE"=49) AND "P_BRAND"<>'Brand#45' AND "P_TYPE" NOT LIKE 'MEDIUM

Here we see all the predicates get offloaded as expected. So, for example, to stop NOT LIKE being offloaded we would need to find the operator in v$sqlfn_metadata

SQL> column descr format a18
SQL> select func_id, descr, offloadable from v$sqlfn_metadata where descr like '%LIKE%';

   FUNC_ID DESCR              OFF
---------- ------------------ ---
        26  LIKE              YES
        27  NOT LIKE          YES
        99  LIKE              NO
       120  LIKE              YES
       121  NOT LIKE          YES
       524  REGEXP_LIKE       YES
       525  NOT REGEXP_LIKE   YES
       537  REGEXP_LIKE       YES
       538  NOT REGEXP_LIKE   YES

we can ignore all but the two basic LIKE operators in this case, so to disable the offload of our LIKE predicates we use:

SQL> alter session set cell_offload_parameters="OPT_DISABLED={26,27};";

and we see this reflected in the offloadable column in v$sqlfn_metadata.

SQL> select func_id, descr, offloadable from v$sqlfn_metadata where descr like '%LIKE%';

   FUNC_ID DESCR              OFF
---------- ------------------ ---
        26  LIKE              NO
        27  NOT LIKE          NO
        99  LIKE              NO
       120  LIKE              YES
       121  NOT LIKE          YES

To re-enable them you would use:

SQL> alter session set cell_offload_parameters="OPT_DISABLED={};";

One thing to note about this param is that it doesn't work like events (whose settings are additive), here it replaces the previous value and so every operator you want disabled has to be included in the same alter session (and the param is limited to 255 maximum characters limiting the number of operators that can be disabled). With the offload of LIKE and NOT LIKE disabled we can see the impact on the plan:

SQL> explain plan for select p_brand, p_type, p_size
from part
where p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
group by p_brand, p_type, p_size;


 |*  3 |    TABLE ACCESS STORAGE FULL| PART | 29833 |  1048K|    |   217   (2)| 00:00:01 |  1 |  8

   3 - storage(("P_SIZE"=3 OR "P_SIZE"=9 OR "P_SIZE"=14 OR "P_SIZE"=19 OR "P_SIZE"=23 OR "P_SIZE"=36 OR "P_SIZE"=45 OR "P_SIZE"=49) AND "P_BRAND"<>'Brand#45')

and the NOT LIKE is no longer in the storage filter. Now lets say that you as DBA are faced with a more complex problem and want to halt all complex processing on the cells temporarily. There is a parameter that will disable everything except the simple comparison operators and NULL checks:

SQL> alter session set "_cell_offload_complex_processing"=FALSE;

Now lets see what happens:

SQL> explain plan for select p_brand, p_type, p_size
from part
where p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
group by p_brand, p_type, p_size;


 |*  3 |    TABLE ACCESS STORAGE FULL| PART | 29833 |  1048K|    |   217   (2)| 00:00:01 |  1 |  8

   3 - filter(("P_SIZE"=3 OR "P_SIZE"=9 OR "P_SIZE"=14 OR "P_SIZE"=19 OR "P_SIZE"=23 OR "P_SIZE"=36 OR "P_SIZE"=45 OR "P_SIZE"=49) AND "P_BRAND"<>'Brand#45' AND "P_TYPE" NOT LIKE 'MEDIUM POLISHED%')

Well we got no storage predicates at all and we didn't expect that because we had one simple predicate namely p_brand != 'Brand#45' and the IN predicate had been rewritten to a series of OR'ed comparisons so what happened? This parameter only permits simple predicates that are linked by AND's and can be attached directly to one column. Disjuncts are not pushable so they are normally evaluated by an eva tree or by pcode neither of which are sent to the cell with this parameter set to FALSE. So why wasn't our one simple predicate offloaded. Well, note where it is in the explain plan. It comes after the rewritten the IN and since the predicates are sorted by the optimizer on effectiveness we stop looking as soon as we see one that can't be offloaded. Let's remove the IN and see what happens:

SQL> explain plan for select  p_brand, p_type, p_size
from part
where p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%';

|*  2 |   TABLE ACCESS STORAGE FULL| PART |   190K|  6686K|   217   (2)| 00:00:01 | 1 | 8 |

   2 - storage("P_BRAND"<>'Brand#45')
       filter("P_BRAND"<>'Brand#45' AND "P_TYPE" NOT LIKE 'MEDIUM POLISHED%')

as expected the simple predicate is now offloaded. If you look at v$sqlfn_metadata you'll see this param is reflected in the offloadable column:

SQL> select func_id, descr, offloadable from v$sqlfn_metadata where descr like '%LIKE%';

   FUNC_ID DESCR              OFF
---------- ------------------ ---
        26  LIKE              NO
        27  NOT LIKE          NO
        99  LIKE              NO
       120  LIKE              NO

       121  NOT LIKE          NO
       524  REGEXP_LIKE       NO
       525  NOT REGEXP_LIKE   NO
       537  REGEXP_LIKE       NO
       538  NOT REGEXP_LIKE   NO

I hope you never need any of this in real life but it's good to have it in the toolbag.

Wednesday Jun 29, 2016

More on tracing the offload server

I posted a while back on how to use Tracing Hybrid Columnar Compression in an offload server so this is a quick follow up.

1. I have trouble remembering the syntax for setting a regular parameter in an offload server without bouncing it. Since I need to keep this written down somewhere I thought it might be use to support folks and dbas.

2. I forgot to show you how to specify which offload group to set the trace event

So this example should do both: 


CellCLI > alter cell offloadGroupEvents = "immediate cellsrv.cellsrv_setparam('my_parameter, 'TRUE')", offloadGroupName = "SYS_122110_160621"

this will, of course, set a parameter temporarily until the next time the offload server is bounced, but also adding it to the offload group's init.ora will take care of that.

Wednesday Jun 22, 2016

Shining some light on Database In-Memory vs the Exadata Columnar Cache in

Over the last few weeks I have gotten the following questions from a number of different sources,"what's the difference between Oracle's various columnar technologies and how do they fit together"? Or, put another way, "what scenarios does each benefit and would I ever need both or one will cover all the bases"? When a question comes up often enough it is always worth a blog post, so here goes.

They say a picture is worth a thousand words so my own mental picture of the difference between DBIM and the Exadata Columnar Cache is that of the difference between an Indycar design (or a Formula 1 design) and a NASCAR vehicle:

Both of these vehicles are thoroughbreds in their own way: both honed over the years to obtain every ounce of performance possible from the given chassis but you would not do as well entering an Indycar into a NASCAR race and vice versa. So let's start at the beginning.

Hybrid Columnar Compression 

We originally designed Hybrid Columnar Compression (HCC) in response to customer requirements to retain growing volumes of colder data: it was primarily a compression technology rather than a performance technology. The goal was to compress colder data as much as possible while retaining the ability to run both table scans and index-access scans against it at acceptable speed and so we chose to apply a spreadsheet pivot to chunks of data to turn a set of rows into columnar format. 

The compression benefits of a columnar format were well known but at that time, but the existing pure columnar databases had struggled to make significant sales due partly to the painfully slow single row retrieval that comes needing one disk I/O per column.

Therefore, we chose a hybrid approach to enable us to retain row retrieval in one or at the most two disk I/Os but still have columnar benefits. The primary goal of our initial design was to get the maximum compression by applying first semantic compression then bitwise compression and choosing from a variety of algorithms based on what used the least space on disk.

The default compression level used ZLIB for its bitwise compression because on spinning disks it is typically far cheaper to decompress ZLIB than it is to the do the extra disk I/O of uncompressed data. The bulk of the performance improvement came from this cost reduction with additional performance improvements coming from (a) direct addressing of columns of interest (instead of navigating the row to get to a column), and (b) applying predicates and expressions on columnar data. Research had show that for traditionally applied predicates (as opposed to DBIM's SIMD) the bulk of the performance improvements are seen by the time you have ~1,000 values. So, for this reason, the default Query High compression targets 1,000 rows as long as they fit in 64 KB after compression.

The Columnar Cache 

One weakness with the hybrid design is that as table width grows, a query will typically use a smaller and smaller percentage of the data in a compression unit and the lack of true columnar disk I/O starts to hurt performance. The other weakness is that with very narrow tables the cost becomes dominated by the decompression time and ideally one would switch to Query Low, which has much cheaper decompression costs, but the extra disk I/O of any unused columns would hurt performance significantly so no one uses it.

Both the problems with very narrow and very wide tables required true columnar I/O so we came up with a simple solution: rearrange 1 MB worth of HCC data into true columnar form and store the page mapping in the flash cache's hash table to tell us where the columns are. It is not uncommon to see savings of 50-80% disk I/O from the columnar cache and significant improvements to the wall clock for queries which are cell CPU bound. The thing to note here is that every block in the 1 MB aligned chunk must be in HCC format: if there are any row major blocks from updates or from Inserts that were too small to compress, they will stop the block from being re-arranged into Columnar Cache format.

The advent of the NVMe flash cache in Exadata X5 and X6 with blindingly high scan speeds meant that one of the arguments for using Query High as opposed to Query Low would have vanished but performance measurements on NVMe showed that the compression unit sizes for Query Low were too small and that they should be the same size as Query High uses. (Note: getting better compression is still the main reason to use Query High, which is why it will remain the default). The next major release of Oracle will use the larger CU size for Query Low but in the meantime there is a parameter available to force larger CU sizes in if any customers need to use Query Low to avoid being cell CPU bound (particularly on the EF machines). Please email me if you have an X5 or X6 and need to try a larger size.

However, HCC and it's columnar cache do nothing to help RDBMS bound queries which is the scenario many customers face which is where Database In-Memory enters the story. Also, there is nothing the columnar cache can do for hot data that is still being updated. What our customers were telling us was that they needed was to extend columnar processing seamlessly across the Information Management Lifecycle for both hot and cold data. And with that, we get to the story of Database In-Memory.

Database In-Memory 

Oracle's unique Hybrid approach had already proved that optimal performance required both quick access to a single row of data and quick columnar scan access. The issue was how to do this on hot data that is still undergoing DMLs without breaking OLTP performance. Oracle's unique innovation was to split the data simultaneously into the traditional row-major on-disk format and also into an In-Memory segment with the data in columnar format. By having the columnar data in memory there would be no changes to the existing OLTP code path and no redo generated to slow wrote transactions down. Rows which had very recently been updated and which were for that reason invalidated in the column store would still be in the buffer cache allowing quick row retrieval when needed.

You would be forgiven for thinking that all we did was take HCC formats and put long runs of them in memory so it would simply be a new form of HCC. But that's the antithesis of what we did. Remember the HCC formats were chosen primarily for their compression abilities and the ability to retrieve a row from disk in as few I/O's as possible. Since the focus of this project was raw columnar performance, the columnar formats were refined and reworked into forms finely tuned for query performance using SIMD instructions (SIMD predicates are not supported on some HCC formats) including the introduction of dictionary encoding (which is not available to HCC due to it's shorter column runs and the requirement that Oracle blocks on disk are always entirely self-describing).

In addition to fast SIMD predicates and higher scan speeds, bringing the columnar data into the RDBMS brought several more advantages:

  1. With the very high scan speeds (because there's no I/O involved) enabling predicates to find narrow ranges of data fast, more indexes could be dropped to improve OLTP transactional performance.
  2. More predicates could be pushed into the columnar format: Smart Scan is unable to offload any predicates that require support from the RDBMS kernel, in particular PL/SQL functions. Many LOB, CLOB, and XML predicates require PL/SQL runtime support.
  3. In the RDBMS, Vector Group By can be used to do the actual grouping not just as a predicate. Smart Scan offloads Vector Group By as the equivalent of an improved bloom filter: it can do the pruning but not the grouping - that is still done after it is returned to RDBMS.

Why would I use both formats?

To bring this all together, there is a coherent information lifecycle management plan that gets optimal OLTP performance AND optimal query performance by placing the hotter data in memory columnar and as data cools to the point where it is unlikely to undergo significant updates it is tiered out to the Exadata Columnar Cache via HCC formats. And optionally, further tiered out to ZFSSA or BDA when the data is cold enough that it is only being retained for compliance purposes and unlikely to be queried.

Disclaimer: these are my personal thoughts and do not represent Oracle's official viewpoint in any way, shape, or form.

Tuesday May 24, 2016

What you need to know about Offload Quarantines

Several questions have couple up in the last few weeks about offload quarantines which means a blog post on this topic is overdue. We work hard to stress test every new rpm that is released but on rare occasions customers can encounter an issue with the thin database layer that exists in the offload server. This layer is known externally as "Smart Scan" and internally as "FPLIB" (a.k.a. Filter Projection Library).

A crash in the thin database layer could because of either an issue with some aspect of the sql_id (for example, with the predicates) or because of an issue with the data on some region of disk (for example, with the OLTP compression symbol table). The worst, and rarest, form of crashes are where striping leads to every offload server failing simultaneously: these are known colloquially as "Railroad Crashes"). The most important thing is to make sure the retry mechanism doesn't immediately resubmit the query and re-crash the offload server causing a halt to the business operating. In a hospital, the floor nurse would call a code and the crash team would come running with a crash cart to stabilize the patient. Two members of my family are nurses and I'm reminded that nurses are a lot like technical support engineers in that while doing their job they sometimes have to deal with abuse from frustrated patients (customers): please remember that both groups work hard to resolve your issues and be kind to them!

What are Quarantines?

The option of calling a crash cart is not available to us here so starting in early, we created a quarantine system where, after a crash, the exception handler remembers both the sql_id and the disk region being processed and creates a persistent quarantine for both. When a sql_id or a disk region is quarantined any Smart Scan operations on them will be executed in passthru mode.

Currently it is hard to see when this is happening, but an upcoming release has a new stat to make this easier. If an operation has been quarantined you can see its effects by monitoring:

  • cell num bytes in passthru due to quarantine

Listing Quarantines 

When a quarantine has been created you can look at it in detail using CellCLI:

CellCLI> list quarantine 1 detail
     name:                      1
         clientPID:             12798
         crashReason:           ORA-600[17099]
         creationTime:          2011-02-07T17:18:13-08:00
         dbUniqueID:            2022407934
         dbUniqueName:          YAMA
         incidentID:            16
         planLineID:            37
         quarantineReason:      Crash
         quarantineType:        "SQL PLAN"
         rpmVersion:            OSS_11.
         sqlID:                 1jw05wutgpfyf
         sqlPlanHashValue:      3386023660

The 'list detail' version of the command gives us everything we would need to know about exactly what has been quarantined and why it was quarantined. CellCLI also supports manually creating a quarantine using the attributes shown by 'list detail'.

Quarantine Escalation

This is the topic that has caused the most confusion: if three new quarantines are generated within a 24 hour period the quarantine is escalated to a database quarantine. Using the 'list detail' option we would then see:

quarantinePlan:         SYSTEM
quarantineReason:       Escalated
quarantineType:         DATABASE

Note: the number of quarantines in 24 hours before escalation is configurable via a cellinit param: please contact Technical Support if you feel you have a valid need to change this.

The final level of escalation is where if more than one database has been escalated to a database quarantine, the system will escalate to a complete offload quarantine where Smart Scan is disabled completely and all I/O goes through regular block I/O. I'm glad to say that I've have never seen this happen.

Dropping Quarantines 

The next question is how and when are the quarantines removed. Any quarantine can be removed manually using CellCLI. Quarantines are also automatically dropped by certain operations:

  • Plan Step Quarantines:
    • Dropped on rpm upgrades
  • Disk Region Quarantines
    • Dropped on rpm upgrades, or on successful writes to the quarantined disk region
  • Database quarantines
    • Dropped on rpm upgrades
  • Offload quarantines
    • Dropped on rpm upgrades

What about CDB?

In 12.1, we changed the architecture of cellsrv to support multiple RDBMS versions running at the same time by introducing the concept of offload servers. When a new rpm is installed it typically contains offload servers for,,, (and This is known internally as multi-DB. Any given operation is tagged with the RDBMS version it is coming from and routed to the offload server for that version. A crash in Smart Scan typically means that only the offload server has to restart and not the central cellsrv that maintains Storage Index and does Smart IO. A side effect of this is that all operations for that RDBMS version can revert to Block IO while the offload server restarts minimizing disruption.

The architecture change necessitated a change to the way quarantines are created, checked, and dropped. In multi-DB, installation of a new rpm no longer drops all quarantines. Instead, system created quarantines now record the offload server's rpm version. Manually created quarantines can also optionally specify offload rpm they are to effect. In multi-DB, a quarantine is observed if the offload rpm specified matches the actual offload rpm the operation will be sent to or if no offload rpm is specified regardless of offloadgroup.
Multi-DB quarantines are dropped if the matching offload rpm is uninstalled or a new rpm installed for that offload version. Multi-DB quarantines with no offload rpm specified must be dropped manually.

Please let me know if you have any questions. 


Friday Apr 08, 2016

Using INMEMORY with External Tables

Several people have asked if there is any way to use INMEMORY on External Tables because the INMEMORY syntax is not supported in Create Table and Alter Table with Organization External (actually Create Table parses it but then ignores it).

While there is no way out of having to load the data into Oracle, there is a short cut to simplify life when you have an external process recreating flat file data on a regular basis and want to query the latest incarnation against corporate data in Oracle Tablespaces. One example we've seen of this is where you have a map-reduce job on Hadoop that highly summarizes large amounts of ephemeral data and it is the summary data that needs to be joined against Oracle data.

Instead of truncating the Oracle table and doing a flat file load each time, we can use a Materialized View on the external table and simply issue an on-demand refresh and let the materialized view take care of the truncate and load under the covers for us.

Since INMEMORY does support Materialized Views, we can now automatically get INMEMORY query speeds on the data from an external source.

Let's see how this works. For example here's a loader table I use with TPC-H Scale 1 for local testing:

create table p_et(
p_partkey number, p_name varchar2(55), p_mfgr char(25),
p_brand char(10), 
p_type varchar2(25), p_size number,
p_container char(10), p_retailprice number, p_comment varchar2(23)) 
organization external( 
 default directory T_WORK 
 access parameters( 
 records delimited by newline 
 fields terminated by '|' 
 missing field values are null) 
 location (T_WORK:'part.tbl')) 
 reject limit unlimited;

Instead of just getting the external data into an Oracle MVIEW segment to query, I can get it into an INMEMORY segment and get much better query performance:

SQL> create materialized view im_p_et
inmemory memcompress for capacity low
as select * from p_et;

SQL> alter materialized view im_p_et ENABLE QUERY REWRITE ;

That succeeds, so first let's make sure our queries are actually going to use this Materialized View:

query=>'select count(p_comment) from p_et',
2 / 

PL/SQL procedure successfully completed.

SQL> select message from rewrite_table;
QSM-01151: query was rewritten
QSM-01209: query rewritten with materialized view, IM_P_ET, using text match algorithm

OK so we are sure this simple query can use our Materialized View, let's verify that it actually did:

Note: my table was too small so I had to force the XT MVIEW to be loaded:

SQL> begin dbms_inmemory.repopulate(schema_name=>'TPCH', table_name=>'IM_P_ET', force=>TRUE); end;
2 /
PL/SQL procedure successfully completed.

SQL> select INMEMORY_SIZE,POPULATE_STATUS from v$im_segments where SEGMENT_NAME='IM_P_ET';

------------- -------------
      7602176 COMPLETED

SQL> set autotrace on 
SQL> select count(p_comment) from p_et;


Execution Plan
Plan hash value: 703290957
| Id  | Operation       | Name  | Rows  | Bytes | Cost (%CPU)| Time  |
|   0 | SELECT STATEMENT       |  | 1 |    13 |    50  (62)| 00:00:01 |
|   1 |  SORT AGGREGATE       |  | 1 |    13 |       |  |
|   2 |   MAT_VIEW REWRITE ACCESS INMEMORY FULL| IM_P_ET |   152K|  1941K|    50  (62)| 00:00:01 |

 1  recursive calls
 3  db block gets
1881  consistent gets
 0  physical reads
 0  redo size
550  bytes sent via SQL*Net to client
607  bytes received via SQL*Net from client
  2  SQL*Net roundtrips to/from client
  0  sorts (memory)
  0  sorts (disk)
  1  rows processed 

and in the query plan we see ' MAT_VIEW REWRITE ACCESS INMEMORY FULL' and in the stats section we see physical reads is now zero. We are now getting very fast INMEMORY queries against a Materialized View built using data from the flat file on disk.

Here are some things to remember when using Materialized Views with External Tables:

  • The Materialized View must have rewrite enabled 
  • The optimizer_mode must be set to cost based e.g. 'first_rows_100' 
  • The query_rewrite_parameter must be set to 'enabled' or 'force' 
  • The query_rewrite_integrity parameter for an XT must be set to 'trusted' or 'stale_tolerated'

Tuesday Mar 15, 2016

Compression in a well-balanced system

This continues the previous entry (Is there such a thing as fast enough?)

Exadata is designed to present a well-balanced system between disk scan rates, cell CPU processing rates, network bandwidth, and RDBMS CPU such that on average across a lot of workloads and a lot of applications there should not be no one obvious bottleneck. Good performance engineering means avoiding a design like this:

because that would be nuts. If you change one component in a car e.g. the engine you need to rework the other components such as the chassis and the brakes. When we rev the Exadata hardware we go through the same exercise to make sure the components match to avoid bottlenecks. 

One of the components in a well-balanced system that people often leave off this list is compression algorithm. With compression you are balancing storage costs as well as the disk scan rates against the CPU's decompression rate. In a disk bound system you can alleviate the bottleneck by compressing the data down to the point that it is being scanned at the same rate that the CPUs can decompress and process it. Squeeze it too far, say with HCC "Compress For Archive High" which uses BZ2, the CPUs become a major bottleneck because the decompression costs unbalance the system (but it's wonderful if you think you're unlikely to ever need the data again).

Since HCC was released with Exadata 11.2, DBAs have almost always used Query High (ZLIB) since it balances spinning disks against the Xeon CPUs quite nicely while retaining a storage size small enough to retrieve in a single disk I/O on 256 KB formatted extents. With Query High we tend to see an equal chance of some query being disk bound or CPU decompression bound.

Prior to X3, HCC Query Low (LZO) was never a performance option for scans since the reduced CPU load couldn't compensate for 2X the disk reads required even on flash and Query Low scans are going to be disk bound. The only utility for Query Low was if there were a lot of single-row reads and the 32 KB storage size combined with cheap decompression made row retrieval much faster than Query High where you decompressing twice as much with a more expensive decompression algorithm.  With X3 flash cache we saw something new, the Xeons would decompress LZO at a speed that very roughly matched the SSD scan speeds of the flash cache. While some queries could now run faster using Query Low than Query High that didn't provide a compelling reason to use Query Low since it uses not only twice as much storage space but twice as much space in the flash cache and you had to be sure it was in the flash cache to see that speed.

With the introduction of X5 just over a year ago the balance changed. I am a huge fan of the X5-EF with its flash only configuration.   While X5-HC offers 10% more storage capacity and a flash cache that is as fast as flash tablespace, the X5-EF offers consistency: you never need to ask whether the data is in the cache or pay the overhead of maintaining the cache. It also offers the chance to leave the past and the notion of spinning disks behind. The X5-EF can scan at over 250 GB/sec  and offers 288 fast cores in the storage layers for decompression. Clearly if your primary concern is storage capacity you will want to stay with Query High but if we are thinking purely in terms of exploring performance the time has come for a reassessment of role HCC Query Low on X5. This will be the topic of the next blog entry.

In the meantime, I'll leave you with my own mental image of the X5-EF - the sleek well-balanced very responsive BMW i8 which can do 0 to 60 in 4.4 seconds and 50 to 75 mph in 2.6 seconds (hmmm, though since the i8 is a hybrid maybe it corresponds to the X5-HC ... just a thought).

Thursday Mar 10, 2016

Is there such a thing as fast enough?

Have you seen this video which went viral?

What fascinates me is that I remember upgrading from Win 3.1 to Win 95 and was blown away by how fast and smooth it was. The feeling of "this is it", this is fast enough to feel really productive. But sadly so often today's amazement dies to quickly becomes tomorrow's frustration. We become addicted to the buzz of doing more interesting things and doing them more quickly. When I started in DW in the 90's, we thought an analysts' attention spans were around 10 minutes: if complex TB scale queries came back in ten minutes they could keep their train of thought and go on asking interesting questions of the data. It wasn't long before 10 minutes was painful, we wanted one minute, then 10 seconds, then ...

At Hotsos 2016 in Dallas this week I was talking about flash performance. When Exadata X3 came out the performance was amazing. In some scenarios the flash cache could serve data faster than the Xeons could decompress ZLIB and for the first time LZO became a viable performance option for flash. With X5-EF, the NVMe flash subsystem can scan over 250 GB/sec dwarfing the QDR Infiniband links 40 GB/sec data return and consequently why having Smart Scan to cut the data down at the storage layer is now more essential than ever.

I look at  250+ Gb/sec and can't quite get my head around what that looks like as a processing rate - reading the entire Library of Congress in less then a minute? I wonder if a time will come when I will look at NVMe and think how on earth did with we live with the frustration? I don't know, but I do know that with X5 now and with future hardware options on the horizon, this is the most exciting time in my 30+ year career to be working at the heart of database performance architecture.

I'm going to be continuing this discussion with some of the performance findings from Hotsos: stay tuned.

Let me know if there are any Smart Scan performance topics you'd like to see posts on. 


Monday Feb 15, 2016

Hotsos 2016

If you want to understand more about Smart Scan works with Hybrid Columnar Compression I will be presenting a performance paper at Hotsos 2016 in Dallas March 6-10.

Hotsos 2016

We will be looking at the effect of changing CPU, disk, and flash technologies in recent Exadata hardware revisions have on HCC and how that works with the columnar cache and different workloads.

Tuesday Dec 15, 2015

Double counting of HCC stats prior to next major release

HCC stats have been confusing for many since their introduction in 11.2 both because of the lack of a common naming convention and because of double counting.

When we did DBIM we spent a lot of time getting the stats right both in terms of naming convention that lent itself to straightforward wildcard searches, and in contents. For the next major release, the introduction of the IMC format Columnar Cache meant that the time had come to revamp the HCC stats more in line with the DBIM stats.

The double counting came from the fact that the decompression library (kdzd) had a stats container that was updated for each operation and the final tally was copied into the stat in a one-to-one correspondence. Each CU and each column that was processed on the cell was matched by the set of calls that looked at the CUs in the synthetic blocks returned by Smart Scan to see if they were secondarily compressed or not (which sometime happens to squeeze the results into the output buffer) consequently each CU and each column incremented the same stat twice. 

In the next major release these will be all been split into pairs of stats 'HCC scan [ cell | rdbms ] %'.

This table provides a handy mapping for anyone looking to find where their favourite stats have gone:

 Previous Stat  New RDBMS Stat  New Cell Stat
 EHCC Analyze CUs Decompressed  HCC analyze table CUs
 EHCC Analyzer Calls  HCC analyzer calls
 EHCC Archive High CUs Compressed Conventional  HCC load conventional CUs archive high
 EHCC Archive High CUs Compressed Direct  HCC load direct CUs archive high
 EHCC Archive Low CUs Compressed Conventional  HCC load conventional CUs archive high
 EHCC Archive Low CUs Compressed Direct  HCC load direct CUs archive low
 EHCC Query High CUs Compressed Conventional  HCC load conventional CUs query high
 EHCC Query High CUs Compressed Direct  HCC load direct CUs query high
 EHCC Query Low CUs Compressed Conventional  HCC load conventional CUs query low
 EHCC Query Low CUs Compressed Direct  HCC load direct CUs query low
 EHCC CU Row Pieces Compressed Conventional  HCC load conventional CUs row pieces
 EHCC CU Row Pieces Compressed Direct  HCC load direct CUs row pieces
 EHCC CUs Compressed Conventional  HCC load conventional CUs 
 EHCC CUs Compressed Direct  HCC load direct CUs 
 EHCC Rows Compressed Conventional  HCC load conventional rows
 EHCC Rows Compressed Direct  HCC load direct rows
 EHCC Rows Not Compressed Conventional  HCC load conventional rows not compressed
 EHCC Rows Not Compressed Direct  HCC load direct rows not compressed
 EHCC Compressed Length Compressed Conventional  HCC load conventional bytes compressed
 EHCC Compressed Length Compressed Direct  HCC load direct bytes compressed
 EHCC Decompressed Length Compressed Conventional  HCC load conventional bytes uncompressed
 EHCC Decompressed Length Compressed Direct  HCC load direct bytes uncompressed
 EHCC Compressed Length Compressed Conventional  HCC load conventional bytes compressed
 EHCC Compressed Length Compressed Direct  HCC load direct bytes compressed
 EHCC Attempted Block Compressions  HCC block compressions attempted
 EHCC Block Compressions  HCC block compressions completed

Table 2. Query Stats

Previous Stat  New RDBMS Stat  New Cell Stat
 EHCC Archive High CUs Decompressed  HCC scan rdbms CUs archive high  HCC scan cell CUs archive high
 EHCC Archive LowCUs Decompressed  HCC scan rdbms CUs archive low  HCC scan cell CUs archive low
 EHCC Query High CUs Decompressed  HCC scan rdbms CUs query high  HCC scan cell CUs query high
 EHCC Query Low CUs Decompressed  HCC scan rdbms CUs query low  HCC scan cell CUs query loq
 EHCC CUs Decompressed  HCC scan rdbms CUs decompressed  HCC scan cell CUs decompressed
 EHCC Check CUs Decompressed  HCC block check CUs
 EHCC Dump CUs Decompressed  HCC block dump CUs
 EHCC Rowid CUs Decompressed  HCC fetch by rowid CUs
 EHCC Normal Scan CUs Decompressed  HCC scan rdbms CUs normal
 EHCC Turbo Scan CUs Decompressed  HCC scan rdbms CUs turbo
 cell CUs processed for compressed  HCC scan cell CUs processed for compressed
 cell CUs processed for uncompressed  HCC scan cell CUs processed for uncompressed
 cell CUs sent compressed  HCC scan cell CUs sent compressed
 cell CUs sent head piece  HCC scan cell CUs sent head piece
 cell CUs sent uncompressed  HCC scan cell CUs sent uncompressed
 EHCC DML CUs Decompressed  HCC DML CUs
 EHCC Conventional DMLs  HCC DML conventional
 EHCC CUs all rows pass minmax  HCC scan rdbms CUs optimized read  HCC scan cell CUs optimized read
 EHCC CUs no rows pass minmax  HCC scan rdbms CUs pruned  HCC scan cell CUs pruned
 EHCC CUs some rows pass minmax <removed> <removed>
 EHCC Columns Decompressed  HCC scan rdbms CUs columns accessed  HCC scan cell CUs columns accessed
 EHCC Compressed Length Decompressed  HCC scan rdbms bytes compresse  HCC scan cell bytes compresse
 EHCC Decompressed Length Decompressed  HCC scan rdbms bytes decompressed  HCC scan cell bytes decompressed
 EHCC Pieces Buffered for Decompression  HCC scan rdbms CUs row pieces accessed  HCC scan cell CUs row pieces accessed
 EHCC Preds all rows pass minmax  HCC scan rdbms CUs predicates received  HCC scan cell CUs predicates received
 EHCC Preds no rows pass minmax  HCC scan rdbms CUs predicates applied  HCC scan cell CUs predicates applied
 EHCC Preds some rows pass minmax  HCC scan rdbms CUs predicates optimized  HCC scan cell CUs predicates optimized
 EHCC Total Columns for Decompression  HCC scan rdbms columns theoretical max  HCC scan cell columns theoretical max
 EHCC Total Pieces for Decompression  HCC scan rdbms CUs row pieces total  HCC scan cell CUs row pieces total
 EHCC Total Rows for Decompression  HCC scan rdbms rows  HCC scan cell rows
 EHCC Used on ZFS Tablespace  HCC usage ZFS
 EHCC Used on Pillar Tablespace  HCC usage pillar

In follow up entries I'll be looking at how the new in 12.2 Columnar Cache works, what the new Columnar Cache stats mean, and how Set membership in Storage Index works, 

Tuesday Nov 24, 2015

Offloading row level security to Smart Scan

The question came up this week about whether the predicates for row level security are offloaded to Smart Scan. The simple answer is yes as long as the policy_function contains off-loadable predicates.

Let's see this in action. Using the public TPC-H schema we can set up the customer table so that customers can only see their own entries in that table. We need a row level security policy to see if each row is one that the current user is permitted to see and a function for the policy to call to make that determination:

grant dba to Customer#000000042 identified by hello;
grant select on customer to public;

create or replace function tpch.trls_c_name(
  p_schema in varchar2,
  p_object in varchar2)
    return varchar2
as begin
  return 'sys_context(''userenv'',''session_user'') in (upper(c_name), ''TPCH'')';


begin  dbms_rls.add_policy(
    object_schema   => 'TPCH',
    object_name     => 'CUSTOMER',
    policy_name     => 'P1',
    policy_function => 'trls_c_name');

This defines the row level policy on the customer table to call the trls_c_name function on a per-row basis. This function checks that the current user is either TPC-H who owns the data or the current user who can only see rows that match the session_user to the c_name column.

Now we can connect as the customer and try it:

connect Customer#000000042/hello

select * from tpch.customer where c_acctbal > 0;

C_NAME                    C_NATIONKEY
------------------------- -----------
Customer#000000042                  5

So the customer is now restricted to seeing their own entries,let's look at the plan:

explain plan for select * from tpch.customer where c_acctbal > 0;
select * from table(dbms_xplan.display(format=>'all'));

Query Block Name / Object Alias (identified by operation id):


   1 - SEL$F5BB74E1

   2 - SEL$F5BB74E1 / CUSTOMER@SEL$2

Predicate Information (identified by operation id):




   2 - storage("C_ACCTBAL">0 AND (SYS_CONTEXT('userenv','session_user')='TPCH' O

R UPPER("C_NAME")=SYS_CONTEXT('userenv','session_user')))

       filter("C_ACCTBAL">0 AND (SYS_CONTEXT('userenv','session_user')='TPCH' OR UPPER("C_NAME")=SYS_CONTEXT('userenv','session_user')))

Column Projection Information (identified by operation id):






This shows that sys_context is off-loadable in this usage and shows that in order for the evaluation of sys_context to work the metadata sent to the Smart Scan encapsulates the relevant contents of userenv to be available to be checked on the cell. 

If you think this is the cause of a Smart Scan problem you are encountering, you can use the hidden parameter "_cell_offload_sys_context" to disable it and see if that resolves the issue. 

Tuesday Aug 11, 2015

Tracing Hybrid Columnar Compression in an offload server

I had previously commented on how to use the FPLIB facility in a trace event but the question came up today of how to trace HCC in an offload server.  The facility name in this case is ADVCMP (Advanced Compression) and the hierachy is:




No compression occurs on the cell so we are only interested in the decompression tracing in an offload server.

So in this case the correct syntax is [facility.event] so: 

cellcli -e 'alter cell offloadgroupEvents = "trace[advcmp.advcmp_decomp.*] disk=lowest"'

Monday Jun 15, 2015

Why you don't want to set _partition_large_extents = FALSE

I've seen some blogs recommending that _partition_large_extents be set to FALSE for a variety of space conserving reasons without the authors thinking about the negative impact this is going to have on Smart Scan. Large Extents cause an INITIAL allocation of 8 MB and a NEXT allocation of 1 MB and they have been the default for table spaces on Exadata since You can verify that large extents are in use by a given table or partition by:

Select segment_flags From sys_dba_segs where segment_name = <table_nameand owner = <schema_name>;

The segment flag bit for large extents is 0x40000000.

This pattern of allocation is design to work optimally with Smart Scan because Smart Scan intrinsically works in 1 MB chunks.  Reads of ASM allocation units are split into maximum 1 MB chunks to be passed to the filter processing library to have their blocks sliced and diced to create the synthetic blocks that contain only the rows and columns of interest to return to the table scan driver. When less than 1 MB gets allocated at a time to a segment and then the next contiguous blocks gets allocated to a different  segment, each separate run of blocks will be read by a different MBR. Each run will be passed separately to Smart Scan and we get sub-optimal chunks to work on increasing both the overhead of processing and the number of round trips needed to process the table scan. The design of Smart Scan is predicated on scooping up contiguous runs of data from disk for efficient processing.

This matters particularly for HCC data and for chained rows.

Hybrid Columnar Compression 

We made two improvements to HCC row-based processing in The first was the introduction of an Order By clause on index-access plans that guaranteed any compression unit would only be decompressed once. The second was 256 KB formatting of extents. Prior to 12c, a Fetch By Rowid would read the first block into the buffer cache in order to get the list of blocks in that Compression Unit and then read the remaining blocks into the buffer cache using the exact read-ahead count. This guaranteed a minimum of two physicals reads to retrieve one row. By formatting 256 KB at a time to the segment, we know a priori that we can issue the first read for all the blocks from the RDBA to next 256 KB alignment thus the majority of Compress For Query reads can be done with a single physical read.

However this formatting can only happen when large segments are used.  You can verify that 256 KB formatting is being used by:

Select flag From GV$FILESPACE_USAGE where tablespace_id = <tsnumberand rfno = <relative_fno>;

The flag bit for 256 KB formatting is 0x08. Note: 256 KB formatting requires a minimum 8 KB block size not by undoing this critical table scan optimization.

Chained Rows

When Smart Scan processes chained rows all it can see is the 1 MB chunk it is given. It has no ability to initiate a disk I/O to retrieve chained row pieces that lie outside that 1 MB chunk (even if it could the chances that the required row piece is on the same cell is relatively low). We changed the direct load path in to work with the natural 1 MB alignment of Smart Scan so that direct loaded rows would never cross a 1 MB boundary. However as of we have not yet changed Updates to keep all the row pieces inside a 1 MB chunk nor have we taught Alter Table Shrink Space Compact to obey the 1 MB alignment rule.

Keeping 1 MB chunk contiguous for the same segment maximizes the chances that after non-bulk inserts and updates Smart Scan is able to see all the row pieces needed to satisfy the column references and not have to send the block back to RDBMS for buffer cache chained row processing. Using style 64 KB allocations maximizes the chances Smart Scan will not be able to process those blocks . Just saying ...


Getting Exadata for its table scan performance but then setting  "_partition_large_extents" = FALSE is like buying a BMW M5 and insisting on filling it with 87 octane. The correct answer to conserving space on small tables is to over-ride the default by explicitly setting INITIAL and NEXT in the Create Table statement.

Roger MacNicol

Oracle Data Storage Technology 


Phil Wang-Oracle


« February 2017