X

Everything about Table Scans including offload to Exadata, Big Data SQL, and External Tables

Recent Posts

Understanding External Table URowids

I finally found time to get back to External Tables and have a list of blog posts I need to write on this topic. Here's a brief one. DBMS_ROWID will nicely break down a heap table's rowid for you into file number, block number, and row number but it doesn't handle the rowids coming from External Tables. So let's look at how to make sense of them. They fall under the datatype UROWID which is a nominally opaque rowid defined by the data source. The first byte of a UROWID tells you which data source it came from and consequently how to deconstruct it. The easiest way to see what is happening is via the SQL Dump function: SQL> column xtrowid format a55 SQL> select c_custkey, dump(rowid,16) "XTROWID" from c_et      2    where c_custkey < 10;  C_CUSTKEY XTROWID ---------- -------------------------------------------------------          1 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,1          2 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,2          3 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,3          4 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,4          5 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,5          6 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,6          7 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,7          8 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,8          9 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,0,9 9 rows selected. Datatype 208 is 'internal universal rowid". The first byte of a UROWID tells you what kind of UROWID it is: physical rowid logical rowid remote rowid external table rowid cube rowid This blog post is only about External Table UROWIDs: the next four bytes tell you the object number, here 0x127a2 is 75682 and we can confirm this simply: SQL> select OBJ# from SYS.OBJ$ where name = 'C_ET');        OBJ ----------      75682 1 row selected. The next four are the sequence number, and the last eight bytes are the row number in the sequence as you can see from the monotonically increasing integers. You can think of the sequence number as granules, here we get further into the table just to show the sequence increasing: SQL> select c_custkey, dump(rowid,16) "XTROWID" from c_et   2    where mod(c_custkey, 100) = 0   3    and c_comment like '%slyly regular request%';  C_CUSTKEY XTROWID ---------- -------------------------------------------------------       9100 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,23,8c      33000 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,80,e8      59100 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,0,0,0,0,0,0,0,e6,dc      71400 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,1,0,0,0,0,0,0,1f,d5      88700 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,1,0,0,0,0,0,0,63,69     108200 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,1,0,0,0,0,0,0,af,95     113000 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,1,0,0,0,0,0,0,c2,55     131000 Typ=208 Len=17: 4,0,1,27,a2,0,0,0,1,0,0,0,0,0,1,8,a5 8 rows selected. Hope this helps, Roger  

I finally found time to get back to External Tables and have a list of blog posts I need to write on this topic. Here's a brief one. DBMS_ROWID will nicely break down a heap table's rowid for you into...

How to tell if the Exadata column cache is fully loaded

When a background activity is happening on the cell you typically can't use RDBMS v$ views to monitor it in the same way. One such question is how to tell if a segment is fully loaded in the Exadata column cache since this does not appear in the equivalent In-Memory v$ views. When a segment is scanned by Smart Scan sufficiently often to be eligible the AUTOKEEP pool (typically that means at least twice an hour), the eligible 1MB chunks are written to flash in 12.1.0.2 style format, and put on a background queue. Lower priority tasks pick up the queued 1MB 12.1.0.2 format chunks from the flash cache, run them though the In-Memory loader, and rewrite the pure columnar representation in place of the old 12.1.0.2 style column cache chunks. The easiest way that I know of to tell when this completes is to monitor that background activity is to use the following query until it shows zero: select name, sum(value) value from (       select extractvalue(value(t),'/stat/@name') name,             extractvalue(value(t),'/stat') value       from v$cell_state cs,            table(xmlsequence(extract(xmltype(cs.statistics_value),                                      '//stats[@type="columnarcache"]/stat'))) t      where statistics_type='CELL')      where name in ('outstanding_imcpop_requests')      group by name;

When a background activity is happening on the cell you typically can't use RDBMS v$ views to monitor it in the same way. One such question is how to tell if a segment is fully loaded in the Exadata...

Revisiting buffer cache Very Large Object Threshold (VLOT)

If you turn on NSMTIO tracing you will see references to VLOT: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches (local/remote) and checking storage reduction factors (OLTP/EHCC Comp) I had said you could ignore VLOT and Frits Hoogland pointed out that tracing showed it had some impact, so let me clarify: VLOT is the absolute upper bound that cached reads can even be considered  This defaults to 500% of the number of buffers in the cache i.e. _very_large_object_threshold = 500 While this number is not used in any calculations, it is used in two places as a cutoff to consider those calculations 1) Can we consider using Automatic Big Table Caching (a.k.a. DWSCAN) for this object? 2) Should we do a cost analysis for Buffer Cache scan vs Direct Read scan on tables larger than the MTT?  The logic for tables above the calculated medium table threshold (MTT) and that are NOT part of searched DMLs and are NOT on Exadata with statistics based storage reduction factor enabled (_statistics_based_srf_enabled) is: If _serial_direct_read == ALWAYS, use Direct Read If _serial_direct_read == NEVER, use Buffer Cache If _serial_direct_read == AUTO and #blocks in table < VLOT, use cost model Else use Direct Read "qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]" In practice 5X buffer cache is so large the cost based decision will come to the same conclusion anyway - the default VLOT simply saves time spent doing the analysis. For example, I got a quick count of the number of  blocks in non-partitioned TPC_H Scale 1 lineitem select segment_name,sum(blocks),sum(bytes) from user_extents where segment_name='LINEITEM' and created my buffer cache to be exactly the same size. With this setup, setting _very_large_object_threshold=100 bypassed the cost model and went straight to DR scan, while setting it to 200 forced the use of the cost model.  The moral of this is that the default value of VLOT rarely changes the decisions made unless you reduce VLOT to a much smaller multiplier of the cache size and can start to see it cause a few more of your larger buffer cache scans move to direct read when they are no longer eligible for cost analysis. If you wish to stop some of the largest buffer cache scans from happening you would need to set _very_large_object_threshold less than 200.  

If you turn on NSMTIO tracing you will see references to VLOT: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches (local/remote) and checking storage reduction factors (OLTP/EHCC...

Random thoughts on block sizes

I heard "Oracle only tests on 8k and doesn't really test 16k" I heard someone assert that one reason you should only use 8k block sizes is that, and I quote, "Oracle only tests on 8k and doesn't really test 16k". I tried googling that rumour and tracked it back to AskTom. Also as disks and memory get bigger and CPUs get faster it is natural to ask if 8k is now "too small". So here are my thoughts: 1. A quick scan of the data layer regression tests showed a very large number running on 16k blocks 2. Oracle typically runs it DW stress tests on 16k blocks So, clearly, the assertion is untrue but I did spot some areas where 32k testing could be improved. I also added a note to AskTom clarifying testing on 16k. Does this mean I should look at using 16k blocks for me DW Whoa, not so fast. Just because table scans typically run slightly faster on 16k blocks and compression algorithms typically get slightly better compression on 16k blocks does not mean your application will see improvements 1. Multiple block sizes mean added work for the DBA 2. Databases do a lot more than just scan tables - e.g. row based writes and reads could perform worse 3. Most apps have low hanging fruit that will give you far better ROI on your time than worrying about block sizes (if you don't believe me, attend one of Jonathan Lewis's excellent index talks). Most applications run fine on 8k because it is a good trade off between different access paths and, in general, 8k is still the right choice for most applications.  What about 'N' row pieces In general Oracle's block layout tries to ensure row pieces are split on column boundaries but in the case of very wide columns we will split in the middle of a column if too much space would be wasted by aligning with column boundaries. When a column is split in the middle it creates what is known as an 'N' row piece. Rows are split by default at 255 column boundaries assuming the row piece fits in the block If you have a table with very wide rows or some very wide inline columns, smaller block sizes will result both in rows being split more often and in columns being split in the middle. At a minimum the number of block gets to retrieve a single row will likely increase. This is one case to consider where a 16k block size may be worth investigating.  The curious case of 70X compression We had case of a customer legitimately point out that ZLIB would give much better than 70X compression on his very repetitive data but COMPRESS FOR QUERY HIGH was stuck at 70X. And, as it turns out, this is a factor of block size. Hybrid Columnar Compression (HCC) takes up to 8 MB of raw data and compresses it down to one CU (Compression Unit) subject to row limits depending on compression type. For example COMPRESS FOR ARCHIVE LOW (which is also ZLIB) is limited to 8,000 rows in a CU. But there is another limit which you may deduce from block dumps which is that HCC will only allow one head piece row per block because HCC rowids are the RDBA of the head piece plus the row offset into the CU. So if you have incredibly compressible data where 8,000 rows compress down to less 16k moving to a larger block size can be bad news because you will end up with wasted space at the end of each block. Summary There are typically better things to spend time on than worrying about block size but if you wish to run your DW on 16k blocks they are thoroughly tested just like 8k blocks are.

I heard "Oracle only tests on 8k and doesn't really test 16k" I heard someone assert that one reason you should only use 8k block sizes is that, and I quote, "Oracle only tests on 8k and doesn't really...

External Tables Part 1: Project Columns All vs Referenced

I normally blog about table scans on Oracle native data but Oracle also supports a wide variety of features for scanning external tables and I need to cover these too. One of the ways I learn new things is being sent a bug report and saying to myself "Oh! I didn't know you could do that". So today I'm going to start with the grammar: Alter Table <xt> Project Columns [ All | Referenced ]   This DDL changes the value in the Property column displayed in user_external_tables: SQL> select property 2    from user_external_tables 3    where table_name='C_ET'; PROPERTY ---------- ALL        Here we have an external table defined using the legacy driver ORACLE_LOADER. This driver defaults to projecting all the columns in the base table rather than just those needed to satisfy the query (i.e. the referenced columns) and discarding rows that have data conversion issues up to the reject limit.  So for example we have a DATE column in our external table that contains dirty data that won't convert cleanly to Oracle internal dates using the supplied mask we can either import it as a VARCHAR2 to not lose values or import it as a date but lose rows even on queries that don't need that date column. We can change the behaviour to only project the referenced columns by  SQL> alter table c_et project column referenced;   Table altered.   SQL>  SQL> select property  2    from user_external_tables 3    where table_name='C_ET';   PROPERTY    ---------- REFERENCED  The driver will now ignore unreferenced columns and if the date column is not used we will get all the rows in the external data and the query will run faster since datatype conversion is expensive but we may get an inconsistent number of rows depending on which columns are used in any given query. This is OK if the DBA knows a priori that there are no conversion errors or if they are willing to live with inconsistency in the number of rows returned. The big data drivers such as ORACLE_HIVE have a different default behaviour which is to only project referenced columns and to replace data with conversion errors with NULL values and i.e. they default to returning a consistent number of rows with best performance. Take away: In order to get consistency and fast table scan performance with ORACLE_LOADER,  the trick is to define the external table with the 'convert_error store_null' option and switch to 'Project Column Referenced'. For example: CREATE TABLE "T_XT" (   c0 varchar(10),   c1 varchar(10) ) ORGANIZATION external (   TYPE oracle_loader   DEFAULT DIRECTORY DMPDIR   ACCESS PARAMETERS   (     RECORDS DELIMITED BY NEWLINE     CHARACTERSET  al32utf8     ESCAPE     FIELDS CSV WITH EMBEDDED terminated by ',' enclosed by '|'     REJECT ROWS WITH ALL NULL FIELDS     nullif equal blanks     truncate_columns     convert_error store_null     (       "C0" CHAR(10),       "C1" CHAR(10)     )   )   location   (     'simple.dat'   ) ) REJECT LIMIT UNLIMITED parallel 4; If you want more information on handling conversion errors when scanning an external table please check the Utilities Guide section on each driver type:  

I normally blog about table scans on Oracle native data but Oracle also supports a wide variety of features for scanning external tables and I need to cover these too. One of the ways I learn new...

The beginners guide to Oracle Table Scans

I was asked a question yesterday that reminded me there are always people completely new to the topic who need an introduction  - somewhere to start before the other articles make sense. So, here's my brief write-up of everything you need to know about the basic of Oracle Table Scans. Oracle has four main ways of scanning a table: the pre-9ir2 table scan, the 9ir2 TurboScan, the 11.1.0.1 Exadata SmartScan, and the 12.1.0.1 In-Memory Scan. Before we summarize each one, the other fundamental piece of information is the Oracle dictum that all blocks much be self-describing: a table scan routine should be able to inspect a block and understand what object it belongs, whether it needs an undo applying, and how the data is laid out without reference to any external structures or secondary storage. The original table scan routine Oracle uses a "dataflow" query engine which means a query plan is built from nodes like a sausage machine that have three basic operations: Open, Next, Close. 'Open' means you ask the next node in the chain to prepare to do some work including acquiring any resources it may need, 'Next' means you fetch one unit of work from your child e.g. a row, and 'Close' means to tell your child node to shut down and release any resources it may be holding. You build a query by connecting the right kinds of nodes together in the order you want: one node just sorts, another groups, another does hash joins. The end of the sausage machine is the node seen on query plans as "Table Access Full" This node would ask the data layer to fetch a block from disk then get rows one at a time from the data layer. This is the work horse table scan: it can scan any kind of data and do SCN manipulations like row versions but it is not the fastest way to scan a table. 9ir2 TurboScan In 9ir2 we introduced a much faster way of scanning tables called TurboScan. The data layer function which had been handing out rows one at a time was replaced by one that stays in a tight loop retrieving rows from disk and pushing them into a callback supplied by "Table Access Full". An automation tool was used to generate several versions of this routine that optimized out common choices that has to be made: does the user need rowids to be projected? do they need predicates applying? is the data compressed or? is the data column-major or row-major? etc etc Every time a CPU reaches a branch in the code it tries to guess which side of the branch will be taken but if it guess wrong there can be a considerable stall during which no work gets done. By removing most of the branches, the code runs much much more quickly. TurboScan is used for all queries which do not use RAW datatypes and which do not need special SCN processing. Both pre-9ir2 scan and TurboScan can use the buffer cache to get blocks (typically small to medium tables) or use Direct Read to get blocks (typically medium to large tables). See: https://blogs.oracle.com/smartscan-deep-dive/when-bloggers-get-it-wrong-part-1 TurboScan can be disabled for triage purposes by setting: SQL> alter session set events='12099 trace name context forever, level 1'; or specifically you can disable it only for HCC tables by setting: SQL> alter session set "_arch_comp_dbg_scan"=1; Exadata SmartScan In 11.1.0.1 we introduced Exadata SmartScan intelligent storage. This is where a thin layer of database processing is embedded in the storage cells and the table scan routine offloads simple search criteria and a list of the columns it needs to storage and the storage cells pre-process the blocks to remove rows that fail the search criteria and remove columns which are not needed by the table scan. If all the rows are removed, the block doesn't have to be sent back at all.  SmartScan can drastically reduce the amount of data returned on the Interconnect and put on the RDBMS memory bus and the space used in SGA by the returned data. An additional significant benefit is gained when the CPU fetches the reduced blocks into the CPU cache since only relevant information exists on the block there is not space wasted by unwanted columns interspersing the wanted columns meaning more relevant data can fit in memory and the CPU prefetch can do a better job of predicting which memory cache line to fetch next. Only TurboScan Direct Read scans can use this offload capability. You can disable SmartScan for triage purposes by setting: SQL> alter session set cell_offload_processing=FALSE;   or   SQL> select /*+ opt_param(‘cell_offload_processing’,’false') */  <col> from <tab> where <predicate>;    In-Memory Scans In-Memory scans were introduced in 12.1.0.1 and brought a revolutionary increase in table scan speeds. With In-Memory scans the table or partition is loaded into a in-memory tablespace in SGA known as the inmemory-area. Data is stored in compressed columnar format typically up to 500,000 values in each columnar compression unit. This tablespace is kept transactionally consistent with the data on disk via means of an invalidation bitmap.   Just like with SmartScan, only TurboScan can use In-Memory scans with In-Memory objects. Instead of getting a block from disk, the specialized version of the scan routines fetches a column run from each column of interest, process the search criteria, then returns column runs with the failing rows removed to the "Table Access Full" node. If any rows have been modified and committed by other users or the users own transaction has modified any rows the scan will see these rows set in the invalidation bitmap. These rows are removed from the columnar results and the additional rows required are fetched from the buffer cache before moving on to the next set of column runs. This works well because the most recently modified blocks are the ones most likely to still be in the buffer cache.        

I was asked a question yesterday that reminded me there are always people completely new to the topic who need an introduction  - somewhere to start before the other articles make sense. So, here's my...

What's new in 12.2: CELLMEMORY Part 3

The Cellmemory Stats in RDBMS  The RDBMS stats for Cellmemory are designed to closely follow the pattern used by the Inmemory stats  Query Stats   Each column in each one MB of disk blocks will be rewritten into one IMC format Column CU in flash and a set of Column CUs comprise an overall Compression Unit so these stats reflect the number of 1 MB rewrites that were processed (not the number of column CUs). "cellmemory IM scan CUs processed for query" - #1 MB chuncks scanned in MEMCOMPRESS FOR QUERY format "cellmemory IM scan CUs processed for capacity" - #1 MB chuncks scanned in MEMCOMPRESS FOR CAPACITY format "cellmemory IM scan CUs processed no memcompress" - #1 MB chuncks scanned in NO CELLMEMORY format (12.1.0.2 format) Load Stats "cellmemory IM load CUs for query" - #1 MB chunks successfully rewritten from 12.1.0.2 to MEMCOMPRESS FOR QUERY format   "cellmemory IM load CUs for capacity" - #1 MB chunks successfully rewritten from 12.1.0.2 to MEMCOMPRESS FOR CAPACITY format "cellmemory IM load CUs no memcompress" - #1 MB chunks successfully rewritten into 12.1.0.2 format Before a rewrite happens a routine is called that looks through the blocks in the 1 MB chunk and determines if it is eligible for write. Reasons it may not be include transactional metadata from the commit cache, the presence of blocks formats that can't be rewitten (although this list is getting smaller with each rpm), and the amount of space the rewrite will take up. A rewrite into 12.1.0.2 format must fit in the original 1 MB of flash cache. An IMC format rewrite is not permitted to exceed 8 MB. This limit is highly unlikely to be reached by MEMCOMPRESS FOR CAPACITY but could be reached when trying to rewrite HCC blocks with much greater than 8X original compression capacity into MEMCOMPRESS FOR QUERY format. This is one reason that the default is FOR CAPACITY. "cellmemory IM scan CUs rejected for query" - #1 MB chunks that could not be rewritten into MEMCOMPRESS FOR QUERY for whatever reason "cellmemory IM scan CUs rejected for capacity - #1 MB chunks that could not be rewritten into MEMCOMPRESS FOR CAPACITY for whatever reason "cellmemory IM scan CUs rejected no memcompress" - #1 MB chunks that could not even be rewritten into 12.1.0.2 format for whatever reason

The Cellmemory Stats in RDBMS  The RDBMS stats for Cellmemory are designed to closely follow the pattern used by the Inmemory stats  Query Stats   Each column in each one MB of disk blocks will...

What's new in 12.2: CELLMEMORY Part 2

Question: do I need to know anything in this blog post? Answer: No, it is a true cache and CELLMEMORY works automatically Question: so why should I read this blog post? Answer:  because you like to keep a toolkit of useful ways to control the system when needed The DDL The Exadata engineering team has done a lot of work to make the flash cache automatically handle a variety of very different workloads happening simultaneously which is why we now typically discourage users from specifying: SQL> Alter Table T storage (cell_flash_cache keep)  and trust the AUTOKEEP pool to recognize table scans that would most benefit from caching and to cache them in a thrash resistant way. Our Best Practice for CELLMEMORY is typically not to use the DDL. That said, there will be cases when the DBA wishes to override the system's default behavior and we will look at a couple of those reasons. But first here's the DDL which is a very cut down portion of the INMEMORY syntax: Alter Table T  [ [ NO ] CELLMEMORY [ MEMCOMPRESS FOR [ QUERY | CAPACITY ] [ LOW | HIGH ] ] ] So let's break this down piece by piece. Examples SQL> Alter Table T NO CELLMEMORY The NO CELLMEMORY clause prevents a table being eligible for the rewrite from 12.1.0.2 format to 12.2 In-Memory format. There are a variety of reasons you may wish to do this and we'll look at those at the end of this post.  SQL> Alter Table T CELLMEMORY SQL> Alter Table T CELLMEMORY MEMCOMPRESS FOR CAPACITY These allows a table to be cached in the default 12.2 In-Memory format (you'd only ever need this to undo a NO CELLMEMORYyou done earlier or to revert a change in the compression level used).  SQL>  Alter Table T CELLMEMORY MEMCOMPRESS FOR QUERY I mentioned above that CELLMEMORY, by default, undergoes two rounds of compression: first a round of semantic compression and secondly a round of bitwise compression using LZO is applied. This is the default to try and keep CELLMEMORY's flash cache footprint as close as we can to HCC flash space usage but run queries much faster than HCC. But, even though LZO has a relatively low decompression cost it is not zero. There are some workloads which run faster with MEMCOMPRESS FOR QUERY but they also use typically twice as much flash space. It would not be appropriate for the system to automatically start using twice as much flash but if you wish to experiment with this, the option is there. Also compressing with LZO takes CPU time which is not needed with MEMCOMPRESS FOR QUERY. What about [ LOW | HIGH ] ? Currently, unlike INMEMORY,  these are throw away words  but we retained them in the grammar for future expansion and because people are used to specifying them in a MEMCOMPRESS clause. this means in effect: CELLMEMORY'S MEMCOMPRESS FOR QUERY is roughly equivalent of INMEMORY'S MEMCOMPRESS FOR QUERY HIGH CELLMEMORY'S MEMCOMPRESS FOR CAPACITY is roughly equivalent of INMEMORY'S MEMCOMPRESS FOR CAPACITY LOW Why might I want to overrule the default? There are a few reasons: Turning CELLMEMORY off may be useful if you know an HCC table will only be queried a few times and then dropped or truncated soon and it is not worth the CPU time to create the In-Memory formats Turning CELLMEMORY off may be useful if your cells are already under a lot of CPU pressure and you wish to avoid the CPU of creating the In-Memory formats Switching to MEMCOMPRESS FOR QUERY may be useful to get better query performance on some workloads (YMMV) and reduce the CPU cost of creating In-Memory formats What about INMEMORY with CELLMEMORY? We allow both INMEMORY and CELLMEMORY on the same table SQL> Create Table T (c1 number) INMEMORY NO CELLMEMORY;SQL> Create Table T (c1 number) INMEMORY CELLMEMORY MEMCOMPRESS FOR QUERY; Why would you want both INMEMORY and CELLMEMORY? DBIM implements a priority system where the In-Memory area is first loaded with the critical tables and then finally down to the PRIORITY LOW tables. If you have a PRIORITY LOW table that is unlikely to get loaded in memory it is fine to also specify CELLMEMORY so as to still get columnar performance.  Note: an HCC encoded INMEMORY table will still get automatic CELLMEMORY if you don't use any DDL at all. Roger

Question: do I need to know anything in this blog post? Answer: No, it is a true cache and CELLMEMORY works automatically Question: so why should I read this blog post? Answer:  because you like to keep...

What's new in 12.2: CELLMEMORY Part 1

Many people know that in 12.1.0.2 we introduced a ground-breaking columnar cache that rewrote 1 MB chunks of HCC format blocks in the flash cache into pure columnar form in a way that allowed us to only do the I/O for the columns needed but also to recreate the original block when thatwas required. This showed up in stats as "cell physical IO bytes saved by columnar cache". But in 12.1.0.2 we had also introduced Database In-Memory (or DBIM) that rewrote heap blocks into pure columnar form in memory. That project introduced:  new columnar formats optimized for query performance a new way of compiling predicates that supported better columnar execution the ability to run predicates against columns using SIMD instructions which could execute the predicate against multiple rows simultaneously  so it made perfect sense to rework the columnar cache in 12.2 to take advantage of the new In-Memory optimizations. Quick reminder of DBIM essentials In 12.2, tables have to be marked manually for DBIM using the INMEMORY keyword: SQL> Alter Table INMEMORY  When a scan on a table tagged as INMEMORY is queried the background process is notified to start loading it into the INMEMORY area. This design was adopted so that the user's scans are not impeded by loading. Subsequent scans that come along will check what percentage is currently loaded in memory and make a rule based decision: For Exadata  Greater than 80% populated, use In-Memory and the Buffer Cache for the rest Less than 80% populated, use Smart Scan The 80% cutoff between BC and DR is configurable with an undocumented underscore parameter Note: if an In-Memory scan is selected even for partially populated, Smart Scan is not used For non-Exadata Greater than 80% populated, use In-Memory and the Buffer Cache for the rest Less than 80% populated, use In-Memory and Direct Read for the rest Note: this requires that segment to be check pointed first The 80% cutoff between BC and DR is configurable with an undocumented underscore parameter  While DBIM can provides dramatic performance improvements it is limited by the amount of usable SGA on the system that can be set aside for the In-Memory area. After that performance becomes that of disk access to heap blocks from the flash cache or disk groups. What was needed was a way to increase the In-Memory area so that cooler segments could still benefit from the In-Memory formats without using valuable RAM which is often a highly constrained resource. Cellmemory Cellmemory works in a similar way to the 12.1.0.2 columnar cache in that 1 MB of HCC formatted blocks are cached in columnar form automatically without the DBA needing to be involved or do anything. This means columnar cache scans are cached after Smart Scan has processed the blocks rather than before as happens with ineligible scans.  The 12.1.0.2 columnar cache format simply takes all the column-1 compression units (CUs) and stores them contiguously, then all the column-2 CUs stored contiguously etc. so that each column can be read directly from the cache without reading unneeded columns. This happens during Smart Scan so that the reformated blocks are returned to the cell server along with the query results for that 1 MB chunk. In 12.2, eligible scans continue to be cached in the 12.1.0.2 format columnar cache format after Smart Scan has processed the blocks so that columnar disk I/O is available immediately. The difference is that if and only if the In-Memory area is in use on the RDBMS node (i.e. the In-Memory feature is already in use in that database), the address of the beginning of the columnar cache entry is added to a queue so that a background process running at a lower priority can read those cache entries and invoke the DBIM loader to rewrite the cache entry into In-Memory formatted data. Unlike the columnar cache which has multiple column-1 CUs, the IMC format creates a single column using the new formats that use semantic compression and support SIMD etc on the compressed intermediate compressed data. By default a second round of compression using LZO is then applied. When 1 MB of HCC ZLIB compressed blocks are rewritten in this way they typically take around 1.2 MB (YMMV obviously). Coming up Up-coming blog entries will cover: Overriding the default behaviour with DDL New RDBMS stats for Cellmemory New cellsrv stats for Cellmemory Flash cache pool changes Tracing Cellmemory

Many people know that in 12.1.0.2 we introduced a ground-breaking columnar cache that rewrote 1 MB chunks of HCC format blocks in the flash cache into pure columnar form in a way that allowed us to...

More tricks with OPT_PARAM

Did you know you can set most parameters for the execution of a single statement without using an Alter Session by using an OPT_PARAM hint? For example, regular parameters (here forcing the storage clause in the query plan): SQL> select /*+ OPT_PARAM('cell_offload_plan_display' 'always') */ col1 From table1 and underscore parameters: SQL>  select /*+ OPT_PARAM('_arch_comp_dbg_scan',2048) */ n_name from nation However if you try conflicting settings that set a different value in an inner query block, the value you set in the outermost query block will prevail. In this trvial example _arch_comp_dbg_scan=2048 will prevail: SQL> select /*+ OPT_PARAM('_arch_comp_dbg_scan',2048) */ n_name from nationwhere n_nationkey = (    select /*+ OPT_PARAM('_arch_comp_dbg_scan',0) */ min(r_regionkey)    from region where n_name < r_name); Another point to remember is that not all parameters get their value from the cursor environment. For example the buffer cache gets the raw value of _serial_direct_read so it cannot be overruled in this way:  SQL> select /*+ OPT_PARAM('_serial_direct_read', always) */ n_name from nation;   will not force a DR scan of the table but an alter session will. Just one more tool to keep in your toolkit for when you need it. Roger

Did you know you can set most parameters for the execution of a single statement without using an Alter Session by using an OPT_PARAM hint? For example, regular parameters (here forcing the storage...

Improvements to HCC with wide tables in 12.2

HCC Compression Unit Sizing  Since the beginning Oracle has provided four compression levels to offer a trade-off between the compression ratio and various other factors including table scans and the performance of single-row retrieval. I can not emphasize enough that the various trade offs mean that YMMV with the different levels and you should always test what works best with your application and hardware when choosing the level. Historically people have rarely used Query Low since the fast compression with reduced compression ratio means that the extra disk I/O is slower than the cost of decompression with Query High. The one time that Query Low makes sense on spinning disks is if you still have a significant number of row retrieval operations (including from index access joins). NMVe Flash X5 introduced NVMe technology which means that the extra I/O from Query Low is faster than ZLIB decompression which makes Query Low beneficial. So we needed to reassess the sizing of Compression Units. from 11.2.0.1 to 12.1.2.4 the sizing guidelines are as follows: Name  Target  Rows  Target  Minimum Size   Target  Maximum Size  Compression Query Low  1000 to 8000  32 kb  32 kb  LZO Query High  1000 to 8000  32 kb  64 kb  ZLIB Archive Low  8000  64 kb  256 kb  ZLIB Archive High  8000  256 kb  256 kb  BZ2 So, for example, Compress for Query High aims to pivot around at least 1000 rows and create a minimum compression unit size of 32 kb and a maximum of 64 kb. Using 12.1.2.3 I ran these against a TPC-H Lineitem table than contained between 1 and 6 copies of each column. For Query Low fixed 32 kb CUs this gave us the following:  Additional copies of lineitem  Rows per 32 kb CU  0  2797  1  580  2  318  3  216  4  162  5  129 and for Query High 32 to 64 kb CUs this gave us:  Additional copies of lineitem  Rows per 32 kb CU  CU ave size  0  5031  32  1  1010  32  2  936  51  3  794  63  4  595  67  5  476  63 so we see that the CU size remains as 32 kb as long as we are getting a 1000 rows or more then increases in size to 64 kb to try and fit in at least 1000 rows. It became clear that this size range was inadequate for wide tables so to get more efficient compression and longer column runs for faster predicate performance (and also better CELLMEMORY rewrites) we removed the fixed size for Query Low and increased the max Query Low: 32 kb to 64 kb Query High:  32 kb to 80 kb This will not affect narrow tables at all but wider tables should see better table compression and faster scans at the cost of slightly slower single row retrieval for data loaded by 12.2 RDBMS. If you have HCC wide tables and typically cache them on flash cache you should consider re-evaluating Query Low for data loaded in 12.2 (or use Alter Table Move Compress to recompress existing data). Roger 

HCC Compression Unit Sizing  Since the beginning Oracle has provided four compression levels to offer a trade-off between the compression ratio and various other factors including table scans and...

Why are HCC stats being bumped when Smart Scanning row major data in 12.2?

In 12.2, there is a stat "cell blocks pivoted" that points to a new optimization. When Smart Scan processes data, it has to create a new synthetic block that only contains the columns that are needed and the rows that pass the offloaded predicates. If Smart Scan has to create a new block, why would you create a new row-major block when we can just as easily create uncompressed HCC block? It is roughly the same amount of work but once the synthetic block is returned to the RDBMS for processing, columnar data is 10X cheaper to process when the query uses rowsets.  SO where does the name of this stat come from? Simple, it works just like a spreadsheet PIVOT operation, swapping rows for columns. So what is a "rowset"? It simply means that a step in a query plan will take in a set of rows before processing and then process that entire batch of rows in one go before fetching more rows; this is significantly more efficient than processing one row at a time. You can see when a query uses rowsets via: select * from table(dbms_xplan.display_cursor('<sql_id>',1,'-note +projection')); in which case you will see projection metadata with details such as "(rowsets=200)". When Smart Scan sees that the query on the RDBMS is using rowsets, it tries to create a columnar output block instead of a row-major output block and when this arrives on the RDBMS, the table scan will see an HCC block and bump the HCC stats even though the table being scanned doesn't use HCC. Let's see how this works in practice: I created a TPC-H Scale 1 database in NOCOMPRESS format: SQL> select COMPRESSION,COMPRESS_FOR from user_tables where table_name='SUPPLIER';COMPRESS COMPRESS_FOR-------- ------------------------------DISABLED and run Query 6: SQL> select sum(l_extendedprice * l_discount) as revenuefrom   lineitemwhere  l_shipdate >= to_date( '1994-01-01',  'YYYY-MM-DD')        and l_shipdate < add_months(to_date( '1994-01-01',  'YYYY-MM-DD'),  12)         and l_discount between .06 - 0.01 and .06 + 0.01        and l_quantity < 24; SQL> select name,value from v$sysstat where name = 'cell blocks pivoted'; NAME                          VALUE------------------------------ ----------cell blocks pivoted                  8261 Just to show how this is tied to rowsets, let's disable this optimization and recheck the stat: SQL> alter session set "_rowsets_enabled"=false;SQL> select sum(l_extendedprice * l_discount) as revenuefrom   lineitemwhere  l_shipdate >= to_date( '1994-01-01',  'YYYY-MM-DD')        and l_shipdate < add_months(to_date( '1994-01-01',  'YYYY-MM-DD'),  12)         and l_discount between .06 - 0.01 and .06 + 0.01        and l_quantity < 24; SQL> select name,value from v$sysstat where name = 'cell blocks pivoted';NAME                           VALUE------------------------------ ----------cell blocks pivoted                  8261 I hope this helps. 

In 12.2, there is a stat "cell blocks pivoted" that points to a new optimization. When Smart Scan processes data, it has to create a new synthetic block that only contains the columns that are needed...

When bloggers get it wrong - part 2

In Part 2 we are going to look at making use of the trace events that show what was discussed in Part 1. NB: Oracle no longer adds new numeric trace events, going forward new trace events use the Unified Tracing Service whose grammer is much simpler. The elements we need are: trace[[x.]y] disk = [ lowest | low | medium | high | highest ] For example Table Scan tracing is in the DATA hierachy: [1] DATA [2] KDS    “Kernel Data Scan”[3] KDSFTS  “Full Table Scan”[3] KDSRID  “RowID” ‘trace[KDSFTS] disk low’ - only trace full table scans‘trace[KDSRID] disk low’ - only trace fetch by rowid‘trace[KDS.*] disk low’ - trace both table scans and fetch by rowidNB: don’t use ‘lowest’ with KDS – it is used for memory tracing only Tracing Full Table Scans: KDSFTS At the beginning of a granule we see if it is possible to use Turbo Scan (which is a prerequisite for using Exadata Smart Scan) and the data object number being scanned: kdst_mode(): kdst_mode_possible: 1 fastPathFetch: 1  kdst_mode(): push_complex: 1 sage_push: 0 pred_split: 0 objd_ktid: 78768 If we can use Turbo Scan, we see the initialization of the row source and the sql_id and text of the query: kdsirsf(): not DR|DW td: 0x7fb2a16245e0 flag: 0x1080008 heap: 1 The trace also told us this was not an ABTC scan (see further down this entry)NB: several scans may be interleaved so use the ‘td’ address to follow the scan of interest: kdsirsf(): kcb_clone on td: 0x7fb2a16245e0 parent: 0x7fb2a16245f8 child: 0x7fb2a1624c70 Dumping sqltext for sqlid: dch22s8kyzgp8kxsDumpSqlTextFromSQLID SQL= select sum(S_ACCTBAL) from supplier This is important because it enables us to tie the trace into the SQL v$ views. The scan then starts: kdsttgr: start td 0x7fb2a16245e0 for 32767 rows at slot 0 blkrows 0kdstf000010100001000km -> FCO ROWP RSET ARR First we see the extent map to be scanned get loaded, the start rdba of each extent, and the number of blocks in the extent:  map with 10 extentsExt: 0, dba: 0x140000e, len: 3, skp: 0Ext: 1, dba: 0x1400011, len: 4, skp: 0 Second, we will see each block in turn get fetched, since extent 0 started at rdba  0x140000e, we will see 0x140000e, 0x140000e, 0x1400010 get fetched in order etc Since this is a serial scan, it will be followed by extent 1 (a PQ granule would have had the extents sorted by rdba) kdst_fetch0(): req tsn: 5 dba: 0x140000e dsf: 0x1 dsf2: 0x0 kdst_fetch0(): req tsn: 5 dba: 0x140000f dsf: 0x100001 dsf2: 0x4000 Finally when all the blocks have been consumed we will see the scan  terminate: kdsers(): ending td: 0x7fb2a16245e0kdsers(): kcb_endscan on td: 0x7fb2a16245e0 Tracing the decision to use buffer cache or direct read To understand the decisoin whether to use buffer cache or direct read for a scan, you need to enable tracing for buffer cache decision making: alter session set events ‘10358 trace name context forever, level 2’;alter session set events 'trace [NSMTIO] disk highest‘;  The Medium Table Threshold (MTT)  Our query against the Supplier table starts with "Is Small Table"? Our table has 1689 blocks, the buffer cache has space for 13,512 blocks, and the Small Table Threshold is therefore 270 blocks: kcbism: islarge 1 next 0 nblks 1689 type 2, bpid 3, kcbisdbfc 0 kcbnhl 1024  kcbstt 270  keep_nb 0 kcbnbh 13512 kcbnwp 1 Clearly this isn't a small table; so "Is Medium Table"? Well, 1689 blocks is greater than 1/10 of the buffer cache i.e. 1,351 blocks so it isn't medium: kcbimd: nblks 1689 kcbstt 270 kcbpnb 1351 bpid 3 kcbisdbfc 0 is_medium 0 So "Is Very Large Object"? Well it's small enough relative to the buffer cache to be worth costing out so its not large:kcbivlo: nblks 1689 vlot 500 pnb 13512 kcbisdbfc 0 is_large 0 NSMTIO shows the decision (but we don't give any internal details of the cost model used). Here we see the "Direct Path Check" shows the cost model predicted using Direct Read would be best and that we need to force a check point. NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote)  and checking storage reduction factors (OLTP/EHCC Comp)NSMTIO: kcbdpc: DirectRead: tsn: 5, objd: 79042, objn: 79042 ckpt: 1, nblks: 1689, ntcache: 0, ntdist:0 Let's make the buffer cache much bigger to make MTT bigger than the table to see the tracing for that. With space for 28,368 blocks in the cache, MTT is now 2,836 blocks, and STT is therefore 567 blocks so we see the is_medium check set to true kcbism:  islarge 1 next 0 nblks 1689 type 2, bpid 3, kcbisdbfc 0 kcbnhl 2048 kcbstt 567               keep_nb 0 kcbnbh 28368 kcbnwp 1kcbimd: nblks 1689 kcbstt 567 kcbpnb 2836 bpid 3 kcbisdbfc 0 is_medium 1 So is it worth it to "Cache Medium Table"? This has to factor in such things as the current state of the cache and the aging of the current contents, and whether the entire database can be cached (kcbisdbfc): kcbcmt1: scann age_diff adjts last_ts nbuf nblk has_val kcbisdbfc 0 8168 0 28368 1689 0 0kcbivlo: nblks 1689 vlot 500 pnb 28368 kcbisdbfc 0 is_large 0 NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)NSMTIO: kcbdpc: DirectRead: tsn: 5, objd: 79042, objn: 79042 ckpt: 1, nblks: 1689, ntcache: 0, ntdist:0and here we see that even though the table is smaller than 10% of the buffer cache, the current state of the buffer cache meant that it was not worth caching and so we will still use Direct Read. Forcing Buffer Cache scan using ABTC Since we couldn't get our scan to use the buffer cache, lets try using Automatuc Big Table Caching by setting the system parameter: DB_BIG_TABLE_CACHE_PERCENT_TARGET=50. Now we see it costing Serial DW scan: kcbism: islarge 1 next 0 nblks 1689 type 2, bpid 3, kcbisdbfc 0 kcbnhl 2048 kcbstt 567 keep_nb 0 kcbnbh 28368 kcbnwp 1kcbivlo: nblks 1689 vlot 500 pnb 28368 kcbisdbfc 0 is_large 0 The difference is that now we look at the object's temperature and here the ABTC policy returns that the table should be cached (ABTC policy is true) and we no longer choose Direct Read.: NSMTIO: kcbimd: serial DW scan <0.5, 79042> nblks 1689 policy 1 imc 0 pq 0)NSMTIO: kcbimd: serial DW cache scan <0.5, 79042> nblks 1689 pq NSMTIO: qertbFetch: NoDirectRead:[- STT < OBJECT_SIZE < MTT]: Obect's size: 1689 (blocks), Threshold: MTT(2836 blocks)

In Part 2 we are going to look at making use of the trace events that show what was discussed in Part 1. NB: Oracle no longer adds new numeric trace events, going forward new trace events use the...

When bloggers get it wrong - part 1

I've read a number of blog entries from various people who've clearly put great diligence into trying to understand how the decision to use the buffer cache for a table scan or whether it is better to use direct read for the scan is made. Since this is critical decision from a performance perspective, I've decided to write a more definitive account. Part 1 deals with a few principles and part 2 (which will follow very shortly) will show how this works with tracing and clear up a few misunderstandings. Part 3 will deal with PQ in more detail as soon as I have time. Serial Scans Small Table Threshold (STT)  Serial scans use direct path reads when the object is not a small table and, since 11.2.0.2, the size used to determine whether it is "small" has come stats rather than the size from segment header. _small_table_threshold defaults to the 2% buffer cache or to 20 blocks whichever is bigger (NB: this parameter ignores the effects of compression). An object is small enough for buffer cache reads to be more effective than direct path read if the number of blocks it contains is less than or equal to the value of _small_table_threshold. Such scans will bump the stat “table scans (short tables)” Medium Table Threshold (MTT) So what happens when an object is bigger than STT? We fall into the MTT case: MTT is not directly configurable and defaults to 10% of the buffer cache blocks. But it is also depends on a cost based decision that includes how busy the cache is. NB: PQ and Exadata have different rules. NB: MTT also depends on _object_statistics being enabled. For STT < #blocks < MTT, if the  cost based decision means a buffer cache scan is chosen, the stats “table scans (short tables)” will still get bumped. So when is a medium table not worth caching? First, when MTT < #blocks (ignore VLOT in blogs, it hasn't been used for many years but persists in the trace and blogs for historic reasons). Second, if  the scan is not a select for update: different rules apply because cache scans are more useful if the segment being updated is first the subject of a query. Table larger than MTT also do factor in storage reduction factor. The ratios used are 10X for HCC and 3.3.X for OLTP. Third, a cost based analysis is done comprising the % currently in cache, the % buffers are dirty (check-pointing cost), the current I/O bottleneck, and any RAC access costs. If the buffer cache is still used the stat “table scans (long tables)” will get bumped. But, if Direct Read is chosen the stat “table scans (direct read)” will get bumped. Automatic Big Table Caching (ABTC) So does this mean the buffer cache never helps with large tables? Not at all, large tables can still use the buffer cache for a portion of their  scan based on the public parameter DB_BIG_TABLE_CACHE_PERCENT_TARGET which sets aside a percentage of the buffer cache for larger tables On a single instance, ABTC works with both Serial and PQ scans BUT mixing Serial and PQ may lead to fewer cache hits. On RAC, it only works with PQ and requires that PARALLEL_DEGREE_POLICY set to AUTO or ADAPTIVE. ABTC works by tracking the heat of medium and large tables and is designed to take advantage of any runs of buffers currently cached. Then if the decision to use ABTC is made, it periodically checks the cache for how many buffers in the next section are already in memory to see whether Direct Read or Buffer Cache scan would be better for the next chunk. Let's see how the heat tracking works which looking at the v$bt_scan_obj_temps view. SQL> select DATAOBJ#,SIZE_IN_BLKS,TEMPERATURE from V$BT_SCAN_OBJ_TEMPS;  DATAOBJ#   SIZE_IN_BLKS TEMPERATURE---------- ------------ -----------          79042         1689        3000 Each time we scan a medium or larger table it will bump that object's temperature by 1000. Then, the object with the highest temperature will get absolute priority for caching and, if needed, will evict any previously cached ABTC blocks of other objects.If the hottest object only partially fits in the buffer cache (which is quite likely), the first section that fits will be read through the buffer cache and cached then the rest of the scan will switch to Direct Read when the ABTC allocation in the cache is fully utilized.NB: The object will still be check-pointed at the beginning so that it is safe to switch to Direct Read at any time. Part 2 will include tracing to how how to make use of this feature. 

I've read a number of blog entries from various people who've clearly put great diligence into trying to understand how the decision to use the buffer cache for a table scan or whether it is better to...

Controlling the offload of specific operators

One of the joys of regexp is that you can write a pattern that is painfully expensive to match and offloading these to the cell can cause significant impact on other users and overall throughput (including heartbeat issues). If you have a user who is prone to writing bad regexp expressions you as DBA can prevent regexp (or any other operator) from being offloaded to the cells. Let's take a very simple example using a cut down version of TPC-H Query 16 and a NOT LIKE predicate:  SQL> explain plan for select p_brand, p_type, p_sizefrom partwhere p_brand <> 'Brand#45'and p_type not like 'MEDIUM POLISHED%'and p_size in (49, 14, 23, 45, 19, 3, 36, 9)group by p_brand, p_type, p_size;SQL> select * FROM TABLE(DBMS_XPLAN.DISPLAY); |*  3 |    TABLE ACCESS STORAGE FULL| PART | 29833 |  1048K|    |   217   (2)| 00:00:01 |  1 |  8 ------------------------------------------------------------------------------------------------------------    3 - storage(("P_SIZE"=3 OR "P_SIZE"=9 OR "P_SIZE"=14 OR "P_SIZE"=19 OR "P_SIZE"=23 OR "P_SIZE"=36 OR "P_SIZE"=45 OR "P_SIZE"=49) AND "P_BRAND"<>'Brand#45' AND "P_TYPE" NOT LIKE 'MEDIUM       POLISHED%') Here we see all the predicates get offloaded as expected. So, for example, to stop NOT LIKE being offloaded we would need to find the operator in v$sqlfn_metadata SQL> column descr format a18SQL> select func_id, descr, offloadable from v$sqlfn_metadata where descr like '%LIKE%';   FUNC_ID DESCR              OFF---------- ------------------ ---        26  LIKE              YES        27  NOT LIKE          YES        99  LIKE              NO       120  LIKE              YES       121  NOT LIKE          YES       ...       524  REGEXP_LIKE       YES       525  NOT REGEXP_LIKE   YES       537  REGEXP_LIKE       YES       538  NOT REGEXP_LIKE   YESwe can ignore all but the two basic LIKE operators in this case, so to disable the offload of our LIKE predicates we use:   FUNC_ID DESCR              OFF---------- ------------------ ---        26  LIKE              YES        27  NOT LIKE          YES        99  LIKE              NO       120  LIKE              YES       121  NOT LIKE          YES       ...       524  REGEXP_LIKE       YES       525  NOT REGEXP_LIKE   YES       537  REGEXP_LIKE       YES       538  NOT REGEXP_LIKE   YES we can ignore all but the two basic LIKE operators in this case, so to disable the offload of our LIKE predicates we use: SQL> alter session set cell_offload_parameters="OPT_DISABLED={26,27};"; and we see this reflected in the offloadable column in v$sqlfn_metadata. SQL> select func_id, descr, offloadable from v$sqlfn_metadata where descr like '%LIKE%';   FUNC_ID DESCR              OFF---------- ------------------ ---        26  LIKE              NO        27  NOT LIKE          NO        99  LIKE              NO       120  LIKE              YES       121  NOT LIKE          YES To re-enable them you would use: >SQL> alter session set cell_offload_parameters="OPT_DISABLED={};"; One thing to note about this param is that it doesn't work like events (whose settings are additive), here it replaces the previous value and so every operator you want disabled has to be included in the same alter session (and the param is limited to 255 maximum characters limiting the number of operators that can be disabled). With the offload of LIKE and NOT LIKE disabled we can see the impact on the plan: SQL> explain plan for select p_brand, p_type, p_sizefrom partwhere p_brand <> 'Brand#45'and p_type not like 'MEDIUM POLISHED%'and p_size in (49, 14, 23, 45, 19, 3, 36, 9)group by p_brand, p_type, p_size;SQL> select * FROM TABLE(DBMS_XPLAN.DISPLAY);  |*  3 |    TABLE ACCESS STORAGE FULL| PART | 29833 |  1048K|    |   217   (2)| 00:00:01 |  1 |  8 ------------------------------------------------------------------------------------------------------------ 3 - storage(("P_SIZE"=3 OR "P_SIZE"=9 OR "P_SIZE"=14 OR "P_SIZE"=19 OR "P_SIZE"=23 OR "P_SIZE"=36 OR "P_SIZE"=45 OR "P_SIZE"=49) AND "P_BRAND"<>'Brand#45') and the NOT LIKE is no longer in the storage filter. Now lets say that you as DBA are faced with a more complex problem and want to halt all complex processing on the cells temporarily. There is a parameter that will disable everything except the simple comparison operators and NULL checks: SQL> alter session set "_cell_offload_complex_processing"=FALSE; Now lets see what happens: SQL> explain plan for select p_brand, p_type, p_sizefrom partwhere p_brand <> 'Brand#45'and p_type not like 'MEDIUM POLISHED%'and p_size in (49, 14, 23, 45, 19, 3, 36, 9)group by p_brand, p_type, p_size;SQL> select * FROM TABLE(DBMS_XPLAN.DISPLAY);  |*  3 |    TABLE ACCESS STORAGE FULL| PART | 29833 |  1048K|    |   217   (2)| 00:00:01 |  1 |  8 ------------------------------------------------------------------------------------------------------------    3 - filter(("P_SIZE"=3 OR "P_SIZE"=9 OR "P_SIZE"=14 OR "P_SIZE"=19 OR "P_SIZE"=23 OR "P_SIZE"=36 OR "P_SIZE"=45 OR "P_SIZE"=49) AND "P_BRAND"<>'Brand#45' AND "P_TYPE" NOT LIKE 'MEDIUM POLISHED%') Well we got no storage predicates at all and we didn't expect that because we had one simple predicate namely p_brand != 'Brand#45' and the IN predicate had been rewritten to a series of OR'ed comparisons so what happened? This parameter only permits simple predicates that are linked by AND's and can be attached directly to one column. Disjuncts are not pushable so they are normally evaluated by an eva tree or by pcode neither of which are sent to the cell with this parameter set to FALSE. So why wasn't our one simple predicate offloaded. Well, note where it is in the explain plan. It comes after the rewritten the IN and since the predicates are sorted by the optimizer on effectiveness we stop looking as soon as we see one that can't be offloaded. Let's remove the IN and see what happens: SQL> explain plan for select  p_brand, p_type, p_sizefrom partwhere p_brand <> 'Brand#45'and p_type not like 'MEDIUM POLISHED%';|*  2 |   TABLE ACCESS STORAGE FULL| PART |   190K|  6686K|   217   (2)| 00:00:01 | 1 | 8 |---------------------------------------------------------------------------------------------------   2 - storage("P_BRAND"<>'Brand#45')       filter("P_BRAND"<>'Brand#45' AND "P_TYPE" NOT LIKE 'MEDIUM POLISHED%') as expected the simple predicate is now offloaded. If you look at v$sqlfn_metadata you'll see this param is reflected in the offloadable column: SQL> select func_id, descr, offloadable from v$sqlfn_metadata where descr like '%LIKE%';   FUNC_ID DESCR              OFF---------- ------------------ ---        26  LIKE              NO        27  NOT LIKE          NO        99  LIKE              NO       120  LIKE              NO       ...       121  NOT LIKE          NO       524  REGEXP_LIKE       NO       525  NOT REGEXP_LIKE   NO       537  REGEXP_LIKE       NO       538  NOT REGEXP_LIKE   NO I hope you never need any of this in real life but it's good to have it in the toolbag.

One of the joys of regexp is that you can write a pattern that is painfully expensive to match and offloading these to the cell can cause significant impact on other users and overall...

Shining some light on Database In-Memory vs the Exadata Columnar Cache in 12.1.0.2

Over the last few weeks I have gotten the following questions from a number of different sources,"what's the difference between Oracle's various columnar technologies and how do they fit together"? Or, put another way, "what scenarios does each benefit and would I ever need both or one will cover all the bases"? When a question comes up often enough it is always worth a blog post, so here goes. They say a picture is worth a thousand words so my own mental picture of the difference between DBIM and the Exadata Columnar Cache is that of the difference between an Indycar design (or a Formula 1 design) and a NASCAR vehicle: Both of these vehicles are thoroughbreds in their own way: both honed over the years to obtain every ounce of performance possible from the given chassis but you would not do as well entering an Indycar into a NASCAR race and vice versa. So let's start at the beginning. Hybrid Columnar Compression  We originally designed Hybrid Columnar Compression (HCC) in response to customer requirements to retain growing volumes of colder data: it was primarily a compression technology rather than a performance technology. The goal was to compress colder data as much as possible while retaining the ability to run both table scans and index-access scans against it at acceptable speed and so we chose to apply a spreadsheet pivot to chunks of data to turn a set of rows into columnar format.  The compression benefits of a columnar format were well known but at that time, but the existing pure columnar databases had struggled to make significant sales due partly to the painfully slow single row retrieval that comes needing one disk I/O per column. Therefore, we chose a hybrid approach to enable us to retain row retrieval in one or at the most two disk I/Os but still have columnar benefits. The primary goal of our initial design was to get the maximum compression by applying first semantic compression then bitwise compression and choosing from a variety of algorithms based on what used the least space on disk. The default compression level used ZLIB for its bitwise compression because on spinning disks it is typically far cheaper to decompress ZLIB than it is to the do the extra disk I/O of uncompressed data. The bulk of the performance improvement came from this cost reduction with additional performance improvements coming from (a) direct addressing of columns of interest (instead of navigating the row to get to a column), and (b) applying predicates and expressions on columnar data. Research had show that for traditionally applied predicates (as opposed to DBIM's SIMD) the bulk of the performance improvements are seen by the time you have ~1,000 values. So, for this reason, the default Query High compression targets 1,000 rows as long as they fit in 64 KB after compression. The Columnar Cache  One weakness with the hybrid design is that as table width grows, a query will typically use a smaller and smaller percentage of the data in a compression unit and the lack of true columnar disk I/O starts to hurt performance. The other weakness is that with very narrow tables the cost becomes dominated by the decompression time and ideally one would switch to Query Low, which has much cheaper decompression costs, but the extra disk I/O of any unused columns would hurt performance significantly so no one uses it. Both the problems with very narrow and very wide tables required true columnar I/O so we came up with a simple solution: rearrange 1 MB worth of HCC data into true columnar form and store the page mapping in the flash cache's hash table to tell us where the columns are. It is not uncommon to see savings of 50-80% disk I/O from the columnar cache and significant improvements to the wall clock for queries which are cell CPU bound. The thing to note here is that every block in the 1 MB aligned chunk must be in HCC format: if there are any row major blocks from updates or from Inserts that were too small to compress, they will stop the block from being re-arranged into Columnar Cache format. The advent of the NVMe flash cache in Exadata X5 and X6 with blindingly high scan speeds meant that one of the arguments for using Query High as opposed to Query Low would have vanished but performance measurements on NVMe showed that the compression unit sizes for Query Low were too small and that they should be the same size as Query High uses. (Note: getting better compression is still the main reason to use Query High, which is why it will remain the default). The next major release of Oracle will use the larger CU size for Query Low but in the meantime there is a parameter available to force larger CU sizes in 12.1.0.2 if any customers need to use Query Low to avoid being cell CPU bound (particularly on the EF machines). Please email me if you have an X5 or X6 and need to try a larger size. However, HCC and it's columnar cache do nothing to help RDBMS bound queries which is the scenario many customers face which is where Database In-Memory enters the story. Also, there is nothing the columnar cache can do for hot data that is still being updated. What our customers were telling us was that they needed was to extend columnar processing seamlessly across the Information Management Lifecycle for both hot and cold data. And with that, we get to the story of Database In-Memory. Database In-Memory  Oracle's unique Hybrid approach had already proved that optimal performance required both quick access to a single row of data and quick columnar scan access. The issue was how to do this on hot data that is still undergoing DMLs without breaking OLTP performance. Oracle's unique innovation was to split the data simultaneously into the traditional row-major on-disk format and also into an In-Memory segment with the data in columnar format. By having the columnar data in memory there would be no changes to the existing OLTP code path and no redo generated to slow wrote transactions down. Rows which had very recently been updated and which were for that reason invalidated in the column store would still be in the buffer cache allowing quick row retrieval when needed. You would be forgiven for thinking that all we did was take HCC formats and put long runs of them in memory so it would simply be a new form of HCC. But that's the antithesis of what we did. Remember the HCC formats were chosen primarily for their compression abilities and the ability to retrieve a row from disk in as few I/O's as possible. Since the focus of this project was raw columnar performance, the columnar formats were refined and reworked into forms finely tuned for query performance using SIMD instructions (SIMD predicates are not supported on some HCC formats) including the introduction of dictionary encoding (which is not available to HCC due to it's shorter column runs and the requirement that Oracle blocks on disk are always entirely self-describing). In addition to fast SIMD predicates and higher scan speeds, bringing the columnar data into the RDBMS brought several more advantages: With the very high scan speeds (because there's no I/O involved) enabling predicates to find narrow ranges of data fast, more indexes could be dropped to improve OLTP transactional performance. More predicates could be pushed into the columnar format: Smart Scan is unable to offload any predicates that require support from the RDBMS kernel, in particular PL/SQL functions. Many LOB, CLOB, and XML predicates require PL/SQL runtime support. In the RDBMS, Vector Group By can be used to do the actual grouping not just as a predicate. Smart Scan offloads Vector Group By as the equivalent of an improved bloom filter: it can do the pruning but not the grouping - that is still done after it is returned to RDBMS. Why would I use both formats? To bring this all together, there is a coherent information lifecycle management plan that gets optimal OLTP performance AND optimal query performance by placing the hotter data in memory columnar and as data cools to the point where it is unlikely to undergo significant updates it is tiered out to the Exadata Columnar Cache via HCC formats. And optionally, further tiered out to ZFSSA or BDA when the data is cold enough that it is only being retained for compliance purposes and unlikely to be queried. Disclaimer: these are my personal thoughts and do not represent Oracle's official viewpoint in any way, shape, or form.

Over the last few weeks I have gotten the following questions from a number of different sources,"what's the difference between Oracle's various columnar technologies and how do they fit together"?...

What you need to know about Offload Quarantines

Several questions have couple up in the last few weeks about offload quarantines which means a blog post on this topic is overdue. We work hard to stress test every new rpm that is released but on rare occasions customers can encounter an issue with the thin database layer that exists in the offload server. This layer is known externally as "Smart Scan" and internally as "FPLIB" (a.k.a. Filter Projection Library). A crash in the thin database layer could because of either an issue with some aspect of the sql_id (for example, with the predicates) or because of an issue with the data on some region of disk (for example, with the OLTP compression symbol table). The worst, and rarest, form of crashes are where striping leads to every offload server failing simultaneously: these are known colloquially as "Railroad Crashes"). The most important thing is to make sure the retry mechanism doesn't immediately resubmit the query and re-crash the offload server causing a halt to the business operating. In a hospital, the floor nurse would call a code and the crash team would come running with a crash cart to stabilize the patient. Two members of my family are nurses and I'm reminded that nurses are a lot like technical support engineers in that while doing their job they sometimes have to deal with abuse from frustrated patients (customers): please remember that both groups work hard to resolve your issues and be kind to them! What are Quarantines? The option of calling a crash cart is not available to us here so starting in early 11.2.0.3, we created a quarantine system where, after a crash, the exception handler remembers both the sql_id and the disk region being processed and creates a persistent quarantine for both. When a sql_id or a disk region is quarantined any Smart Scan operations on them will be executed in passthru mode. Currently it is hard to see when this is happening, but an upcoming release has a new stat to make this easier. If an operation has been quarantined you can see its effects by monitoring: cell num bytes in passthru due to quarantine Listing Quarantines  When a quarantine has been created you can look at it in detail using CellCLI: CellCLI> list quarantine 1 detail     name:                      1         clientPID:             12798         crashReason:           ORA-600[17099]         creationTime:          2011-02-07T17:18:13-08:00         dbUniqueID:            2022407934         dbUniqueName:          YAMA         incidentID:            16         planLineID:            37         quarantineReason:      Crash         quarantineType:        "SQL PLAN"         remoteHostName:        sclbndb02.us.oracle.com         rpmVersion:            OSS_11.2.0.3.0_LINUX.X64_1012         sqlID:                 1jw05wutgpfyf         sqlPlanHashValue:      3386023660 The 'list detail' version of the command gives us everything we would need to know about exactly what has been quarantined and why it was quarantined. CellCLI also supports manually creating a quarantine using the attributes shown by 'list detail'. Quarantine Escalation This is the topic that has caused the most confusion: if three new quarantines are generated within a 24 hour period the quarantine is escalated to a database quarantine. Using the 'list detail' option we would then see: quarantinePlan:         SYSTEMquarantineReason:       EscalatedquarantineType:         DATABASE Note: the number of quarantines in 24 hours before escalation is configurable via a cellinit param: please contact Technical Support if you feel you have a valid need to change this. The final level of escalation is where if more than one database has been escalated to a database quarantine, the system will escalate to a complete offload quarantine where Smart Scan is disabled completely and all I/O goes through regular block I/O. I'm glad to say that I've have never seen this happen. Dropping Quarantines  The next question is how and when are the quarantines removed. Any quarantine can be removed manually using CellCLI. Quarantines are also automatically dropped by certain operations: Plan Step Quarantines: Dropped on rpm upgrades Disk Region Quarantines Dropped on rpm upgrades, or on successful writes to the quarantined disk region Database quarantines Dropped on rpm upgrades Offload quarantines Dropped on rpm upgrades What about CDB? In 12.1, we changed the architecture of cellsrv to support multiple RDBMS versions running at the same time by introducing the concept of offload servers. When a new rpm is installed it typically contains offload servers for 11.2.0.4, 12.1.0.1, 12.1.0.2, (and 12.2.0.1). This is known internally as multi-DB. Any given operation is tagged with the RDBMS version it is coming from and routed to the offload server for that version. A crash in Smart Scan typically means that only the offload server has to restart and not the central cellsrv that maintains Storage Index and does Smart IO. A side effect of this is that all operations for that RDBMS version can revert to Block IO while the offload server restarts minimizing disruption. The architecture change necessitated a change to the way quarantines are created, checked, and dropped. In multi-DB, installation of a new rpm no longer drops all quarantines. Instead, system created quarantines now record the offload server's rpm version. Manually created quarantines can also optionally specify offload rpm they are to effect. In multi-DB, a quarantine is observed if the offload rpm specified matches the actual offload rpm the operation will be sent to or if no offload rpm is specified regardless of offloadgroup.Multi-DB quarantines are dropped if the matching offload rpm is uninstalled or a new rpm installed for that offload version. Multi-DB quarantines with no offload rpm specified must be dropped manually. Please let me know if you have any questions.   Roger

Several questions have couple up in the last few weeks about offload quarantines which means a blog post on this topic is overdue. We work hard to stress test every new rpm that is released but on...

Using INMEMORY with External Tables

Several people have asked if there is any way to use INMEMORY on External Tables because the INMEMORY syntax is not supported in Create Table and Alter Table with Organization External (actually Create Table parses it but then ignores it). While there is no way out of having to load the data into Oracle 12.1.0.2, there is a short cut to simplify life when you have an external process recreating flat file data on a regular basis and want to query the latest incarnation against corporate data in Oracle Tablespaces. One example we've seen of this is where you have a map-reduce job on Hadoop that highly summarizes large amounts of ephemeral data and it is the summary data that needs to be joined against Oracle data. Instead of truncating the Oracle table and doing a flat file load each time, we can use a Materialized View on the external table and simply issue an on-demand refresh and let the materialized view take care of the truncate and load under the covers for us. Since INMEMORY does support Materialized Views, we can now automatically get INMEMORY query speeds on the data from an external source. Let's see how this works. For example here's a loader table I use with TPC-H Scale 1 for local testing: create table p_et( p_partkey number, p_name varchar2(55), p_mfgr char(25), p_brand char(10), p_type varchar2(25), p_size number, p_container char(10), p_retailprice number, p_comment varchar2(23)) organization external(   type ORACLE_LOADER   default directory T_WORK   access parameters(   records delimited by newline   nobadfile   nologfile   fields terminated by '|'   missing field values are null)   location (T_WORK:'part.tbl'))   reject limit unlimited; Instead of just getting the external data into an Oracle MVIEW segment to query, I can get it into an INMEMORY segment and get much better query performance: SQL> create materialized view im_p_et inmemory memcompress for capacity low as select * from p_et; SQL> alter materialized view im_p_et ENABLE QUERY REWRITE ; That succeeds, so first let's make sure our queries are actually going to use this Materialized View: SQL> begin DBMS_MVIEW.EXPLAIN_REWRITE( query=>'select count(p_comment) from p_et', mv=>'im_p_et'); end; 2 /  PL/SQL procedure successfully completed. SQL> select message from rewrite_table; QSM-01151: query was rewritten QSM-01209: query rewritten with materialized view, IM_P_ET, using text match algorithm OK so we are sure this simple query can use our Materialized View, let's verify that it actually did: Note: my table was too small so I had to force the XT MVIEW to be loaded: SQL> begin dbms_inmemory.repopulate(schema_name=>'TPCH', table_name=>'IM_P_ET', force=>TRUE); end; 2 /PL/SQL procedure successfully completed. SQL> select INMEMORY_SIZE,POPULATE_STATUS from v$im_segments where SEGMENT_NAME='IM_P_ET'; INMEMORY_SIZE POPULATE_STAT------------- -------------      7602176 COMPLETED SQL> set autotrace on SQL> select count(p_comment) from p_et; COUNT(P_COMMENT)----------------  200000 Execution Plan----------------------------------------------------------Plan hash value: 703290957---------------------------------------------------------------------------------------------------| Id  | Operation       | Name  | Rows  | Bytes | Cost (%CPU)| Time  |---------------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT       |  | 1 |    13 |    50  (62)| 00:00:01 ||   1 |  SORT AGGREGATE       |  | 1 |    13 |       |  ||   2 |   MAT_VIEW REWRITE ACCESS INMEMORY FULL| IM_P_ET |   152K|  1941K|    50  (62)| 00:00:01 |--------------------------------------------------------------------------------------------------- Statistics----------------------------------------------------------  1  recursive calls  3  db block gets1881  consistent gets  0  physical reads  0  redo size 550  bytes sent via SQL*Net to client 607  bytes received via SQL*Net from client   2  SQL*Net roundtrips to/from client   0  sorts (memory)   0  sorts (disk)   1  rows processed  and in the query plan we see ' MAT_VIEW REWRITE ACCESS INMEMORY FULL' and in the stats section we see physical reads is now zero. We are now getting very fast INMEMORY queries against a Materialized View built using data from the flat file on disk. Here are some things to remember when using Materialized Views with External Tables: The Materialized View must have rewrite enabled  The optimizer_mode must be set to cost based e.g. 'first_rows_100'  The query_rewrite_parameter must be set to 'enabled' or 'force'  The query_rewrite_integrity parameter for an XT must be set to 'trusted' or 'stale_tolerated'

Several people have asked if there is any way to use INMEMORY on External Tables because the INMEMORY syntax is not supported in Create Table and Alter Table with Organization External...

Compression in a well-balanced system

This continues the previous entry (Is there such a thing as fast enough?) Exadata is designed to present a well-balanced system between disk scan rates, cell CPU processing rates, network bandwidth, and RDBMS CPU such that on average across a lot of workloads and a lot of applications there should not be no one obvious bottleneck. Good performance engineering means avoiding a design like this: because that would be nuts. If you change one component in a car e.g. the engine you need to rework the other components such as the chassis and the brakes. When we rev the Exadata hardware we go through the same exercise to make sure the components match to avoid bottlenecks.  One of the components in a well-balanced system that people often leave off this list is compression algorithm. With compression you are balancing storage costs as well as the disk scan rates against the CPU's decompression rate. In a disk bound system you can alleviate the bottleneck by compressing the data down to the point that it is being scanned at the same rate that the CPUs can decompress and process it. Squeeze it too far, say with HCC "Compress For Archive High" which uses BZ2, the CPUs become a major bottleneck because the decompression costs unbalance the system (but it's wonderful if you think you're unlikely to ever need the data again). Since HCC was released with Exadata 11.2, DBAs have almost always used Query High (ZLIB) since it balances spinning disks against the Xeon CPUs quite nicely while retaining a storage size small enough to retrieve in a single disk I/O on 256 KB formatted extents. With Query High we tend to see an equal chance of some query being disk bound or CPU decompression bound. Prior to X3, HCC Query Low (LZO) was never a performance option for scans since the reduced CPU load couldn't compensate for 2X the disk reads required even on flash and Query Low scans are going to be disk bound. The only utility for Query Low was if there were a lot of single-row reads and the 32 KB storage size combined with cheap decompression made row retrieval much faster than Query High where you decompressing twice as much with a more expensive decompression algorithm.  With X3 flash cache we saw something new, the Xeons would decompress LZO at a speed that very roughly matched the SSD scan speeds of the flash cache. While some queries could now run faster using Query Low than Query High that didn't provide a compelling reason to use Query Low since it uses not only twice as much storage space but twice as much space in the flash cache and you had to be sure it was in the flash cache to see that speed. With the introduction of X5 just over a year ago the balance changed. I am a huge fan of the X5-EF with its flash only configuration.   While X5-HC offers 10% more storage capacity and a flash cache that is as fast as flash tablespace, the X5-EF offers consistency: you never need to ask whether the data is in the cache or pay the overhead of maintaining the cache. It also offers the chance to leave the past and the notion of spinning disks behind. The X5-EF can scan at over 250 GB/sec  and offers 288 fast cores in the storage layers for decompression. Clearly if your primary concern is storage capacity you will want to stay with Query High but if we are thinking purely in terms of exploring performance the time has come for a reassessment of role HCC Query Low on X5. This will be the topic of the next blog entry. In the meantime, I'll leave you with my own mental image of the X5-EF - the sleek well-balanced very responsive BMW i8 which can do 0 to 60 in 4.4 seconds and 50 to 75 mph in 2.6 seconds (hmmm, though since the i8 is a hybrid maybe it corresponds to the X5-HC ... just a thought).

This continues the previous entry (Is there such a thing as fast enough?) Exadata is designed to present a well-balanced system between disk scan rates, cell CPU processing rates, network bandwidth,...

Is there such a thing as fast enough?

Have you seen this video which went viral? What fascinates me is that I remember upgrading from Win 3.1 to Win 95 and was blown away by how fast and smooth it was. The feeling of "this is it", this is fast enough to feel really productive. But sadly so often today's amazement dies to quickly becomes tomorrow's frustration. We become addicted to the buzz of doing more interesting things and doing them more quickly. When I started in DW in the 90's, we thought an analysts' attention spans were around 10 minutes: if complex TB scale queries came back in ten minutes they could keep their train of thought and go on asking interesting questions of the data. It wasn't long before 10 minutes was painful, we wanted one minute, then 10 seconds, then ... At Hotsos 2016 in Dallas this week I was talking about flash performance. When Exadata X3 came out the performance was amazing. In some scenarios the flash cache could serve data faster than the Xeons could decompress ZLIB and for the first time LZO became a viable performance option for flash. With X5-EF, the NVMe flash subsystem can scan over 250 GB/sec dwarfing the QDR Infiniband links 40 GB/sec data return and consequently why having Smart Scan to cut the data down at the storage layer is now more essential than ever. I look at  250+ Gb/sec and can't quite get my head around what that looks like as a processing rate - reading the entire Library of Congress in less then a minute? I wonder if a time will come when I will look at NVMe and think how on earth did with we live with the frustration? I don't know, but I do know that with X5 now and with future hardware options on the horizon, this is the most exciting time in my 30+ year career to be working at the heart of database performance architecture. I'm going to be continuing this discussion with some of the performance findings from Hotsos: stay tuned. Let me know if there are any Smart Scan performance topics you'd like to see posts on.  Roger 

Have you seen this video which went viral? What fascinates me is that I remember upgrading from Win 3.1 to Win 95 and was blown away by how fast and smooth it was. The feeling of "this is it", this is...

Double counting of HCC stats prior to next major release

HCC stats have been confusing for many since their introduction in 11.2 both because of the lack of a common naming convention and because of double counting. When we did DBIM we spent a lot of time getting the stats right both in terms of naming convention that lent itself to straightforward wildcard searches, and in contents. For the next major release, the introduction of the IMC format Columnar Cache meant that the time had come to revamp the HCC stats more in line with the DBIM stats. The double counting came from the fact that the decompression library (kdzd) had a stats container that was updated for each operation and the final tally was copied into the stat in a one-to-one correspondence. Each CU and each column that was processed on the cell was matched by the set of calls that looked at the CUs in the synthetic blocks returned by Smart Scan to see if they were secondarily compressed or not (which sometime happens to squeeze the results into the output buffer) consequently each CU and each column incremented the same stat twice.  In the next major release these will be all been split into pairs of stats 'HCC scan [ cell | rdbms ] %'. This table provides a handy mapping for anyone looking to find where their favourite stats have gone:  Previous Stat  New RDBMS Stat  New Cell Stat  EHCC Analyze CUs Decompressed  HCC analyze table CUs  EHCC Analyzer Calls  HCC analyzer calls  EHCC Archive High CUs Compressed Conventional  HCC load conventional CUs archive high  EHCC Archive High CUs Compressed Direct  HCC load direct CUs archive high  EHCC Archive Low CUs Compressed Conventional  HCC load conventional CUs archive high  EHCC Archive Low CUs Compressed Direct  HCC load direct CUs archive low  EHCC Query High CUs Compressed Conventional  HCC load conventional CUs query high  EHCC Query High CUs Compressed Direct  HCC load direct CUs query high  EHCC Query Low CUs Compressed Conventional  HCC load conventional CUs query low  EHCC Query Low CUs Compressed Direct  HCC load direct CUs query low  EHCC CU Row Pieces Compressed Conventional  HCC load conventional CUs row pieces  EHCC CU Row Pieces Compressed Direct  HCC load direct CUs row pieces  EHCC CUs Compressed Conventional  HCC load conventional CUs   EHCC CUs Compressed Direct  HCC load direct CUs   EHCC Rows Compressed Conventional  HCC load conventional rows  EHCC Rows Compressed Direct  HCC load direct rows  EHCC Rows Not Compressed Conventional  HCC load conventional rows not compressed  EHCC Rows Not Compressed Direct  HCC load direct rows not compressed  EHCC Compressed Length Compressed Conventional  HCC load conventional bytes compressed  EHCC Compressed Length Compressed Direct  HCC load direct bytes compressed  EHCC Decompressed Length Compressed Conventional  HCC load conventional bytes uncompressed  EHCC Decompressed Length Compressed Direct  HCC load direct bytes uncompressed  EHCC Compressed Length Compressed Conventional  HCC load conventional bytes compressed  EHCC Compressed Length Compressed Direct  HCC load direct bytes compressed  EHCC Attempted Block Compressions  HCC block compressions attempted  EHCC Block Compressions  HCC block compressions completed Table 2. Query Stats Previous Stat  New RDBMS Stat  New Cell Stat  EHCC Archive High CUs Decompressed  HCC scan rdbms CUs archive high  HCC scan cell CUs archive high  EHCC Archive LowCUs Decompressed  HCC scan rdbms CUs archive low  HCC scan cell CUs archive low  EHCC Query High CUs Decompressed  HCC scan rdbms CUs query high  HCC scan cell CUs query high  EHCC Query Low CUs Decompressed  HCC scan rdbms CUs query low  HCC scan cell CUs query loq  EHCC CUs Decompressed  HCC scan rdbms CUs decompressed  HCC scan cell CUs decompressed  EHCC Check CUs Decompressed  HCC block check CUs  EHCC Dump CUs Decompressed  HCC block dump CUs  EHCC Rowid CUs Decompressed  HCC fetch by rowid CUs  EHCC Normal Scan CUs Decompressed  HCC scan rdbms CUs normal  EHCC Turbo Scan CUs Decompressed  HCC scan rdbms CUs turbo  cell CUs processed for compressed  HCC scan cell CUs processed for compressed  cell CUs processed for uncompressed  HCC scan cell CUs processed for uncompressed  cell CUs sent compressed  HCC scan cell CUs sent compressed  cell CUs sent head piece  HCC scan cell CUs sent head piece  cell CUs sent uncompressed  HCC scan cell CUs sent uncompressed  EHCC DML CUs Decompressed  HCC DML CUs  EHCC Conventional DMLs  HCC DML conventional  EHCC CUs all rows pass minmax  HCC scan rdbms CUs optimized read  HCC scan cell CUs optimized read  EHCC CUs no rows pass minmax  HCC scan rdbms CUs pruned  HCC scan cell CUs pruned  EHCC CUs some rows pass minmax <removed> <removed>  EHCC Columns Decompressed  HCC scan rdbms CUs columns accessed  HCC scan cell CUs columns accessed  EHCC Compressed Length Decompressed  HCC scan rdbms bytes compresse  HCC scan cell bytes compresse  EHCC Decompressed Length Decompressed  HCC scan rdbms bytes decompressed  HCC scan cell bytes decompressed  EHCC Pieces Buffered for Decompression  HCC scan rdbms CUs row pieces accessed  HCC scan cell CUs row pieces accessed  EHCC Preds all rows pass minmax  HCC scan rdbms CUs predicates received  HCC scan cell CUs predicates received  EHCC Preds no rows pass minmax  HCC scan rdbms CUs predicates applied  HCC scan cell CUs predicates applied  EHCC Preds some rows pass minmax  HCC scan rdbms CUs predicates optimized  HCC scan cell CUs predicates optimized  EHCC Total Columns for Decompression  HCC scan rdbms columns theoretical max  HCC scan cell columns theoretical max  EHCC Total Pieces for Decompression  HCC scan rdbms CUs row pieces total  HCC scan cell CUs row pieces total  EHCC Total Rows for Decompression  HCC scan rdbms rows  HCC scan cell rows  EHCC Used on ZFS Tablespace  HCC usage ZFS  EHCC Used on Pillar Tablespace  HCC usage pillar In follow up entries I'll be looking at how the new in 12.2 Columnar Cache works, what the new Columnar Cache stats mean, and how Set membership in Storage Index works, 

HCC stats have been confusing for many since their introduction in 11.2 both because of the lack of a common naming convention and because of double counting. When we did DBIM we spent a lot of time...

Offloading row level security to Smart Scan

The question came up this week about whether the predicates for row level security are offloaded to Smart Scan. The simple answer is yes as long as the policy_function contains off-loadable predicates. Let's see this in action. Using the public TPC-H schema we can set up the customer table so that customers can only see their own entries in that table. We need a row level security policy to see if each row is one that the current user is permitted to see and a function for the policy to call to make that determination: grant dba to Customer#000000042 identified by hello;grant select on customer to public; create or replace function tpch.trls_c_name(  p_schema in varchar2,  p_object in varchar2)    return varchar2as begin  return 'sys_context(''userenv'',''session_user'') in (upper(c_name), ''TPCH'')'; end;/ begin  dbms_rls.add_policy(    object_schema   => 'TPCH',    object_name     => 'CUSTOMER',    policy_name     => 'P1',    policy_function => 'trls_c_name');end;/ This defines the row level policy on the customer table to call the trls_c_name function on a per-row basis. This function checks that the current user is either TPC-H who owns the data or the current user who can only see rows that match the session_user to the c_name column. Now we can connect as the customer and try it: connect Customer#000000042/hello select * from tpch.customer where c_acctbal > 0; C_NAME                    C_NATIONKEY------------------------- -----------Customer#000000042                  5 So the customer is now restricted to seeing their own entries,let's look at the plan: explain plan for select * from tpch.customer where c_acctbal > 0;select * from table(dbms_xplan.display(format=>'all')); Query Block Name / Object Alias (identified by operation id): -------------------------------------------------------------    1 - SEL$F5BB74E1    2 - SEL$F5BB74E1 / CUSTOMER@SEL$2 Predicate Information (identified by operation id): --------------------------------------------------- PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------    2 - storage("C_ACCTBAL">0 AND (SYS_CONTEXT('userenv','session_user')='TPCH' O R UPPER("C_NAME")=SYS_CONTEXT('userenv','session_user')))        filter("C_ACCTBAL">0 AND (SYS_CONTEXT('userenv','session_user')='TPCH' OR UPPER("C_NAME")=SYS_CONTEXT('userenv','session_user'))) Column Projection Information (identified by operation id): ----------------------------------------------------------- PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------    1 - "C_NATIONKEY"[NUMBER,22], "C_NAME"[VARCHAR2,25]    2 - "C_NATIONKEY"[NUMBER,22], "C_NAME"[VARCHAR2,25] This shows that sys_context is off-loadable in this usage and shows that in order for the evaluation of sys_context to work the metadata sent to the Smart Scan encapsulates the relevant contents of userenv to be available to be checked on the cell. If you think this is the cause of a Smart Scan problem you are encountering, you can use the hidden parameter "_cell_offload_sys_context" to disable it and see if that resolves the issue. 

The question came up this week about whether the predicates for row level security are offloaded to Smart Scan. The simple answer is yes as long as the policy_function contains...

Why you don't want to set _partition_large_extents = FALSE

I've seen some blogs recommending that _partition_large_extents be set to FALSE for a variety of space conserving reasons without the authors thinking about the negative impact this is going to have on Smart Scan. Large Extents cause an INITIAL allocation of 8 MB and a NEXT allocation of 1 MB and they have been the default for table spaces on Exadata since 11.2.0.2. You can verify that large extents are in use by a given table or partition by: Selectsegment_flags From sys_dba_segs where segment_name =<table_name> and owner= <schema_name>; The segment flag bit for large extents is 0x40000000. This pattern of allocation is design to work optimally with Smart Scan because Smart Scan intrinsically works in 1 MB chunks.  Reads of ASM allocation units are split into maximum 1 MB chunks to be passed to the filter processing library to have their blocks sliced and diced to create the synthetic blocks that contain only the rows and columns of interest to return to the table scan driver. When less than 1 MB gets allocated at a time to a segment and then the next contiguous blocks gets allocated to a different  segment, each separate run of blocks will be read by a different MBR. Each run will be passed separately to Smart Scan and we get sub-optimal chunks to work on increasing both the overhead of processing and the number of round trips needed to process the table scan. The design of Smart Scan is predicated on scooping up contiguous runs of data from disk for efficient processing. This matters particularly for HCC data and for chained rows. Hybrid Columnar Compression  We made two improvements to HCC row-based processing in 12.1.0.1. The first was the introduction of an Order By clause on index-access plans that guaranteed any compression unit would only be decompressed once. The second was 256 KB formatting of extents. Prior to 12c, a Fetch By Rowid would read the first block into the buffer cache in order to get the list of blocks in that Compression Unit and then read the remaining blocks into the buffer cache using the exact read-ahead count. This guaranteed a minimum of two physicals reads to retrieve one row. By formatting 256 KB at a time to the segment, we know a priori that we can issue the first read for all the blocks from the RDBA to next 256 KB alignment thus the majority of Compress For Query reads can be done with a single physical read. However this formatting can only happen when large segments are used.  You can verify that 256 KB formatting is being used by: Selectflag FromGV$FILESPACE_USAGE where tablespace_id =<tsnumber> and rfno =<relative_fno>; The flag bit for 256 KB formatting is 0x08. Note: 256 KB formatting requires a minimum 8 KB block size not by undoing this critical table scan optimization. Chained Rows When Smart Scan processes chained rows all it can see is the 1 MB chunk it is given. It has no ability to initiate a disk I/O to retrieve chained row pieces that lie outside that 1 MB chunk (even if it could the chances that the required row piece is on the same cell is relatively low). We changed the direct load path in 11.2.0.3 to work with the natural 1 MB alignment of Smart Scan so that direct loaded rows would never cross a 1 MB boundary. However as of 12.1.0.2 we have not yet changed Updates to keep all the row pieces inside a 1 MB chunk nor have we taught Alter Table Shrink Space Compact to obey the 1 MB alignment rule. Keeping 1 MB chunk contiguous for the same segment maximizes the chances that after non-bulk inserts and updates Smart Scan is able to see all the row pieces needed to satisfy the column references and not have to send the block back to RDBMS for buffer cache chained row processing. Using 11.2.0.1 style 64 KB allocations maximizes the chances Smart Scan will not be able to process those blocks . Just saying ... Conclusion Getting Exadata for its table scan performance but then setting  "_partition_large_extents" = FALSE is like buying a BMW M5 and insisting on filling it with 87 octane. The correct answer to conserving space on small tables is to over-ride the default by explicitly setting INITIAL and NEXT in the Create Table statement.Roger MacNicolOracle Data Storage Technology 

I've seen some blogs recommending that _partition_large_extents be set to FALSE for a variety of space conserving reasons without the authors thinking about the negative impact this is going to have...

Alter Table Shrink Space Cascade and SmartScan

Over the years, updates can cause rows to become highly fragmented sapping performance on Exadata table scans. The offload server and hence SmartScan get data to process 1 MB at a time. Because of this, SmartScan is only able to process row pieces that are available in the current 1 MB chunk that it processes at a time. Unlike RDBMS table scans, SmartScan is not able to initiate disk I/Os to retrieve further row pieces and, even if it could, it in unlikely that they would be present on the same cell.  When SmartScan finds it needs a row for a projected column that is not present in the blocks available to it, it will apply predicate on the row pieces it does have but if those predicate pass, it has to return the row unprocessed for the RDBMS to fetch the missing row pieces from the buffer cache. There are three main tools available for cleaning up a segment (Alter Table Shrink, Alter Table Move, and export/import), but one of them isn't as helpful as you might have thought.   Consider the following sequence of events where we update the 256th column to cause widespread fragmentation: SQL> update t set c256 = 'abcdefghijklmnopqrstuvwxyz'; 2000000 rows updated. SQL> commit; Commit complete. SQL> analyze table t compute statistics; Table analyzed. SQL> select chain_cnt from dba_tables where table_name = 'T' and owner = 'FUSION'; CHAIN_CNT ----------    2000000 SQL> select sum(blocks) from user_segments where segment_name = 'T'; SUM(BLOCKS) -----------      139264 SQL> alter table t enable row movement; Table altered. SQL> alter table t shrink space cascade; Table altered. SQL> analyze table t compute statistics; Table analyzed. SQL> select chain_cnt from dba_tables where table_name = 'T' and owner = 'FUSION'; CHAIN_CNT ----------    1970068 1 row selected. Note: 'chain_cnt" does not count chained rows, rather it counts rows whose row pieces are chained across more than one block.  A Row that is in three pieces but all three pieces are in the same block has a zero chain_cnt. In this particular artificial scenario Shrink has not gained us much reduction in space used, and more importantly it hasn't reduced the kind of fragmentation that affects SmartScan performance. This is because Shrink works in two phases. In Phase 1, the segment is scanned down from the end of the segment to the beginning. Rows with their head piece in the currently scanned block are moved together with all their row pieces. The segment is scanned from beginning upwards looking for space for the entire row. When it is unable to move any more entire rows, Phase 2 starts scanning down again from the end of the segment trying to move individual row pieces to blocks with space. This meant that while Phase 1 could potentially reduce chaining for relocated rows, Phase 2 was very unlikely to reduce the chain count and could in fact increase the chain_cnt. The moral of this is that Shrink really is for freeing up blocks close to the High Water Mark and not for cleaning up fragmented rows. Now let's try Alter Table move with the same segment: SQL> alter table t move;  Table altered. SQL> analyze table t compute statistics; Table analyzed. SQL> select chain_cnt from dba_tables where table_name = 'T' and owner = 'FUSION'; CHAIN_CNT ----------      45976 1 row selected. SQL> select sum(blocks) from user_segments where segment_name = 'T'; SUM(BLOCKS) -----------       92160 1 row selected. OK, that did what we hoped: more space has been reclaimed but more importantly for SmartScan, the number of fragmented rows has been reduced considerably. With the fix for 19433200, the mechanics of Shrink have been reworked and it is now better at reducing the chain_cnt. However, even with the improvements made, when faced with heavily fragmented rows, Alter Table Move or export/import are likely to provide significantly better table scan performance with SmartScan.Roger MacNicolOracle Data Storage Technology 

Over the years, updates can cause rows to become highly fragmented sapping performance on Exadata table scans. The offload server and hence SmartScan get data to process 1 MB at a time. Because of...

Working around heatbeat issues caused by tracing or by regexp

I had noted in my first post that using the highest level of tracing caused timeout issues with the offload server heartbeat monitor. Heartbeat issues can also occur with expensive (and badly formed) regexp expressions. By default the heartbeat monitor is set to 6 seconds which is the maximum permitted to process 1MB data in the offload server and mark the task completed and is far more time than is reasonably expected to take.  Operations such as expensive tracing to disk or badly formed regexp expressions that cause that time period to be exceeded lead to this in the alert log:State dump signal delivered to CELLOFLSRV<10180> by pid - 9860, uid - 3318 Thu Mar  5 12:26:31 2015 561 msec State dump completed for CELLOFLSRV<10180>Clean shutdown signal delivered to CELLOFLSRV<10180> by pid - 9860, uid - 3318 CELLOFLSRV <10180> is exiting with code 1where the restart server bounces the offload server to clear the perceived hang. Increasing the timeout via:CellCLI> alter cell events = "immediate cellsrv.cellsrv_setparam('_cell_oflsrv_heartbeat_timeout_sec','60')" enables the tracing to proceed without causing the restart server. My point in writing this entry was to provide a work-around when tracing is needed but also to address a couple of blog posts I'd seen that recommend leaving it set at 60 or 90 seconds. This is not a good idea. The heartbeat exists to catch genuine but rare issues and leaving this set to an increased value will hinder the offload server restarting quickly to resume work. This is one parameter that shoud be reset to the default when the work-around is no longer needed unless otherwise directed by support.Roger MacNicol

I had noted in my first post that using the highest level of tracing caused timeout issues with the offload server heartbeat monitor. Heartbeat issues can also occur with expensive (and badly formed)...

Examining the new Columnar Cache with v$cell_state

12.1.0.2 introduced the new Columnar Flash Cache where 1MB of  blocks that are all in HCC format are rewritten so as to make each column from each CU contiguous. This works by first writing all the block headers to an array, then writing all the CU headers to an array, finally writing all the Column 1 column-level-CUs, then writing all the Column2 column-level-CUs etc. The flash cache hash table maintains a simple mapping of column numbers to 64KB flash pages so, for any given query, we can simply do asynchronous disk I/O of the minimum set of 64KB pages required to cover the referenced columns. Within the "flashcache" cell stats there is a new nested cell stat called "columnarcache" that enables you to track how it is working. > set long 50000000> set pagesize 10000> select xmltype(STATISTICS_VALUE).getclobval(2,2) from v$cell_state; XMLTYPE(STATISTICS_VALUE).GETCLOBVAL(2,2)--------------------------------------------------------------------------------....   <stats type="flashcache">    <stats type="columnarcache">      <stat name="columnar_cache_size">0</stat>      <stat name="columnar_cache_keep_size">0</stat>      <stat name="columnar_cache_keep_read">0</stat>      <stat name="columnar_cache_keep_read_bytes">0</stat>      <stat name="columnar_attempted_read">0</stat>      <stat name="columnar_attempted_read_bytes">0</stat>      <stat name="columnar_cache_hits">0</stat>      <stat name="columnar_cache_hits_read_bytes">0</stat>      <stat name="columnar_cache_hits_saved_bytes">0</stat>      <stat name="columnar_cache_pop">0</stat>      <stat name="columnar_cache_pop_bytes">0</stat>      <stat name="columnar_cache_pop_throttled">0</stat>      <stat name="columnar_cache_pop_invalid_buf">0</stat>      <stat name="columnar_cache_pop_size_align">0</stat>      <stat name="columnar_cache_pop_findchdrfailure_mainchdr">0</stat>      <stat name="columnar_cache_pop_findchdrfailure_childchdr">0</stat>    </stats>  </stats> I typically spool the output of this to wade through it an editor but if we want to monitor how it is working with some workload, we need to extract individual stats, for example I bounced the cells and verified the cache was empty: > select xmlcast(xmlquery('/cell_stats/stats/stats/stat[@name="columnar_cache_size"]' passing xmltype(STATISTICS_VALUE) returning content) as varchar2(200) ) "COLUMNAR CACHE SIZE"from v$cell_statewhere xmlexists('/cell_stats/stats/stats[@type="columnarcache"]' passing xmltype(STATISTICS_VALUE));COLUMNAR CACHE SIZE--------------------------------------------------------------------------------0 I am using the 1GB TPC-H schema which takes a little more 400MB on disk when compressed with Query Low: SQL> select sum(bytes) from user_segments where SEGMENT_NAME in ('SUPPLIER','PARTSUPP','LINEITEM','ORDERS','PART','CUSTOMER'); SUM(BYTES)---------- 420675584 and checking the columnar cache again shows about half of the data has been rewritten into columnar cache format instead of caching raw blocks: SQL> select xmlcast(xmlquery('/cell_stats/stats/stats/stat[@name="columnar_cache_size"]' passing xmltype(STATISTICS_VALUE) returning content) as varchar2(200) ) "COLUMNAR CACHE SIZE"from v$cell_state where xmlexists('/cell_stats/stats/stats[@type="columnarcache"]' passing xmltype(STATISTICS_VALUE));  2    3  COLUMNAR CACHE SIZE--------------------------------------------------------------------------------179306496 So let's look at how the cache helped: SQL> select xmlcast(xmlquery('/cell_stats/stats/stats/stat[@name="columnar_cache_hits_read_bytes"]' passing xmltype(STATISTICS_VALUE) returning content) as varchar2(200) ) "HIT READ BYTES"from v$cell_statewhere xmlexists('/cell_stats/stats/stats[@type="columnarcache"]' passing xmltype(STATISTICS_VALUE));   2    3  HIT READ BYTES--------------------------------------------------------------------------------1909456896 SQL> SQL> select xmlcast(xmlquery('/cell_stats/stats/stats/stat[@name="columnar_cache_hits_saved_bytes"]' passing xmltype(STATISTICS_VALUE) returning content) as varchar2(200) ) "HIT SAVED BYTES"from v$cell_statewhere xmlexists('/cell_stats/stats/stats[@type="columnarcache"]' passing xmltype(STATISTICS_VALUE));   2    3  HIT SAVED BYTES--------------------------------------------------------------------------------1128267776 which shows we were able to eliminate about 35% of the disk I/O for this query run! We could, of course, have gotten that information more simply with the regular stat "cell physical IO bytes saved by columnar cache" but I wanted to show how to pull values from v$cell_state for use in scripts. Many people only use Query High compression as they find the increased disk I/O from Query Low more than offsets the savings from cheaper decompression costs. However, with the columnar cache in place, those trade-offs have changed. It may be worth re-evaluating the decision as to when user Query Low vs. Query High particularly on CPU-bound cells. Addendum: performance bug 20525311 affecting the columnar cache with selective predicates is fixed in the next rpm. Roger MacNicolOracle Data Storage Technology 

12.1.0.2 introduced the new Columnar Flash Cache where 1MB of  blocks that are all in HCC format are rewritten so as to make each column from each CU contiguous. This works by first writing all the...

Oracle

Integrated Cloud Applications & Platform Services