Big Data SQL, Data Warehousing | Sunday, August 7, 2016

Big Data SQL Quick Start. Storage Indexes - Part10.

By: Alexey Filanovskiy | Product Manager

Today I'm going to explain very powerfully Big Data SQL feature - Storage Indexes. Before all, I want to note, that name "Storage Index" could mislead you. In fact, it's dynamic structure that automatically built over your data after you scan it. There is no any specific command or something that user have run. There is nothing that you have to maintain (like Btree index), rebuild. You just run your workload and after a while, you may note better performance. Storage Indexes is not something completely new for Big Data SQL. Oracle Exadata also has this feature and with Big Data SQL we just re-use it.

How it works. 

The  main idea is that we could create some metadata over the unit of the scan (block or multiple blocks ). For example, we scan HDFS blocks with one given query, which has some predicate in where clause (like WHERE id=123). If this block doesn't return any rows we build statistics for this column, like the minimum, the maximum for given block.

Next scan could use these statistics for skipping the scan.

It's very powerful feature for unique columns. Fine-granule unit for Storage Index in the case of Hadoop is HDFS block. As you may know, HDFS block has pretty big size (in Big Data Appliance default is 256MB) and if you will be able to skip full scan of it will bring to you significant performance benefits.

Query initially scan granule and if this scan doesn't return any row, storage index is built (if you find at least one row in the block, Storage Index will not be created over this concrete block).
In HDFS data usually stored in 3 copies. For maximize performance and get benefits from Storage Index as quick as possible, Big Data SQL (since 3.1 version) uses the deterministic order of the hosts.

If you scan table once and create Storage Indexes over the first replica, the second scan will be performed over the same copy and will use Storage Index right with the second scan. In conclusion, I want to show you couple bad and good examples for Storage Index. I have a table, with one pretty unique column:

SQL> SELECT num_distinct FROM user_tab_col_statistics
WHERE table_name = 'STORE_SALES_CSV'
AND COLUMN_NAME = 'SS_TICKET_NUMBER';

num_distinct 
------------ 
849805312

the table is quite big:

SQL> select count(1) from STORE_SALES_CSV

-------------------
6 385 178 703 

which means that in average each value appears in given dataset in average 7-8 times, which is quite selective (this is 900.1 GB dataset). For show Storage Indexes in action, I run the query that uses predicate which returns 2 rows.

SQL> select count(1) from STORE_SALES_CSV where SS_TICKET_NUMBER=187378862;

the first scan consumes a lot of IO and CPU and finishes in 10.6 minutes. The second and next scans finished extremely fast in 3 seconds (because of Storage Index I definitely know that there is no data that matches with my predicate in my table). For checking a number of Storage Index efficiency I query the session statistic view:

SQL> SELECT n.name,  CASE NAME

          WHEN 'cell XT granule predicate offload retries' THEN

   VALUE
  WHEN 'cell XT granules requested for predicate offload' THEN
           VALUE
          ELSE
           round(VALUE / 1024 / 1024 / 1024,2)
       END Val,
       CASE NAME
          WHEN 'cell XT granule predicate offload retries' THEN
           'Granules'
          WHEN 'cell XT granules requested for predicate offload' THEN
           'Granules'
          ELSE
           'GBytes'
       END Metric
  FROM v$mystat   s,
       v$statname n
 WHERE s.statistic# = n.statistic#
   AND n.name IN ('cell XT granule IO bytes saved by storage index',
                  'cell XT granule bytes requested for predicate offload')
 ORDER BY Metric;

------------------------------------------------------------------------------------- 

cell XT granule IO bytes saved by storage index         899.86  GBytes
cell XT granule bytes requested for predicate offload   900.11  GBytes

Based on this statistics we could conclude that only (cell XT granule bytes requested for predicate offload - cell XT granule IO bytes saved by storage index) = 256 Mbytes were read, which is one HDFS block.

First scan.

I don't  recommend you to disable Storage Index in your real production environment, but it has one side effect. First scan in the case if Storage Indexes are enabled takes longer than without Storage Indexes. In my previous example, first scan took 10.6 minutes, but all next were finished in seconds:

If I disable Storage Index first, second and next scan will take the same time - about 5.1 minutes:

I could summarize all above in the table:

   Elapsed time with Storage Indexes Elapsed time without Storage Indexes
first scan of the table 10.3 minutes 5.1 minutes
second scan of the table 3 seconds 5.1 minutes

 

Query by the unselective predicate.

Taking into account that in our previous example, Storage Index brought performance degradation for the first query it's interesting to check the behavior of the query which uses unselective predicates. The same table has column SS_QUANTITY which has only 100 unique values:

SQL> SELECT num_distinct FROM user_tab_col_statistics
WHERE table_name = 'STORE_SALES_CSV'
AND COLUMN_NAME = 'SS_QUANTITY';

------------ 

100

that are between 0 and 100:

SQL> SELECT min(SS_QUANTITY, max(SS_QUANTITY) FROM STORE_SALES_CSV

----   ------ 
0       100

With enabled Storage Indexes I ran 3 times query like:

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY=82;

and all 3 times it was done in 5.3 minutes. This query return a lot of rows

But when I tried to query with unexisting predicates (like a negative), which return 0 rows:

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY=-1;

Behaviour was like in the previous example. First scan took 10.5 minutes, second and nexts 3 seconds. In the end of the test I disable Storage Index and  run query multiple times. In the end, I got 5.3 minutes. Let me conclude results into the table:

   Elapsed time with Storage Index  Elapsed time without Storage Index
 The first run, return many rows  5.3 minutes  5.3 minutes
 Second and next run, return many rows  5.3 minutes  5.3 minutes
 The first run, return 0 rows  10.5 minutes  5.3 minutes
 Second and next run, return 0 rows  3 seconds  5.3 minutes

Based on this experiments we may see:

1) Storage Index built only when block doesn't return any row

2) If block return at least one row SI will not be built

3) It means, that first scan will not have performance degradation, but the second scan will not have Index for accelerate performance

Order by

In the last example, we saw that query which uses unselective predicate is the bad candidate for Storage Indexes unless you sort out your source data.

What does it mean? I  create a new dataset from original one, like CTAS with hive statement:

hive> create table csv.store_sales_quantity_sort stored as textfile
as select * from csv.store_sales order by SS_QUANTITY;

After this I repeat my query again two times in Oracle RDBMS (table STORE_SALES_CSV mapped to the hive table store_sales_quantity_sort) using the predicate which is return many rows:

SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY= 82;

as you can see the second one is way faster (even for the query which returns many rows). And now Storage Indexes take place. I can prove this with SI statistics:

 cell XT granule IO bytes saved by storage index         601.72 GBytes
 cell XT granule bytes requested for predicate offload   876.47 GBytes

For analyze how columns are sorted and reorder it you may use this tool.

Bucketing.

Another one trick which will allow you dramatically improve performance is bucketing. Only one thing, that it will work only in case if you exactly know the number of the different values (like distinct) or the maximum number of distinct values. If you know ahead that you will use some query predicate (like SS_QUANTITY in my examples above), you may prepare data in an optimal way for this.

this statement will create 100 files and each file will have corresponded value (it's ideal case otherwise you will have hash like distribution).

now after creating Oracle external table and querying the data twice, we could check the benefits which bring us Storage Indexes:

hive> CREATE TABLE csv.store_sales_quantity_bucketed(
  ss_sold_date_sk bigint,
  ss_sold_time_sk bigint,
  ss_item_sk bigint,
  ss_customer_sk bigint,
  ss_cdemo_sk bigint,
  ss_hdemo_sk bigint,
  ss_addr_sk bigint,
  ss_store_sk bigint,
  ss_promo_sk bigint,
  ss_ticket_number bigint,
  ss_quantity int,
  ss_wholesale_cost double,
  ss_list_price double,
  ss_sales_price double,
  ss_ext_discount_amt double,
  ss_ext_sales_price double,
  ss_ext_wholesale_cost double,
  ss_ext_list_price double,
  ss_ext_tax double,
  ss_coupon_amt double,
  ss_net_paid double,
  ss_net_paid_inc_tax double,
  ss_net_profit double)
CLUSTERED BY (SS_QUANTITY) INTO 100 BUCKETS 
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
stored as textfile; 
hive> set hive.enforce.bucketing = true;
hive> insert overwrite table csv.store_sales_quantity_bucketed select * from csv.store_sales; 

In Oracle RDBMS run:

SQL> select count(1) from STORE_SALES_CSV_QUANTITY_BUCK  where SS_QUANTITY= 82;
.... 
elapsed time: 822 sec
SQL> select count(1) from STORE_SALES_CSV_QUANTITY_BUCK  where SS_QUANTITY= 82; 
.... 
elapsed time: 8 sec 
SQL> SELECT * FROM xt_stat;
cell XT granule IO bytes saved by storage index           867.53  GBytes
cell XT granule bytes requested for predicate offload     876.47 GBytes 

the fact that we read only 8.94GB of data in the second case explain why elapsed time have been significantly reduced. Bucketing together with Storage Indexes could bring you significant performance benefits if you will use bucketed column in where predicate.

Joins and Storage Indexes.

Storage Indexes could be very powerful and very useful for joins together with Bloom Filters. For showing this I took the table from the previous example - STORE_SALES_CSV_QUANTITY_BUCK and joined it with the small table, which contains only 2 rows.

SQL> CREATE TABLE test_couple_rows AS SELECT 3 q FROM dual UNION ALL SELECT 4 q FROM dual;

Now I going to join it with the big fact table, which is bucketed by  SS_QUANTITY column, using SS_QUANTITY as a join predicate with Bloom filters.

SQL>  SELECT /*+ use_hash(tt ss)*/  COUNT(1)
  FROM test_couple_rows tt,
       STORE_SALES_CSV_QUANTITY_BUCK SS
 WHERE ss.Ss_Quantity = tt.q
 AND tt.q>0;

let's check the plan for making sure, that we are using Bloom Filters:

The query was done in 12 seconds and we save a lot of IO thanks for SI:

cell XT granule IO bytes saved by storage index           859.91 GBytes
cell XT granule bytes requested for predicate offload     876.47 GBytes 

Please feel free to ask any questions in the comments!

Join the discussion

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
 

Visit the Oracle Blog

 

Contact Us

Oracle

Integrated Cloud Applications & Platform Services