Today I'm going to explain very powerfully Big Data SQL feature - Storage Indexes. Before all, I want to note, that name "Storage Index" could mislead you. In fact, it's dynamic structure that automatically built over your data after you scan it. There is no any specific command or something that user have run. There is nothing that you have to maintain (like Btree index), rebuild. You just run your workload and after a while, you may note better performance. Storage Indexes is not something completely new for Big Data SQL. Oracle Exadata also has this feature and with Big Data SQL we just re-use it.
How it works.
The main idea is that we could create some metadata over the unit of the scan (block or multiple blocks ). For example, we scan HDFS blocks with one given query, which has some predicate in where clause (like WHERE id=123). If this block doesn't return any rows we build statistics for this column, like the minimum, the maximum for given block.
Next scan could use these statistics for skipping the scan.
It's very powerful feature for unique columns. Fine-granule unit for Storage Index in the case of Hadoop is HDFS block. As you may know, HDFS block has pretty big size (in Big Data Appliance default is 256MB) and if you will be able to skip full scan of it will bring to you significant performance benefits.
If you scan table once and create Storage Indexes over the first replica, the second scan will be performed over the same copy and will use Storage Index right with the second scan. In conclusion, I want to show you couple bad and good examples for Storage Index. I have a table, with one pretty unique column:
the table is quite big:
which means that in average each value appears in given dataset in average 7-8 times, which is quite selective (this is 900.1 GB dataset). For show Storage Indexes in action, I run the query that uses predicate which returns 2 rows.
the first scan consumes a lot of IO and CPU and finishes in 10.6 minutes. The second and next scans finished extremely fast in 3 seconds (because of Storage Index I definitely know that there is no data that matches with my predicate in my table). For checking a number of Storage Index efficiency I query the session statistic view:
Based on this statistics we could conclude that only (cell XT granule bytes requested for predicate offload - cell XT granule IO bytes saved by storage index) = 256 Mbytes were read, which is one HDFS block.
I don't recommend you to disable Storage Index in your real production environment, but it has one side effect. First scan in the case if Storage Indexes are enabled takes longer than without Storage Indexes. In my previous example, first scan took 10.6 minutes, but all next were finished in seconds:
If I disable Storage Index first, second and next scan will take the same time - about 5.1 minutes:
I could summarize all above in the table:
|Elapsed time with Storage Indexes||Elapsed time without Storage Indexes|
|first scan of the table||10.3 minutes||5.1 minutes|
|second scan of the table||3 seconds||5.1 minutes|
Query by the unselective predicate.
Taking into account that in our previous example, Storage Index brought performance degradation for the first query it's interesting to check the behavior of the query which uses unselective predicates. The same table has column SS_QUANTITY which has only 100 unique values:
that are between 0 and 100:
With enabled Storage Indexes I ran 3 times query like:
and all 3 times it was done in 5.3 minutes. This query return a lot of rows
But when I tried to query with unexisting predicates (like a negative), which return 0 rows:
Behaviour was like in the previous example. First scan took 10.5 minutes, second and nexts 3 seconds. In the end of the test I disable Storage Index and run query multiple times. In the end, I got 5.3 minutes. Let me conclude results into the table:
|Elapsed time with Storage Index||Elapsed time without Storage Index|
|The first run, return many rows||5.3 minutes||5.3 minutes|
|Second and next run, return many rows||5.3 minutes||5.3 minutes|
|The first run, return 0 rows||10.5 minutes||5.3 minutes|
|Second and next run, return 0 rows||3 seconds||5.3 minutes|
Based on this experiments we may see:
1) Storage Index built only when block doesn't return any row
2) If block return at least one row SI will not be built
3) It means, that first scan will not have performance degradation, but the second scan will not have Index for accelerate performance
In the last example, we saw that query which uses unselective predicate is the bad candidate for Storage Indexes unless you sort out your source data.
What does it mean? I create a new dataset from original one, like CTAS with hive statement:
After this I repeat my query again two times in Oracle RDBMS (table STORE_SALES_CSV mapped to the hive table store_sales_quantity_sort) using the predicate which is return many rows:
as you can see the second one is way faster (even for the query which returns many rows). And now Storage Indexes take place. I can prove this with SI statistics:
For analyze how columns are sorted and reorder it you may use this tool.
Another one trick which will allow you dramatically improve performance is bucketing. Only one thing, that it will work only in case if you exactly know the number of the different values (like distinct) or the maximum number of distinct values. If you know ahead that you will use some query predicate (like SS_QUANTITY in my examples above), you may prepare data in an optimal way for this.
this statement will create 100 files and each file will have corresponded value (it's ideal case otherwise you will have hash like distribution).
now after creating Oracle external table and querying the data twice, we could check the benefits which bring us Storage Indexes:
In Oracle RDBMS run:
the fact that we read only 8.94GB of data in the second case explain why elapsed time have been significantly reduced. Bucketing together with Storage Indexes could bring you significant performance benefits if you will use bucketed column in where predicate.
Joins and Storage Indexes.
Storage Indexes could be very powerful and very useful for joins together with Bloom Filters. For showing this I took the table from the previous example - STORE_SALES_CSV_QUANTITY_BUCK and joined it with the small table, which contains only 2 rows.
Now I going to join it with the big fact table, which is bucketed by SS_QUANTITY column, using SS_QUANTITY as a join predicate with Bloom filters.
let's check the plan for making sure, that we are using Bloom Filters:
The query was done in 12 seconds and we save a lot of IO thanks for SI:
Please feel free to ask any questions in the comments!