X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Big Data SQL Quick Start. Partition Pruning - Part7.

Alexey Filanovskiy
Product Manager

partitioning is the very common technique in data warehousing and all kind of databases. I assume that reader know what partitioning is and I will not explain theoretical part. If you want you could consider Oracle RDBMS example.

But I directly start with the practical.

Hive partitioning.

Hive originally was created as an easy way to write MapReduces over  HDFS. HDFS is a file system, which has Linux like structure. So, it's easy to assume that partition, in this case, is just sub-directory. I used Intel BigBench dataset for creating partitioned Hive table. I took two tables - big fact table store_sales and small dimension table date_dim. They have followed relationship:

fact table (store_sales) doesn't have clear time identificator, this table related with dimension (dictionary) table date_dim, which have columns for explicit data definition (d_dom - day of month, d_moy - month of year, d_year - year). Now I'm going to create a partitioned store_sales table:

SQL> CREATE TABLE store_sales_part(
  ss_sold_date_sk bigint,
...
  ss_net_profit double)
  partitioned by (
  yearINT,
  month INT,
  day INT)
stored as ORC;

The statement above creates a partitioned table with 3 virtual columns - year, month, day. Not I will insert data, into this hive table (I added few parameters which are mandatory for dynamic partitioning):

hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
hive> SET hive.exec.max.dynamic.partitions =10000;
hive> INSERT INTO TABLE store_sales_part PARTITION (year, month, day)
hive> SELECT store_sales.*, dt.d_year, dt.d_moy, dt.d_dom  
FROM  store_sales, date_dim dt
WHERE
dt.d_date_sk = store_sales.ss_sold_date_sk;  

after this insert, i want to check file distribution on HDFS.

$ hadoop fs -du -h /user/hive/warehouse/orc.db/store_sales_part/*/*/|tail -2
168.5 M  505.5 M  /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=9/day=8
168.7 M  506.0 M  /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=9/day=9

so, now new table store_sales_part has three virtual column, that actually doesn't stored into disks (and don't occupy space), but could be used for avoiding unnecessary IO. Also, those columns could be queried from hive console:

hive> select ss_sold_date_sk, year, month, day from store_sales_part limit 2;
OK
36890   2001    1       1
36890   2001    1       1

Great! Now let's turn to Oracle RDBMS and will create a table there, that will be linked with this hive table:

SQL>  CREATE TABLE STORE_SALES_ORC_PART
  (SS_SOLD_DATE_SK NUMBER(10,0),
   ....
   SS_NET_PROFIT BINARY_DOUBLE,
   YEAR NUMBER,
   MONTH NUMBER,
   DAY NUMBER)
 ORGANIZATION EXTERNAL    
(TYPE ORACLE_HIVE      
DEFAULT DIRECTORY DEFAULT_DIR    
ACCESS PARAMETERS    
 ( com.oracle.bigdata.cluster=bds30   
 com.oracle.bigdata.tablename=orc.store_sales_part) )
  REJECT LIMIT UNLIMITED  
PARALLEL;

Now in Oracle we have an external table, that has columns, which could prune unnecessary partition. Let's verify this! Query table without partition predicate: 

SQL> SELECT COUNT(1) FROM STORE_SALES_ORC_PART

and after the query has finished check the statistics: 

SQL> SELECT   n.name,  round(s.value / 1024 / 1024 / 1024)  
FROM v$mystat   s, v$statname n
WHERE s.statistic# = 462  
AND s.statistic# = n.statistic#;
--------------------------------------------
cell XT granule bytes requested for predicate offload         298  

then check the size of the directory: 

$ hadoop fs -du -h -s /user/hive/warehouse/orc.db/store_sales_part/
297.8 G  893.5 G  /user/hive/warehouse/orc.db/store_sales_part

it matches, so everything correctly. Now try to query one particular year:

SQL> SELECT COUNT(1) FROM STORE_SALES_ORC_PART WHERE YEAR=2004

check the database statistics:

SQL> SELECT   n.name,  round(s.value / 1024 / 1024 / 1024)  
FROM v$mystat   s, v$statname n
WHERE s.statistic# = 462  
AND s.statistic# = n.statistic#;
--------------------------------------------
cell XT granule bytes requested for predicate offload         60

compare it with HDFS file size:

$ hadoop fs -du -h -s /user/hive/warehouse/orc.db/store_sales_part/year=2004
60.3 G  893.5 G  /user/hive/warehouse/orc.db/store_sales_part

But the most interesting thing, that you could filter subpartition. For example, is I want to have statistics for all Decembers (all years), I will read only December partitions, like this:

SQL> SELECT COUNT(1) FROM STORE_SALES_ORC_PART WHERE MONTH=12

Check the database statistics:

SQL> SELECT   n.name,  round(s.value / 1024 / 1024 / 1024)  
FROM v$mystat   s, v$statname n
WHERE s.statistic# = 462  
AND s.statistic# = n.statistic#;
--------------------
cell XT granule bytes requested for predicate offload         23

and great thing that on the HDFS we have exactly same amount of data:

$ hadoop fs -du -h -s /user/hive/warehouse/orc.db/store_sales_part/*/month=12
5.1 G  15.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2001/month=12
5.1 G 15.3 G/user/hive/warehouse/orc.db/store_sales_part/year=2002/month=12
5.1 G  15.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2003/month=12
5.1 G  15.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2004/month=12
2.1 G  6.3 G  /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=12

 Simple, but very powerful feature!

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.