X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Big Data SQL Quick Start. Data types - Part8.

Alexey Filanovskiy
Product Manager

Today I'm going to share one of the easiest way to improve overall Big Data SQL performance. Big Data SQL is the complex system, which contains two main pieces - Database and Hadoop. Each system has own datatypes - Oracle RDBMS and Java. Every time when you query data from Oracle RDBMS you do data conversion. Data conversion is very expensive CPU operation.

 

AVRO, RCFile, ORC and Parquet files.

Let's zoom Hadoop server piece and find there few components - Hadoop part (Data Node) and Oracle part (Smart Scan). Also we have "External Table Service" (part of Big Data SQL software) here we do datatype transformation (read as: here we spend a lot of CPU).

But there is good news, if you already did ETL once when transform source data into Parquet or ORC file, you could reap of this transformation and by proper mapping Hive datatypes to Oracle DB datatypes you will just pass through data without transformations. Сompliance matrix you could find here. But it's always better to see once than hear hundred times - let me give you example. I do have table in hive (ORC file):

hive> show create table store_sales;
OK
CREATE TABLE store_sales(
ss_sold_date_sk bigint,
ss_sold_time_sk bigint,
ss_item_sk bigint,
ss_customer_sk bigint,
ss_cdemo_sk bigint,
ss_hdemo_sk bigint,
ss_addr_sk bigint,
ss_store_sk bigint,
ss_promo_sk bigint,
ss_ticket_number bigint,
ss_quantity int,
ss_wholesale_cost double,
ss_list_price double,
ss_sales_price double,
ss_ext_discount_amt double,
ss_ext_sales_price double,
ss_ext_wholesale_cost double,
ss_ext_list_price double,
ss_ext_tax double,
ss_coupon_amt double,
ss_net_paid double,
ss_net_paid_inc_tax double,
ss_net_profit double)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://bds30-ns/user/hive/warehouse/orc.db/store_sales'

And create in Oracle RDBMS two external tables - one of the table map hive double to Oracle NUMBER(7,2):

SQL> CREATE TABLE STORE_SALES_ORC_NUM    (
SS_SOLD_DATE_SK NUMBER(10,0),
SS_SOLD_TIME_SK NUMBER(10,0),
SS_ITEM_SK NUMBER(10,0),
SS_CUSTOMER_SK NUMBER(10,0),
SS_CDEMO_SK NUMBER(10,0),
SS_HDEMO_SK NUMBER(10,0),
SS_ADDR_SK NUMBER(10,0),
SS_STORE_SK NUMBER(10,0),
SS_PROMO_SK NUMBER(10,0),
SS_TICKET_NUMBER NUMBER(10,0),
SS_QUANTITY NUMBER(10,0),
SS_WHOLESALE_COST NUMBER(7,2),
SS_LIST_PRICE NUMBER(7,2),
SS_SALES_PRICE NUMBER(7,2),
SS_EXT_DISCOUNT_AMT NUMBER(7,2),
SS_EXT_SALES_PRICE NUMBER(7,2),
SS_EXT_WHOLESALE_COST NUMBER(7,2),
SS_EXT_LIST_PRICE NUMBER(7,2),
SS_EXT_TAX NUMBER(7,2),
SS_COUPON_AMT NUMBER(7,2),
SS_NET_PAID NUMBER(7,2),
SS_NET_PAID_INC_TAX NUMBER(7,2),
SS_NET_PROFIT NUMBER(7,2)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.cluster=bds30
com.oracle.bigdata.tablename=orc.store_sales)
)
REJECT LIMIT UNLIMITED
PARALLEL ;

For second table I check define datatypes according mapping matrix, I define Oracle BINARY_DOUBLE instead NUMBER(7,2) over hive DOUBLE: 

SQL> CREATE TABLE STORE_SALES_ORC_NUM    (
SS_SOLD_DATE_SK NUMBER(10,0),
SS_SOLD_TIME_SK NUMBER(10,0),
SS_ITEM_SK NUMBER(10,0),
SS_CUSTOMER_SK NUMBER(10,0),
SS_CDEMO_SK NUMBER(10,0),
SS_HDEMO_SK NUMBER(10,0),
SS_ADDR_SK NUMBER(10,0),
SS_STORE_SK NUMBER(10,0),
SS_PROMO_SK NUMBER(10,0),
SS_TICKET_NUMBER NUMBER(10,0),
SS_QUANTITY NUMBER(10,0),
SS_WHOLESALE_COST NUMBER(7,2),
SS_LIST_PRICE NUMBER(7,2),
SS_SALES_PRICE NUMBER(7,2),
SS_EXT_DISCOUNT_AMT NUMBER(7,2),
SS_EXT_SALES_PRICE NUMBER(7,2),
SS_EXT_WHOLESALE_COST NUMBER(7,2),
SS_EXT_LIST_PRICE NUMBER(7,2),
SS_EXT_TAX NUMBER(7,2),
SS_COUPON_AMT NUMBER(7,2),
SS_NET_PAID NUMBER(7,2),
SS_NET_PAID_INC_TAX NUMBER(7,2),
SS_NET_PROFIT NUMBER(7,2)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.cluster=bds30
com.oracle.bigdata.tablename=orc.store_sales)
)
REJECT LIMIT UNLIMITED
PARALLEL ;

For performance test I run 10 concurent queries, each one filter the data using column, that in first case have double -> number mapping and in second double -> binary_double mapping. Like this:

SQL> SELECT  COUNT(1) FROM STORE_SALES_ORC WHERE ss_net_paid_inc_tax=:bind

and this:

SQL> SELECT  COUNT(1) FROM STORE_SALES_ORC_NUM WHERE ss_net_paid_inc_tax=:bind

All 10 queries were finished almost at the same time. 

Mapping type  Is it Proper?  Elapsed time
double (hive) -> number (oracle)  No 16.1 min
double (hive) -> binary_double (oracle)  Yes 10.8 min

So, fine I get better performance with proper datatype mapping, but what's going on under the hood? Let's check the graphs.

CPU consumption are very hight in both cases:

But as soon as we are CPU bound, IO throughput become more interesting metrics:

In the first case we do have complex transformation, we spend more CPU time on the cell side and bacause of that can not read faster (stuck on CPU). Second query doesn't transform the data, just pass it to Oracle Smart Scan as is.

TextFiles and Sequence Files.

If in case of AVRO, RCFile, ORC and Parquet it's matter what you actually store on the HDFS, but Textfile and SequenceFile works in a complitely different way. The Hadoop InputFormat for a CSV file is reading a byte
stream, finding text rows (normally terminated by a newline) and
then parsing off columns. It's how it works step by step:

1. Java part of "External Table Services" reads HDFS blocks, passes byte buffer up to C

2. C part of "External Table Services" parses buffer for newline to find a row

3. C part of "External Table Services" parses row for "|" to find column value, always a string,
like "-11.52"

4. C part of "External Table Services" then converts the string found, "-11.52" to Oracle Number

The difference here is that the conversion from string to Oracle Number (4) is much more efficient than the above conversion of string to IEEE754 Binary Floating Point (Oracle BINARY_DOUBLE).

And for sure I run the test (over CSV files) to prove this. Like in an example above I created two Oracle RDBMS tables, one used NUMBER, second one use BINARY_DOUBLE:

SQL> CREATE TABLE STORE_SALES_CSV_NUM    (
SS_SOLD_DATE_SK NUMBER(10,0),
SS_SOLD_TIME_SK NUMBER(10,0),
SS_ITEM_SK NUMBER(10,0),
SS_CUSTOMER_SK NUMBER(10,0),
SS_CDEMO_SK NUMBER(10,0),
SS_HDEMO_SK NUMBER(10,0),
SS_ADDR_SK NUMBER(10,0),
SS_STORE_SK NUMBER(10,0),
SS_PROMO_SK NUMBER(10,0),
SS_TICKET_NUMBER NUMBER(10,0),
SS_QUANTITY NUMBER(10,0),
SS_WHOLESALE_COST NUMBER(7,2),
SS_LIST_PRICE NUMBER(7,2),
SS_SALES_PRICE NUMBER(7,2),
SS_EXT_DISCOUNT_AMT NUMBER(7,2),
SS_EXT_SALES_PRICE NUMBER(7,2),
SS_EXT_WHOLESALE_COST NUMBER(7,2),
SS_EXT_LIST_PRICE NUMBER(7,2),
SS_EXT_TAX NUMBER(7,2),
SS_COUPON_AMT NUMBER(7,2),
SS_NET_PAID NUMBER(7,2),
SS_NET_PAID_INC_TAX NUMBER(7,2),
SS_NET_PROFIT NUMBER(7,2) )
ORGANIZATION EXTERNAL (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS (
com.oracle.bigdata.cluster=bds30
com.oracle.bigdata.tablename=csv.store_sales) )
REJECT LIMIT UNLIMITED
PARALLEL ;

For the second table I defined Oracle BINARY_DOUBLE instead NUMBER(7,2) : 

SQL> CREATE TABLE STORE_SALES_CSV (
SS_SOLD_DATE_SK NUMBER(10,0),
SS_SOLD_TIME_SK NUMBER(10,0),
SS_ITEM_SK NUMBER(10,0),
SS_CUSTOMER_SK NUMBER(10,0),
SS_CDEMO_SK NUMBER(10,0),
SS_HDEMO_SK NUMBER(10,0),
SS_ADDR_SK NUMBER(10,0),
SS_STORE_SK NUMBER(10,0),
SS_PROMO_SK NUMBER(10,0),
SS_TICKET_NUMBER NUMBER(10,0),
SS_QUANTITY NUMBER(10,0),
SS_WHOLESALE_COST BINARY_DOUBLE,
SS_LIST_PRICE BINARY_DOUBLE,
SS_SALES_PRICE BINARY_DOUBLE,
SS_EXT_DISCOUNT_AMT BINARY_DOUBLE,
SS_EXT_SALES_PRICE BINARY_DOUBLE,
SS_EXT_WHOLESALE_COST BINARY_DOUBLE,
SS_EXT_LIST_PRICE BINARY_DOUBLE,
SS_EXT_TAX BINARY_DOUBLE,
SS_COUPON_AMT BINARY_DOUBLE,
SS_NET_PAID BINARY_DOUBLE,
SS_NET_PAID_INC_TAX BINARY_DOUBLE,
SS_NET_PROFIT BINARY_DOUBLE)
ORGANIZATION EXTERNAL (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS (
com.oracle.bigdata.cluster=bds30
com.oracle.bigdata.tablename=csv.store_sales )
REJECT LIMIT UNLIMITED
PARALLEL;

For performance test I used full table scan query (actually it's query that gather statistics):

SQL> SELECT
to_char(COUNT(SS_SOLD_DATE_SK)),
substrb(dump(MIN(SS_SOLD_DATE_SK),16,0,64),1,240),
substrb(dump(MAX(SS_SOLD_DATE_SK),16,0,64),1,240),
...
to_char(COUNT(SS_NET_PROFIT)),
substrb(dump(MIN(SS_NET_PROFIT),16,0,64),1,240),
substrb(dump(MAX(SS_NET_PROFIT),16,0,64),1,240)
FROM STORE_SALES_CSV

and the same query for the second table:

SQL> SELECT
to_char(COUNT(SS_SOLD_DATE_SK)),
substrb(dump(MIN(SS_SOLD_DATE_SK),16,0,64),1,240),
substrb(dump(MAX(SS_SOLD_DATE_SK),16,0,64),1,240),
...
to_char(COUNT(SS_NET_PROFIT)),
substrb(dump(MIN(SS_NET_PROFIT),16,0,64),1,240),
substrb(dump(MAX(SS_NET_PROFIT),16,0,64),1,240)
FROM STORE_SALES_CSV_NUM

Performance numbers are way different.

 

 Datatype transformation  Elapsed time  Comments
hive(string) -> Oracle(Number) 18 mins We do String -> Oracle number transformation
 hive(string) -> Oracle(BINARY_DOUBLE) 64 mins We do String -> Oracle BINARY_DOUBLE transformation, which is very expensive

 

it's also obvious from the graphs that we spend way more CPU in case of binary_double transformation:

Important: In case if you define binary_double in Oracle RDBMS and you do have double type with Parquet, ORC, RC and AVRO you don't do any conversion. You just pass data directly to the Oracle Smart Scan. 

In case of Textfile or Sequencefile you always do data transformation (because text input format considered as a string) and here you have to choose cheapest one way (which is NUMBER(7,2) instead BINARY_DOUBLE).

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.