Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Big Data SQL Quick Start. NoSQL databases - Part9.

Alexey Filanovskiy
Product Manager
It's not a secret that lately IT people are talking a lot about NoSQL. Some even use it. NoSQL databases could have some advantages over RDBMS (like scalability), but many of NoSQL databases don't have many features that are quite common for RDBMS (like transaction support, maturity for backup and recovery tools). Also, many NoSQL databases are schema-less, which can be an advantage (in sense of application development agility), but it also could be a disadvantage (in sense of supportability).  It's a long discussion that stays out of the scope of this blog and you could easily find in the web many posts about this and many opinions about this (I assume, that reader is familiar with those). My personal opinion is that NoSQL databases could be useful for some particular cases, but it rarely stays alone and it's why seamless integration with databases may be needed. Big Data SQL could provide it. Let's have an example. I have a huge company with tens of millions customers. I do store customers profiles in HBase database, because:

-  Number of customers (10s of millions)

-  All need low latency for reading operation (People and applications)

-  Profile tags (metrics) could vary from case to case (data is pretty sparse)

-  Application developers want to have more flexibility and want easily add new columns

I also have Oracle Database like an analytical platform and I have existing BI application, which doesn't know what HBase is. I just need to use data from those profiles (stored in HBase) in my BI reporting. The overall picture looks like this:

Now let me show how this challenge may be resolved with Big Data SQL.

Create HBase structures and insert the data.

Here is a simple example of creating HBase table (with HBase shell) and adding a couple of values there: 

# hbase shell
hbase(main):001:0> create 'emp', 'personal data', 'professional data'
=> Hbase::Table - emp
hbase(main):002:0> put 'emp','6179995','personal data:name','raj'
hbase(main):003:0> put 'emp','6179995','personal data:city','palo alto'
hbase(main):004:0> put 'emp','6179995','professional data:position','developer'
hbase(main):005:0> put 'emp','6179995','professional data:salary','50000'
hbase(main):007:0> put 'emp','6274234','personal data:name','alex'
hbase(main):008:0> put 'emp','6274234','personal data:city','belmont'
hbase(main):009:0> put 'emp','6274234','professional data:position','pm'
hbase(main):010:0> put 'emp','6274234','professional data:salary','60000'

After this we will have following data into HBase table (here is a screenshot from HUE HBase browser):

looks like a table from a relational database. But HBase is more flexible. If developer will want to add a new field into the table (analog of a new column in the relational database), he just will need to use following simple API:

hbase(main):001:0> put 'emp','6274234','professional data:skills','Hadoop'

and new field immediately will appear into the table:

No DDL or metadata changing is not needed , just another one put operation. Very flexible!

Create Hive and Oracle External tables.

Now we have real data into HBase. Hive provides us opportunity to represent it like a table. Hive has very powerful integration tool- StorageHandlers. Using HBase StorageHandlers we define (add metadata in Hive) the way how to interpret NoSQL data:

(rowkey STRING,
 ename STRING, 
 city STRING,  
 position STRING,
 salary STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
('hbase.columns.mapping' = ':key#binary,
personal data:name,
personal data:city, 
professional data:position,
professional data:salary')
TBLPROPERTIES('hbase.table.name' = 'emp');

Now we could use HQL (Hive Query Language) for access HBase table. But our final goal is using Oracle RDBMS SQL for this. It's not a problem, only one thing that we need is connect to the Oracle RDBMS and create an external table, that will be linked to the hive table, like this: 

  rowkey number,
  ename VARCHAR2(4000),
  city VARCHAR2(4000),
  position VARCHAR2(4000),
  salary number
      ( com.oracle.bigdata.cluster=bds30

This explanation seems a bit tangled, but I hope this diagram could resolve this misunderstanding:

Data stored into HBase, metadata stored into Hive and Oracle RDBMS.

Build Hybrid Report.

let's imagine, one day business user comes to you and asks the question: "I need a report of all sales by years for each position (job role)". To answer on this question I need information from STORE_SALES table (sales), EMP_HBASE (position) and DATE_DIM (year). It's a three different tables, two dimension tables (date_dim and emp_hbase) and one fact table (store_sales), they could be joined by follow keys:

In my infrastructure I store "STORE_SALES" on HDFS in ORC format, customers profiles (EMP_HBASE) I store in HBase and date dimension table (date_dim) i store in Oracle RDBMS as permanent table.Overall picture looks like this:

Thanks for Big Data SQL I could query all the data within the single query, like this:

store_sales_orc s,
emp_hbase e,
date_dim d
WHERE e.rowkey = s.ss_customer_sk
AND s.ss_sold_date_sk =d. d_date_sk
AND e.rowkey > 0
GROUP BY e.position, d.d_year

Query plan looks like usual Oracle RDBMS plan:

Bingo! One query cover three data sources.

Big Data SQL and NoSQL patterns and anti-patterns.

After my previous example you may be so excited and decide to use NoSQL databases always for any use case. But it's wrong feeling. Many NoSQL (including HBase and Oracle NoSQL DB) databases work well when you access data by key. For example, fetch value by key or scan small range of data. On the opposite side stay queries when you don't use key (instead this you use one of the filed from value). Let me demonstrate this.

In my test stand I co-locate HBase and Big Data SQL on the 3 node Hadoop cluster and use one server for database.

I create HBase table load data from HDFS parquetfile:

hive> CREATE TABLE IF NOT EXISTS fil.store_sales_hbase_bin
( ss_sold_date_sk BIGINT,
   ss_sold_time_sk BIGINT,   
   ss_item_sk BIGINT,       
   ss_customer_sk BIGINT,       
   ss_cdemo_sk BIGINT,       
   ss_hdemo_sk BIGINT,       
   ss_addr_sk BIGINT,       
   ss_store_sk BIGINT,       
   ss_promo_sk BIGINT,       
   ss_ticket_number BIGINT,       
   ss_quantity INT,       
   ss_wholesale_cost DOUBLE,       
   ss_list_price DOUBLE,       
   ss_sales_price DOUBLE,       
   ss_ext_discount_amt DOUBLE,       
   ss_ext_sales_price DOUBLE,       
   ss_ext_wholesale_cost DOUBLE,       
   ss_ext_list_price DOUBLE,       
   ss_ext_tax DOUBLE,       
   ss_coupon_amt DOUBLE,       
   ss_net_paid DOUBLE,       
   ss_net_paid_inc_tax DOUBLE,       
   ss_net_profit DOUBLE)
 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' =
hive> set hive.hbase.bulk=true;
hive> insert into fil.store_sales_hbase_bin select * from parq.store_sales;

and after this compare performance for basic operations. HBase co-located with Hadoop servers (Big Data SQL also installed on those servers) and it's fair comparison, because numbers of disks, CPUs are the same (please note it's not official benchmark, it's just an example).

Totally I have around 189 Millions in both tables. HBase is really strong if we query data by keys. In my DDL column "SS_TICKET_NUMBER" is key for HBase:

SQL> SELECT COUNT(1) FROM store_sales_hbase_small  WHERE  SS_TICKET_NUMBER=187378869

because of this query took less than one second. Also you could run a lot of simple queries over HBase (but make sure that you have key in where predicate). Same query over parquetfile:

SQL> SELECT COUNT(1) FROM store_sales_parq_small WHERE  SS_TICKET_NUMBER=187378869 

will take about 8 seconds.

But if we will try query HBase, using not key columns - we will get full scan and very low performance as consequence. Parquet files will work a much faster. Also, i convert this data (the same table) in CSV format and I ran few tests. You could find conclusion in table below for few type of the queries.

Table. Example of performance numbers:


As you could noted before, combination of NoSQL + Big Data SQL is doing pretty well, when it read data by NoSQL key, but how scalable it is?

To answer on this question I ran simple test. I fire the same query:

with different concurrency (number of simultaneous queries) and got follow results:

 Number of simultaneous queries  Average elapsed time, sec
 10  1.2
 15  1.3
 20  1.9
 25  2.2
 30  3
 45  4.3
 60  6.2


let me summarize all findings regarding NoSQL databases and Big Data SQL. 

1) You could query NoSQL data by Oracle SQL with Big Data SQL

2) You could run many concurrent queries over NoSQL databases

3) You could have good performance until you work over NoSQL key column (column, which is mapped to NoSQL key)

4) You could query data even using non-key column, but performance will low

5) Full scan of NoSQL databases is slower than full scan of textfile or other Hadoop file formats, because of extra overhead of API NoSQL DB.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.