- Number of customers (10s of millions)
- All need low latency for reading operation (People and applications)- Profile tags (metrics) could vary from case to case (data is pretty sparse)
- Application developers want to have more flexibility and want easily add new columns
I also have Oracle Database like an analytical platform and I have existing BI application, which doesn't know what HBase is. I just need to use data from those profiles (stored in HBase) in my BI reporting. The overall picture looks like this:
Now let me show how this challenge may be resolved with Big Data SQL.
Create HBase structures and insert the data.
Here is a simple example of creating HBase table (with HBase shell) and adding a couple of values there:
After this we will have following data into HBase table (here is a screenshot from HUE HBase browser):
looks like a table from a relational database. But HBase is more flexible. If developer will want to add a new field into the table (analog of a new column in the relational database), he just will need to use following simple API:
and new field immediately will appear into the table:
No DDL or metadata changing is not needed , just another one put operation. Very flexible!
Create Hive and Oracle External tables.
Now we have real data into HBase. Hive provides us opportunity to represent it like a table. Hive has very powerful integration tool- StorageHandlers. Using HBase StorageHandlers we define (add metadata in Hive) the way how to interpret NoSQL data:
Now we could use HQL (Hive Query Language) for access HBase table. But our final goal is using Oracle RDBMS SQL for this. It's not a problem, only one thing that we need is connect to the Oracle RDBMS and create an external table, that will be linked to the hive table, like this:
This explanation seems a bit tangled, but I hope this diagram could resolve this misunderstanding:
Data stored into HBase, metadata stored into Hive and Oracle RDBMS.
Build Hybrid Report.
let's imagine, one day business user comes to you and asks the question: "I need a report of all sales by years for each position (job role)". To answer on this question I need information from STORE_SALES table (sales), EMP_HBASE (position) and DATE_DIM (year). It's a three different tables, two dimension tables (date_dim and emp_hbase) and one fact table (store_sales), they could be joined by follow keys:
In my infrastructure I store "STORE_SALES" on HDFS in ORC format, customers profiles (EMP_HBASE) I store in HBase and date dimension table (date_dim) i store in Oracle RDBMS as permanent table.Overall picture looks like this:
Thanks for Big Data SQL I could query all the data within the single query, like this:
Query plan looks like usual Oracle RDBMS plan:
Bingo! One query cover three data sources.
Big Data SQL and NoSQL patterns and anti-patterns.
After my previous example you may be so excited and decide to use NoSQL databases always for any use case. But it's wrong feeling. Many NoSQL (including HBase and Oracle NoSQL DB) databases work well when you access data by key. For example, fetch value by key or scan small range of data. On the opposite side stay queries when you don't use key (instead this you use one of the filed from value). Let me demonstrate this.
In my test stand I co-locate HBase and Big Data SQL on the 3 node Hadoop cluster and use one server for database.
I create HBase table load data from HDFS parquetfile:
and after this compare performance for basic operations. HBase co-located with Hadoop servers (Big Data SQL also installed on those servers) and it's fair comparison, because numbers of disks, CPUs are the same (please note it's not official benchmark, it's just an example).
Totally I have around 189 Millions in both tables. HBase is really strong if we query data by keys. In my DDL column "SS_TICKET_NUMBER" is key for HBase:
because of this query took less than one second. Also you could run a lot of simple queries over HBase (but make sure that you have key in where predicate). Same query over parquetfile:
will take about 8 seconds.
But if we will try query HBase, using not key columns - we will get full scan and very low performance as consequence. Parquet files will work a much faster. Also, i convert this data (the same table) in CSV format and I ran few tests. You could find conclusion in table below for few type of the queries.
Table. Example of performance numbers:
As you could noted before, combination of NoSQL + Big Data SQL is doing pretty well, when it read data by NoSQL key, but how scalable it is?
To answer on this question I ran simple test. I fire the same query:
with different concurrency (number of simultaneous queries) and got follow results:
|Number of simultaneous queries||Average elapsed time, sec|
let me summarize all findings regarding NoSQL databases and Big Data SQL.
1) You could query NoSQL data by Oracle SQL with Big Data SQL
2) You could run many concurrent queries over NoSQL databases
3) You could have good performance until you work over NoSQL key column (column, which is mapped to NoSQL key)
4) You could query data even using non-key column, but performance will low
5) Full scan of NoSQL databases is slower than full scan of textfile or other Hadoop file formats, because of extra overhead of API NoSQL DB.