By anuj.sahni on Jan 12, 2015
We recently announced Oracle Big Data SQL support for Oracle NoSQL Database. This feature will allow Oracle Database users to connect to external data repositories like Oracle NoSQL Database, and/or Hadoop, and will enable them to fetch data from any or all of the repositories (at once) through single SQL query. We believe this a significant step forward as it streamlines the adoption of big data technologies with traditional technology stack.
If you are following big data space then you would agree that everyone likes the scalability, availability, performance at cost aspect of it but almost everyone struggles to modify their enterprise application to work with big data technologies because they don't talk the same language as traditional database systems do.
For innovations to happen fast what is required is that we lower down the adoption barrier for the developers & the data scientists who have been at the core of this data revolution. And by providing same SQL declarative language for relational and non-relational data we have made it simpler for everyone to adopt these big data technologies.
SQL on Oracle NoSQL Database
Having an access layer using a declarative language is definitely a great thing but a common trend these days is that every NoSQL vendor is writing a custom SQLike layer. Now if you want to use your existing application to access relational data as well as non-relational data stored in the NoSQL database then you can not write a single SQL query to access these two separate stores, which again makes it complicated from application development and manageability point of view.
Oracle Big Data SQL for Oracle NoSQL Database fills this gap by providing exactly the same SQL access layer to both the stores and therefore the queries would look exactly the same no matter which store you access. To achieve this, Oracle Big Data SQL requires that Oracle NoSQL Database support the ability to access its data via Hive. Thus, plugging Oracle NoSQL Database to Oracle Database is as simple as performing 1 (Creating a NoSQL Store and a Table), 2 (Configuring Hive to access NoSQL Table) & 3 (Configuring Oracle Database to talk to HCatalog) steps.
Please note that Big Data SQL provides much better query performance over running a query directly on a generic external tables because with former approach, computation (for selecting, & formatting data) is moved where the data is as apposed to moving all the data to the Oracle Database node first and then applying the predicates. The same proprietary cell technology (developed for Exadata) is used for applying the query predicates in parallel to all the distributed NoSQL partitions, making the overall query to run faster.
In rest of the article I will give you the technical details on how to perform first two steps so you can get the hands-on experience of this feature.
Figure 1: Flow diagram showing steps to access data in Oracle NoSQL Database using hive queries.
Step 1: Create a TABLE in Oracle NoSQL Database and load some data that we would like to access from HIVE later on.
- For this demonstration we will be using KVLITE but you can very well deploy a real cluster by following steps covered in this webinar. Please remember to set KVHOME as the environment variable that will point to the directory where you unzipped the latest Oracle NoSQL Database binaries (download here) and KVROOT (eg. /u01/kvroot) where you would like metadata files to be stored.
$java -jar $KVHOME/lib/kvstore.jar kvlite -root $KVROOT -host localhost &
- Once you have Oracle NoSQL Database cluster deployed (in our case it is single instance), connect to administration shell using runadmin command and then from the KV prompt connect to the store (kvstore in my case).
$java -jar $KVHOME/lib/kvstore.jar runadmin -port 5000 -host localhost kv-> CONNECT STORE -name kvstore
- Create MOVIE table with few columns (to capture movie specific information)
kv->EXEC "CREATE TABLE MOVIE( \ id INTEGER, \ original_title STRING,\ overview STRING,\ poster_path STRING,\ release_date STRING,\ vote_count INTEGER,\ runtime INTEGER,\ popularity DOUBLE,\ genres STRING,\ PRIMARY KEY (id) )"
- SHOW table to confirm it is created successfully & DESCRIBE to make sure all the columns types are appropriate as well.
kv-> EXEC 'SHOW TABLES'
kv-> EXEC 'DESC AS JSON TABLE MOVIE'
- Next we will load some MOVIE data into the table. You can download a movie.dat file with details of about 10 movies represented as JSON string. We will directly load these strings into MOVIE table.
kv-> PUT TABLE -name MOVIE -file /tmp/movie.dat Inserted 10 rows in movie table
- You can now COUNT the rows and GET the rows to make sure data is indeed there.
Step 2: Create Hive External Table using Storage Handler for Oracle NoSQL Database (that got shipped in release 3.2.5) .
kv-> AGGREGATE TABLE -name MOVIE -count Row count: 10
KV-> GET TABLE -name MOVIE
- Before we begin doing anything with Hive we first need to install Hadoop on our system.
- After Hadoop being installed and tested, download the Hive archive from here. I am using apache-hive-0.14.0 at the time of this article but you can download any latest version. Unzip the archive and create environment variable HIVE_HOME to point to this directory.
- Here are some of the environment variables that are set in my ~/.bashrc file:
export HADOOP_HOME=/u01/home/hadoop/hadoop-2.2.0 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" export KVHOME=/u01/nosql/kvhome export HIVE_HOME=/u01/home/hadoop/apache-hive-0.14.0 export HIVE_AUX_JARS_PATH=$KVHOME/lib/kvclient.jar export PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$HIVE_HOME/bin:\ $HADOOP_HOME/bin:$HADOOP_HOME
- NOTICE that we have HIVE_AUX_JARS_PATH set to kvclient.jar. We need to set this variable to make sure Hive can find Storage Handler for Oracle NoSQL Database from the client driver itself.
- Now the only thing that is left is creating an external table to mimic the table exist in Oracle NoSQL Database. Let's connect to hive first and then create the table using this DDL:
$hive hive-> CREATE EXTERNAL TABLE IF NOT EXISTS MOVIE( id INT, original_title STRING, overview STRING, poster_path STRING, release_date STRING, vote_count INT, runtime INT, popularity DOUBLE, genres STRING) STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' TBLPROPERTIES ( "oracle.kv.kvstore"="kvstore", "oracle.kv.hosts"="localhost:5000", "oracle.kv.hadoop.hosts"="localhost", "oracle.kv.tableName"="MOVIE"); Time taken: 0.174 seconds
Note: STORE BY field is where we define the StorageHandler class for Oracle NoSQL Database and TBLPROPERTIES is where we define name of the store, host name and port where storage node agent is listening at (in our case we have only node), and the name of the table in NoSQL DB. We have MOVIE as the table there so that is what we use as oracle.kv.tableName field.
- That is all one need to do to configure Hive to work with Oracle NoSQL Database. Pretty simple, right ?
Step 2b: Now that we have NoSQL DB & Hive setup done, we just need to test running hive queries.
- Here are some of the simple queries to fetch MOVIE data using Hive Query Language (HQL):
hive> SELECT id, original_title, release_date, genres FROM MOVIE; hive> SELECT id, original_title, release_date, genres FROM MOVIE WHERE release_date >1990; hive> SELECT id, original_title, release_date, genres FROM MOVIE WHERE genres LIKE '%Comedy%';
- Some other queries that would require M/R jobs to be spun off to get the results:
hive> SELECT count(*) FROM MOVIE WHERE genres LIKE '%Drama%'; hive> SELECT id, original_title, release_date, genres FROM MOVIE WHERE genres LIKE '%Drama%' ORDER BY release_date; hive> SELECT MAX(release_date) FROM MOVIE;
Using the Hive StorageHandler from latest release, one can access NoSQL Data as hive external tables. The metadata of these external tables are stored in the HCatalog, which is later referenced by Oracle Big Data SQL to decide where to dispatch the queries in parallel. The process of configuring Oracle NoSQL Database has become very simple (using the out of the box hive StorageHandler) that it takes only couple of steps to complete.
In the part 2 of the blog I will talk about how hive metastore can be configured with Oracle Database to finally run SQL queries on NoSQL Data.