X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Big Data SQL Quick Start. Offloading - Part2.

Alexey Filanovskiy
Product Manager


After reading these articles: Big Data SQL Quick Start. Introduction, One Fast Query All your Data, you know what Big Data SQL is, and you understand that it allows you query data from Hadoop through the Oracle Database.  But you also should know that it’s not just reading data. Big Data SQL allows you to process data stored in HDFS locally and return back to the database only data relevant to the query. Let’s imagine a simple diagram of a data management system that includes Oracle Database and Hadoop:


If you run Big Data SQL on engineered systems like Big Data Appliance and Exadata, your architecture will look like:

 

And now let’s list some of the advantages of the each tier:

1) Databases are good for:

i. Transactional workload

ii. Concurrency reads/writes

iii. Store critical data

iv. Multiple joins, complex queries

2) Hadoop is good for:

a. Batch, not concurrent workload

b. Relatively simple queries

c. Horizontal scalability for this simple queries

d. Unstructured and semi-structured type of data

This list suggests to us that it would be nice to have a system that could store “raw” data on the Hadoop tier, accomplish all the rudimentary processing jobs, and return back only prepared, clean data on the database tier. So, Big Data SQL actually lets you accomplish this simply. On the Hadoop tier it performs:
- Deserialization of the data. If you store data in some serialized format like AVRO, JSON, or Parquet Big Data SQL will deserialize it on the Hadoop tier
- Column pruning. If your table has 400 columns and in the query, you select only 1 column, Big Data SQL will prune 399 columns and return back only 1.

- Datatype transformation. Oracle Database has an own datatype, and data type conversion is not a cheap operation. On the Hadoop layer, Big Data SQL transforms data to the Oracle Database format
- Apply functions. If you have filter predicates (like “id=…”) it’s good to have that work done on the Hadoop tier. You can list all of the functions that can be done on the Hadoop tier by running this query in Oracle:

SQL> SELECT * FROM v$sqlfn_metadata WHERE offloadable = 'YES'

From here on out, I’ll call queries for those most of the workload done on the Hadoop tier like “OFFLOADABLE”. Now, let’s consider the example from my previous blog post (Big Data SQL Quick Start. Introduction):

SQL> SELECT min(w.ws_sold_time_sk)
FROM WEB_SALES w
WHERE w.ws_sold_date_sk = 2451047

Filtering is the main part of this query from Big Data SQL perspective; it will sift out inappropriate rows and pass on the RDBMS side only eligible rows. Also, Big Data SQL will perform column pruning.  That means that from all columns in this table, it will pass only one that we are going to calculate. Let’s check execution plan and query statistics in OEM:



Here in the Oracle Enterprise Manager we also could check offload efficiency: 

As an alternative to OEM, you can check execution statistics with following query (run it after user query in your favorite SQL tool, like SQLdeveloper or SQL*Plus):

SQL> SELECT n.name,
round(s.value/1024/1024)
FROM v$mystat s,
v$statname n
WHERE s.statistic# IN (462,463)
AND s.statistic# = n.statistic#;


cell XT granule bytes requested for predicate offload 32768
cell interconnect bytes returned by XT smart scan     32

This is a classic example of a good query for Big Data SQL.

How to determine that offloading didn’t happen?

Before I answer this question, I’ll explain in more detail how offloading works. The picture below will help make things clear:

In the first step, Big Data SQL applies storage indexes, reads data from HDFS, and performs deserialization. Actually “External Table Service” has two modes for reading data: C or Java. You may specify it during table creation in access parameter settings:


ACCESS PARAMETERS
( com.oracle.bigdata.datamode=java
)
or

ACCESS PARAMETERS
( com.oracle.bigdata.datamode=c
)

Use C mode when it possible (not all data formats are available for c mode). In both cases, Big Data SQL reads an HDFS block,
translates it to the Oracle format (which is needed for Smart Scan subcomponent) and passes the data to Smart scan it there. On the Smart Scan level, Big Data SQL performs column pruning, filtering, parses JSON, XML and so on. Resulting data is pushed to the Database level over the network in native Oracle database format. How do you figure out if your query is offloadable or not? The first step to figuring this out this is to open SQL monitoring in Oracle Enterprise Manager (this is true for debugging of all Oracle queries) and have a look at the events.Let’s start with and example of “proper” (offloadable) query (I name proper case when we perform bigger part of the workload on the storage side). I implemented a senseless PL/SQL function, which is actually doing nothing:

SQL> create or replace function fnull(input number) return number is
Result number;
begin
Result:=input;
return(Result);
end fnull;

And then run following SQL:

SQL> SELECT fnull(MIN(WS_SOLD_DATE_SK)),
fnull(MIN(WS_NET_PROFIT))
FROM WEB_SALES

The table WEB_SALES has many columns (34 total) and in my SQL I list all of them (but show only two in this blog). The function min is offloadable, which means that we could execute it on the Hadoop side:

SQL> SELECT NAME,
offloadable,
AGGREGATE
FROM v$sqlfn_metadata
WHERE NAME = 'MIN'
AND AGGREGATE='YES'

NAME offloadable AGGREGATE
---- ----------- ---------
MIN YES         YES

Then, we will get this filtered by Big Data SQL, the result will be transmitted to the db side and then our PL/SQL function will be applied on
the database tier. Let’s inspect what OEM shows us:



87% of the events are “user IO” and all of this IO is the cell side (“cell external table smart scan”). 13% of the wait events are CPU waits (most probably it’s PL/SQL execution waits). Also, will be useful check offloading statistics. We could do this in OEM: 

Alternatively, we could run following query in SQLdeveloper, SQL*Plus or in other SQL tools:

SQL> SELECT n.name,
round(s.value/1024/1024)
FROM v$mystat s, v$statname n
WHERE s.statistic# IN (462,463)
AND s.statistic# = n.statistic#;

This example shows offloadable query, which means it’s good for Big Data SQL. In contrast, let’s consider anti-pattern of the query, for example, we could reverse our functions like this:

SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)),
MIN(fnull(WS_NET_PROFIT))
FROM WEB_SALES

It looks very similar to the previous query and returns the same result, but… Execution will be different. As soon as we wrap a column in PL/SQL (PL/SQL could not be performed on the Hadoop tier), we have to move all data on the database tier. OEM also shows different wait events:

You can observe only CPU waits on the screen above. We can click on the query details and see that smart scan returns all eligible bytes:

Avoid this case if possible,  and try to perform as much of the query as possible on the Hadoop tier. The Hadoop tier has performed some work in this
query - it transforms HDFS block to the database format and for queries where not all list of the columns are listed it prune unused columns. It’s just not doing all it can.

Database’s side datatype transformation

Ok, but what if Big Data SQL agents on some node(s) become unavailable? Let’s look at this case! We start by stopping Big Data SQL (via
Cloudera Manager):


After this run any query. Here’s a simple one:

SQL> SELECT COUNT(1)
FROM STORE_SALES ss
WHERE ss.ss_sold_date_sk = 2452200

The first sign that offloading isn’t happening will be slow query performance. Then, in OEM, you will find many “Application: External Procedure Call” events. Here is a screen from Enterprise Manager:

Also, you can check network utilization on the database side (it will be significant):

Restrictions for the Big Data SQL.

Here is a brief set of restrictions:

1) Big Data SQL is full Oracle Database SQL. All details you could find here: https://docs.oracle.com/database/121/SQLRF/toc.htm

2) Not all functions can be done on the Hadoop tier. You can find the list of the functions that can be offloaded with the following view:

SQL> SELECT NAME FROM v$sqlfn_metadata WHERE offloadable ='YES'

3) Even for non-offloadable operations Big Data SQL will perform column pruning and datatype conversion (which saves a lot of resources)

4) Other operations (non-offloadable) will be done on the database side

Join the discussion

Comments ( 1 )
  • Paweł Tuesday, February 9, 2016

    Very nice introduction to Big Data :)


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.