X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Big Data SQL Quick Start. Parallel Query - Part3.

Alexey Filanovskiy
Product Manager

Definitely, not all queries could be offloaded and not all part of the SQL could be executed on the cell (storage) side. Some of the queries could be pretty complex and could have PL/SQL inside. For increase performance those type of queries you could definitely use database parallel query and good news that you could use it together with Big Data SQL. And second good news is that for end user or application there is no difference for that. Let’s consider complex query example, which requires a lot of filtering resources as well as Database resources (for performing sorting, for example). Like a function that could not be offloaded I’ve chosen RANK:

SQL> select NAME, offloadable, AGGREGATE FROM
v$sqlfn_metadata
WHERE NAME = 'RANK';

and it will return:

 RANK NO NO
 RANK NO YES

 And write a query that uses this non-offloadable function:

SQL> SELECT /*+ NOPARALLEL */
RANK() OVER(PARTITION BY ws.ws_bill_customer_sk ORDER BY ws.ws_ship_date_sk), ws.ws_item_sk
FROM WEB_SALES ws
WHERE
ws.ws_sold_date_sk > 245
OFFSET 500000 ROWS FETCH NEXT 10 ROWS ONLY;

Run it in serial mode (with hint /*+ NOPARALLEL */) first and check results in OEM:

So, as I can see a lot of CPU events on the Database side.

And now let’s repeat this query with some degree of parallelism (with hint /*+ PARALLEL(4)*/):

SQL> SELECT /*+ PARALLEL(4) */
RANK() OVER(PARTITION BY ws.ws_bill_customer_sk ORDER BY ws.ws_ship_date_sk), 
ws.ws_item_sk
FROM WEB_SALES ws
WHERE
ws.ws_sold_date_sk > 245
OFFSET 500000 ROWS FETCH NEXT 10 ROWS ONLY;

And again have a look into OEM: 

In both cases we perform the same actions on the storage side it’s only one difference with the second query - parallel processing on the DB
side. As soon as Database part of the work performs faster we have more Storage waits (we have fast enough procession for waiting for storage). 

Also, will be interesting to see what’s going on Big Data’s side from IO perspective. Let’s check the Hadoop graphs:

 

The serial query creates less IO workload on the Storage side, but the reason of that is database side slowness. Or in other words, database don’t
have enough resources to handle traffic that comes from the cell side. Cell degree of parallelism does not depend on database DOP directly, but it could be
changed as a side effect (how it was in our example). As soon as database actively involves the processing of this query elapsed time of the parallel
query less than elapsed time of the serial query:

Serial (DOP=1): 7.1 min

Parallel (DOP=4): 3.2 min

it doesn't mean that Big Data SQL useless even in this case. Data Type transformation and column pruning will be done on the storage side, but for getting the better performance we need to parallelize this SQL on the database side.

Hence, for some queries which do almost all job on the cell side, performance difference between serial query and parallel query will be
minimal. Let me demonstrate this with following examples: 

SQL>  SELECT /*+ noparallel monitor*/ COUNT(1)
FROM WEB_SALES ws
WHERE
ws.ws_sold_date_sk = 2451507
AND ws.ws_sold_time_sk=70565
AND ws.ws_bill_customer_sk=19628510
AND ws.ws_ext_discount_amt=4433.48

Is serial query and we could observe follow wait events in
OEM:

Almost all events (99%) related to the cell (storage side) – “cell external table smart scan”.

And for sure it’s interesting to see what’s going on in a case of parallel query:

SQL>  SELECT /*+ parallel(4) monitor*/ COUNT(1)
FROM WEB_SALES ws
WHERE
ws.ws_sold_date_sk = 2451507
AND ws.ws_sold_time_sk=70565
AND ws.ws_bill_customer_sk=19628510
AND ws.ws_ext_discount_amt=4433.48

And an again type of the wait events is the same – “cell external table smart scan” (storage event). Opposite to previous one example query execution time as well as cell disk utilization for serial and parallel mode roughly the same:

Serial (DOP=1): 3.1 min

Parallel (DOP=4): 2.9 min

The cause of this that all processing was done on the cell side and database parallelism doesn’t matter (instance just not involved in this processing).

Cell side parallelism.

Actually, cell (Hadoop node) also has an own degree of the parallelism. A user could not setup it properly (at least without underscore parameters). But in fact, every single cell (Hadoop) node proceeds multiple blocks in parallel. If you run the serial query it will coordinate all parallel threads on the cell side. This picture illustrates this:

Each cell (Hadoop) node has own Degree of Parallelism (let’s say is equal N). Serial query coordinate all these parallel threads. In the case of parallel query, we will have Query Coordinator (QC), who will control Parallel Executors (PX). Each PX will be responsible for handling his part of input data splits on the cell (Hadoop) and as consequence will coordinate his part of the parallel threads. Picture bellow shows an example when we have a query with database DOP=2 and each Parallel executor handle A and B parallel threads on the cell side respectively.

Important to note that sum A and B will produce N threads, so it means that query with different DOP on the database side (like select /*+ noparallel*/, select /*+ parallel(4)*/ or select /*+ parallel(32)*/ ) will produce the same parallelism on the cell side.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.