Friday Feb 19, 2016

How Can Graph Databases and Analytics Help Your Business?

Several videos describing the value of graph analyses are now available from our Oracle Big Data Spatial and Graph + Oracle Labs teams.

video

Check out this blog post for details :). 

Big Data Lite 4.4.0 is now available on OTN

big data lite


It's now available for download on OTN.  Check out this VM to help you learn about Oracle's big data platform.[Read More]

Thursday Feb 04, 2016

Hadoop Compression. Compression rate. – Part1.

Compression codecs.

Text files (csv with “,” delimiter):

Codec Type  Average rate  Minimum rate  Maximum rate
bzip2 17.36 3.88 61.81
gzip 9.73 2.9 26.55
lz4 4.75 1.66 8.71
snappy 4.19 1.61 7.86
lzo 3.39 2 5.39

RC File: 

Codec Type Average rate Minimum rate Maximum rate
 bzip2 17.51 4.31 54.66
 gzip 13.59 3.71 44.07
 lz4 7.12 2 21.23
 snappy 6.02 2.04  15.38
 lzo 4.37 2.33 7.02

Parquet file:

Codec Type Average rate Minimum rate Maximum rate
 gzip 17.8 3.9 60.35
 snappy 12.92 2.63 45.99

[Read More]

Using Spark(Scala) and Oracle Big Data Lite VM for Barcode & QR Detection

Big Data and Scalable Image Processing and Analytics

Guest post by Dave Bayard - Oracle's Big Data Pursuit Team 

One of the promises of Big Data is its flexibility to work with large volumes of unstructured types of data such as images and photos. In todayís world, there are many sources of images including social media photos, security cameras, satellite images, and more. There are many kinds of image processing and analytics that are possible from optical character recognition (OCR), license plate detection, bar code detection, face recognition, geological analysis and more. And there are many open source libraries such as OpenCV, Tesseract, ZXing, and others that are available to leverage.

[Read More]

Tuesday Jan 26, 2016

Big Data SQL Quick Start. Offloading - Part2.

After reading these articles: Big Data SQL Quick Start. Introduction, One fast Query All your Data, you know what Big Data SQL is, and you understandthat it allows you query data from Hadoop through the Oracle Database.  But you also should to know that it’s not just reading data. Big Data SQL allows you to process data stored in HDFS locally and return back to the database only data relevant to the query. Let’s imagine a simple diagram of a data management system that includes Oracle Database and Hadoop:

[Read More]

Monday Jan 25, 2016

Parallel Correlated Filters and Expressions in 12c

In the last post we looked at multiple parallelizers and how they could cause a SQL statement use more PX servers than expected. You will notice that some types of statements behave differently in 12c compared to 11g in this regard. For these statements 12c does not use multiple parallelizers, thereby avoiding the negative consequences of multiple parallelizers.

Correlated filters

Here is a simple query that checks if the sales table has any rows without a corresponding channel_id in the channels table.

[Read More]

Tuesday Jan 19, 2016

Big Data SQL Quick Start. Introduction - Part1.

Today I am going to explain steps that required to start working with Big Data SQL. It’s really easy!  I hope after this article you all will agree with me. First, if you want to get caught up on what Big Data SQL is, I recommend that you read these blogs: Oracle Big Data SQL: One Fast Query, Big Data SQL 2.0 - Now Available.

The above blogs cover design goals of Big Data SQL. One of the goals of Big Data SQL is transparency. You just define table that links to some directory in HDFS or some table in HCatalog and continue working with it like with general Oracle Database table.It’s also useful to read the product documentation.

Your first query with Big Data SQL

Let’s start with simplest one example and query data that is actually stored in HDFS via Oracle Database using Big Data SQL. I’m going to begin this example by checking of the data that actually lies into HDFS. To accomplish this, I run the hive console and check hive table DDL:

[Read More]

Wednesday Jan 13, 2016

BIWA 2016 - here's my list of must-attend sessions and labs

It’s almost here - the 2016 BIWA conference at the Oracle Conference Center. The conference starts on January 26 with a welcome by the conference leaders at 8:30am. The BIWA summit is dedicated to providing all the very latest information and best practices for data warehousing, big data, spatial analytics and BI. This year the conference has expanded to include the most important query language on the planet: SQL. There will be a whole track dedicated to YesSQL! The full agenda is available here

Unfortunately I won’t be able to attend this year’s conference but if I was going to be there, then this would be my list of must-attend sessions and hands-on labs.

[Read More]

Thursday Jan 07, 2016

Data loading into HDFS - Part1

Today I’m going to start first article that will be devoted by very important topic in Hadoop world – data loading into HDFS. Before all, let me explain different approaches of loading and processing data in different IT systems.

Schema on Read vs Schema on Write

So, when we talking about data loading, usually we do this into system that could belong on one of two types.  One of this is schema on write. With this approach we have to define columns, data formats and so on. During the reading  every user will observe the same data set. As soon as we performed ETL (transform data in format that mostly convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid penalty for this when were loading data. Like example of schema on write system you could consider Relational data base, for example, like Oracle or MySQL.


Schema on Write

Another approach is schema on read. In this case we load data as-is without any changing and transformations.  With this approach we skip ETL (don’t transform data) step and we don’t have any headaches with data format and data structure. Just load file on file system, like coping photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise you will get set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations.

Schema on Read

Hadoop Distributed File System is classical example of schema on read system.More details about Schema on Read and Schema on Write approach you could findhere. Now we are going to talk about data loading data into HDFS. I hope after explanation above, you understand that data loading into Hadoop is not equal of ETL (data doesn’t transform).

[Read More]

Thursday Dec 24, 2015

Oracle Big Data Lite 4.3.0 is Now Available on OTN

Big Data Lite 4.3.0 is now available on OTN



This latest release is packed with new features - here's the inventory of what's included:

  • Oracle Enterprise Linux 6.7
  • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.7)
  • Cloudera Manager (5.4.7)
  • Oracle Big Data Spatial and Graph 1.1
  • Oracle Big Data Discovery 1.1.1
  • Oracle Big Data Connectors 4.3
    • Oracle SQL Connector for HDFS 3.4.0
    • Oracle Loader for Hadoop 3.5.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.1
    • Oracle XQuery for Hadoop 4.2.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.4.7)
  • Oracle Table Access for Hadoop and Spark 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1.2 with Oracle REST Data Services 3.0
  • Oracle Data Integrator 12cR1 (12.2.1)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.2.0
  • Oracle Perfect Balance 2.5.0
Also, this release is using github as the repository for all of our sample code (https://github.com/oracle/BigDataLite).  This gives us a great mechanism for updating the samples/demos between releases.  Users simply double click the "Refresh Samples" icon on the desktop to download the latest collateral.

Friday Dec 18, 2015

Oracle among vendors that lead the pack in The Forrester Wave™: Enterprise Data Warehouse, Q4 2015, achieving highest scores for current product offering and product strategy

According to Forrester, a leading independent research firm, EDWs are now evolving beyond traditional storage and delivery. It is now scale, performance, and innovation that distinguish the EDW leaders. Oracle was top ranked by Forrester in the current offering category. Furthermore, Oracle was top ranked in the strategy category.  
[Read More]

Monday Dec 07, 2015

Multiple Parallelizers

The number of PX servers that a SQL statement needs depend on the execution plan and the degree of parallelism (DOP) of the statement. Most statements use DOP*2 number of PX servers as a result of the producer/consumer model used in parallel execution.

We sometimes get questions from users about statements using many more PX servers than they expect. They say "I have a statement with a DOP of 8 and I see that it uses 64 processes". These types of statements typically involve multiple parallelizers. We have looked at how you can identify such statements in the previous post, that post also covers basic terminology so please read that post before going into this.

In this post we will look at a few example cases that can generate plans with multiple parallelizers and how those plans behave at runtime. We will also talk about the implications of having such statements in your system. The examples used here are based on 12c. Ideally a user should be able to control the number of PX servers for a statement, this is not the case when you have multiple parallelizers. We are trying to minimize cases of multiple parallelizers but I wanted to explain what the current behavior is.

Multiple nonconcurrent parallelizers

The typical case where you can see multiple parallelizers is temp table transformation. With temp table transformation Oracle creates temporary tables at runtime, stores intermediate results in those tables and queries them during the execution of the statement. The optimizer can decide to use temp table transformation with subquery factoring a.k.a. WITH clause, grouping sets, star queries, and in-memory aggregation.

Let's look at a simple query with grouping sets as an example to understand how multiple parallelizers work.

SELECT /*+ parallel(2) */ channel_desc, calendar_month_desc, SUM(amount_sold)
FROM sales, times, channels
WHERE sales.time_id=times.time_id 
AND sales.channel_id= channels.channel_id
GROUP BY GROUPING SETS(channel_desc, calendar_month_desc);

This plan has 4 parallelizers (Line Ids 2, 11, 20, 29). The parallelizers #1 and #4 have one DFO each which means each of them needs one PX server set (2 PX servers as the DOP is 2). The parallelizers #2 and #3 have 2 DFOs each which means they need 2 PX server sets (4 PX servers as the DOP is 2). The important question here is, will these parallelizers execute concurrently? If they are concurrent this statement will use 12 PX servers, if they are not concurrent the statement can run with fewer PX servers.

In the case of temp table transformation parallelizers do not execute concurrently, they run one after the other. Here is a timeline of execution for this plan.

T0: First parallelizer #1 starts and allocates 2 PX servers, it joins three tables and loads the result set into a temporary table (SYS_TEMP_0FD9D6719_E74332).

T1: When this parallelizer finishes it releases the PX servers back to the pool. These PX servers are now available for anyone to use.

T2: Now parallelizer #2 can start, it allocates 4 PX servers from the pool, these may be the same ones released by the previous parallelizer or they may be different PX servers from the pool. This parallelizer reads from the temporary table created previously, does the group by operation and loads the results into a new temporary table (SYS_TEMP_0FD9D671A_E74332).

T3: Now it releases the PX servers back to the pool and these PX servers become available in the system.

T4, T5: Parallelizer #3 does the same thing, it allocates 4 PX servers and releases them when it finishes.

T6, T7: Now the last parallelizer (#4) starts, it allocates 2 PX servers, reads from the temporary table created previously (SYS_TEMP_0FD9D671A_E74332), sends the results to the user and releases the PX servers when it finishes.

This sequence of execution shows that the maximum number of PX servers used concurrently was 4, not more. In this case we used DOP*2 number of PX servers concurrently even though we had multiple parallelizers. This is because a parallelizer allocates PX servers only when it starts and releases them when it finishes. A parallelizer is started only when it is needed in the runtime execution of the plan.

Here is what v$pq_sesstat shows after running this query.

STATISTIC                      LAST_QUERY SESSION_TOTAL     CON_ID
------------------------------ ---------- ------------- ----------
Queries Parallelized                    1             1          0
DML Parallelized                        0             4          0
DDL Parallelized                        0             0          0
DFO Trees                               1             5          0
Server Threads                          2             0          0
Allocation Height                       2             0          0
Allocation Width                        1             0          0
Local Msgs Sent                        86          3113          0
Distr Msgs Sent                         0             0          0
Local Msgs Recv'd                      90          3123          0
Distr Msgs Recv'd                       0             0          0
DOP                                     2             0          0
Slave Sets                              1             0          0

Even though we used 4 PX servers concurrently it shows we used 2 (Servers Threads), this is because v$pq_sesstat only shows stats from the last active parallelizer which was Line Id 32 in this case.

To find the actual number of PX servers used concurrently you need to look at v$px_session when the statement is running. Here is what that view shows when the statement is running and the second parallelizer is active (my session's SID which is also the QC SID was 26 in this case).

select qcsid,sid,server_group,server_set,server#
from v$px_session where qcsid=26

     QCSID        SID SERVER_GROUP SERVER_SET    SERVER#
---------- ---------- ------------ ---------- ----------
        26        791            1          1          1 <--- Server Set #1, PX Server #1
        26         14            1          1          2 <--- Server Set #1, PX Server #2
        26        786            1          2          1 <--- Server Set #2, PX Server #1
        26         33            1          2          2 <--- Server Set #2, PX Server #2
        26         26                                    <--- Query Coordinator

This shows that at the time I looked at v$px_session there were 5 sessions working for my query, one was the QC (sid=26), the others were 4 PX server sessions. There was one parallelizer active (server_group=1) and it had 2 PX server sets (server_set) and 4 PX servers (server#). As the statement proceeds you will see that this view will show the active parallelizers and PX servers.

SQL Monitor also nicely shows which parallelizers are active by showing the active operations in the plan.

In this screenshot you can see that only the first parallelizer is active at this time.

Multiple concurrent parallelizers

A typical case where multiple parallelizers run concurrently is noncorrelated subqueries. Here is an example showing this case.

SELECT /*+ parallel(2) */ DISTINCT prod_id
FROM sales
WHERE amount_sold >
  (SELECT AVG(amount_sold) FROM sales)
AND quantity_sold >
  (SELECT AVG(quantity_sold) FROM sales)  ;

We have 3 parallelizers in this plan, Line Ids 1, 10, and 16. Here is a timeline of the execution for this plan.

T0: This plan starts with parallelizer #1, since this parallelizer needs 2 PX server sets it allocates 4 PX servers as the DOP is 2. This parallelizer requires the outputs of the filter subqueries so now the other parallelizers will be started.

T1: While still keeping the initial 4 PX servers allocated we now start the second parallelizer (#2), this parallelizer only needs one PX server set, so it allocates 2 PX servers. We now have a total of 6 PX servers concurrently allocated for this query.

T2: When the first subquery finishes running this second parallelizer is now finished, so it releases 2 PX servers it allocated.

T3, T4: Now the third parallelizer (#3) starts, allocates 2 PX servers and releases them when it finishes.

The first parallelizer now has the results of both subqueries and can continue scanning and filtering the sales table. So, this query starts at most 2 parallelizers concurrently and uses 6 PX servers concurrently at any point.

UPDATE, 14 Dec 2015: The actual order of allocating PX servers is parallelizer #1, #3, and #2. Please see the comments for this correction.

If we look at v$pq_sesstat after this query ends it reports 8 PX servers (Server Threads), not 6. This is because this view does not show the number of PX servers used concurrently, but shows the accumulated number of PX servers during the execution even if the same PX servers were released and allocated.

STATISTIC                      LAST_QUERY SESSION_TOTAL     CON_ID
------------------------------ ---------- ------------- ----------
Queries Parallelized                    1             3          0
DML Parallelized                        0             0          0
DDL Parallelized                        0             0          0
DFO Trees                               3             7          0
Server Threads                          8             0          0
Allocation Height                       2             0          0
Allocation Width                        1             0          0
Local Msgs Sent                       120           296          0
Distr Msgs Sent                         0             0          0
Local Msgs Recv'd                     120           296          0
Distr Msgs Recv'd                       0             0          0
DOP                                     2             0          0
Slave Sets                              4             0          0

Again, if you want to find out the number of PX servers allocated concurrently for statements with multiple parallelizers use the view v$px_session instead.

Implications of multiple parallelizers

Downgrades

A statement with a single parallelizer allocates the required number of PX servers at the start and uses them without releasing until it finishes. So the number of PX servers throughout the execution is constant. Statements with multiple parallelizers are different as we saw in the above examples, they allocate PX servers when each parallelizer starts. Since parallelizers can start at different times during the execution each parallelizer may be running with a different number of PX servers based on the number of available processes in the system. Basically the rules about DOP downgrades I talked before apply to each parallelizer individually.

Consider the sequence of execution above. When parallelizer #1 starts it will try to allocate 2 PX servers. Assuming there are enough available PX servers in the system it will get those processes. When it finishes and releases them parallelizer #2 will start and try to allocate 4 PX servers. If at this point there are no available PX servers in the system this parallelizer will run serially. The same is true for the subsequent parallelizers.

Parallel Statement Queuing

Parallel Statement Queuing decides to queue or run a parallel statement based on its DOP and the number of available PX servers. It assumes no statement will use more than 2 PX server sets (thus DOP*2 number of PX servers). Consider the case where the DOP is 8 and there are 16 PX servers available below parallel_servers_target. Oracle will allow this query to run, but if this statement uses multiple concurrent parallelizers and starts using 24 PX servers the queuing point will be exceeded. When a statement is picked from the queue and allowed to run it is free to allocate any number of PX servers it needs. Depending on the number of these kinds of statements running at the same time all PX servers in the system can be consumed. This effectively eliminates the benefit of queuing and statements may start getting downgraded because of PX server shortage in the system. So, if you are using Parallel Statement Queuing and if you see that parallel_servers_target is exceeded look for statements with multiple parallelizers as possible suspects. We are working to fix this behavior in future releases. Until then make sure there is enough gap between parallel_servers_target and parallel_max_servers to prevent downgrades.

Database Resource Manager DOP limits

We have had some users setting DOP limits using Database Resource Manager (DBRM) and expecting all statements to be limited to DOP*2 number of PX servers. As of today that expectation is not true for statements with multiple concurrent parallelizers. For example even if DBRM limits the DOP to 2, a statement with a DOP of 2 can use 6 PX servers concurrently as we saw in example #2.

Summary

Here is a short summary of what we have discussed in this post.

1. Multiple parallelizers can run concurrently or nonconcurrently.

2. Each parallelizer allocates PX servers when it starts, the number of PX servers allocated by each parallelizer depends on the number of PX server sets and can be at most DOP*2.

3. v$pq_sesstat may show incorrect information depending on the concurrency of multiple parallelizers, watch the behavior at runtime instead.

4. Each parallelizer may or may not get the required number of PX servers depending on the number of available PX servers in the system.

5. Parallel Statement Queuing assumes each statement will use DOP*2 number of PX servers, this can cause statements with multiple parallelizers to be allowed to run and exceed the queuing point (parallel_servers_target). To prevent downgrades in this case make sure there is enough gap between parallel_servers_target and parallel_max_servers.

I am planning to cover Parallel Statement Queuing in detail in a future post, we will come back to this topic in there, so please come back for that one too.

In each release we are trying to decrease the number of cases that can generate plans with multiple parallelizers. If you are on 11g you will see that some plans with multiple parallelizers will switch to using a single parallelizer when you upgrade to 12c. In the next post I will talk about those cases and compare 11g to 12c in this regard.

Thursday Dec 03, 2015

Looking forward to #ukoug_tech15 conference

Next week I will be in Birmingham at the annual UKOUG Tech conference. I will be presenting on some of the new SQL features that we added in Database 12c to support analysis of big data......to give you a taster of what to expect during my session here is a sort-of relevant Dilbert cartoon, courtesy of Scott Adams

Dilbert-Approximate-Answers


 

[Read More]

Tuesday Dec 01, 2015

My highlights from DOAG 2015...

Last week I was in Nuremburg at the excellent annual German Oracle user group conference - DOAG15. This was my first visit to DOAG and I was amazed at the size of the event. It is huge but most importantly it is very well organized and definitely one of the best, possibly the best, user conference that I have attended. 

..and then there was the chance to drink some gluhwein with the database PM team on the way to the speaker dinner.

Nuremburg

[Read More]

Tuesday Nov 24, 2015

Little things to know about ... Oracle Partitioning (part one of hopefully many)

Oracle Partitioning is one of the most commonly used option on top of Enterprise Edition - if not the most often used one, which is as you can guess always a discussion in our buildings ;-)

Over the years Oracle Partitioning matured significantly and became more powerful and flexible. But, as in real life, with power and flexibility comes always little things that are good to know (no, that's not an equivalent for complexity). So I am happy to see Connor McDonald just blogging about such a little detail around Interval Partitioning.  

Check it out, it's worth it.

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2016
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
16
17
19
20
22
23
24
26
27
28
30
       
Today