Friday Jun 24, 2016

Updates to Big Data Lite 4.5

There are some updates to Big Data Lite 4.5 that you will want to take advantage of:
  • Updated Big Data SQL configuration enables predicate pushdown to engage - which can have pretty dramatic performance benefits
  • Update to the Big Data Spatial framework (fixes a configuration issue)
  • New samples for the Oracle Shell for Hadoop Loaders

To get the latest updates, simply double click on the Refresh Samples icon on the desktop:

As an example of the BDS performance improvement (and - this is a BDL VM - so the actual numbers aren't nec. interesting.  But, it gets the performance concepts across) - I ran a few tests with the following query:

select count(*)  from table
where custid = 1185972
and activity=6
and genreid=7;

I ran this query 3 times against different tables:

  • Avro - not partitioned - 17s
  • Avro - genre partitions - illustrates partition pruning - 1.1s
  • Parquet - not partitioned - illustrates predicate pushdown into parquet - 0.4s

The VM doesn't have Big Data SQL Cells - so you don't see smartscan and storage indexes - which would show even more performance benefits.

BDS 3.0.x is a great release.  Take it for a spin!

Thursday Jun 02, 2016

Adaptive Distribution Methods in Oracle Database 12c

In my post about common distribution methods in Parallel Execution I talked about a few problematic execution plans that can be generated when the optimizer statistics are stale or non-existent. Oracle Database 12c brings some adaptive execution features that can fix some of those issues at runtime by looking at the actual data rather than statistics. In this post we will look at one of these features which is about adapting the distribution method on the fly during statement execution.

Adaptive Distribution Methods

One of the problems I mentioned in the earlier post was hash distribution with low cardinality. In that case there were only a few rows in the table but the optimizer statistics indicated many rows because they were stale. Because of this stale information we were picking hash distribution and as a result only some of the consumer PX servers were receiving rows. This made the statement slower because not all PX servers were doing work.

This is one of the problems we are trying to fix by using adaptive distribution methods in 12c. To show what an adaptive distribution method is and how it works I will use the same example from the older post and try to see how it works in 12c.

You can go back and look at the post I linked, but as a reminder here are the tables we used.

create table c as 
with t as (select rownum r from dual connect by level<=10000) 
select rownum-1 id,rpad('X',100) pad 
from t,t 
where rownum<=10;

create table s as 
with t as (select rownum r from dual connect by level<=10000)
select mod(rownum,10) id,rpad('X',100) pad 
from t,t 
where rownum<=10000000;

exec dbms_stats.set_table_stats(user,'C',numrows=>100000);
exec dbms_stats.gather_table_stats(user,'S');

Just like in the 11g example I modified the optimizer statistics for table C to make them stale.

Here is the same SQL statement I used before, this time without optimizer_features_enable set.

select /*+ parallel(8) leading(c) use_hash(s) */ count(*) 
from c,s 

Here is the SQL Monitor report for this query in 12.1.

Rather than picking broadcast distribution for table C based on optimizer statistics like in 11g, here we see that the plan shows another distribution method, PX SEND HYBRID HASH in lines 7 and 12. We also see a new plan step called STATISTICS COLLECTOR. These are used to adapt the distribution method at runtime based on the number of rows coming from table C.

The query coordinator (QC) at runtime looks at the number of rows coming from table C, if the total number of rows is less than or equal to DOP*2 it decides to use broadcast distribution as the cost of broadcasting small number of rows will not be high. If the number of rows from table C is greater than DOP*2 the QC decides to use hash distribution for table C. The distribution method for table S is determined based on this decision. If table C is distributed by hash, so will table S. If table C is distributed by broadcast, table S will be distributed by round-robin.

The QC looks at the number of rows from table C at runtime using the statistics collector. Each PX server scanning table C count their rows using the statistics collector until they reach a threshold, once they reach the threshold they stop counting and the statistics collector is bypassed. They return their individual counts to the QC and the QC makes the distribution decision for both tables.

In this example table C is distributed by broadcast and table S is distributed by round-robin as the number of rows from table C is 10 and the DOP is 8. You can find this out by looking at the number of rows from table C (line ID 10), which is 10, and the number of rows distributed at line ID 7, which is 80. 10 rows were scanned and 80 rows were distributed, this is because DOP was 8 and all 10 rows were broadcasted to 8 PX servers. For an easier way to find out the actual distribution method used at runtime, please see an earlier post that shows how to do it in SQL Monitor.

If we look at the Parallel tab we now see that all consumer PX servers perform similar amount of work as opposed to some of them staying idle in 11g.

Another problem I mentioned before was using hash distribution when the data is skewed. We will look at how Oracle Database 12c solves this problem in a later post.

Wednesday Jun 01, 2016

Big Data SQL Quick Start. NoSQL databases - Part9.

[Read More]

Thursday May 19, 2016

Big Data SQL Quick Start. Data types - Part8.

[Read More]

Tuesday May 17, 2016

Single PX Server Execution

I recently helped a customer tune a few queries in their Oracle Database In Memory POC. I want to talk about a simplified version of one of their queries as it is a nice tuning example and also a good opportunity to talk about a new parallel execution feature introduced in Oracle Database 12c.

Let me start with what the feature is and then look at the performance problem this particular customer was having....

[Read More]

Friday May 13, 2016

New pattern matching tutorial on LiveSQL

LiveSQL home page on

If you always wanted to try our new SQL pattern matching feature, MATCH_RECOGNIZE, but never had access to a Database 12c instance then you really need to checkout the our great new LiveSQL playground environment. LiveSQL is a great place to learn about all the new features of Database 12c along with all the existing features from earlier releases.

The new tutorial is called “Log file sessionization analysis with MATCH_RECOGNIZE” and you can view by clicking here.  The objective is to introduce you to some of the important keywords and concepts that are part of the MATCH_RECOGNIZE clause.

[Read More]

Monday May 02, 2016

Big Data SQL Quick Start. Partition Pruning - Part7.

[Read More]

Friday Apr 29, 2016

Big Data SQL Quick Start. Predicate Push Down - Part6.

[Read More]

Monday Apr 25, 2016

SQL Pattern Matching Deep Dive - Part 3, greedy vs. reluctant quantifiers

Welcome to the third post in this deep-dive series on SQL pattern matching using the MATCH_RECOGNIZE feature that is part of Database 12c. 

In the first part of this series we looked at a wide range of topics including ensuring query consistency, how to correctly use predicates and how to manage sorting. In the second part we looked at using the built-in measures to understand how a data set is matched to a pattern. 

In this post I am going to review the concepts of greedy and reluctant quantifiers. I will breakdown this down into a number of areas: 1) Overview of regular expressions, 2) understanding quantifiers, and 3) greedy vs. reluctant quantifiers. The examples in this post use the built-in measures to help show the difference between greedy and reluctant matching. If you are not familiar with the MATCH_NUMBER() function or the CLASSIFIER() function then please take some time to read the second post in this series

Overview of regular expressions

[Read More]

Thursday Apr 21, 2016

Common Distribution Methods in Parallel Execution

Parallel execution uses the producer/consumer model when executing a SQL statement. The execution plan is divided up into DFOs, each DFO is executed by a PX server set. Data is sent from one PX server set (producer) to another PX server set (consumer) using different types of distribution methods.

In this post we will look at the most common distribution methods used in joins in Oracle Database 11gR2. The distribution method is chosen by the optimizer and depends on the DOP, the cardinality of the tables, the number of partitions, etc...

[Read More]

Monday Apr 18, 2016

Big Data SQL Quick Start. Joins. Bloom Filter and other features - Part5.

Using bloom filter for join operations: 

[Read More]

Friday Apr 15, 2016

Is an approximate answer just plain wrong?

We are starting to see a significant change in the way we analyze data as a result of the growth of interest in big data and the newer concept of Internet of Things. Ever since databases were first created everyone has been obsessed, quite rightly so, with ensuring queries returned the correct answer - i.e. precise, accurate answers. This key requirement is derived from the need to run operational, transactional applications. If we check our bank balance online we want the figure we see to be accurate right down to the last cent and for a good reason. Yet increasingly both as part of our online as well as offline experiences we deal with numbers that are not 100% accurate and somehow we manage to make good use of these approximate answers. Here are a couple of examples of where we already using approximations: route planning on our smartphones and crowd counting information in newspapers...

[Read More]

Thursday Apr 14, 2016

Big Data SQL Quick Start. Security - Part4.

Big Data SQL is the way to acsess data that stored in HDFS through Oracle RDBMS, using Oracle external table mechanism. In context of security "table" is key word, which means that you may apply standard security approches to those tables. Today I want to give you couple examples with:

- Oracle Virtual Private databases


- Oracle Data Redaction features. 

Oracle Data Redaction

I don't want to replace documentation of the Oracle Data Redaction within this blog, all available information you could find there, but just let me remind you the main idea of this feature. Very often databases contain sensitive data, like credit card number, SSN or other personal information. It could be useful to have this data in unchanged format for resolve different issue with billing department, but at the same time another departments (like call center) may need only part of this information (like 4 last digits of the credit cards) and for security complience you are not able to show them original data.

[Read More]

Wednesday Apr 13, 2016

Oracle OpenWorld 2016 call for papers is OPEN!

Yes it's that time of year again! If you have a story to tell about data warehousing, big data and SQL analytics then we want to hear from you because the OpenWorld 2016 call for presentations is now open. Mark your calendars: this year Oracle OpenWorld conference will be held in September on 18 - 22, 2016 at the Moscone Center in San Francisco.

We are looking for proposals that describe insights and improvements that attendees can put to use in their own jobs: exciting innovations, strategies to modernize their business, different or easier ways to implement key features, unique use cases, lessons learned, the best of best practices...

[Read More]

Tuesday Apr 12, 2016

SQL Pattern Matching Deep Dive - Part 2, using MATCH_NUMBER() and CLASSIFIER()

Welcome to the second post in this deep dive series on SQL pattern matching using the new MATCH_RECOGNIZE feature that is part of Database 12c. In the first part of this series we looked at the areas of ensuring query consistency, how to correctly use predicates and how to manage sorting.

In this post I am going to review the two built-in measures that we have provided to help you understand how your data set is mapped to the pattern that you have defined. This post will breakdown into three areas: 1) a review of the built-in measures, 2) understanding how to control the output (number of rows returned) and lastly I will bring these two topics together with some examples...

[Read More]

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« July 2016