Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

Alexey Filanovskiy
Product Manager

It's very frequent case when somebody talks about Big Data he or she also wants to know how to apply Machine Learning algorithms over this data sets. Oracle Big Data SQL provides the easy and seamless way to do this. Secret of this in Oracle Advanced Analytics (OAA) option, which has been existing for many years. This is the set of the existing algorithms, together with SQL Developer Data Minner interface, allow easily in drag and drop style create advanced models. OAA works over the Oracle DB tables. Big Data SQL allows us to get access to Hadoop data through the External table. Easy!

It's better to see once rather than hear multiple times.

let me give an example. Let's imagine we have some store and for sure we have customers who make purchases there. Roughly customers could be divided into 3 categories: who spend a little money, who spend a lot of money and average customer.

Now, having statistics of the sales and some personal information about customers, we want to understand what is the profile of Big Spender (a person who spend a lot of money in our store).

I simplified my example and I only have two tables:

- Fact of the sales

- Dimension table with customer's info

Big Fact stores data into JSON format on the HDFS. Dimension tables stored into Oracle RDBMS. 

As you may remember from my previous post, Big Data SQL allows us to represent semi-structure (or even structure) data as a table in Oracle Database.

Now we have two tables, which is related by customer key (primary key for the dimension table, foreign key for the fact table).

After this I build the model in SQL Developer:

Data Minner allows us to write SQL inside the model, let's have a look:


This query defines first 5% of customers who spend more money as "big spender". Bottom 5% as "low spender". Others are average customers.

I materialize this aggregate and join it with customers table. After this run Machine learning algorithms (they already created I just use it).

Decision Trees.

Let's have a look at the results:

First is decision tree:

it's hard to see here something. Let's zoom out two nodes which show us Big Spenders (our target): 

From this Node we could conclude, that there is pretty high probability, that Big Spender profile is:

- Female

- With high annual_income (more than 84 500)

- Who lives in either California or Maryland or New Jersey

- Her marital status is Divorced or Married or Separated or Widowed

The second node shows us:

that, there is also pretty high probability, that Big Spender profile is:

- Female

- With high annual_income (more than 84 500)

- Who is younger than 43.5

- Her marital status is Single

Naive Bayes

I also used Naive Bayes classifier in my example:

it shows, that most probably big Spender is:

- Female

- Younger than 34.5

- Single

- With high (more than 118 712 annual income)


The couple of Oracle Advances Analytics and Big Data SQL allows us to:

- Use HDFS as a cheap storage

- Use HDFS as schema-on read storage, which allows us to define the data schema during the read (parse semi-structure data)

- Use Oracle Advanced analytics Drag and Drop intuitive interface for building classification models over Big Data (data stored into Hadoop).

All code and Advanced Analytics models, which have been used in my example available here


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.