Wednesday Mar 26, 2014

Oracle Big Data Lite Virtual Machine - Version 2.5 Now Available

Oracle Big Data Appliance Version 2.5 was released last week.  Some great new features in this release- including a continued security focus (on-disk encryption and automated configuration of Sentry for data authorization) and updates to Cloudera Distribution of Apache Hadoop and Cloudera Manager.

With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine.  Oracle Big Data Lite provides an integrated environment to help you get started with the Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you to begin using the system right away. The following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition (
  • Cloudera’s Distribution including Apache Hadoop (CDH4.6)
  • Cloudera Manager 4.8.2
  • Cloudera Enterprise Technology, including:
    • Cloudera RTQ (Impala 1.2.3)
    • Cloudera RTS (Search 1.2)
  • Oracle Big Data Connectors 2.5
    • Oracle SQL Connector for HDFS 2.3.0
    • Oracle Loader for Hadoop 2.3.1
    • Oracle Data Integrator 11g
    • Oracle R Advanced Analytics for Hadoop 2.3.1
    • Oracle XQuery for Hadoop 2.4.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)
  • Oracle JDeveloper 11g
  • Oracle SQL Developer 4.0
  • Oracle Data Integrator 12cR1
  • Oracle R Distribution 3.0.1

Go to the Oracle Big Data Lite Virtual Machine landing page on OTN to download the latest release.

Monday Mar 24, 2014

Demonstration: Auditing Data Access Across the Enterprise

Security has been an important theme across recent Big Data Appliance releases. Our most recent release includes encryption of data at rest and automatic configuration of Sentry for data authorization. This is in addition to the security features previously added to the BDA, including Kerberos-based authentication, network encryption and auditing.

Auditing data access across the enterprise - including databases, operating systems and Hadoop - is critically important and oftentimes required for SOX, PCI and other regulations. Let's take a look at a demonstration of how Oracle Audit Vault and Database Firewall delivers comprehensive audit collection, alerting and reporting of activity on an Oracle Big Data Appliance and Oracle Database 12c. 


In this scenario, we've set up auditing for both the BDA and Oracle Database 12c.


The Audit Vault Server is deployed to its own secure server and serves as mission control for auditing. It is used to administer audit policies, configure activities that are tracked on the secured targets and provide robust audit reporting and alerting. In many ways, Audit Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources into an audit schema and then delivers both pre-built and ad hoc reporting capabilities.

For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database 12c monitored targets; these agents are responsible for managing collectors that gather activity data. This is a secure agent deployment; the Audit Vault Server has a trusted relationship with each agent. To set up the trusted relationship, the agent makes an activation request to the Audit Vault Server; this request is then activated (or "approved") by the AV Administrator. The monitored target then applies an AV Server generated Agent Activation Key to complete the activation.


On the BDA, these installation and configuration steps have all been automated for you. Using the BDA's Configuration Generation Utility, you simply specify that you would like to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the audit data. Mammoth - the BDA's installation tool - uses this information to configure the audit processing. Specifically, it sets up audit trails across the following services:

  • HDFS: collects all file access activity
  • MapReduce:  identifies who ran what jobs on the cluster
  • Oozie:  audits who ran what as part of a workflow
  • Hive:  captures changes that were made to the Hive metadata

There is much more flexibility when monitoring the Oracle Database. You can create audit policies for SQL statements, schema objects, privileges and more. Check out the auditor's guide for more details. In our demonstration, we kept it simple: we are capturing all select statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user and any unsuccessful attempts at selecting from any table in any schema.

Now that we are capturing activity across the BDA and Oracle Database 12c, we'll set up an alert to fire whenever there is suspicious activity attempted over sensitive HR data in Hadoop:


In the alert definition found above, a critical alert is defined as three unsuccessful attempts from a given IP address to access data in the HR directory. Alert definitions are extremely flexible - using any audited field as input into a conditional expression. And, they are automatically delivered to the Audit Vault Server's monitoring dashboard - as well as via email to appropriate security administrators.

Now that auditing is configured, we'll generate activity by two different users: oracle and DrEvil. We'll then see how the audit data is consolidated in the Audit Vault Server and how auditors can interrogate that data.

Capturing Activity

The demonstration is driven by a few scripts that generate different types of activity by both the oracle and DrEvil users. These activities include:

  • an oozie workflow that removes salary data from HDFS
  • numerous HDFS commands that upload files, change file access privileges, copy files and list the contents of directories and files
  • hive commands that query, create, alter and drop tables
  • Oracle Database commands that connect as different users, create and drop users, select from tables and delete records from a table

After running the scripts, we log into the Audit Vault Server as an auditor. Immediately, we see our alert has been triggered by the users' activity.


Drilling down on the alert reveals DrEvil's three failed attempts to access the sensitive data in HDFS:

alert details

Now that we see the alert triggered in the dashboard, let's see what other activity is taking place on the BDA and in the Oracle Database.

Ad Hoc Reporting

Audit Vault Server delivers rich reporting capabilities that enables you to better understand the activity that has taken place across the enterprise. In addition to the numerous reports that are delivered out of box with Audit Vault, you can create your own custom reports that meet your own personal needs. Here, we are looking at a BDA monitoring report that focuses on Hadoop activities that occurred in the last 24 hours:

monitor events

As you can see, the report tells you all of the key elements required to understand: 1) when the activity took place, 2) the source service for the event, 3) what object was referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the ip address (or host) that initiated the event, and 7) how the object was modified or accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed attempts to open the sensitive salaries.txt file.

Notice, events may be related to one another. The Hive command "ALTER TABLE my_salarys RENAME TO my_salaries" will generate two events. The first event is sourced from the Metastore; the alter table command is captured and the metadata definition is updated. The Hive command also impacts HDFS; the table name is represented by an HDFS folder. Therefore, an HDFS event is logged that renames the "my_salarys" folder to "my_salaries".

Next, consider an Oozie workflow that performs a simple task: delete a file "salaries2.txt" in HDFS. This Oozie worflow generates the following events:


  1. First, an Oozie workflow event is generated indicating the start of the workflow.
  2. The workflow definition is read from the "workflow.xml" file found in HDFS.
  3. An Oozie working directory is created
  4. The salaries2.txt file is deleted from HDFS
  5. Oozie runs its clean-up process

The Audit Vault reports are able to reveal all of the underlying activity that is executed by the Oozie workflow. It's flexible reporting allows you to sequence these independent events into a logical series of related activities.

The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few unsuccessful attempts to access sensitive salary data in HDFS. But, what other unsuccessful events have occured recently across our data platform? We'll use Audit Vault's ad hoc reporting capabilities to answer that question. Report filters enable users to search audit data based on a range of conditions. Here, we'll keep it pretty simple; let's find all failed access attempts across both the BDA and the Oracle Database within the last two hours:


Again, DrEvil's activity stands out. As you can see, DrEvil is attempting to access sensitive salary data not only in HDFS - but also in the Oracle Database.


Security and integration with the rest of the Oracle ecosystem are two tablestakes that are critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewall's auditing of data across the BDA, databases and operating systems epitomizes this goal - providing a single repository and reporting environment for all your audit data.

Built-in sorting optimizations to support analytical SQL

One of the proof points that I often make for using analytical SQL over more sophisticated SQL-based methods is that we have included specific optimizations within the database engine to support our analytical functions. In this blog post I am going to briefly talk about how the database optimizes the number of sorts that occur when using analytical SQL.

Sort Optimization 1: Ordering Groups

Many of analytical functions include PARTITION BY and/or an ORDER BY clause both of which by definition implies that an ordering process is going to be required. As each function can have its own PARTITION BY-ORDER BY clause this can create situations where lot of different sorts are needed. For example, if we have a SQL statement that included the following:

Rank() Over (Partition by (x) Order by (w))
Sum(a) Over (Partition by (w,x) Order by (z))
Ntile() Over (Partition by (x) Order by (y))
Sum(b) Over (Partition by (x,y) Order by (z))

this could involve four different sort processes to take into account the use of both PARTITION BY and ORDER BY clauses across the four functions. Performing four separate sort processes on a data set could add a tremendous overhead (depending on the size of the data set). Therefore, we have taken two specific steps to optimize the sorting process.

The first step is create the notion of "Ordering Groups". This optimizations looks for ways to group together sets of analytic functions which can be evaluated with a single sort. The objective is to construct a minimal set of ordering groups which in turn minimizes the number of sorts. In the example above we would create two ordering groups as follows:

Screen Shot 2014 03 13 at 13 39 37

This allows us to reduce the original list of sorts down from 4 to just 2.

Sort Optimization 2: Eliminating Sorts

We can further reduce the number sorts that need to be performed by carefully scheduling the execution so that:

  • Ordering groups with sorts corresponding to that in the GROUP BY execute first (immediately after the GROUP BY) 
  • Ordering groups with sorts corresponding to that in the ORDER BY execute last (immediately before the ORDER BY)

In addition, we can also eliminate sorts when an index or join method (sort-merge) makes sorting unnecessary. 

Optimization 3 : RANK Predicates

Where a SQL statement includes RANK() functions there are additional optimizations that kick-in. Instead of sorting all the data, adding the RANK and then applying the predicate, the RANK predicate is evaluated as part of the sort process. The net result is that fewer records are actually sorted, resulting in more efficient execution.


Overall, these three optimizations ensure that as few sorts as possible are performed when you include SQL analytical functions as part of your SQL statements. 

Friday Mar 21, 2014

Open World 2014 - guidelines for call-for-papers…

OOW Banner 2013

Most of you will already have received an email from the OOW team announcing the call for papers for this year's conference: Each year, customers ask me how they can increase their chances of getting their paper accepted? Well, I am going to start by stating that product managers have absolutely no influence over which papers are accepted - even mentioning that a product manager will be co-presenting with you will not increase your chances!

So how do you increase you make sure that your presentation title and abstract catches the eye of the selection committee? Here is my top 10 list of guidelines for submitting proposals:

1) Read the "call-for-papers" carefully and follow its instructions - even if you have submitted presentations for lots of Oracle conferences it is always a good idea to carefully read the call for papers and to make sure you follow the instructions. There is an excellent section towards the end of the call-for-papers web page, "Tips and Guidelines"

2) Address the theme of the conference - If this is available when the call the for papers is announced then try to address the theme of the conference within your abstract.

3) Address the key data warehouse focus areas - for this year's OOW 2014 the key focus areas for data warehousing will be partitioning, analytical SQL, parallel execution, workload management and logical data warehouse. If possible try to include one or more of these focus areas within your abstract.

4) Have a strong biography - You need to use your biography to differentiate and build credibility. This is an important topic because it allows you to differentiate yourself from all the other presenters who are trying to get speaking slots. Your biography must explain why you are an authority on the topic you have chosen for your presentation and why people will want to listen to what you have to say.

5) Have a strong business case - build your presentation around a strong business case, relevant to your industry and/or your target audience (DBAs, developers, architects etc). Try to explain in clear and simple terms the problem you needed to solve, how you solved it using Oracle technology and the direct technical/business benefits.

6) Make the title and abstract interesting - Your title and abstract must be easy to read and make sure you introduce your main idea as early as possible. Review the titles and abstracts from previous conferences as a guide. Ideally make the issue relevant to the delegates attending OWW, get to the point, and make sure it is easy to read.

7) Look at previous presentations - the content catalog for last year's conference is available online,see here: You can review all the titles and abstracts that were accepted and use them as guidelines for creating your own title and abstract for this year's conference.

8) Write clear outcomes - The majority of the best presentations have clearly stated outcomes. What do you expect that conference attendees will be able do or know at the end of your session? Consider including a sentence at the end of your abstract such as the following: “At the end of this presentation, participants will be able to . . . .”

9) Don’t submit your paper right away - Once you have a title and abstract show it to a few colleagues. Get some feedback. You probably know many people who’d be happy to give you ideas on making your paper better.

10) Keep number of submissions low - You do not increase your chances of getting a paper accepted by submitting lots of different papers.

I cannot guarantee you success if you follow these guideline but I hope they prove helpful. Good luck with your submission(s) and I look forward to seeing at you at this year's OpenWorld in San Francisco.


Wednesday Mar 19, 2014

Announcing Encryption of Data-at-Rest on Big Data Appliance

With the release of Big Data Appliance software bundle 2.5, BDA completes the encryption story underneath Cloudera CDH. BDA already came with network encryption, ensuring no network sniffing can be applied in between the nodes, it now adds encryption of data-at-rest.

A Brief Overview

Encryption of data-at-rest can be done in 2 modes. One mode leverages the Trusted Platform Module (TPM) on the motherboard to provide a key to encrypt the data on disk. This mode does not require a password or pass phrase but relies on the motherboard. The second mode leverages a passphrase, which in turn will be used to generate a private-public key pair generated with OpenSSL. The key pair is encrypted as well.

The passphrase encryption has a few more interesting aspects. For one, it does require the passphrase to be entered upon re-booting the system. Leveraging the TPM option does not require any manual intervention at reboot. On Big Data Appliance it is possible to regularly change the passphrase without impacting the encryption, or required re-encryption of the data.

Neither one of the encryption methods affect user access to user data. In other words, on an unprotected cluster a user that can read data before encryption will be able to read data after encryption. The goal is to ensure data is protected on physical media - like theft or incorrect disposal of a disk. Both forms protect from that, but only passphrase based encryption protects from disposal or theft of a server.

On BDA, it is possible to switch between these two methods. This does have impact on running the cluster as data needs to be re-encrypted. For this step the cluster will be down, however data is not duplicated, so there is no need to reserve double the space to do the re-encryption.

How to Encrypt Data

As with all installation or changes on Big Data Appliance you will leverage Mammoth to do the install with encryption or to make changes to the system if you are already in production. Before you set up either of the two modes of data-at-rest encryption, you should consider your requirements. Changing the mode - as described - is possible, but will require the cluster to be down for re-encryption.

Full Set of Security Features

Encryption - out-of-the-box is yet another feature that is specific to Oracle Big Data Appliance. On top of pre-configured Kerberos, Apache Sentry, Oracle Audit Vault Encryption now adds another security dimension. To read more about the full set of features start here.

Thursday Mar 13, 2014

Video: Big Data Connectors and IDH (Strata)

The certification of Oracle Big Data Connectors on Intel Distribution for Hadoop now complete (see our previous post). This video from Strata gives you a nice overview of IDH and BDC.

Friday Mar 07, 2014

Intel® Distribution for Apache Hadoop* certified with Oracle Big Data Connectors

Intel partnered with Oracle to certify compatibility between Intel® Distribution for Apache Hadoop* (IDH) and Oracle Big Data Connectors*.  Users can now connect IDH to Oracle Database with Oracle Big Data Connectors, taking advantage of the high performance feature-rich components of that product suite. Applications on IDH can leverage the connectors for fast load into Oracle Database, in-place query of data in HDFS with Oracle SQL, analytics in Hadoop with R, XQuery processing on Hadoop, and native Hadoop integration within Oracle Data Integrator.

Read the whole post here.

Thursday Feb 06, 2014

Sessionization analysis with 12c SQL pattern matching is super fast

Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:

Most of these have been related to explaining the basic concepts and along with some specific use cases.

In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.

To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.

Analysis of Sessionization Data

To make things easy to understand I have divided our sessionization workflow into a number of steps.
Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:

Session_Id User
1 Mary
2 Sam
3 Richard
11 Mary
12 Sam
13 Richard
22 Sam
23 Mary
23 Richard
32 Sam
33 Richard
34 Mary
43 Richard
43 Sam
44 Mary
47 Sam
48 Sam
53 Mary
54 Richard
59 Sam
60 Sam
63 Mary
63 Richard
68 Sam

The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:


The 12c SQL to create the initial result set is as follows:

SELECT user_id, session_id start_time, no_of_events, duration
  MEASURES match_number() session_id,
           count(*) as no_of_events,
           first(time_stamp) start_time,
           last(time_stamp) - first(time_stamp) duration
  PATTERN (b s*)
       s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10)

as a comparison here is how to achieve the above using 11g analytical window functions

CREATE VIEW Sessionized_Events as
SELECT Time_Stamp, User_ID,
 Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID
FROM ( SELECT Time_Stamp, User_ID,
 CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10
 THEN 0 ELSE 1 END Session_Increment
 FROM Events);
SELECT User_ID, Min(Time_Stamp) Start_Time,
 Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration
FROM Sessionized_Events
GROUP BY User_ID, Session_ID;

As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.

Part 3 - However, to get business value from this derived data set we need to do some additional processing.  Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:

  • Within-partition Session_ID
  • Number of events in a session
  • Total duration 

To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2)  we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to compare the current record against the previous record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 ( select o_pbykey,
 from orders_v MATCH_RECOGNIZE
  PARTITION BY o_pbykey
  ORDER BY O_custkey, O_Orderdate
  MEASURES match_number() session_id, count(*) as no_of_events,
           first(o_orderdate) start_time,
           last(o_orderdate) - first(o_orderdate) duration
  PATTERN (b s*)
  DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100)
 where No_Of_Events >= 20

Here is code based on using the NEXT() function to compare the current record against the next record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 ( select o_pbykey, session_id, start_time, no_of_events, duration
 from orders_v MATCH_RECOGNIZE
ORDER BY O_custkey, O_Orderdate
MEASURES match_number() session_id, count(*) as no_of_events,
         first(o_orderdate) start_time,
         last(o_orderdate) - first(o_orderdate) duration
PATTERN (s* e)
DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100)
 where No_Of_Events >= 20

Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses  window functions (which in my opinion is a lot more complex):

select count(*)
from (
 select /* wf */ *
 from (select O_pbykey,
              min(O_Orderdate) Start_Time,
              count(*) No_Of_Events,
             (max(O_Orderdate) - Min(O_Orderdate)) Duration
        from (select O_Orderdate,
O_Custkey, o_pbykey,
                     over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID
               from ( select O_Custkey,
                             case when (O_Orderdate –
                                 over(partition by o_pbykey
                                      order by O_custkey, O_Orderdate)) <= 100 -- Threshold
                             then 0 else 1 end Session_Increment
                      from orders_v
       group by o_pbykey, Session_ID
 where No_Of_Events >= 20

The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="" alt="NewImage" width="598" height="275" border="0">

Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.

MR Performance Results

Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.

However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.

Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.

In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (
Have fun with MATCH_RECOGNIZE….

Wednesday Feb 05, 2014

OTN Virtual Developer Day Database 12c content now available on-demand

Thank you to everyone who attended the SQL pattern matching session during yesterday's OTN Virtual Developer Day event. We had a great crowd of people join our live workshop session. I hope everyone enjoyed using the amazing platform which the OTN team put together to host the event.  

The great news is that all the content from the event is now available for download and you can watch the all on-demand videos from the four tracks (Big Data DBA, Big Data Developer, Database DBA and Database Developer). 

The link to fantastic OTN VDD platform is here: and this is what the landing pad page looks like:


This page will give you access to the keynote session by Tom Kyte and Jonathan Lewis which covered the landscape of Oracle DB technology evolution and adoption.  The content looks at what's next for Oracle Database 12c looking at the high value technologies and techniques that are driving greater database efficiencies and innovation.

You will be able to access the videos, slides from each presentation and a huge range of technical hands-on labs covering big data and database technologies, including my SQL Pattern Matching workshop. If you want to download the the Virtualbox image for the Database tracks it is available here: (this contains everything you need to run my SQL Pattern Matching workshop).

While you doing the workshop, if you have any questions then please feel free to email me -


Monday Jan 27, 2014

Announcing: Oracle Big Data Lite Virtual Machine

You've been hearing alot about Oracle's big data platform. Today, we're pleased to announce Oracle Big Data Lite Virtual Machine - an environment to help you get started with the platform. And, we have a great OTN Virtual Developer Day event scheduled where you can start using our big data products as part of a series of workshops.

BigDataLite Picture

Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle's big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It's been configured to run on a "developer class" computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.

How do you get started? Why not start by registering for the Virtual Developer Day scheduled for Tuesday, February 4, 2014 - 9am  to 1pm PT / 12pm to 4pm ET / 3pm to 7pm BRT:


There will be 45 minute sessions delivered by product experts (from both Oracle and Oracle Aces) - highlighted by Tom Kyte and Jonathan Lewis' keynote "Landscape of Oracle Database Technology Evolution". Some of the big data technical sessions include:

  • Oracle NoSQL Database Installation and Cluster Topology Deployment
  • Application Development & Schema Design with Oracle NoSQL Database
  • Processing Twitter Data with Hadoop
  • Use Data from a Hadoop Cluster with Oracle Database
  • Make the Right Offers to Customers Using Oracle Advanced Analytics
  • In-DB Map Reduce with SQL/Hadoop
  • Pattern Matching in SQL

Keep an eye on this space - we'll be publishing how-to's that leverage the new Oracle Big Data Lite VM. And, of course, we'd love to hear about the clever applications you build as well!

FREE OTN virtual workshop - Learn about SQL pattern matching with Oracle Database 12c.

otn virtual dvlper day

Make sure you are free on Tuesday February 4 because the OTN team are hosting another of their virtual developer day events. Most importantly it is FREE. Even more importantly is the fact that I will be running a 12c pattern matching workshop at 11:45am Pacific Time. Of course there are lots other sessions that you can attend relating to big data and Oracle Database 12c and the OTN team has created two streams to help you learn about this two important areas:

  • Oracle Database application development — Learn expert tips and tricks on how to develop applications for Oracle Database 12c and Big Data environments more effectively.
  • Oracle Database platform deployment processes — From integration, to data migration, experts showcase new capabilities in Oracle 12c and Big Data environments that will allow you to deliver greater database performance and integration.

You can sign-up for the event and pick your tracks and sessions via this link:

My pattern matching session is included in the Oracle 12c DBA section of the application development track and the workshop will cover the following topics:

  • Part 1 - Introduction to SQL Pattern Matching
  • Part 2 - Pattern Match: simple example
  • Part 3 - How to use built-in measures
  • Part 4 - Searching for more complex patterns
  • Part 5 - Deep dive into how SQL Pattern Matching works
  • Part 6 - More Advanced Topics

As my session is only 45 minutes long I am only going to cover the first three topics and leave you to work through the last three topics in your own time. During the 45 minute workshop I will be available to answer any questions via the live Q&A chat feature.

There is a link to the full agenda on the invitation page. The OTN team will be providing a Database 12c Virtualbox VM that you will be able to download later this week. For the pattern matching session I will be providing the scripts to install our sample schema, the slides from the webcast and the workshop files which include a whole series of exercises that will help you learn about pattern matching and test your SQL skills. 

The big data team has kindly included my pattern matching content inside their Virtualbox image so if you want to focus on the sessions offered on the big data tracks but still want to work on the pattern matching exercises after the event then you will have everything you need already installed and ready to go!

Don't forget to register as soon as possible and I hope you have a great day…Let me know if you have any questions or comments.

Friday Jan 24, 2014

How to create more sophisticated reports with SQL

SQL Pivot Cube

This is a continuation of a series of posts that cover Oracle's SQL extensions for reporting and analysis. As reviewed in earlier blog posts (Part 1 and Part 2), We have extended SQL's analytical processing capabilities by introducing a family of aggregate and analytic SQL functions. Over the last couple of days I have been exploring how to use some of these SQL features to create tabular-style reports/views. In the past when I worked as a BI Beans product manager I would typically build these types of reports using OLAP cubes and/or Java code. It is been very interesting to work through some of my old report scenarios and transfer my Java aggregation processing back into the database by using SQL to do all the heavy lifting without having to resort to low-level coding.

[Read More]

Friday Jan 17, 2014

StubHub's Data Scientists reap benefits of integrated approach….

We have released yet another great video customer video, this time with Stubhub.

Many customers are still pulling data out of their data warehouse and shipping it to specialised processing engines so they can mine their data, run spatial analytics and/or built multi-dimensional cubes. The problem with this approach, as the team at Stubhub points out, is that typically when you move the data to these specialised engines you have to work with a subset of the data that is sitting in your data warehouse. When you work with a subset of data you immediately start to impose compromises on your analytical workflows. If you can't work with all your data then you can't be sure that your analytical model is as good as it could be and that could mean losing customers or missing out on additional revenue.

The other problem comes from everyone using their own favourite tool to do their analysis: how do you share your discoveries, how do you develop a high level of corporate-wide analytical skills?

Stubhub asked Oracle to help them resolve these two key problems...

[Read More]

Monday Jan 06, 2014

CaixaBank deploys new big data infrastructure on Oracle

CaixaBank is Spain’s largest domestic bank by market share with a customer base of 13.7 It is also Spain’s leading bank in terms of innovation and technology, and one of the most prominent innovators worldwide. CaixaBank has been recently awarded the title of the World’s Most Innovative Bank at the 2013 Global Banking Innovation Awards (November 2013).

Like most financial services companies CaixaBank wants to get closer to its customers by collecting data about their activities across all the different channels (offices, internet, phone banking, ATMs, etc.). In the old days we used to call this CRM and then this morphed into "360-degree view" etc etc. While many companies have delivered these types of projects and customers feel much more connected and in control of their relationship with their bank the capture of streams of big data has the potential to create another revolution in the way we interact with our bank. What banks like CaixaBank want to do is to capture data in one part of the business and make it available to all the other lines of business as quickly as possible.

Big data is allowing businesses like  CaixaBank to significantly enhance the business value of their existing customer data by integrating it with all sorts of other internal and external data sets. This is probably the most exciting part of big data because the potential business benefits are really only constrained by imagination of the team working on these type of projects. However, that in itself does create problems in terms of securing funding and ownership of projects because the benefits can be difficult to estimate which is where all the industry use cases, conference papers and blog posts can help in terms of providing insight into what is going on in across the market in broad general terms.

To help them implement a strategic Big Data project, CaixaBank has selected Oracle for the deployment of its new Big Data infrastructure. This project, which includes an array of Oracle solutions, positions CaixaBank at the forefront of innovation in the banking industry. The new infrastructure will allow CaixaBank to maximize the business value from any kind of data and embark on new business innovation projects based on valuable information gathered from large data sets. Projects currently under review include:

  • Development of predictive models to improve customer value
  • Identifying cross-selling and up-selling  opportunities
  • Development of personalized offers to customers
  • Reinforcement of risk management and brand protection services
  • More powerful fraud analysis 
  • General streamlining of current processes to reduce time-to-market
  • Support new regulatory requirements

The Oracle solution (including Oracle Engineered Systems, Oracle Software and Oracle Consulting Services) consists in the implementation of a new Information Management Architecture that provides a unified corporate data model and new advanced analytic capabilities (for more information about how Oracle's Reference Architecture can help you integrate structured, semi-structured and unstructured information into a single logical information resource that can be exploited for commercial gain click here to view our whitepaper)

The importance of the project is best explained by Juan Maria Nin, CEO of CaixaBank:

“Business innovation is the key for success in today’s highly competitive banking environment. The implementation of this Big Data solution will help CaixaBank remain at the forefront of innovation in the financial sector, delivering the best and most competitive services to our customers”.

The Oracle press release is here:

Friday Dec 20, 2013

BAE Systems Choose Big Data Appliance for Critical Projects

Here's another great story about how to use data warehousing and big data technologies to solve real world problems using diverse sets of data using Oracle technology. BAE Systems is taking unstructured, semi-structured, operational and social media data and using it to solve complex problems such as financial crime, cyber security and digital transformation. The volumes of data that BAE deals with are very large and this creates its own set of challenges and problems in terms of optimising hardware and software to work efficiently and effectively together. Although BAE had their own in-house Hadoop experts they chose Oracle Big Data Appliance for their Hadoop cluster because it’s easier, cheaper, and faster to operate.

BAE is working with many telco customers to explore the new areas that are being opened up by the use of big data to manage browsing data and call record data. These data sources are being transformed to provide additional insight for the network operations teams, analysis of customer quality and to drive marketing campaigns.




Click on the image to watch the video, or click here:


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« October 2015