Thursday May 15, 2014

DBAs Guide to Sandboxes vs. Data Marts

I had an interesting response to my first post on the topic of sandboxing (DBA's Guide to Deploying Sandboxes in the Cloud). The following question was asked: what is the difference between a data mart and sandbox?

This is actually a great question so I thought it would be useful to convert my answer into a short blog post. I am sure there will be lots of different opinions on this topic just as there are alternative names for "sandbox environment" (from analytical sandbox, to analytical appliance to discovery zone etc etc) but here is my attempt at an answer:

OLAP1 IndustryDataModels

In my experience data marts tend to be a single subject area data repository and/or linked to a specific corporate application (such as finance, HR, CRM, ERP, logistics, sales tracking etc). The source data is pushed to a specific line of business for analysis. The push and loading processes implements all the necessary data cleansing and transformation routines so the data arrives into its destination schema ready for use. Most importantly, the data push happens on a regular basis and is driven by the needs of the business.  Many customers implement a data lifecycle management workflow to ensure that sufficient historical data is available to support the required analysis. In many cases the life of the data mart is largely open-ended and the IT team will ensure that regular backups and all the usual patching and maintenance operations are performed on a regular basis. Where the mart is seen as a mission critical system then high-availability features can be implemented and in extreme cases a disaster recovery site is setup and data synchronised between the production and DR systems.

The schema itself is typically organized to support reporting requirements and will be based around the standard relational models such as star and/or snowflake schemas although this is not a mandatory requirements. Sometimes a 3NF approach is required to support the particular needs of the business. The majority of queries within the mart are "well-defined" and "well-known" and subject to tuning and monitoring by the DBA team. 

A "sandbox" is generally meant as a non-operational environment where business analysts and data scientists can test ideas, manipulate data and model "what if" scenarios without placing an excessive computational load on the core operational processes. It has a finite life expectancy so that when timer runs out the sandbox is deleted and the associated discoveries are either incorporated into the enterprise warehouse, or data mart, or simply abandoned. The primary driver from an organisational perspective is to use a 'fail-fast" approach. At any one point in time an organization might be running any number of analytical experiments spread across hundreds of sandboxes. However, at some point in time those experiments will be halted and evaluated and the "hardware" resources being consumed will be returned to a general pool for reuse by existing projects or used to create environments for new projects. 

In general terms a sandbox environment is never patched or upgraded, except in exceptional circumstances. There is never a great urgency to apply software or operating system patches so the ITM team will  just incorporated these tasks into the normal cycles.

A sandbox should never be considered mission critical so there is no need to implement high availability features or build and manage a DR environment. If a sandbox becomes unavailable due to a fault (hardware or software) there is no pressing urgency to resolve the issue - in Oracle parlance a "sandbox going down" is not a P1 issue for support.

Below is a summary of how I view the differences between marts and sandboxes:

Business Centric Attributes

Data Mart


Business scope

Single subject area

Potentially a mixing pot of data sourced from multiple systems

Core objective

Managing the business

Discovery of new products, markets and/or customer segments

Query scope

Batch reporting for dashboards and pre-configured reports along with limited ad-hoc analysis

Ad –hoc data discovery

Frequency of update

Regular, scheduled data loads

One-off and ad-hoc loads as required

Data Volumes

Driven by external factors such as GRC requirements

Driven by needs of project

Data Quality

Very important – data is fully cleansed and transformed during load process

Raw data is loaded, transformations, cleansing and enhancements are incorporated into discovery process

Typical Output

Historical reports, KPIs, scorecards, multi-dimensional analysis

Data mining models: forecasts, predictions, scoring

Sophisticated analytics (aggregations, spatial, graph etc)

Typical query patterns

Pre-defined patterns returning small data sets, easily tuned

Complex ad hoc queries over massive data volumes

IT Centric Attributes

Data Mart


Mission critical



Performance SLAs



Individual/LOB chargeback for resource usage



HA features/DR Site




Full + incremental



As required to resolve specific issues along with scheduled maintenance programs

Only when necessary

Use of beta software


Possibly if project needs specific features

Life expectancy


Limited – typically 90 days or less

Number of instances

Most companies choose to
implent a very small number of subject-specific marts

Large companies likely to have many hundreds of sandoxes running at any given point in time

Table: Key differences between data mart and sandbox

As you can see from this table, in some ways sandboxes are similar to data marts and in other ways they are not. For me, the key difference is in the life expectancy - a sandbox should never outstay its welcome. The best sandbox environments that I have come across are those where strict time limits are enforced on their duration. If you let a sandbox live on too long then you run the danger of it morphing into a shadow data mart and that is a very dangerous situation if you look at the attributes and descriptions listed in the first table (Business Centric Attributes). 

So why is there so much confusion about the differences between marts and sandboxes? Much of this is down to niche vendors trying to jump on specific marketing bandwagons. At the moment the latest marketing bandwagon is the concept of "analytical databases" which in reality is nothing more than a data mart (and in many cases these vendors are simple peddling highly specialised data silos). These niche vendor platforms are simply not designed to run hundreds of environments with resources being continually returned to a centralised pool for redistribution to existing or new projects - which is a core requirement for effective sandboxing.

Over the next 2-3 months I am will share why and how the unique features of Oracle Database 12c provide the perfect platform for supporting environments running hundreds of thousands of sandbox-driven projects.

Tuesday May 13, 2014

Announcing: Big Data Lite VM 3.0

Last week we released Big Data Lite VM 3.0. It contains the latest update on the VM for the entire stack.

Oracle Enterprise Linux 6.4
Oracle Database 12c Release 1 Enterprise Edition ( with Oracle Advanced Analytics, Spatial & Graph, and more
Cloudera’s Distribution including Apache Hadoop (CDH 5.0)
Oracle Big Data Connectors 3.0
        Oracle SQL Connector for HDFS 3.0.0
        Oracle Loader for Hadoop 3.0.0
        Oracle Data Integrator 12c
        Oracle R Advanced Analytics for Hadoop 2.4.0
        Oracle XQuery for Hadoop 3.0.0
Oracle NoSQL Database Enterprise Edition 12cR1 (3.0.5)
Oracle JDeveloper 11g
Oracle SQL Developer 4.0
Oracle Data Integrator 12cR1
Oracle R Distribution 3.0.1

The download page is on OTN in its usual place.

Friday May 09, 2014

New 12c sessionization analytics workshop now available on OLL

I have just uploaded a new workshop on sessionization analytics using the 12c pattern matching feature, MATCH_RECOGNIZE, to the Oracle Learning Library. The workshop is based on analysis of the log files generated by our the Big Data Lite Movieplex application, which is part of our Big Data Lite virtual machine. Oracle Movieplex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalised experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable. 

All the activity from our application is captured in a log file and we are going to analyze the data captured in that file by using SQL pattern matching to create a sessionization result set for our business users and data scientists to explore and analyze. The sections in the workshop (I have recorded a video of this workshop, see links below) will step you through the process of creating our sessionization result set using the Database 12c pattern matching features.

BD VM sessionization

The workshop and video are available on the Oracle Learning Library using the following links:

For more information (whitepapers, multi-media Apple iBooks, tutorials etc) about SQL pattern matching and analytical SQL then checkout our home page on OTN:


Tuesday Apr 29, 2014

Oracle Data Warehouse and Big Data Magazine April Edition for Customers + Partners

Follow us on Facebook Twitter Blogger
Oracle Data Warehouse and Big Data Magazine APRIL Edition for Customers + Partners

The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. The content for this magazine is taken from the various data warehouse and big data Oracle product management blogs, Oracle press releases, videos posted on Oracle Media Network and Oracle Facebook pages. Click here to view the April Edition

Please share this link to our magazine with your customers and partners

This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store

Monday Apr 28, 2014

DBAs Guide to Deploying Sandboxes in the Cloud


The need for a private, secure and safe area for data discovery within the data warehouse ecosystem is growing rapidly as many companies start investing in and investigating "big data". Business users need space and resources to evaluate new data sources to determine their value to the business and/or explore news way of analyzing existing datasets to extract even more value.  These safe areas are most commonly referred to as "Sandboxes" or "Discovery Sandboxes" or "Discovery Zones".  If you are not familiar with the term then Forrester Research defines a "sandbox" as:

“data exploration environment where a power user can analyse production […] with near complete freedom to modify data models, enrich data sets and run the analysis whenever necessary, without much dependency on IT and production environment restrictions.” *1

These sandboxes are tremendously useful for business users because they allow them to quickly and informally explore new data sets or new ways of analyzing data without having to go through the formal rigour normally associated with data flowing into the EDW or deploying analytical scripts within the EDW. They provide business users with a high degree of freedom. The real business value is highlighted in a recent article by Ralph Kimball:

In several of the e-commerce enterprises interviewed for this white paper, analytic sandboxes were extremely important, and in some cases hundreds of the sandbox experiments were ongoing simultaneously.

As one interviewee commented “newly discovered patterns have the most disruptive potential, and insights from them lead to the highest returns on investment" *2

Key Characteristics

So what are they key characteristics of a sandbox? Essentially there are three:

  1. Used by skilled business analysts and data scientists
  2. Environment has fewer rules of engagement
  3. Time boxed

Sandboxes are not really designed to be used by CIOs or CEOs or general BI users. They are designed for business analysts and data scientists who have a strong knowledge of SQL, detailed understanding of the business and the source data that is being evaluated/analyzed. As with many data exploration projects you have to be able to understand the results that come back from a query and be able to determine very quickly if they make sense.

As I stated before, the normal EDW rules of engagement are significantly relaxed within the sandbox and new data flowing into the sandbox is typically disorganised and dirty. Hence the need for strong SQL skills to create simplified but functional data cleaning and transformation scripts with the emphasis being to make new data usable as quickly as possible. Part of the "transformation" process might be to generate new data points derived from existing attributes. A typical example of this is where a data set contains date-of-birth information, which in itself is quite a useful piece of information, that can be transformed to create a new data point of "age". Obviously the business analysts and data scientist need to be reasonably proficient in SQL to create the required transformation steps - it is not a complicated process but it highlights the point that the business community needs to have the necessary skills so that they are self-sufficient.

Most importantly the sandbox environment needs to have a time limit. In the past this is where most companies have gone wrong! Many companies fail to kill off their sandboxes. Instead these environments evolve and flourish into shadow marts and/or data warehouses which end up causing havoc as users can never be sure which system contains the correct data. Today, most enlightened companies enforce a 90-day timer on their sandboxes. Once the 90 day cycle is complete then ownership of the processes and data are either moved over to the EDW team, who can then start to apply the corporate standards to the various objects and scripts, or the environment and all its data is simply dropped.

The only way a business can support the hundreds of live sandbox experiments described in Kimball's recent report (*2) is by enforcing these three key characteristics.

Choosing your deployment model:

Over the years that I have spent working on various data warehouse projects I have seen a wide variety of  weird and wonderful deployment models designed to support sandboxing. In very general terms these various deployment models reduce down to one of the following types:

  1. Desktop sandbox
  2. Detached sandbox
  3. Attached sandbox

each one of these deployment models has benefits and advantages as described here:

1. Desktop Sandboxes

Many business users prefer to use their desktop tools, such as spreadsheet packages, because the simple row-column data model gives them a simplified and easily managed view of their data set. However, this approach places a significant processing load on the desktop computer (laptop or PC) and while some vendors offer a way to off-load some of that processing to bespoke middleware servers this obviously means implementing an additional specialised middleware server on dedicated hardware.  Otherwise, companies have to invest large amounts of money upgrading their desktop systems with additional memory and solid-state disks.

Creating a new sandbox is just a question of opening a new, fresh worksheet and loading the required data set. Obviously, the size and breadth of the dataset is limited by the resources on the desktop system and complicated calculations can take a considerable time to run with little or no scope for additional optimisation or tuning. Desktop sandbox are, by default, data-silos and completely disconnected from the enterprise data warehouse which makes it very difficult to do any sort of joined-up analysis. 

The main advantage of this approach is that power users can easily run what-if models where they redefine their data model to test new "hierarchies", add new dimensions or new attributes. They can even change the data by simply over-typing existing values. Collaboration is a simple process of emailing the spreadsheet model to other users for comments. The overriding assumption here is that users who receive the spreadsheet are actually authorised to view the data! Of course there is nothing to prevent recipients forwarding the data to other users. Therefore, it is fair to say that data security is non-existent.

For DBAs, the biggest problem with this approach is that it offers no integration points into the existing cloud management infrastructure. Therefore, it is difficult for the IT team to monitor the resources being used and make appropriate x-charges.  Of course the DBA has no control over the deletion of desktop based sandboxes so there is a tendency for these environments to take on a life of their own with business users using them to create "shadow" production systems that are never decommissioned.

Overall, the deployment of desktop sandboxes is not recommended.

2. Detached Sandboxes

Using a detached, dedicated sandbox platform resolves many of the critical issues related to desktop sandbox platforms most notably the issues relating to: data security and processing scalability. Assuming a relatively robust platform is used to manage the sandboxes then the security profiles implemented in the EDW can be replicated across to the stand-alone platform. This approach still allows users to redefine their data model to test new "hierarchies", add new dimensions or new attributes within what-if models and even change data points but this ability is "granted" by the DBA rather than being automatically taken and enforced by the business user. In terms of sharing results there is no need to distribute data via email and this ensures everyone gets the same consistent view of the results (and by default the original source, should there be a need to work backwards from the results to the source).

Key concerns for business users is the level of latency that occurs from the need to unload and reload not only the required data but also all the supporting technical and business metadata. Unloading, moving and importing large historical data sets can be very time consuming and can require large amounts of resources on the production system - which may or may not be available depending on the timing of the request. 

For the DBA issues arise around the need to monitor additional hardware and software services in the data center. For IT this means more costs because additional floor space, network bandwidth, power and cooling may be required. Of course, assuming that the sandbox platform fits into the existing monitoring and control infrastructure then x-charging can be implemented. In this environment the DBA has full control over the deletion of a sandbox so they can prevent the spread of "shadow" production data sets. For important business discoveries, the use of detached sandboxes does provide the IT team with the opportunity to grab the loading and analysis scripts and move them to the production EDW environment. This helps to reduce the amount of time and effort needed to "productionize" discoveries.

While detached sandboxes remove some of the disadvantages of desktop platforms it is still not an ideal way to deliver sandboxes to the business community.

3. Attached Sandboxes

Attached sandboxes resolve all the problems associated with the other two scenarios. Oracle provides a rich set of in-database features that allow business users to work with in-place data, which in effect, removes the issue of data latency. Oracle Database is able to guarantee complete isolation for any changes to dimensions, hierarchies, attributes and/or even individual data points so there is no need to unload, move and then reload data. All the existing data security policies remain in place which means there is no need to replicate security profiles to other systems where there is the inherent risk that something might be missed in the process.

For the DBA, x-charging can be implemented using existing infrastructure management tools. The DBA has full control over the sandbox in terms of resources (storage space, CPU, I/O) and duration. The only concern that is normally raised regarding the use of attached sandboxes is the impact on the existing operational workloads. Fortunately, Oracle Database, in conjunction with our engineered systems, has a very robust workload management framework (see earlier posts on this topic: This means that the DBA can allocate sufficient resource to each sandbox while ensuring that the key operational workloads continue to meet their SLAs. Overall, attached sandboxes, within an Oracle Database environment, is a win-win solution: both the DBA and the business community get what they need.


Deployment Model



Desktop Sandbox

High degree of local control over data
“Fast” performance
Quick and easy sharing of results

Reduced data scalability
Not easy to integrate new data
Very costly to implement
Undermines data consistency-governance
Data security is compromised

Detached Sandbox

Reduces workload on EDW
Upload personal/external data to sandbox
Explore large volumes of data without limits

Requires additional hardware and software
Requires replication of corporate data
High latency
Replication + increased management of operational metadata

Attached Sandbox

Upload additional data to virtual partitions Easy to mix new data with corporate data
No replication of corporate data
Efficient use of DW platform resources
Data access controlled by enterprise security features

Requires robust workload management tools

From this list of pros and cons it is easy to see that the "Attached Sandbox"  is the best deployment model to use. Fortunately, Oracle Database 12c has a number of new features and improvements to existing features that mean it is the perfect platform for deploying and managing attached sandboxes.

B-O-X-D: the lifecycle of a sandbox

Now we know what type of sandbox we need to deploy (just in case you were not paying attention - attached sandboxes!) to keep our business users happy the next step is to consider the lifecycle of the sandbox along with the tools and features that support each of the key phases. To make things easier I have broken this down into four key DBA-centric phases as shown below:

Sandbox lifecycle

Over the next four weeks I will cover these four key phases of the sandbox lifecycle and explain which Oracle tools and Oracle Database features are relevant and how they can be used. 


*1 Solve the Data Management Conflict Between Business and IT, by Brad Peters - Information Management Newsletters, July 20, 2010

*2 The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics by Ralph Kimball

Tuesday Apr 22, 2014

Announcing: Big Data Appliance 3.0 and Big Data Connectors 3.0

Today we are releasing Big Data Appliance 3.0 (which includes the just released Oracle NoSQL Database 3.0) and Big Data Connectors 3.0.These releases deliver a large number of interesting and cool features and enhance the overall Oracle Big Data Management System that we think is going to be the core of information management going forward.

This post highlights a few of the new enhancements across the BDA, NoSQL DB and BDC stack.

Big Data Appliance 3.0:
  • Pre-configured and pre-installed CDH 5.0 with default support for YARN and MR2
  • Upgrade from BDA 2.5 (CDH 4.6) to BDA 3.0 (CDH 5.0)
  • Full encryption (at rest and over the network) from a single vendor in an appliance
  • Kerberos and Apache Sentry pre-configured
  • Partition Pruning through Oracle SQL Connector for Hadoop
  • Apache Spark (incl. Spark Streaming) support
  • More
Oracle NoSQL Database 3.0:
  • Table data model support layered on top of distributed key-value model
  • Support for Secondary Indexing
  • Support for "Data Centers" => Metro zones for DR and secondary zones for read-only workloads
  • Authentication and network encryption
  • More

You can read about all of these features by going to links above and reading the OTN page, data sheets and other relevant information.

While BDA 3.0 immediately delivers upgrade from BDA 2.5, Oracle will also support the current version and we fully expect more BDA 2.x releases based on more CDH 4.x releases. As a customer you now have a choice how to deploy BDA and which version it is you want to run, while knowing you can upgrade to the latest and greatest in a safe manner.

Thursday Apr 17, 2014

Analyzing our Big Data Lite movie app clickstream data

Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature: 

 To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here:

The data set is based around an online customer movie application. Here is a little background…

Movieplex on otn

Oracle MoviePlex Demo Application

Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:

Avro file

In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information.  Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations… let's bring on the code!

Sessionization using Java

Accessing the Avro file

At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here:

The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.

Calculating the sessionization data

 Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code.  Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):

The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):

As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes  to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.

Making life easier and richer with SQL

My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.

Accessing the Avro file

First step is to create a DIRECTORY object to access the raw data file:

CREATE DIRECTORY session_file_dir AS '/home/oracle/applog';
GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;

Next I created an external table over the data file. Notice that in the avro file, each key column  has an identifier and a value.


The objective for this external table was to keep the processing as simple as possible:

 LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log') 

Cleaning the data

Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:

    WHEN '1' THEN '1'
    WHEN '2}' THEN '2'
    WHEN '3}' THEN '3'
    WHEN '4}' THEN '4'
    WHEN '5}' THEN '5'
    WHEN '6}' THEN '6'
    WHEN '7}' THEN '5'
    WHEN '8}' THEN '8'
    WHEN '9}' THEN '9'
    WHEN '10' THEN '10'
    WHEN '11' THEN '11'
    ELSE null
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase

Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.


 Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.

Calculating the sessionization data

The code to calculate the sessionization information is as follows:

FROM vwsession_data 
 (PARTITION BY cust_id ORDER BY sess_date 
  MEASURES match_number() session_id, 
    COUNT(*) no_of_events,
    FIRST(sess_date) start_date,
    TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time,
    LAST(sess_date) end_date,
    TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
    TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
    LAST(activity_id) last_act_id,
    COUNT(act_rate) act_rate,
    COUNT(act_complete) act_complete,
    COUNT(act_pause) act_pause,
    COUNT(act_start) act_start,
    COUNT(act_browse) act_browse,
    COUNT(t(act_list) act_list,
    COUNT(act_search) act_search,
    COUNT(act_login) act_login, 
    COUNT(act_logout) act_logout,
    COUNT(act_incomplete) act_incomplete,
    COUNT(act_purchase) act_purchase 
 PATTERN (strt s+) 
    s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5)

This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about  5 minutes then I will assume that a new session has started even if the user has not logged out. 

Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least  one (or more) "complete"  event is logged,

PATTERN (strt s* f+ c+) 
 f as act_start = 'Y',
 c as act_complete = 'Y',
 s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)

The output from this query is shown below:

Start Complete Data Full

Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams. 

A different view: Calculating the sessionization data for each movie

While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.

So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:

 (PARTITION BY movie_id ORDER BY sess_date 
  MEASURES COUNT(*) no_of_events,
     FIRST(sess_date) start_date,
     TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time,
     LAST(sess_date) end_date,
     TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
     TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
     MIN(activity_id) act_id,
     MAX(activity_id) last_act_id,
     COUNT(act_rate) act_rate,
     COUNT(act_complete) act_complete,
     COUNT(act_pause) act_pause,
     COUNT(act_start) act_start,
     COUNT(act_browse) act_browse,
     COUNT(act_list) act_list,
     COUNT(act_search) act_search,
     COUNT(act_login) act_login, 
     COUNT(act_logout) act_logout,
     COUNT(act_incomplete) act_incomplete,
     COUNT(act_purchase) act_purchase 
 PATTERN (strt s*) 
     s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120)
 ) MR;

The speed at which you can create these additional/alternative result sets is really great!  It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.


The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis.  Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….

Tuesday Apr 15, 2014

OpenWorld call for Papers closes today!

 Just a gentle reminder - if you have not submitted a paper for this year's OpenWorld conference then there is still just enough time because the deadline is Today (Tuesday, April 15) at 11:59pm PDT. The call for papers website is here and this provides all the details of how and what to submit.

I have been working with a number of customers on some really exciting papers so I know this year's conference is going to be really interesting for data warehousing and analytics. I would encourage everyone to submit a paper, especially if you have never done this before. Right now both data warehousing and analytics are among the hottest topics in IT and I am sure all of you have some great stories that you could share with your industry peers who will be attending the conference. It is a great opportunity to present to your peers and also learn from them by attending their data warehouse/analytics sessions during this week long conference. And of course you get a week of glorious Californian sunshine and the chance to spend time in one of the World's most beautiful waterfront cities.

If you would like any help submitting a proposal then feel free to email during today and I will do my best to provide answers and/or guidance. My email address is

Have a great day and get those papers entered into our OpenWorld system right now! 

Thursday Apr 03, 2014

Updated: Price Comparison for Big Data Appliance and Hadoop

Untitled Document

It was time to update this post a little. Big Data Appliance grew, got more features and prices as well as insights just changed all across the board. So, here is an update.

The post is still aimed at providing a simple apples-to-apples comparison and a clarification of what is, and what is not included in the pricing and packaging of Oracle Big Data Appliance when compared to "I'm doing this myself - DIY style".

Oracle Big Data Appliance Details

A few of the most overlooked items in pricing out a Hadoop cluster are the cost of software, the cost of actual production-ready hardware and the required networking equipment. A Hadoop cluster needs more than just CPUs and disks... For Oracle Big Data Appliance we assume that you would want to run this system as a production system (with hot-pluggable components and redundant components in your system). We also assume you want the leading Hadoop distribution plus support for that software. You'd want to look at securing the cluster and possibly encrypting data at rest and over the network. Speaking of network, InfiniBand will eliminate network saturation issues - which is important for your Hadoop cluster.

With that in mind, Oracle Big Data Appliance is an engineered system built for production clusters.  It is pre-installed and pre-configured with Cloudera CDH and all (I emphasize all!) options included and we (with the help of Cloudera of course) have done the tuning of the system for you. On top of that, the price of the hardware (US$ 525,000 for a full rack system - more configs and smaller sizes => read more) includes the cost of Cloudera CDH, its options and Cloudera Manager (for the life of the machine - so not a subscription).

So, for US$ 525,000 you get the following:

  • Big Data Appliance Hardware (comes with Automatic Service Request upon component failures)
  • Cloudera CDH and Cloudera Manager
  • All Cloudera options as well as Accumulo and Spark (CDH 5.0)
  • Oracle Linux and the Oracle JDK
  • Oracle Distribution of R
  • Oracle NoSQL Database Community Edition
  • Oracle Big Data Appliance Enterprise Manager Plug-In

The support cost for the above is a single line item.. The list price for Premier Support for Systems per the Oracle Price list (see source below) is US$ 63,000 per year.

To do a simple 3 year comparison with other systems, the following table shows the details and the totals for Oracle Big Data Appliance. Note that the only additional item is the install and configuration cost which are done by Oracle personnel or partners, on-site:

Year 1 Year 2 Year 3 3 Year
BDA Cost

Annual Support Cost

On-site Install (approximately)


For this you will get a full rack BDA (18 Sun X4-2L servers, 288 cores (Two Intel Xeon E5-2650V2 CPUs per node), 864TB disk (twelve 4TB disks per node), plus software, plus support, plus on-site setup and configuration. Or in terms of cost per raw TB at purchase and at list pricing: $697.

HP DL-380 Comparative System (this is changed from the original post to the more common DL-380's)

To build a comparative hardware solution to the Big Data Appliance we picked an HP-DL180 configuration and built up the servers using the website for pricing. The following is the price for a single server.

Model Number Description Quantity Total Price
653200-B21 ProLiant DL380p Gen8 Rackmount Factory Integrated 8 SFF CTO Model (2U) with no processor, 24 DIMM with no memory, open bay (diskless) with 8 SFF drive cage, Smart Array P420i controller with Zero Memory, 3 x PCIe 3.0 slots, 1 FlexibleLOM connector, no power supply, 4 x redundant fans, Integrated HP iLO Management Engine
2.6GHz Xeon E5-2650 v2 processor (1 chip, 8 cores) with 20MB L3 cache - Factory Integrated Only
HP 1GbE 4-port 331FLR Adapter - Factory Integrated Only
460W Common Slot Gold Hot Plug Power Supply
HP Rack 10000 G2 Series - 10842 (42U) 800mm Wide Cabinet - Pallet Universal Rack
8GB (1 x 8GB) Single Rank x8 PC3L-12800R (DDR3-1600) Registered CAS-11 Low Voltage Memory Kit
HP Smart Array P222/512MB FBWC 6Gb 1-port Int/1-port Ext SAS controller 1
4TB 6Gb SAS 7.2K LFF hot-plug SmartDrive SC Midline disk drive (3.5") with 1-year warranty

Grand Total for a single server (list prices)


On top of this we need InfiniBand switches. Oracle Big Data Appliance comes with 3 IB switches, allowing us to expand the cluster without suddenly requiring extra switches. And, we do expect these machines to be a part of a much larger clusters. The IB switches are somewhere in the neighborhood of US$ 6,000 per switch, so add $18,000 per rack and add a management switch (BDA uses a Cisco switch) which seems to be around $15,000 list. The total switching comes to roughly $33,000.

We will also need Cloudera Enterprise subscription - and to compare apples to apples, we will do it for all software. Some sources (see this document) peg CDH Core at $3,382 list per node and per year (24*7 support). Since BDA has more software (all options) and that pricing is not public I am going to make an educated calculation and rounding and double the price with a rounding to the nearest nice and round number. That gets me to $7,000 per node, per year for 24*7 support. 

BDA also comes with on-disk encryption, which is even harder to price out. My somewhat educated guess is around $1,500 list or so per node and per year. Oh, and lets not forget the Linux subscription, which lists at $1,299 per node per year. We also run a MySQL database (enterprise edition with replication), which costs list subscription $5,000. We run it replicated over 2 nodes.

This all gets us to roughly $10,000 list price per node per year for all applicable software subscriptions and support and an additional $10,000 for the two MySQL nodes.

HP + Cloudera Do-it-Yourself System

Let's go build our own system. The specs are like a BDA, so we will have 18 servers and all other components included. 

Year 1 Year 2 Year 3 Total




SW Subscriptions and Support

Installation and Configuration


Some will argue that the installation and configuration is free (you already pay your data center team), but I would argue that something that takes a short amount of time when done by Oracle, is worth the equivalent if it takes you a lot longer to get all this installed, optimized, and running. Nevertheless, here is some math on how to get to that cost anyways: approximately 150 hours of labor per rack for the pure install work. That adds up to US $15,000 if we assume a cost per hour of $100. 

Note: those $15,000 do NOT include optimizations and tuning to Hadoop, to the OS, to Java and other interesting things like networking settings across all these areas. You will now need to spend time to figure out the number of slots you allocate per node, the file system block size (do you use Apache defaults, or Cloudera's or something else) and many more things at system level. On top of that, we pre-configure for example Kerberos and Apache Sentry giving you a secure authorization and authentication method, as well as have a one-click on-disk and network encryption setting. Of course you can contact various other companies to do this for you.

You can also argue that "you want the cheapest hardware possible", because Hadoop is built to deal with failures, so it is OK for things to regularly fail. Yes, Hadoop does deal well with hardware failures, but your data center is probably much less keen about this idea, because someone is going to replace the disks (all the time). So make sure the disks are hot-swappable. An oh, that someone swapping the disks does cost money... The other consideration is failures in important components like power... redundant power in a rack is a good thing to have. All of this is included (and thought about) in Oracle Big Data Appliance.

In other words, do you really want spend weeks installing, configuring and learning or would you rather start to build applications on top of the Hadoop cluster and thus providing value to your organization.

The Differences

The main differences between Oracle Big Data Appliance and a DIY approach are:

  1. A DIY system - at list price with basic installation but no optimization - is a staggering $220 cheaper as an initial purchase
  2. A DIY system - at list price with basic installation but no optimization - is almost $250,000 more expensive over 3 years.
    Note to purchasing, you can spend this on building or buying applications on your cluster (or buy some real intriguing Oracle software)
  3. The support for the DIY system includes five (5) vendors. Your hardware support vendor, the OS vendor, your Hadoop vendor, your encryption vendor as well as your database vendor. Oracle Big Data Appliance is supported end-to-end by a single vendor: Oracle
  4. Time to value. While we trust that your IT staff will get the DIY system up and running, the Oracle system allows for a much faster "loading dock to loading data" time. Typically a few days instead of a few weeks (or even months)
  5. Oracle Big Data Appliance is tuned and configured to take advantage of the software stack, the CPUs and InfiniBand network it runs on
  6. Any issue we, you or any other BDA customer finds in the system is fixed for all customers. You do not have a unique configuration, with unique issues on top of the generic issues.


In an apples-to-apples comparison of a production Hadoop cluster, Oracle Big Data Appliance starts of with the same acquisition prices and comes out ahead in terms of TCO over 3 years. It allows an organization to enter the Hadoop world with a production-grade system in a very short time reducing both risk as well as reducing time to market.

As always, when in doubt, simply contact your friendly Oracle representative for questions, support and detailed quotes.


HP and related pricing: or (the latter is a paid service - sorry!)
Oracle Pricing:
MySQL Pricing:

Tuesday Apr 01, 2014

Limited Edition Exadata X4-2C - Brighten Up Your Data Center

Oracle has always been at the forefront of efforts to revolutionise your data center. To date, for obvious reasons, the focus has been on optimizing energy and space efficiency. As of today we are moving into an exciting new phase in terms of the look and feel of your data center. Oracle recently added a new fashion design team to its engineered system group to help us re-imagine the next generation data center and the first exciting fruits of this new partnership of both technology and fashion are now available for our customers to order…..

For a short period only, Oracle is offering its data warehouse customers the chance to buy a limited edition EXADATA X4-2C. This new Exadata configuration is going to brighten up your data center with its exciting range of color coordinated racks! Now you can enjoy running those really sophisticated business queries in glorious technicolor. Most importantly, the great news is that we are not charging you anything extra for this fabulous new technicolor data warehouse experience:

X4 2C


Each color-coded rack comes with its own color-linked version of Enterprise Manager to add more colour, brightness and joy to all those day-to-day tasks as you can see below on these specially designed monitoring screens: 


 Your Exadata DBA is really going to thank you!

So what happens if you buy a 1/2 rack then slowly add more Exadata nodes? Great question - well, while stocks last you can actually create your own multi-colored Exadata rack. As always we are ahead of the game because we know what our customers want. SO WHY NOT HAVE A  TECHNICOLOR DATA WAREHOUSE in your data center! Go on, you know it makes sense….

X4 2C 2

BUT YOU GOTTA HURRY - This new Exadata X4-2C range is a limited edition, special order only model. Stocks are limited. To brighten up your data center make sure you contact your Oracle Sales Representative right now because you do not want to miss out on this exciting opportunity to put one of these gorgeous, colour-coded dudes in your data center. And don't forget, only Oracle gives you  HARDWARE, SOFTWARE AND COLORENGINEERED TO WORK TOGETHER

Oracle 1-800-633-0738

Wednesday Mar 26, 2014

Oracle Big Data Lite Virtual Machine - Version 2.5 Now Available

Oracle Big Data Appliance Version 2.5 was released last week.  Some great new features in this release- including a continued security focus (on-disk encryption and automated configuration of Sentry for data authorization) and updates to Cloudera Distribution of Apache Hadoop and Cloudera Manager.

With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine.  Oracle Big Data Lite provides an integrated environment to help you get started with the Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you to begin using the system right away. The following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition (
  • Cloudera’s Distribution including Apache Hadoop (CDH4.6)
  • Cloudera Manager 4.8.2
  • Cloudera Enterprise Technology, including:
    • Cloudera RTQ (Impala 1.2.3)
    • Cloudera RTS (Search 1.2)
  • Oracle Big Data Connectors 2.5
    • Oracle SQL Connector for HDFS 2.3.0
    • Oracle Loader for Hadoop 2.3.1
    • Oracle Data Integrator 11g
    • Oracle R Advanced Analytics for Hadoop 2.3.1
    • Oracle XQuery for Hadoop 2.4.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)
  • Oracle JDeveloper 11g
  • Oracle SQL Developer 4.0
  • Oracle Data Integrator 12cR1
  • Oracle R Distribution 3.0.1

Go to the Oracle Big Data Lite Virtual Machine landing page on OTN to download the latest release.

Monday Mar 24, 2014

Demonstration: Auditing Data Access Across the Enterprise

Security has been an important theme across recent Big Data Appliance releases. Our most recent release includes encryption of data at rest and automatic configuration of Sentry for data authorization. This is in addition to the security features previously added to the BDA, including Kerberos-based authentication, network encryption and auditing.

Auditing data access across the enterprise - including databases, operating systems and Hadoop - is critically important and oftentimes required for SOX, PCI and other regulations. Let's take a look at a demonstration of how Oracle Audit Vault and Database Firewall delivers comprehensive audit collection, alerting and reporting of activity on an Oracle Big Data Appliance and Oracle Database 12c. 


In this scenario, we've set up auditing for both the BDA and Oracle Database 12c.


The Audit Vault Server is deployed to its own secure server and serves as mission control for auditing. It is used to administer audit policies, configure activities that are tracked on the secured targets and provide robust audit reporting and alerting. In many ways, Audit Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources into an audit schema and then delivers both pre-built and ad hoc reporting capabilities.

For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database 12c monitored targets; these agents are responsible for managing collectors that gather activity data. This is a secure agent deployment; the Audit Vault Server has a trusted relationship with each agent. To set up the trusted relationship, the agent makes an activation request to the Audit Vault Server; this request is then activated (or "approved") by the AV Administrator. The monitored target then applies an AV Server generated Agent Activation Key to complete the activation.


On the BDA, these installation and configuration steps have all been automated for you. Using the BDA's Configuration Generation Utility, you simply specify that you would like to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the audit data. Mammoth - the BDA's installation tool - uses this information to configure the audit processing. Specifically, it sets up audit trails across the following services:

  • HDFS: collects all file access activity
  • MapReduce:  identifies who ran what jobs on the cluster
  • Oozie:  audits who ran what as part of a workflow
  • Hive:  captures changes that were made to the Hive metadata

There is much more flexibility when monitoring the Oracle Database. You can create audit policies for SQL statements, schema objects, privileges and more. Check out the auditor's guide for more details. In our demonstration, we kept it simple: we are capturing all select statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user and any unsuccessful attempts at selecting from any table in any schema.

Now that we are capturing activity across the BDA and Oracle Database 12c, we'll set up an alert to fire whenever there is suspicious activity attempted over sensitive HR data in Hadoop:


In the alert definition found above, a critical alert is defined as three unsuccessful attempts from a given IP address to access data in the HR directory. Alert definitions are extremely flexible - using any audited field as input into a conditional expression. And, they are automatically delivered to the Audit Vault Server's monitoring dashboard - as well as via email to appropriate security administrators.

Now that auditing is configured, we'll generate activity by two different users: oracle and DrEvil. We'll then see how the audit data is consolidated in the Audit Vault Server and how auditors can interrogate that data.

Capturing Activity

The demonstration is driven by a few scripts that generate different types of activity by both the oracle and DrEvil users. These activities include:

  • an oozie workflow that removes salary data from HDFS
  • numerous HDFS commands that upload files, change file access privileges, copy files and list the contents of directories and files
  • hive commands that query, create, alter and drop tables
  • Oracle Database commands that connect as different users, create and drop users, select from tables and delete records from a table

After running the scripts, we log into the Audit Vault Server as an auditor. Immediately, we see our alert has been triggered by the users' activity.


Drilling down on the alert reveals DrEvil's three failed attempts to access the sensitive data in HDFS:

alert details

Now that we see the alert triggered in the dashboard, let's see what other activity is taking place on the BDA and in the Oracle Database.

Ad Hoc Reporting

Audit Vault Server delivers rich reporting capabilities that enables you to better understand the activity that has taken place across the enterprise. In addition to the numerous reports that are delivered out of box with Audit Vault, you can create your own custom reports that meet your own personal needs. Here, we are looking at a BDA monitoring report that focuses on Hadoop activities that occurred in the last 24 hours:

monitor events

As you can see, the report tells you all of the key elements required to understand: 1) when the activity took place, 2) the source service for the event, 3) what object was referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the ip address (or host) that initiated the event, and 7) how the object was modified or accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed attempts to open the sensitive salaries.txt file.

Notice, events may be related to one another. The Hive command "ALTER TABLE my_salarys RENAME TO my_salaries" will generate two events. The first event is sourced from the Metastore; the alter table command is captured and the metadata definition is updated. The Hive command also impacts HDFS; the table name is represented by an HDFS folder. Therefore, an HDFS event is logged that renames the "my_salarys" folder to "my_salaries".

Next, consider an Oozie workflow that performs a simple task: delete a file "salaries2.txt" in HDFS. This Oozie worflow generates the following events:


  1. First, an Oozie workflow event is generated indicating the start of the workflow.
  2. The workflow definition is read from the "workflow.xml" file found in HDFS.
  3. An Oozie working directory is created
  4. The salaries2.txt file is deleted from HDFS
  5. Oozie runs its clean-up process

The Audit Vault reports are able to reveal all of the underlying activity that is executed by the Oozie workflow. It's flexible reporting allows you to sequence these independent events into a logical series of related activities.

The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few unsuccessful attempts to access sensitive salary data in HDFS. But, what other unsuccessful events have occured recently across our data platform? We'll use Audit Vault's ad hoc reporting capabilities to answer that question. Report filters enable users to search audit data based on a range of conditions. Here, we'll keep it pretty simple; let's find all failed access attempts across both the BDA and the Oracle Database within the last two hours:


Again, DrEvil's activity stands out. As you can see, DrEvil is attempting to access sensitive salary data not only in HDFS - but also in the Oracle Database.


Security and integration with the rest of the Oracle ecosystem are two tablestakes that are critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewall's auditing of data across the BDA, databases and operating systems epitomizes this goal - providing a single repository and reporting environment for all your audit data.

Built-in sorting optimizations to support analytical SQL

One of the proof points that I often make for using analytical SQL over more sophisticated SQL-based methods is that we have included specific optimizations within the database engine to support our analytical functions. In this blog post I am going to briefly talk about how the database optimizes the number of sorts that occur when using analytical SQL.

Sort Optimization 1: Ordering Groups

Many of analytical functions include PARTITION BY and/or an ORDER BY clause both of which by definition implies that an ordering process is going to be required. As each function can have its own PARTITION BY-ORDER BY clause this can create situations where lot of different sorts are needed. For example, if we have a SQL statement that included the following:

Rank() Over (Partition by (x) Order by (w))
Sum(a) Over (Partition by (w,x) Order by (z))
Ntile() Over (Partition by (x) Order by (y))
Sum(b) Over (Partition by (x,y) Order by (z))

this could involve four different sort processes to take into account the use of both PARTITION BY and ORDER BY clauses across the four functions. Performing four separate sort processes on a data set could add a tremendous overhead (depending on the size of the data set). Therefore, we have taken two specific steps to optimize the sorting process.

The first step is create the notion of "Ordering Groups". This optimizations looks for ways to group together sets of analytic functions which can be evaluated with a single sort. The objective is to construct a minimal set of ordering groups which in turn minimizes the number of sorts. In the example above we would create two ordering groups as follows:

Screen Shot 2014 03 13 at 13 39 37

This allows us to reduce the original list of sorts down from 4 to just 2.

Sort Optimization 2: Eliminating Sorts

We can further reduce the number sorts that need to be performed by carefully scheduling the execution so that:

  • Ordering groups with sorts corresponding to that in the GROUP BY execute first (immediately after the GROUP BY) 
  • Ordering groups with sorts corresponding to that in the ORDER BY execute last (immediately before the ORDER BY)

In addition, we can also eliminate sorts when an index or join method (sort-merge) makes sorting unnecessary. 

Optimization 3 : RANK Predicates

Where a SQL statement includes RANK() functions there are additional optimizations that kick-in. Instead of sorting all the data, adding the RANK and then applying the predicate, the RANK predicate is evaluated as part of the sort process. The net result is that fewer records are actually sorted, resulting in more efficient execution.


Overall, these three optimizations ensure that as few sorts as possible are performed when you include SQL analytical functions as part of your SQL statements. 

Friday Mar 21, 2014

Open World 2014 - guidelines for call-for-papers…

OOW Banner 2013

Most of you will already have received an email from the OOW team announcing the call for papers for this year's conference: Each year, customers ask me how they can increase their chances of getting their paper accepted? Well, I am going to start by stating that product managers have absolutely no influence over which papers are accepted - even mentioning that a product manager will be co-presenting with you will not increase your chances!

So how do you increase you make sure that your presentation title and abstract catches the eye of the selection committee? Here is my top 10 list of guidelines for submitting proposals:

1) Read the "call-for-papers" carefully and follow its instructions - even if you have submitted presentations for lots of Oracle conferences it is always a good idea to carefully read the call for papers and to make sure you follow the instructions. There is an excellent section towards the end of the call-for-papers web page, "Tips and Guidelines"

2) Address the theme of the conference - If this is available when the call the for papers is announced then try to address the theme of the conference within your abstract.

3) Address the key data warehouse focus areas - for this year's OOW 2014 the key focus areas for data warehousing will be partitioning, analytical SQL, parallel execution, workload management and logical data warehouse. If possible try to include one or more of these focus areas within your abstract.

4) Have a strong biography - You need to use your biography to differentiate and build credibility. This is an important topic because it allows you to differentiate yourself from all the other presenters who are trying to get speaking slots. Your biography must explain why you are an authority on the topic you have chosen for your presentation and why people will want to listen to what you have to say.

5) Have a strong business case - build your presentation around a strong business case, relevant to your industry and/or your target audience (DBAs, developers, architects etc). Try to explain in clear and simple terms the problem you needed to solve, how you solved it using Oracle technology and the direct technical/business benefits.

6) Make the title and abstract interesting - Your title and abstract must be easy to read and make sure you introduce your main idea as early as possible. Review the titles and abstracts from previous conferences as a guide. Ideally make the issue relevant to the delegates attending OWW, get to the point, and make sure it is easy to read.

7) Look at previous presentations - the content catalog for last year's conference is available online,see here: You can review all the titles and abstracts that were accepted and use them as guidelines for creating your own title and abstract for this year's conference.

8) Write clear outcomes - The majority of the best presentations have clearly stated outcomes. What do you expect that conference attendees will be able do or know at the end of your session? Consider including a sentence at the end of your abstract such as the following: “At the end of this presentation, participants will be able to . . . .”

9) Don’t submit your paper right away - Once you have a title and abstract show it to a few colleagues. Get some feedback. You probably know many people who’d be happy to give you ideas on making your paper better.

10) Keep number of submissions low - You do not increase your chances of getting a paper accepted by submitting lots of different papers.

I cannot guarantee you success if you follow these guideline but I hope they prove helpful. Good luck with your submission(s) and I look forward to seeing at you at this year's OpenWorld in San Francisco.


Wednesday Mar 19, 2014

Announcing Encryption of Data-at-Rest on Big Data Appliance

With the release of Big Data Appliance software bundle 2.5, BDA completes the encryption story underneath Cloudera CDH. BDA already came with network encryption, ensuring no network sniffing can be applied in between the nodes, it now adds encryption of data-at-rest.

A Brief Overview

Encryption of data-at-rest can be done in 2 modes. One mode leverages the Trusted Platform Module (TPM) on the motherboard to provide a key to encrypt the data on disk. This mode does not require a password or pass phrase but relies on the motherboard. The second mode leverages a passphrase, which in turn will be used to generate a private-public key pair generated with OpenSSL. The key pair is encrypted as well.

The passphrase encryption has a few more interesting aspects. For one, it does require the passphrase to be entered upon re-booting the system. Leveraging the TPM option does not require any manual intervention at reboot. On Big Data Appliance it is possible to regularly change the passphrase without impacting the encryption, or required re-encryption of the data.

Neither one of the encryption methods affect user access to user data. In other words, on an unprotected cluster a user that can read data before encryption will be able to read data after encryption. The goal is to ensure data is protected on physical media - like theft or incorrect disposal of a disk. Both forms protect from that, but only passphrase based encryption protects from disposal or theft of a server.

On BDA, it is possible to switch between these two methods. This does have impact on running the cluster as data needs to be re-encrypted. For this step the cluster will be down, however data is not duplicated, so there is no need to reserve double the space to do the re-encryption.

How to Encrypt Data

As with all installation or changes on Big Data Appliance you will leverage Mammoth to do the install with encryption or to make changes to the system if you are already in production. Before you set up either of the two modes of data-at-rest encryption, you should consider your requirements. Changing the mode - as described - is possible, but will require the cluster to be down for re-encryption.

Full Set of Security Features

Encryption - out-of-the-box is yet another feature that is specific to Oracle Big Data Appliance. On top of pre-configured Kerberos, Apache Sentry, Oracle Audit Vault Encryption now adds another security dimension. To read more about the full set of features start here.


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« September 2015