Thursday Apr 17, 2014

Analyzing our Big Data Lite movie app clickstream data

Since last year's OpenWorld I have posted quite a few articles on our new 12c pattern matching feature: 

 To date most of my online demos and tutorials have used a very simple data set consisting of a single table with 60 rows of fictitious trade data. Over the last few months I have been searching around for a slightly bigger and more challenging data set to use. Fortunately, our BIg Data PMs (led by Marty Gubar) have put together a very rich big data schema as part of their recently launched Big Data Virtual Machine that you can download from our OTN web page, see here: http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.

The data set is based around an online customer movie application. Here is a little background…

Movieplex on otn

Oracle MoviePlex Demo Application

Oracle MoviePlex is a fictitious on-line movie streaming company. Customers log into Oracle MoviePlex where they are presented with a targeted list of movies based on their past viewing behavior. Because of this personalized experience and reliable and fast performance, customers spend a lot of money with the company and it has become extremely profitable.

As the users watch movies, search for their favorite actors, rate their movies the system records all the clicks in a log file. The contents of that log file looks like this:

Avro file

In its raw state this information is not very helpful. It needs a lot of processing to convert the various data points into usable information. In many cases companies have been turning to Hadoop and its related Java-based programming language MapReduce to process and convert these types of files into usable information.  Most business users will want to summarise this information by customer and/or movie and then merge this information with other data sets. So how can we make access to and analysis of this type of data much easier? As part of this post I am going to compare the Hadoop-Java-MapReduce approach with an alternative approach using 12c SQL. My objective is not to determine which is solution is the best because each approach has its own camp of supporters. Once we have reviewed the two solutions I will put everything into context and make some recommendations…..so let's bring on the code!

Sessionization using Java

Accessing the Avro file

At the moment In the context of Big Data, everything seem to revolve around Hadoop, MapReduce and Java. It is quite natural for a big data developer to extend their basic map reduce processing to include more complicated requirements. In the case of our movie demo there is a lot of processing that needs to be done using the native features of Hadoop to collect and process the weblog file being generated by our application. There is an excellent video prepared by Marty Gubar (Director of Product Management for Big Data) which explains this process. This is part four of a six-part series that explains the movieplex demo: Part 4. Turn Clicks into Value - Flume & Hive. The movie demo lasts about 5 mins and you can watch here: https://www.youtube.com/watch?v=IwrjJUoUwXY.

The steps shown in the movie explain how to access the avro file and then how to clean the data to provide some interesting additional metrics.

Calculating the sessionization data

 Creating the sessionization analysis is a little more complicated. In fact, it requires 370 lines of Java code.  Here is the Java code we created for doing the sessionization analysis (the code window is scrollable so you can review the very long code sample):

The actual logic for the sessionization analysis is about 100 lines of code as shown here (at s before, the code window is scrollable so you can review the very long code sample):

As you can see from the code listing this requires a strong knowledge of Java and with 370 lines of code, if we decide to change the details of the pattern that we are searching for it is going to be a lengthy process to make the required changes. Making anything changes  to the code to reflect changing business requirements is definitely going to be beyond the skills of your average business analyst and/or data scientists and this might negatively impact the level of project agility.

Making life easier and richer with SQL

My personal view is that sessionization analysis is quite a sophisticated requirement and is best achieved using SQL because the requirements and parameters for the analysis will naturally evolve over time as new questions are triggered by existing answers. Therefore, you need an agile approach to analytics that can deal with constant changing requirements during the data discovery phase.

Accessing the Avro file

First step is to create a DIRECTORY object to access the raw data file:

CREATE DIRECTORY session_file_dir AS '/home/oracle/applog';
GRANT READ, WRTIE ON DIRECTORY session_file_dir to pmuser;

Next I created an external table over the data file. Notice that in the avro file, each key column  has an identifier and a value.

{"custId":1077225"movieId":100"genreId":8"time":"2011-07-30:22:55:32""recommended":"N""activity":6}

The objective for this external table was to keep the processing as simple as possible:

CREATE TABLE RAW_SESSION_DATA 
(
  CUST_ID VARCHAR2(20) 
, MOVIE_ID VARCHAR2(20) 
, GENRE_ID VARCHAR2(20) 
, SESSION_DATE VARCHAR2(50) 
, RECOMMENDED_ID VARCHAR2(20) 
, ACTIVITY_ID VARCHAR2(20)
, RATING_ID VARCHAR2(20)
) 
ORGANIZATION EXTERNAL 
( 
 TYPE ORACLE_LOADER 
 DEFAULT DIRECTORY SESSION_FILE_DIR 
 ACCESS PARAMETERS
 (RECORDS DELIMITED BY NEWLINE
 NOBADFILE
 NODISCARDFILE
 NOLOGFILE
 FIELDS TERMINATED BY ','
 OPTIONALLY ENCLOSED BY "'"
 MISSING FIELD VALUES ARE NULL)
 LOCATION (SESSION_FILE_DIR: 'movieapp_30months.log') 
) 
REJECT LIMIT 1;
ALTER TABLE RAW_SESSION_DATA 
PROJECT COLUMN REFERENCED;

Cleaning the data

Now the next stage is to clean the data and remove the column identifiers such as custId, movieId, genreId etc. To do this processing I simply used the SQL SUBSTR() function. At the same time I decided to breakout the activity ID column to create unique columns for each type of activity. This approach is useful if you are going to be using the data as part of a data mining project because in many cases it useful to uniquely identify specific attributes. In many cases you will want to create a fact table from this process but in this case I have created a view while I do some basic prototyping:

CREATE OR REPLACE VIEW VW_SESSION_DATA AS SELECT 
  SUBSTR(CUST_ID,11) AS CUST_ID
, CASE SUBSTR(MOVIE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(MOVIE_ID,11) END AS MOVIE_ID
, CASE SUBSTR(GENRE_ID,11) WHEN 'null' THEN null ELSE SUBSTR(GENRE_ID,11) END AS GENRE_ID
, TO_DATE(SUBSTR(SESSION_DATE, 9,19), 'YYYY-MM-DD HH24:MI:SS') AS SESS_DATE
, CASE SUBSTR(RECOMMENDED_ID,15) WHEN 'null' THEN null ELSE SUBSTR(RECOMMENDED_ID,16,1) END AS REC_ID
, (CASE SUBSTR(ACTIVITY_ID,12,2)
    WHEN '1' THEN '1'
    WHEN '2}' THEN '2'
    WHEN '3}' THEN '3'
    WHEN '4}' THEN '4'
    WHEN '5}' THEN '5'
    WHEN '6}' THEN '6'
    WHEN '7}' THEN '5'
    WHEN '8}' THEN '8'
    WHEN '9}' THEN '9'
    WHEN '10' THEN '10'
    WHEN '11' THEN '11'
    ELSE null
  END) AS ACTIVITY_ID
, CASE SUBSTR(RATING_ID,9) WHEN 'null' THEN null ELSE SUBSTR(RATING_ID,10,1) END as RATING_ID
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '1' THEN 'Y' END as act_rate
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '2' THEN 'Y' END as act_complete 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '3' THEN 'Y' END as act_pause
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '4' THEN 'Y' END as act_start 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '5' THEN 'Y' END as act_browse 
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '6' THEN 'Y' END as act_list
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '7' THEN 'Y' END as act_search
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '8' THEN 'Y' END as act_login
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '9' THEN 'Y' END as act_logout
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '10' THEN 'Y' END as act_incomplete
, case SUBSTR(ACTIVITY_ID,12,1) WHEN '11' THEN 'Y' END as act_purchase
FROM RAW_SESSION_DATA;

Running a query against this view generates our basic data set that we can then feed into our MATCH_RECOGNIZE clause to create the sessionization result set.

Sessionization

 Is there a better, simpler way to deal with the avro file? Well yes there is. As part of the Database 12c release programme we will be adding in-database support for JSON data. This will allow us to take the raw avro file and access stand query it directly from within the database without the need to go through the complicated cleaning process! At the moment we have not made any collateral (PPTs etc) on this feature publicly available so I can't provide you with any supporting links that will give you more information. However, once this feature is released I will revisit this little demo to explore how the new JSON feature can be used along side SQL pattern matching.

Calculating the sessionization data

The code to calculate the sessionization information is as follows:

SELECT *
FROM vwsession_data 
MATCH_RECOGNIZE
 (PARTITION BY cust_id ORDER BY sess_date 
  MEASURES match_number() session_id, 
    COUNT(*) no_of_events,
    FIRST(sess_date) start_date,
    TO_CHAR(FIRST(sess_date), 'hh24:mi:ss') start_time,
    LAST(sess_date) end_date,
    TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
    TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
    LAST(activity_id) last_act_id,
    COUNT(act_rate) act_rate,
    COUNT(act_complete) act_complete,
    COUNT(act_pause) act_pause,
    COUNT(act_start) act_start,
    COUNT(act_browse) act_browse,
    COUNT(t(act_list) act_list,
    COUNT(act_search) act_search,
    COUNT(act_login) act_login, 
    COUNT(act_logout) act_logout,
    COUNT(act_incomplete) act_incomplete,
    COUNT(act_purchase) act_purchase 
 ONE ROW PER MATCH 
 PATTERN (strt s+) 
 DEFINE 
    s as (round(to_number(sess_date - prev(sess_date)) * 1440) <= 5)
 );

This statement uses many of the MATCH_RECOGNIZE features that I have discussed in previous posts. In this specific example I have decided to set the duration between events within a session as 5 minutes. That means if the user does nothing for about  5 minutes then I will assume that a new session has started even if the user has not logged out. 

Beyond the basic sessionization model there is another interesting pattern that we can search for within our data set. If a user starts watching a movie then we might not get another event being logged for over two hours. Therefore, we could look for instances where the time between events is less than 150 minutes (assuming most movies last around 2 hours) and there is at least one (or more) "start" event is logged and at least  one (or more) "complete"  event is logged,

PATTERN (strt s* f+ c+) 
 DEFINE 
 f as act_start = 'Y',
 c as act_complete = 'Y',
 s as (round(to_number(sess_date - prev(sess_date))*1440) <= 150)

The output from this query is shown below:

Start Complete Data Full

Looking at the results what really jumps out is the level of interaction on the site before, during and after the user has watched a movie. Many of the records that are returned show users searching our movie database, going to their lists of favourite movies and rating the movies they have just watched. All this implies that if we can get customers to begin watching a movie there is a huge knock-on effect in terms of how long they remain on the site and the number of "events" they create while on the site. This is the sort of information that is useful to marketing and campaign management teams. 

A different view: Calculating the sessionization data for each movie

While developing the above analysis I began to wonder about analysing the data not from a customer perspective but from the point of view of each movie. This would allow me to look at the session activity over time for each movie and understand the usage patterns across movies and genres. For example, are there certain movies or types of movies that are more or less likely to be rated. Is the process of watching a movie the same across all movies in terms of the events that are registered during a session? There are all sorts of new avenues of analysis that could be created from looking at the data in this way.

So how easy would it be to change the focus of the analysis to movies? Well the code took me approximately 30 seconds to write - actually cut & paste from the customer sessionization example, switch the customer id for the movie id and finally removed a few fields. Here is the new code:

SELECT *
FROM SESSION_DATA td
MATCH_RECOGNIZE
 (PARTITION BY movie_id ORDER BY sess_date 
  MEASURES COUNT(*) no_of_events,
     FIRST(sess_date) start_date,
     TO_CHAR(first(sess_date), 'hh24:mi:ss') start_time,
     LAST(sess_date) end_date,
     TO_CHAR(LAST(sess_date), 'hh24:mi:ss') end_time, 
     TO_CHAR(ROUND(TO_NUMBER(LAST(sess_date) - FIRST(sess_date)) * 1440), '999,999') duration,
     MIN(activity_id) act_id,
     MAX(activity_id) last_act_id,
     COUNT(act_rate) act_rate,
     COUNT(act_complete) act_complete,
     COUNT(act_pause) act_pause,
     COUNT(act_start) act_start,
     COUNT(act_browse) act_browse,
     COUNT(act_list) act_list,
     COUNT(act_search) act_search,
     COUNT(act_login) act_login, 
     COUNT(act_logout) act_logout,
     COUNT(act_incomplete) act_incomplete,
     COUNT(act_purchase) act_purchase 
 ONE ROW PER MATCH 
 PATTERN (strt s*) 
 DEFINE 
     s as (ROUND(TO_NUMBER(sess_date - PREV(sess_date))*1440) <= 120)
 ) MR;

The speed at which you can create these additional/alternative result sets is really great!  It is now relatively easy to continue this discovery process by analysing the information by movie genre or any of the other related attributes.

Summary

The main take-away from this post is that (hopefully) I shown how easy it is to use SQL for sessionization analysis.  Firstly in terms of creating the framework to support the normal "sessionization" transformation process: taking the source log file, extracting the data points and then grouping the information by user. The great news is that processing of the source file will be greatly simplified when the JSON feature is released because it will be possible to query the log file in a much simpler way to extract the required data points. Secondly, SQL provides a much simpler and easier way to support the evolutionary of the discovery process. Changing the focus from customer analysis to movie analysis is both quick and easy. I am not convinced that a Java-based approach to data discovery can offer the same level of simplicity and agility but I will let you be the judge of that last statement….

Tuesday Apr 15, 2014

OpenWorld call for Papers closes today!

 Just a gentle reminder - if you have not submitted a paper for this year's OpenWorld conference then there is still just enough time because the deadline is Today (Tuesday, April 15) at 11:59pm PDT. The call for papers website is here http://www.oracle.com/openworld/call-for-papers/index.html and this provides all the details of how and what to submit.

I have been working with a number of customers on some really exciting papers so I know this year's conference is going to be really interesting for data warehousing and analytics. I would encourage everyone to submit a paper, especially if you have never done this before. Right now both data warehousing and analytics are among the hottest topics in IT and I am sure all of you have some great stories that you could share with your industry peers who will be attending the conference. It is a great opportunity to present to your peers and also learn from them by attending their data warehouse/analytics sessions during this week long conference. And of course you get a week of glorious Californian sunshine and the chance to spend time in one of the World's most beautiful waterfront cities.

If you would like any help submitting a proposal then feel free to email during today and I will do my best to provide answers and/or guidance. My email address is keith.laker@oracle.com.

Have a great day and get those papers entered into our OpenWorld system right now! 

Thursday Apr 03, 2014

Updated: Price Comparison for Big Data Appliance and Hadoop

Untitled Document

It was time to update this post a little. Big Data Appliance grew, got more features and prices as well as insights just changed all across the board. So, here is an update.

The post is still aimed at providing a simple apples-to-apples comparison and a clarification of what is, and what is not included in the pricing and packaging of Oracle Big Data Appliance when compared to "I'm doing this myself - DIY style".

Oracle Big Data Appliance Details

A few of the most overlooked items in pricing out a Hadoop cluster are the cost of software, the cost of actual production-ready hardware and the required networking equipment. A Hadoop cluster needs more than just CPUs and disks... For Oracle Big Data Appliance we assume that you would want to run this system as a production system (with hot-pluggable components and redundant components in your system). We also assume you want the leading Hadoop distribution plus support for that software. You'd want to look at securing the cluster and possibly encrypting data at rest and over the network. Speaking of network, InfiniBand will eliminate network saturation issues - which is important for your Hadoop cluster.

With that in mind, Oracle Big Data Appliance is an engineered system built for production clusters.  It is pre-installed and pre-configured with Cloudera CDH and all (I emphasize all!) options included and we (with the help of Cloudera of course) have done the tuning of the system for you. On top of that, the price of the hardware (US$ 525,000 for a full rack system - more configs and smaller sizes => read more) includes the cost of Cloudera CDH, its options and Cloudera Manager (for the life of the machine - so not a subscription).

So, for US$ 525,000 you get the following:

  • Big Data Appliance Hardware (comes with Automatic Service Request upon component failures)
  • Cloudera CDH and Cloudera Manager
  • All Cloudera options as well as Accumulo and Spark (CDH 5.0)
  • Oracle Linux and the Oracle JDK
  • Oracle Distribution of R
  • Oracle NoSQL Database Community Edition
  • Oracle Big Data Appliance Enterprise Manager Plug-In

The support cost for the above is a single line item.. The list price for Premier Support for Systems per the Oracle Price list (see source below) is US$ 63,000 per year.

To do a simple 3 year comparison with other systems, the following table shows the details and the totals for Oracle Big Data Appliance. Note that the only additional item is the install and configuration cost which are done by Oracle personnel or partners, on-site:


Year 1 Year 2 Year 3 3 Year
Total
BDA Cost
$525,000



Annual Support Cost
$63,000
$63,000
$63,000

On-site Install (approximately)
$14,000



Total
$602,000
$63,000
$63,000
$728,150

For this you will get a full rack BDA (18 Sun X4-2L servers, 288 cores (Two Intel Xeon E5-2650V2 CPUs per node), 864TB disk (twelve 4TB disks per node), plus software, plus support, plus on-site setup and configuration. Or in terms of cost per raw TB at purchase and at list pricing: $697.

HP DL-380 Comparative System (this is changed from the original post to the more common DL-380's)

To build a comparative hardware solution to the Big Data Appliance we picked an HP-DL180 configuration and built up the servers using the HP.com website for pricing. The following is the price for a single server.

Model Number Description Quantity Total Price
653200-B21 ProLiant DL380p Gen8 Rackmount Factory Integrated 8 SFF CTO Model (2U) with no processor, 24 DIMM with no memory, open bay (diskless) with 8 SFF drive cage, Smart Array P420i controller with Zero Memory, 3 x PCIe 3.0 slots, 1 FlexibleLOM connector, no power supply, 4 x redundant fans, Integrated HP iLO Management Engine
1
$2,051
715218-L21
2.6GHz Xeon E5-2650 v2 processor (1 chip, 8 cores) with 20MB L3 cache - Factory Integrated Only
2
$3,118
684208-B21
HP 1GbE 4-port 331FLR Adapter - Factory Integrated Only
1
$25
503296-B21
460W Common Slot Gold Hot Plug Power Supply
1
$229
AF041A
HP Rack 10000 G2 Series - 10842 (42U) 800mm Wide Cabinet - Pallet Universal Rack
0
$0
731765-B21
8GB (1 x 8GB) Single Rank x8 PC3L-12800R (DDR3-1600) Registered CAS-11 Low Voltage Memory Kit
8
$1,600
631667-B21
HP Smart Array P222/512MB FBWC 6Gb 1-port Int/1-port Ext SAS controller 1
$599
695510-B21
4TB 6Gb SAS 7.2K LFF hot-plug SmartDrive SC Midline disk drive (3.5") with 1-year warranty
12
$12,588





Grand Total for a single server (list prices)

$20,210

On top of this we need InfiniBand switches. Oracle Big Data Appliance comes with 3 IB switches, allowing us to expand the cluster without suddenly requiring extra switches. And, we do expect these machines to be a part of a much larger clusters. The IB switches are somewhere in the neighborhood of US$ 6,000 per switch, so add $18,000 per rack and add a management switch (BDA uses a Cisco switch) which seems to be around $15,000 list. The total switching comes to roughly $33,000.

We will also need Cloudera Enterprise subscription - and to compare apples to apples, we will do it for all software. Some sources (see this document) peg CDH Core at $3,382 list per node and per year (24*7 support). Since BDA has more software (all options) and that pricing is not public I am going to make an educated calculation and rounding and double the price with a rounding to the nearest nice and round number. That gets me to $7,000 per node, per year for 24*7 support. 

BDA also comes with on-disk encryption, which is even harder to price out. My somewhat educated guess is around $1,500 list or so per node and per year. Oh, and lets not forget the Linux subscription, which lists at $1,299 per node per year. We also run a MySQL database (enterprise edition with replication), which costs list subscription $5,000. We run it replicated over 2 nodes.

This all gets us to roughly $10,000 list price per node per year for all applicable software subscriptions and support and an additional $10,000 for the two MySQL nodes.

HP + Cloudera Do-it-Yourself System

Let's go build our own system. The specs are like a BDA, so we will have 18 servers and all other components included. 


Year 1 Year 2 Year 3 Total

Servers

$363,780



Networking
$33,000



SW Subscriptions and Support
$190,000
$190,000
$190,000

Installation and Configuration
$15,000



Total
$601,780
$190,000
$190,000
$981,780

Some will argue that the installation and configuration is free (you already pay your data center team), but I would argue that something that takes a short amount of time when done by Oracle, is worth the equivalent if it takes you a lot longer to get all this installed, optimized, and running. Nevertheless, here is some math on how to get to that cost anyways: approximately 150 hours of labor per rack for the pure install work. That adds up to US $15,000 if we assume a cost per hour of $100. 

Note: those $15,000 do NOT include optimizations and tuning to Hadoop, to the OS, to Java and other interesting things like networking settings across all these areas. You will now need to spend time to figure out the number of slots you allocate per node, the file system block size (do you use Apache defaults, or Cloudera's or something else) and many more things at system level. On top of that, we pre-configure for example Kerberos and Apache Sentry giving you a secure authorization and authentication method, as well as have a one-click on-disk and network encryption setting. Of course you can contact various other companies to do this for you.

You can also argue that "you want the cheapest hardware possible", because Hadoop is built to deal with failures, so it is OK for things to regularly fail. Yes, Hadoop does deal well with hardware failures, but your data center is probably much less keen about this idea, because someone is going to replace the disks (all the time). So make sure the disks are hot-swappable. An oh, that someone swapping the disks does cost money... The other consideration is failures in important components like power... redundant power in a rack is a good thing to have. All of this is included (and thought about) in Oracle Big Data Appliance.

In other words, do you really want spend weeks installing, configuring and learning or would you rather start to build applications on top of the Hadoop cluster and thus providing value to your organization.

The Differences

The main differences between Oracle Big Data Appliance and a DIY approach are:

  1. A DIY system - at list price with basic installation but no optimization - is a staggering $220 cheaper as an initial purchase
  2. A DIY system - at list price with basic installation but no optimization - is almost $250,000 more expensive over 3 years.
    Note to purchasing, you can spend this on building or buying applications on your cluster (or buy some real intriguing Oracle software)
  3. The support for the DIY system includes five (5) vendors. Your hardware support vendor, the OS vendor, your Hadoop vendor, your encryption vendor as well as your database vendor. Oracle Big Data Appliance is supported end-to-end by a single vendor: Oracle
  4. Time to value. While we trust that your IT staff will get the DIY system up and running, the Oracle system allows for a much faster "loading dock to loading data" time. Typically a few days instead of a few weeks (or even months)
  5. Oracle Big Data Appliance is tuned and configured to take advantage of the software stack, the CPUs and InfiniBand network it runs on
  6. Any issue we, you or any other BDA customer finds in the system is fixed for all customers. You do not have a unique configuration, with unique issues on top of the generic issues.

Conclusion

In an apples-to-apples comparison of a production Hadoop cluster, Oracle Big Data Appliance starts of with the same acquisition prices and comes out ahead in terms of TCO over 3 years. It allows an organization to enter the Hadoop world with a production-grade system in a very short time reducing both risk as well as reducing time to market.

As always, when in doubt, simply contact your friendly Oracle representative for questions, support and detailed quotes.

Sources:

HP and related pricing: http://www.hp.com or http://www.ideasinternational.com/ (the latter is a paid service - sorry!)
Oracle Pricing: http://www.oracle.com/us/corporate/pricing/exadata-pricelist-070598.pdf
MySQL Pricing: http://www.oracle.com/us/corporate/pricing/price-lists/mysql-pricelist-183985.pdf

Tuesday Apr 01, 2014

Limited Edition Exadata X4-2C - Brighten Up Your Data Center

Oracle has always been at the forefront of efforts to revolutionise your data center. To date, for obvious reasons, the focus has been on optimizing energy and space efficiency. As of today we are moving into an exciting new phase in terms of the look and feel of your data center. Oracle recently added a new fashion design team to its engineered system group to help us re-imagine the next generation data center and the first exciting fruits of this new partnership of both technology and fashion are now available for our customers to order…..

For a short period only, Oracle is offering its data warehouse customers the chance to buy a limited edition EXADATA X4-2C. This new Exadata configuration is going to brighten up your data center with its exciting range of color coordinated racks! Now you can enjoy running those really sophisticated business queries in glorious technicolor. Most importantly, the great news is that we are not charging you anything extra for this fabulous new technicolor data warehouse experience:

X4 2C

HARDWARE, SOFTWARE AND COLORENGINEERED TO WORK TOGETHER

Each color-coded rack comes with its own color-linked version of Enterprise Manager to add more colour, brightness and joy to all those day-to-day tasks as you can see below on these specially designed monitoring screens: 

EMC

 Your Exadata DBA is really going to thank you!

So what happens if you buy a 1/2 rack then slowly add more Exadata nodes? Great question - well, while stocks last you can actually create your own multi-colored Exadata rack. As always we are ahead of the game because we know what our customers want. SO WHY NOT HAVE A  TECHNICOLOR DATA WAREHOUSE in your data center! Go on, you know it makes sense….

X4 2C 2

BUT YOU GOTTA HURRY - This new Exadata X4-2C range is a limited edition, special order only model. Stocks are limited. To brighten up your data center make sure you contact your Oracle Sales Representative right now because you do not want to miss out on this exciting opportunity to put one of these gorgeous, colour-coded dudes in your data center. And don't forget, only Oracle gives you  HARDWARE, SOFTWARE AND COLORENGINEERED TO WORK TOGETHER

Oracle 1-800-633-0738


Wednesday Mar 26, 2014

Oracle Big Data Lite Virtual Machine - Version 2.5 Now Available

Oracle Big Data Appliance Version 2.5 was released last week.  Some great new features in this release- including a continued security focus (on-disk encryption and automated configuration of Sentry for data authorization) and updates to Cloudera Distribution of Apache Hadoop and Cloudera Manager.

With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine.  Oracle Big Data Lite provides an integrated environment to help you get started with the Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you to begin using the system right away. The following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.1)
  • Cloudera’s Distribution including Apache Hadoop (CDH4.6)
  • Cloudera Manager 4.8.2
  • Cloudera Enterprise Technology, including:
    • Cloudera RTQ (Impala 1.2.3)
    • Cloudera RTS (Search 1.2)
  • Oracle Big Data Connectors 2.5
    • Oracle SQL Connector for HDFS 2.3.0
    • Oracle Loader for Hadoop 2.3.1
    • Oracle Data Integrator 11g
    • Oracle R Advanced Analytics for Hadoop 2.3.1
    • Oracle XQuery for Hadoop 2.4.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)
  • Oracle JDeveloper 11g
  • Oracle SQL Developer 4.0
  • Oracle Data Integrator 12cR1
  • Oracle R Distribution 3.0.1

Go to the Oracle Big Data Lite Virtual Machine landing page on OTN to download the latest release.

Monday Mar 24, 2014

Demonstration: Auditing Data Access Across the Enterprise

Security has been an important theme across recent Big Data Appliance releases. Our most recent release includes encryption of data at rest and automatic configuration of Sentry for data authorization. This is in addition to the security features previously added to the BDA, including Kerberos-based authentication, network encryption and auditing.

Auditing data access across the enterprise - including databases, operating systems and Hadoop - is critically important and oftentimes required for SOX, PCI and other regulations. Let's take a look at a demonstration of how Oracle Audit Vault and Database Firewall delivers comprehensive audit collection, alerting and reporting of activity on an Oracle Big Data Appliance and Oracle Database 12c. 

Configuration

In this scenario, we've set up auditing for both the BDA and Oracle Database 12c.

architecture

The Audit Vault Server is deployed to its own secure server and serves as mission control for auditing. It is used to administer audit policies, configure activities that are tracked on the secured targets and provide robust audit reporting and alerting. In many ways, Audit Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources into an audit schema and then delivers both pre-built and ad hoc reporting capabilities.

For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database 12c monitored targets; these agents are responsible for managing collectors that gather activity data. This is a secure agent deployment; the Audit Vault Server has a trusted relationship with each agent. To set up the trusted relationship, the agent makes an activation request to the Audit Vault Server; this request is then activated (or "approved") by the AV Administrator. The monitored target then applies an AV Server generated Agent Activation Key to complete the activation.

agents

On the BDA, these installation and configuration steps have all been automated for you. Using the BDA's Configuration Generation Utility, you simply specify that you would like to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the audit data. Mammoth - the BDA's installation tool - uses this information to configure the audit processing. Specifically, it sets up audit trails across the following services:

  • HDFS: collects all file access activity
  • MapReduce:  identifies who ran what jobs on the cluster
  • Oozie:  audits who ran what as part of a workflow
  • Hive:  captures changes that were made to the Hive metadata

There is much more flexibility when monitoring the Oracle Database. You can create audit policies for SQL statements, schema objects, privileges and more. Check out the auditor's guide for more details. In our demonstration, we kept it simple: we are capturing all select statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user and any unsuccessful attempts at selecting from any table in any schema.

Now that we are capturing activity across the BDA and Oracle Database 12c, we'll set up an alert to fire whenever there is suspicious activity attempted over sensitive HR data in Hadoop:

setup_alert

In the alert definition found above, a critical alert is defined as three unsuccessful attempts from a given IP address to access data in the HR directory. Alert definitions are extremely flexible - using any audited field as input into a conditional expression. And, they are automatically delivered to the Audit Vault Server's monitoring dashboard - as well as via email to appropriate security administrators.

Now that auditing is configured, we'll generate activity by two different users: oracle and DrEvil. We'll then see how the audit data is consolidated in the Audit Vault Server and how auditors can interrogate that data.

Capturing Activity

The demonstration is driven by a few scripts that generate different types of activity by both the oracle and DrEvil users. These activities include:

  • an oozie workflow that removes salary data from HDFS
  • numerous HDFS commands that upload files, change file access privileges, copy files and list the contents of directories and files
  • hive commands that query, create, alter and drop tables
  • Oracle Database commands that connect as different users, create and drop users, select from tables and delete records from a table

After running the scripts, we log into the Audit Vault Server as an auditor. Immediately, we see our alert has been triggered by the users' activity.

alert

Drilling down on the alert reveals DrEvil's three failed attempts to access the sensitive data in HDFS:

alert details

Now that we see the alert triggered in the dashboard, let's see what other activity is taking place on the BDA and in the Oracle Database.

Ad Hoc Reporting

Audit Vault Server delivers rich reporting capabilities that enables you to better understand the activity that has taken place across the enterprise. In addition to the numerous reports that are delivered out of box with Audit Vault, you can create your own custom reports that meet your own personal needs. Here, we are looking at a BDA monitoring report that focuses on Hadoop activities that occurred in the last 24 hours:

monitor events

As you can see, the report tells you all of the key elements required to understand: 1) when the activity took place, 2) the source service for the event, 3) what object was referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the ip address (or host) that initiated the event, and 7) how the object was modified or accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed attempts to open the sensitive salaries.txt file.

Notice, events may be related to one another. The Hive command "ALTER TABLE my_salarys RENAME TO my_salaries" will generate two events. The first event is sourced from the Metastore; the alter table command is captured and the metadata definition is updated. The Hive command also impacts HDFS; the table name is represented by an HDFS folder. Therefore, an HDFS event is logged that renames the "my_salarys" folder to "my_salaries".

Next, consider an Oozie workflow that performs a simple task: delete a file "salaries2.txt" in HDFS. This Oozie worflow generates the following events:

oozie-workflow

  1. First, an Oozie workflow event is generated indicating the start of the workflow.
  2. The workflow definition is read from the "workflow.xml" file found in HDFS.
  3. An Oozie working directory is created
  4. The salaries2.txt file is deleted from HDFS
  5. Oozie runs its clean-up process

The Audit Vault reports are able to reveal all of the underlying activity that is executed by the Oozie workflow. It's flexible reporting allows you to sequence these independent events into a logical series of related activities.

The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few unsuccessful attempts to access sensitive salary data in HDFS. But, what other unsuccessful events have occured recently across our data platform? We'll use Audit Vault's ad hoc reporting capabilities to answer that question. Report filters enable users to search audit data based on a range of conditions. Here, we'll keep it pretty simple; let's find all failed access attempts across both the BDA and the Oracle Database within the last two hours:

across-sources

Again, DrEvil's activity stands out. As you can see, DrEvil is attempting to access sensitive salary data not only in HDFS - but also in the Oracle Database.

Summary

Security and integration with the rest of the Oracle ecosystem are two tablestakes that are critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewall's auditing of data across the BDA, databases and operating systems epitomizes this goal - providing a single repository and reporting environment for all your audit data.

Built-in sorting optimizations to support analytical SQL

One of the proof points that I often make for using analytical SQL over more sophisticated SQL-based methods is that we have included specific optimizations within the database engine to support our analytical functions. In this blog post I am going to briefly talk about how the database optimizes the number of sorts that occur when using analytical SQL.

Sort Optimization 1: Ordering Groups

Many of analytical functions include PARTITION BY and/or an ORDER BY clause both of which by definition implies that an ordering process is going to be required. As each function can have its own PARTITION BY-ORDER BY clause this can create situations where lot of different sorts are needed. For example, if we have a SQL statement that included the following:

Rank() Over (Partition by (x) Order by (w))
Sum(a) Over (Partition by (w,x) Order by (z))
Ntile() Over (Partition by (x) Order by (y))
Sum(b) Over (Partition by (x,y) Order by (z))

this could involve four different sort processes to take into account the use of both PARTITION BY and ORDER BY clauses across the four functions. Performing four separate sort processes on a data set could add a tremendous overhead (depending on the size of the data set). Therefore, we have taken two specific steps to optimize the sorting process.

The first step is create the notion of "Ordering Groups". This optimizations looks for ways to group together sets of analytic functions which can be evaluated with a single sort. The objective is to construct a minimal set of ordering groups which in turn minimizes the number of sorts. In the example above we would create two ordering groups as follows:

Screen Shot 2014 03 13 at 13 39 37

This allows us to reduce the original list of sorts down from 4 to just 2.

Sort Optimization 2: Eliminating Sorts

We can further reduce the number sorts that need to be performed by carefully scheduling the execution so that:

  • Ordering groups with sorts corresponding to that in the GROUP BY execute first (immediately after the GROUP BY) 
  • Ordering groups with sorts corresponding to that in the ORDER BY execute last (immediately before the ORDER BY)

In addition, we can also eliminate sorts when an index or join method (sort-merge) makes sorting unnecessary. 

Optimization 3 : RANK Predicates

Where a SQL statement includes RANK() functions there are additional optimizations that kick-in. Instead of sorting all the data, adding the RANK and then applying the predicate, the RANK predicate is evaluated as part of the sort process. The net result is that fewer records are actually sorted, resulting in more efficient execution.

Summary 

Overall, these three optimizations ensure that as few sorts as possible are performed when you include SQL analytical functions as part of your SQL statements. 

Friday Mar 21, 2014

Open World 2014 - guidelines for call-for-papers…

OOW Banner 2013

Most of you will already have received an email from the OOW team announcing the call for papers for this year's conference: http://www.oracle.com/openworld/call-for-papers/index.html. Each year, customers ask me how they can increase their chances of getting their paper accepted? Well, I am going to start by stating that product managers have absolutely no influence over which papers are accepted - even mentioning that a product manager will be co-presenting with you will not increase your chances!

So how do you increase you make sure that your presentation title and abstract catches the eye of the selection committee? Here is my top 10 list of guidelines for submitting proposals:

1) Read the "call-for-papers" carefully and follow its instructions - even if you have submitted presentations for lots of Oracle conferences it is always a good idea to carefully read the call for papers and to make sure you follow the instructions. There is an excellent section towards the end of the call-for-papers web page, "Tips and Guidelines"

2) Address the theme of the conference - If this is available when the call the for papers is announced then try to address the theme of the conference within your abstract.

3) Address the key data warehouse focus areas - for this year's OOW 2014 the key focus areas for data warehousing will be partitioning, analytical SQL, parallel execution, workload management and logical data warehouse. If possible try to include one or more of these focus areas within your abstract.

4) Have a strong biography - You need to use your biography to differentiate and build credibility. This is an important topic because it allows you to differentiate yourself from all the other presenters who are trying to get speaking slots. Your biography must explain why you are an authority on the topic you have chosen for your presentation and why people will want to listen to what you have to say.

5) Have a strong business case - build your presentation around a strong business case, relevant to your industry and/or your target audience (DBAs, developers, architects etc). Try to explain in clear and simple terms the problem you needed to solve, how you solved it using Oracle technology and the direct technical/business benefits.

6) Make the title and abstract interesting - Your title and abstract must be easy to read and make sure you introduce your main idea as early as possible. Review the titles and abstracts from previous conferences as a guide. Ideally make the issue relevant to the delegates attending OWW, get to the point, and make sure it is easy to read.

7) Look at previous presentations - the content catalog for last year's conference is available online,see here:https://oracleus.activeevents.com/2013/connect/search.ww?eventRef=openworld. You can review all the titles and abstracts that were accepted and use them as guidelines for creating your own title and abstract for this year's conference.

8) Write clear outcomes - The majority of the best presentations have clearly stated outcomes. What do you expect that conference attendees will be able do or know at the end of your session? Consider including a sentence at the end of your abstract such as the following: “At the end of this presentation, participants will be able to . . . .”

9) Don’t submit your paper right away - Once you have a title and abstract show it to a few colleagues. Get some feedback. You probably know many people who’d be happy to give you ideas on making your paper better.

10) Keep number of submissions low - You do not increase your chances of getting a paper accepted by submitting lots of different papers.

I cannot guarantee you success if you follow these guideline but I hope they prove helpful. Good luck with your submission(s) and I look forward to seeing at you at this year's OpenWorld in San Francisco.

OOW2014

Wednesday Mar 19, 2014

Announcing Encryption of Data-at-Rest on Big Data Appliance

With the release of Big Data Appliance software bundle 2.5, BDA completes the encryption story underneath Cloudera CDH. BDA already came with network encryption, ensuring no network sniffing can be applied in between the nodes, it now adds encryption of data-at-rest.

A Brief Overview

Encryption of data-at-rest can be done in 2 modes. One mode leverages the Trusted Platform Module (TPM) on the motherboard to provide a key to encrypt the data on disk. This mode does not require a password or pass phrase but relies on the motherboard. The second mode leverages a passphrase, which in turn will be used to generate a private-public key pair generated with OpenSSL. The key pair is encrypted as well.

The passphrase encryption has a few more interesting aspects. For one, it does require the passphrase to be entered upon re-booting the system. Leveraging the TPM option does not require any manual intervention at reboot. On Big Data Appliance it is possible to regularly change the passphrase without impacting the encryption, or required re-encryption of the data.

Neither one of the encryption methods affect user access to user data. In other words, on an unprotected cluster a user that can read data before encryption will be able to read data after encryption. The goal is to ensure data is protected on physical media - like theft or incorrect disposal of a disk. Both forms protect from that, but only passphrase based encryption protects from disposal or theft of a server.

On BDA, it is possible to switch between these two methods. This does have impact on running the cluster as data needs to be re-encrypted. For this step the cluster will be down, however data is not duplicated, so there is no need to reserve double the space to do the re-encryption.

How to Encrypt Data

As with all installation or changes on Big Data Appliance you will leverage Mammoth to do the install with encryption or to make changes to the system if you are already in production. Before you set up either of the two modes of data-at-rest encryption, you should consider your requirements. Changing the mode - as described - is possible, but will require the cluster to be down for re-encryption.

Full Set of Security Features

Encryption - out-of-the-box is yet another feature that is specific to Oracle Big Data Appliance. On top of pre-configured Kerberos, Apache Sentry, Oracle Audit Vault Encryption now adds another security dimension. To read more about the full set of features start here.

Thursday Mar 13, 2014

Video: Big Data Connectors and IDH (Strata)

The certification of Oracle Big Data Connectors on Intel Distribution for Hadoop now complete (see our previous post). This video from Strata gives you a nice overview of IDH and BDC.

Friday Mar 07, 2014

Intel® Distribution for Apache Hadoop* certified with Oracle Big Data Connectors

Intel partnered with Oracle to certify compatibility between Intel® Distribution for Apache Hadoop* (IDH) and Oracle Big Data Connectors*.  Users can now connect IDH to Oracle Database with Oracle Big Data Connectors, taking advantage of the high performance feature-rich components of that product suite. Applications on IDH can leverage the connectors for fast load into Oracle Database, in-place query of data in HDFS with Oracle SQL, analytics in Hadoop with R, XQuery processing on Hadoop, and native Hadoop integration within Oracle Data Integrator.

Read the whole post here.

Thursday Feb 06, 2014

Sessionization analysis with 12c SQL pattern matching is super fast

Over the past six months I have posted a number of articles about SQL pattern matching, as shown here:

Most of these have been related to explaining the basic concepts and along with some specific use cases.

In this post I want to review some of the internal performance tests that we have run during the development of this feature. In part 3 of the series of podcasts I covered a number of use cases for SQL pattern matching such as: stock market analysis, tracking governance-compliance, call service quality and sessionization. The most popular scenarios is likely to be the analysis of sessionization data as this is currently a hot topic when we start considering machine-data and in more general data terms, big data.

To help us create a meaningful test data set we used decided to use the TPC-H schema because it contained approximately seven years of data which equated to approximately 1TB of data. One of the objectives of our performance tests was to compare and contrast the performance and scalability of code using the 12c MATCH_RECOGNIZE clause with code using 11g window functions.

Analysis of Sessionization Data

To make things easy to understand I have divided our sessionization workflow into a number of steps.
Part 1 - For the purposes of this specific use case we defined a session as a sequence of one or more events with the same partition key where the gap between the timestamps is less than 10 seconds - obviously the figure for the gap is completely arbitrary and could be set to any number as required. The 1TB of source data looked like this:

Session_Id User
1 Mary
2 Sam
3 Richard
11 Mary
12 Sam
13 Richard
22 Sam
23 Mary
23 Richard
32 Sam
33 Richard
34 Mary
43 Richard
43 Sam
44 Mary
47 Sam
48 Sam
53 Mary
54 Richard
59 Sam
60 Sam
63 Mary
63 Richard
68 Sam


The following sections compare the use of 11g window functions vs. 12c MATCH_RECOGNIZE clause.
Part 2- To create the first part of the sessionization workflow we took the original source data and used the USER_ID as the PARTITION BY key and the timestamp for the ORDER BY clause. The objective for this first step is to detect the various sessions and assign a surrogate session id to each session within each partition (USER_ID).
This creates an output result set that delivers a simplified sessionization data set as shown here:

NewImage

The 12c SQL to create the initial result set is as follows:

SELECT user_id, session_id start_time, no_of_events, duration
FROM Events MATCH_RECOGNIZE
 (PARTITION BY User_ID ORDER BY Time_Stamp
  MEASURES match_number() session_id,
           count(*) as no_of_events,
           first(time_stamp) start_time,
           last(time_stamp) - first(time_stamp) duration
  ONE ROW PER MATCH 
  PATTERN (b s*)
  DEFINE
       s as (s.Time_Stamp - prev(s.Time_Stamp) <= 10)
 )
;

as a comparison here is how to achieve the above using 11g analytical window functions

CREATE VIEW Sessionized_Events as
SELECT Time_Stamp, User_ID,
 Sum(Session_Increment) over (partition by User_ID order by Time_Stampasc) Session_ID
FROM ( SELECT Time_Stamp, User_ID,
 CASE WHEN (Time_Stamp - Lag(Time_Stamp) over (partition by User_ID order by Time_Stampasc)) < 10
 THEN 0 ELSE 1 END Session_Increment
 FROM Events);
SELECT User_ID, Min(Time_Stamp) Start_Time,
 Count(*) No_Of_Events, Max(Time_Stamp) -Min(Time_Stamp) Duration
FROM Sessionized_Events
GROUP BY User_ID, Session_ID;


As you can see the 11g approach using window functions ( SUM() OVER(PARTITION BY…) ) is a little more complex to understand but it produces the same output - i.e. our initial sessionized data set.


Part 3 - However, to get business value from this derived data set we need to do some additional processing.  Typically, with this kind of analysis the business value within the data emerges only after aggregation, which in this case needs to by session. We need to reduce the data set to a single tuple, or row, per session along with some derived attributes, such as:

  • Within-partition Session_ID
  • Number of events in a session
  • Total duration 

To do this with Database 12c we can use the MATCH_RECOGNIZE clause to determine how many events are captured within each session. There are actually two ways to do this: 1) we can compare the current record to the previous record, i.e. peek backwards or 2)  we can compare the current record to the next record, i.e. peek forwards.
Here is code based on using the PREV() function to compare the current record against the previous record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey,
session_id,
 start_time,
 no_of_events,
 duration
 from orders_v MATCH_RECOGNIZE
 (
  PARTITION BY o_pbykey
  ORDER BY O_custkey, O_Orderdate
  MEASURES match_number() session_id, count(*) as no_of_events,
           first(o_orderdate) start_time,
           last(o_orderdate) - first(o_orderdate) duration
  PATTERN (b s*)
  DEFINE s as (s.O_Orderdate - prev(O_Orderdate) <= 100)
 )
 )
 where No_Of_Events >= 20
); 

Here is code based on using the NEXT() function to compare the current record against the next record:

select count(*)
from ( select /* mr_sessionize_prev */ *
 from
 ( select o_pbykey, session_id, start_time, no_of_events, duration
 from orders_v MATCH_RECOGNIZE
 (
PARTITION BY o_pbykey
ORDER BY O_custkey, O_Orderdate
MEASURES match_number() session_id, count(*) as no_of_events,
         first(o_orderdate) start_time,
         last(o_orderdate) - first(o_orderdate) duration
PATTERN (s* e)
DEFINE s as (next(s.O_Orderdate) - s.O_Orderdate <= 100)
 )
 )
 where No_Of_Events >= 20

Finally we can compare the 12c MATCH_RECOGNIZE code to the 11g code which uses  window functions (which in my opinion is a lot more complex):

select count(*)
from (
 select /* wf */ *
 from (select O_pbykey,
              Session_ID,
              min(O_Orderdate) Start_Time,
              count(*) No_Of_Events,
             (max(O_Orderdate) - Min(O_Orderdate)) Duration
        from (select O_Orderdate,
O_Custkey, o_pbykey,
                     sum(Session_Increment)
                     over(partition by o_pbykey order by O_custkey, O_Orderdate) Session_ID
               from ( select O_Custkey,
                             O_Orderdate,
                             O_pbykey,
                             case when (O_Orderdate –
                             Lag(O_Orderdate)
                                 over(partition by o_pbykey
                                      order by O_custkey, O_Orderdate)) <= 100 -- Threshold
                             then 0 else 1 end Session_Increment
                      from orders_v
                    )
            )
       group by o_pbykey, Session_ID
    )
 where No_Of_Events >= 20


The final output generated by both sets of code (11g using window functions and 12c using MATCH_RECOGNIZE clause ) would look something like this:
border-style: initial; border-width: 0px; display: block; margin-left: auto; margin-right: auto;" title="NewImage.png" src="http://lh6.ggpht.com/-HDaEnake_3o/UrL_AViG4iI/AAAAAAAAA20/WyCX1sPjRVk/NewImage.png?imgmax=800" alt="NewImage" width="598" height="275" border="0">

Part 4 - The performance results for these three approaches (11g window functions vs. MATCH_RECOGNIZE using PREV() vs. MATCH_RECOGNIZE using NEXT() )are shown below. Please note that on the graph the X-axis shows the number of partitions within each test-run and the Y-axis shows the time taken to run each test. There are three key points to note from this graph:
The first is that, in general the 12c MATCH_RECOGNIZE code is between 1.5x and 1.9x faster compared to using window functions, which is good news if you are looking for a reason to upgrade to Database 12c.

MR Performance Results

Secondly, it is clear from the X-axis that as the number of partitions increases the MATCH_RECOGNIZE clause offers excellent scalability and continues to deliver excellent performance. So it performs well and scales well as your data volumes increase.

However, it is important to remember that the 11g window function code shows similar attributes of excellent scalability and excellent performance. If you are using 11g at the moment and you have not considered using Oracle Database to run your sessionization analysis then it is definitely worth pulling that transformation code back inside the database and using window functions to run those sessionization transformations. If you need a reason to upgrade to Database 12c then MATCH_RECOGNIZE does offer significant performance benefits if you are doing pattern matching operations either inside the Oracle Database 11g or using an external processing engine.

Lastly, when you are designing your own MATCH_RECONGIZE implementations and you are using the NEXT() or PREV() functions it is worth investigating if using the alternate function offers any significant performance benefits. Obviously, much will depend on the nature of the comparison you are trying to formulate but it is an interesting area and we would welcome feedback on this specific point based on your own data sets.

In general, if you are using the MATCH_RECOGNIZE clause then I would love to hear about your use case and experiences in developing your code. You can contact me directly either via this blog or via email (keith.laker@oracle.com).
Have fun with MATCH_RECOGNIZE….

Wednesday Feb 05, 2014

OTN Virtual Developer Day Database 12c content now available on-demand

Thank you to everyone who attended the SQL pattern matching session during yesterday's OTN Virtual Developer Day event. We had a great crowd of people join our live workshop session. I hope everyone enjoyed using the amazing platform which the OTN team put together to host the event.  

The great news is that all the content from the event is now available for download and you can watch the all on-demand videos from the four tracks (Big Data DBA, Big Data Developer, Database DBA and Database Developer). 

The link to fantastic OTN VDD platform is here: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite and this is what the landing pad page looks like:

OTNVDD Me

This page will give you access to the keynote session by Tom Kyte and Jonathan Lewis which covered the landscape of Oracle DB technology evolution and adoption.  The content looks at what's next for Oracle Database 12c looking at the high value technologies and techniques that are driving greater database efficiencies and innovation.

You will be able to access the videos, slides from each presentation and a huge range of technical hands-on labs covering big data and database technologies, including my SQL Pattern Matching workshop. If you want to download the the Virtualbox image for the Database tracks it is available here: http://www.oracle.com/technetwork/database/enterprise-edition/databaseappdev-vm-161299.html (this contains everything you need to run my SQL Pattern Matching workshop).

While you doing the workshop, if you have any questions then please feel free to email me - keith.laker@oracle.com.

Enjoy.

Monday Jan 27, 2014

Announcing: Oracle Big Data Lite Virtual Machine

You've been hearing alot about Oracle's big data platform. Today, we're pleased to announce Oracle Big Data Lite Virtual Machine - an environment to help you get started with the platform. And, we have a great OTN Virtual Developer Day event scheduled where you can start using our big data products as part of a series of workshops.


BigDataLite Picture

Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle's big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It's been configured to run on a "developer class" computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.

How do you get started? Why not start by registering for the Virtual Developer Day scheduled for Tuesday, February 4, 2014 - 9am  to 1pm PT / 12pm to 4pm ET / 3pm to 7pm BRT:

OTN_VDD

There will be 45 minute sessions delivered by product experts (from both Oracle and Oracle Aces) - highlighted by Tom Kyte and Jonathan Lewis' keynote "Landscape of Oracle Database Technology Evolution". Some of the big data technical sessions include:

  • Oracle NoSQL Database Installation and Cluster Topology Deployment
  • Application Development & Schema Design with Oracle NoSQL Database
  • Processing Twitter Data with Hadoop
  • Use Data from a Hadoop Cluster with Oracle Database
  • Make the Right Offers to Customers Using Oracle Advanced Analytics
  • In-DB Map Reduce with SQL/Hadoop
  • Pattern Matching in SQL

Keep an eye on this space - we'll be publishing how-to's that leverage the new Oracle Big Data Lite VM. And, of course, we'd love to hear about the clever applications you build as well!

FREE OTN virtual workshop - Learn about SQL pattern matching with Oracle Database 12c.

otn virtual dvlper day

Make sure you are free on Tuesday February 4 because the OTN team are hosting another of their virtual developer day events. Most importantly it is FREE. Even more importantly is the fact that I will be running a 12c pattern matching workshop at 11:45am Pacific Time. Of course there are lots other sessions that you can attend relating to big data and Oracle Database 12c and the OTN team has created two streams to help you learn about this two important areas:

  • Oracle Database application development — Learn expert tips and tricks on how to develop applications for Oracle Database 12c and Big Data environments more effectively.
  • Oracle Database platform deployment processes — From integration, to data migration, experts showcase new capabilities in Oracle 12c and Big Data environments that will allow you to deliver greater database performance and integration.

You can sign-up for the event and pick your tracks and sessions via this link: https://oracle.6connex.com/portal/database2014/login?langR=en_US&mcc=aceinvite

My pattern matching session is included in the Oracle 12c DBA section of the application development track and the workshop will cover the following topics:

  • Part 1 - Introduction to SQL Pattern Matching
  • Part 2 - Pattern Match: simple example
  • Part 3 - How to use built-in measures
  • Part 4 - Searching for more complex patterns
  • Part 5 - Deep dive into how SQL Pattern Matching works
  • Part 6 - More Advanced Topics

As my session is only 45 minutes long I am only going to cover the first three topics and leave you to work through the last three topics in your own time. During the 45 minute workshop I will be available to answer any questions via the live Q&A chat feature.

There is a link to the full agenda on the invitation page. The OTN team will be providing a Database 12c Virtualbox VM that you will be able to download later this week. For the pattern matching session I will be providing the scripts to install our sample schema, the slides from the webcast and the workshop files which include a whole series of exercises that will help you learn about pattern matching and test your SQL skills. 

The big data team has kindly included my pattern matching content inside their Virtualbox image so if you want to focus on the sessions offered on the big data tracks but still want to work on the pattern matching exercises after the event then you will have everything you need already installed and ready to go!

Don't forget to register as soon as possible and I hope you have a great day…Let me know if you have any questions or comments.

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« July 2015
SunMonTueWedThuFriSat
   
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today