Wednesday Feb 20, 2013

Looking for tools to solve your big data problems?

Look no further, Infosys today announced its Infosys BigDataEdge developer platform to drive value from your big data stack. 

By empowering business users to rapidly develop insights from vast amounts of structured and unstructured data, better business decisions can be made in near real-time. With Infosys BigDataEdge, enterprises can reduce the time taken to extract information by up to 40 percent and generate insights up to eight times faster.

Read More.

Thursday Feb 07, 2013

New Series of Video Workshops

Contrary to my habit of posting How-To articles, today I'm not going to tell you how to do anything.  Instead, I'm going to ask you to go do something.  What I want you to do is check out the series of video tutorials I'm doing around pieces of the Hadoop ecosystem and Big Data analytics.  

The first entry is on YouTube and in the Big Data Playlist on Oracle Media Network.  It gives a brief overview of setting up Flume flows into HDFS.  Of course, if you're looking for a more detailed How-To, just check the archives -- I covered it a while back.

Some of the topics I'll cover in upcoming months include:

  • Hive User Defined Functions
  • Market Basket Analysis with Hive, R and Data Miner 
  • Customer Segmentation and Clustering in Hadoop
  • Classification and Targeting using Random Forests

[Read More]

Wednesday Jan 30, 2013

Parallel R: Quick Ways Model More


I am less and less often mistaken for a pirate when I mention the R language.  While I miss the excuse to wear an eyepatch, I'm glad more people are beginning to explore a statistical language I've been touting for years.  When it comes to plotting or running complex statistics in a single line of code, R is a great tool to have.  That said, there are plenty of pitfalls for the casual or new user: syntax, learning to write vectorized code, or even just knowing which "apply" function you really should choose.

  I want to explore a slightly less-often considered aspect of R development: parallelism.  Out of the box, R can seem very limited to someone used to working on compute clusters or even a multicore server.  However, there are a few tricks we can leverage to get the most out of R on everything from a personal workstation to a Hadoop cluster.

 R is Single-Threaded

The R interpreter is -- and likely always will be -- single-threaded.  This means loading data frames is done in a single thread.  So is building your linear model, or generating that pretty surface plot.  Even on my laptop, that's a lot of threads to not use for modeling.  No matter how much my web browser might covet those cycles, I'd like to use them for work.

Rather than a complex multithreaded re-implementation, the R interpreter offers a number of ways to allow users to selectively apply parallelism.  Some of these approaches leverage MPI libraries and mirror that message passing approach.  Others allow a more implicit parallelism via "foreach" or "apply" constructs. We'll just focus on a pair of strategies using the parallelism that's been included in R since it's 2.14.1 version: the parallel library.

 Setting The Stage for Parallel Execution

We're going to need to load a few libraries into our R session before we can execute anything outside of our single-thread.  We'll use the doParallel and foreach because they allow us to focus on what to parallelize rather than how to coordinate our threads.

> data(iris)

Knowing that calculations in R will be single-threaded, we want to use the parallel package to operate on logical subsets of the data simultaneously.  For example, I loaded a set of data about Iris which contains a number of different species.  One way I might want to parallelize is to fit the same each species simultaneously.  For that, I'm going to have to split the data by species:

> species.split <- split(iris, iris$Species)

 This gives us a list we can iterate over -- or parallelize.  From here on out, it's simply a question of deciding what resources we want to leverage: local CPUs or remote hosts.


We're going to use the makeCluster function to bind together a set of computational resources.  But first we need to decide: do we want to use only local CPUs, or is it necessary to open up socket connections to other machines distribute our workload?  In the former case we'll use makeCluster to create what's called a FORK cluster (in that it uses UNIX's fork call to create slaves).  In the latter, we'll create a SOCK cluster by opening up sockets to a list of remote hosts and starting slave processes on them.

Here's a FORK cluster which uses all my cores:

> cl <- makeCluster(detectCores())

And here's a SOCK cluster across three nodes (password-less SSH is required)

> hostlist <- c("", "", "")
cl <- makeCluster(hostlist)

In each case, I call registerDoParallel to bind this cluster to the %dopar% operator.  This is the operator which will let us easily iterate in parallel.

Running in Parallel

Once we've got something to iterate over and a cluster with which to do it, modeling in parallel becomes straightforward.  Suppose I want to fit a model of sepal length as a linear combination of petal characteristics.  In that case, the code is simply:

> species.models <- foreach(i=species.split) %dopar% {
m<-lm(i$Sepal.Length ~ i$Petal.Width*i$Petal.Length);

But I'm not just restricted to fitting linear models on my little cluster.  I can run k-means clustering for several different k simultaneously using basically the same block:

> species.clusters<- foreach(i=2:5) %dopar% {
km <- kmeans(iris, i);

When I'm done with my block, I can just call stopCluster(cl) to ensure my processes terminate and I'm not hogging resources.

Using Hadoop

Finally, there will be situations in which I need to deploy in parallel against much larger datasets -- specifically, datasets stored in HDFS.  Both Hive and Pig will let me run an R script as part of a streaming process.  In Hive, the TRANSFORM operator will send data to an R Script.  In Pig, you can use theSTREAM operator to send a whole bag to an R script.  However, you can't stream from within Pig'sFOREACH blocks, so I occasionally use a UDF which invokes R scripts for me.

Regardless of the method you choose to send HDFS data to an R process, it's important to make sure your R script can consume data streaming from standard input.  I find the most expedient way of doing this via the file function.  A typical script might start:

#! /usr/bin/env Rscript
#Connection to STDIN for reading a data frame
con <- file(description="stdin") <- read.table(con, header=FALSE, sep=",")


We've covered several ways to push R beyond the the bounds of its single-threaded core.  There are forking and socket mechanisms for spreading our work around, not to mention tricks for leveraging the power of Hadoop Streaming.  In each case, however, one thing stands out: we must be smart as modelers and understand what can and should be done in parallel.

[Read More]

Monday Jan 28, 2013

First Oracle BIWA Data Scientists Certified

For those who attended the BIWA Summit a few weeks ago, you would have seen the data scientist certification. BIWA just listed the first batch of data scientists it certified:

Instructor Level Certificate  - Brendan Tierney

Oracle Data Scientist Certificate 
Don Ferguson, CherryRoad Technologies
Jorge Anicama, IBM (GBS)
Tim Vlamis, Vlamis Software Solutions
Vijayalakshmi Muthukrishnan, Motorola
Sicheng Liu, Deloitte Consulting
Avik Bhattacharya, Printpack Inc.
Ari Kaplan, Ariball
Paul Mitchell, Oracle

Associate Level 
Suresh Anand, Sashatech LLC

Participation Certificate 
Ahmed Kopap
Ekine Akuiyibo
Khader Mohiuddin

More on the program, see here:

Friday Jan 25, 2013

Announcing: OTN Big Data Developer Day

Announcing the first Big Data Developer Day. A full day with two tracks and hands-on on all things Big Data at Oracle!!

An influx of new data types combined with new approaches for analyzing data are creating untapped growth opportunities that have the potential to transform your business. Oracle is the first vendor to provide a complete and integrated set of enterprise-ready products to address the full spectrum of big data business requirements. Jumpstart your understanding of big data in the enterprise by attending this complementary one-day hands-on workshop. You will learn from technical experts how to:

  • Write MapReduce on Oracle’s Big Data Platform
  • Manage a Big Data environment
  • Access Oracle NoSQL Database
  • Manage Oracle NoSQL DB Cluster
  • Use data from a Hadoop Cluster with Oracle
  • Develop analytics on big data

Register today to learn these skills which you can immediately put to use within your organization.

For more information and to sign up (space is limited!) click the link here.

So if you are in the bay area, do come and learn the coolest new technologies.

Friday Jan 18, 2013

Big Data Appliance X3-2 Updates

Untitled Document

Hello world. Waaw, time went by too fast. Happy new year, and here is the long past due update on the new Big Data Appliance and the software updates.

Big Data Appliance X3-2

Both the software as well as the hardware of the Big Data Appliance got a refresher.

Hardware Update

A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:

Big Data Appliance v1

Big Data Appliance X3-2


2 x 6-Core Intel® Xeon® 5675 (3.06 GHz)
2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
64GB expandable to 512GB

12 x 3TB High Capacity SAS

12 x 3TB High Capacity SAS
1 KVM Switch
N/A (removed)

For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X3-2. For those wondering what we did with the 2RU we now have left from the KVM, that is open space, at the top of the rack.

The higher core count gives a BDA X3-2 more parallel compute power while saving some 30% in energy and heat.

Software Update

As we did with Hardware, a good place to start is a quick overview of the software changes in below table:

Big Data Appliance v1.1.x Software Stack Big Data Appliance V2.0.1 Software Stack
Oracle Linux 5.6
Oracle Linux 5.8 with UEK
Cloudera CDH
CDH 3u4
CDH 4.1.x
Cloudera Manager
CM 3
CM 4.1
Oracle Enterprise Manager
Big Data Appliance Plug-In for Enterprise Manager
Open Source R
Oracle R Distribution 2.x
Big Data Connectors *
Big Data Connectors 1.1.x
Big Data Connectors 2.0.x
Oracle NoSQL Database CE **
NoSQL DB 1.x
NoSQL DB 2.x

* Oracle Big Data Connectors is a separately licensed product which can be pre-installed and pre-configured on BDA
** Oracle NoSQL DB 2.x will be pre-installed in a future update to Mammoth but can be applied manually today

Apart from the versions updates, bug fixes and a great number of performance improvements across the entire system, the biggest updates are the inclusion of CDH 4.1.2 and the default set up of highly available name nodes for Hadoop, the Enterprise Manager management of the BDA, the uptake of the Oracle R Distribution and the updates to Oracle NoSQL Database. In a nutshell these updates deliver the following improvements:

Cloudera CDH 4.1.x

The latest version of CDH and CM deliver:

  • Higher overall performance
  • Highly available name nodes with the BDA using failover quorum processes instead of an external HA filer solution
  • Vastly expanded management capabilities via CM 4

On top of this, BDA now has both Zookeeper and Oozie configured out of the box.

Oracle Enterprise Manager

The new Big Data Appliance Plug-In for Enterprise Manager delivers the first end-to-end management of the Hadoop cluster from hardware metrics to software and Hadoop metrics. To achieve the end-to-end management of the system Enterprise Manager delivers all the system metrics users are used to from the Exadata Plug-In for Enterprise Manager. Enterprise Manager enables a seamless transition between the Hardware and high level software monitoring and the expanded Hadoop monitoring and diagnostics from Cloudera Manager. This combination of functionality makes operations for a BDA simpler and allows operations staff to seamlessly switch between their Exadata, Big Data Appliance and other Oracle Engineered systems.

Oracle R Distribution

The big difference between Oracle R Distribution and the Opensource R distribution is that Oracle R Distribution is enabled to dynamically load the math kernel libraries on the CPUs from both Intel and AMD. This increases performance of basic calculations, which in turn increases the performance of the overall R calculations because more math is off-loaded into the CPUs.

Oracle NoSQL Database 2.x

A great number of great new features are added into NoSQL DB 2.x. Most of these are in both the Community Edition as well in the Enterprise Edition. Charles Lamb has a nice concise post describing what is new here.

Big Data Connectors

To close out, Big Data Connectors got a refresher focused on performance, so download the new products here and give them a go via this download page. More information on news, read the data sheet here.

Tuesday Nov 20, 2012

Winner of the 2012 Government Big Data Solutions Award

Hot off the press:

The winner of the 2012 Government Big Data Solutions Aware is the National Cancer Institute!! Read all the details on

A short excerpt to wet your appetite: "... This solution, based on the Oracle Big Data Appliance with the Cloudera Distribution of Apache Hadoop (CDH), leverages capabilities available from the Big Data community today in pioneering ways that can serve a broad range of researchers. The promising approach of this solution is repeatable across many other Big Data challenges for bioinfomatics, making this approach worthy of its selection as the 2012 Government Big Data Solution Award."

Read the entire post.

Congrats to the entire team!!

Friday Oct 05, 2012

Building Simple Workflows in Oozie


More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site.

Hive Actions: Prepping for Pig

In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie.

I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code.

CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2)
PARTITIONED BY (yr string)
LOCATION '/user/oracle/weather/historic';

As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access.

ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010')
LOCATION '/user/oracle/weather/historic/yr=2011';

INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history'
SELECT w.stn, w.wban, w.weather_year, w.weather_month,
w.weather_day, w.temp, w.dewp, FROM (
FROM historic_weather SELECT TRANSFORM(...)
USING '/path/to/hive/filters/'
as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather
) w;

Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called weather_train.hql.

Starting Our Workflow

Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point:

<workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1">
<start to="ParseNCDCData"/>
<end name="end"/>

To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure.

<action name="ParseNCDCData">
<hive xmlns="uri:oozie:hive-action:0.2">
<ok to="WeatherMan"/>
<error to="end"/>

There are a couple of things to note here:
  1. I have to give the FQDN (or IP) and port of my JobTracker and NameNode.
  2. I have to include a hive-default.xml file.
  3. I have to include a script file.
  4. The hive-default.xml and script file must be stored in HDFS
That last point is particularly important. Oozie doesn't make assumptions about where a given workflow is being run. You might submit workflows against different clusters, or have different hive-defaults.xml on different clusters (e.g. MySQL or Postgres-backed metastores).


A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions toworkflow.xml. At this point, our local directory should contain:

  1. workflow.xml
  2. hive-defaults.xml (make sure this file contains your metastore connection data)
  3. ncdc_parse.hql


Adding Pig to the Ooze

Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action toworkflow.xml as follows:

<action name="WeatherMan">
<ok to="end"/>
<error to="end"/>


Once we've done this, we'll copy weather_train.pig to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes.

While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything underworking_directory/lib gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in theworking_directory/lib directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an <archive> tag.

Yes, that's as confusing as you think it is.

 You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook.

Making the Workflow Work

We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called as follows:



While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is weatherRoot: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. Theoozie.libpath pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory.

We're finally ready to submit our job! After all that work we only need to do a few more things:

  1. Validate our workflow.xml
  2. Copy our working directory to HDFS
  3. Submit our job to the Oozie server
  4. Run our workflow
Let's do them in order. First validate the workflow:
oozie validate workflow.xml

Next, copy the working directory up to HDFS:
hadoop fs -put working_dir /user/oracle/working_dir

Now we submit the job to the Oozie server. We need to ensure that we've got the correct URL for the Oozie server, and we need to specify our file as an argument.
oozie job -oozie -config /path/to/working_dir/ -submit

We've submitted the job, but we don't see any activity on the JobTracker? All I got was this funny bit of output:

This is because submitting a job to Oozie creates an entry for the job and places it in PREP status. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job.
oozie -oozie -start 14-20120525161321-oozie-oracle

Of course, if we really want to run the job from the outset, we can change the "-submit" argument above to "-run." This will prep and run the workflow immediately.



So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.

[Read More]

Wednesday Sep 05, 2012

E-Book on big data (featuring Analysts, Customers and more)

As we are gearing up for Openworld, here is a nice E-book on big data to start paging through. It contains Gartner's take on big data, customer and partner interviews and a lot more good info. Enjoy the read so you come prepared for Openworld!!

Read the E-Book here.

For those coming to Oracle Openworld (or the Americas Cup races around the same time), you can find big data sessions via this URL.


Wednesday Aug 01, 2012

Flume and Hive for Log Analytics

There's a lot to learn from log data, but to get the most value from it, that data needs to be easy to collect and analyze. Otherwise, time that could be used to learn from data is spent writing parsers and transport components. In this entry we'll simplify log collection and transport using JSON serialization and parts of the Hadoop ecosystem.

Logging everything in JSON is a great idea. As serialization formats go it's engineer-friendly: you and your favorite programming language can both read it. Moreover, having all of your log data structured as universal data structures makes getting started with analytics much simpler. To illustrate how much simpler, we'll take JSON logs written to a flat file, stream them into HDFS, and expose them via Hive for exploration and aggregation.

The Preliminaries

We're going to use three components to put our system together:

  • A flat file that's collecting JSON data. Assume entries look a bit like this:
    {"fieldA":"string data","fieldB":400,"fieldC":0.99}
  • Flume: the distributed log-collection service that's part of the Hadoop ecosystem
  • Hive and a SerDe for handling JSON data

The "Tail Table"

We'll begin by setting up the final destination for our log data. This requires we create a directory in HDFS to hold the log data and define a Hive table over it. Making the directory's easy:

hadoop fs -mkdir /user/oracle/tail_table

Similarly, defining the external table is straightforward in the Hive command line:

CREATE EXTERNAL TABLE IF NOT EXISTS tail_table(fieldA int, fieldB string, fieldC float)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/oracle/tail_table';

This gives us a table which will read and respect the types of the values in our JSON records. If a field isn't present for a given record, a NULL value is returned for that column. Fields not included in the CREATE statement are ignored but still exist in the JSON. This allows the schema of the JSON to remain flexible while minimally impacting the Hive table.

Streaming Data with Flume OG

CDH 3u4 ships with two very different versions of Flume. The default is Flume 0.9.4, or Flume OG . It's great at streaming data into HDFS, but Flume OG has some requirements.

  • You must run Zookeeper to coordinate Flume nodes
  • You must run a Flume master to control Flume nodes
Those caveats aside, setting up Flume to stream data into our Hive table is remarkably simple. We only need to define a source which tails our JSON logs and a sink which writes these into the appropriate HDFS directory. We can set this up via the Flume master's web interface. Just navigate to the Flume master web interface at http://flumemaster.your.domain:35871and click the config link. From here, select the Flume node you want to configure from the dropdown menu (i.e. the node which has the JSON log file). The rest is easy:
  • Set the source as: tail("/path/to/json.log")
  • Set the sink as: collectorSink("hdfs://namenode/user/oracle/tail_table", "logdata", 30)

This configuration will tail the log file and write a new message into HDFS with each new line. The collectorSink will commit data to our Hive table every 30 seconds. The resulting configuration looks like this: 

Streaming Data with Flume NG

The other version of Flume which ships with CDH3 is Flume NG . Flume NG is significantly different from its predecessor. Our tail source from the previous section is gone, but so too are many of restrictions.
  • Zookeeper is no longer a requirement
  • The master-slave architecture has been replaced by independent Flume agents
  • We can now use Avro RPCs to transfer data in multi-hop flows
That last point is a big advance for Flume. In Flume OG, transfer from application servers to our Hadoop cluster was a gray area. Either our application servers run Flume nodes connected to the Zookeeper instances and Flume masters for the Hadoop cluster, or logs must be transferred into the Hadoop cluster via another method. In Flume NG, we can run independent Flume agents on application server and the Hadoop cluster, relying on Avro RPC to handle forwarding.

For this type of multi-hop log transfer, we need a flume-ng-agent running on each application server and one on the Hadoop cluster. The application servers will have a flume.conf file which includes something like this:

app-agent.sources = tail
app-agent.channels = memoryChannel
app-agent.sinks = avro-forward-sink
app-agent.sources.tail.type = exec
app-agent.sources.tail.command = tail -f /path/to/json.log
app-agent.sources.avro-forward-sink.type = avro
app-agent.sources.avro-forward-sink.hostname =
app-agent.sources.avro-forward-sink.port = 10000

This sets up a source that runs "tail" and sinks that data via Avro RPC to on port 10000.

The collecting Flume agent on the Hadoop cluster will need a flume.conf with an avro source and an HDFS sink.
hdfs-agent.sources= avro-collect
hdfs-agent.sinks = hdfs-write
hdfs-agent.channels = memoryChannel
hdfs-agent.sources.avro-collect.type = avro
hdfs-agent.sources.avro-collect.bind =
hdfs-agent.sources.avro-collect.port = 10000
hdfs-agent.sinks.hdfs-write.type = hdfs
hdfs-agent.sinks.hdfs-write.path = hdfs://namenode/user/oracle/tail_table
hdfs-agent.sinks.hdfs-write.rollInterval = 30

On this side we've defined a source that reads Avro messages from port 10000 on and writes the results into HDFS, rolling the file every 30 seconds. It's just like our setup in Flume OG, but now multi-hop forwarding is a snap.

The resulting configuration looks like this: 


No matter which version you deploy, the combination of Flume, Hive and JSON make it straightforward to up an end-to-end pipeline for consuming and analyzing serialized log data. With deployments this simple, you can spend more time focusing on your applications and analytics

[Read More]

Wednesday Jun 27, 2012

FairScheduling Conventions in Hadoop

While scheduling and resource allocation control has been present in Hadoop since 0.20, a lot of people haven't discovered or utilized it in their initial investigations of the Hadoop ecosystem. We could chalk this up to many things:

  • Organizations are still determining what their dataflow and analysis workloads will comprise
  • Small deployments under tests aren't likely to show the signs of strains that would send someone looking for resource allocation options
  • The default scheduling options -- the FairScheduler and the CapacityScheduler -- are not placed in the most prominent position within the Hadoop documentation.

However, for production deployments, it's wise to start with at least the foundations of scheduling in place so that you can tune the cluster as workloads emerge. To do that, we have to ask ourselves something about what the off-the-rack scheduling options are. We have some choices:

  • The FairScheduler, which will work to ensure resource allocations are enforced on a per-job basis.
  • The CapacityScheduler, which will ensure resource allocations are enforced on a per-queue basis.
  • Writing your own implementation of the abstract class org.apache.hadoop.mapred.job.TaskScheduler is an option, but usually overkill.

If you're going to have several concurrent users and leverage the more interactive aspects of the Hadoop environment (e.g. Pig and Hive scripting), the FairScheduler is definitely the way to go. In particular, we can do user-specific pools so that default users get their fair share, and specific users are given the resources their workloads require.

To enable fair scheduling, we're going to need to do a couple of things. First, we need to tell the JobTracker that we want to use scheduling and where we're going to be defining our allocations. We do this by adding the following to the mapred-site.xml file in HADOOP_HOME/conf:





What we've done here is simply tell the JobTracker that we'd like to task scheduling to use the FairScheduler class rather than a single FIFO queue. Moreover, we're going to be defining our resource pools and allocations in a file called allocations.xml For reference, the allocation file is read every 15s or so, which allows for tuning allocations without having to take down the JobTracker.

Our allocation file is now going to look a little like this

<?xml version="1.0"?>
<pool name="dan">

In this case, I've explicitly set my username to have upper and lower bounds on the maps and reduces, and allotted myself double the number of running jobs. Now, if I run hive or pig jobs from either the console or via the Hue web interface, I'll be treated "fairly" by the JobTracker. There's a lot more tweaking that can be done to the allocations file, so it's best to dig down into the description and start trying out allocations that might fit your workload.

[Read More]

Monday May 14, 2012

Price comparison for Big Data Appliance

Untitled Document

We are seeing a lot of discussions about pricing and comparable pricing between a "standard Hadoop cluster" and the Oracle Big Data Appliance. This post is aimed at providing a simple apples-to-apples comparison and a clarification of what is, and what is not included in the pricing and packaging of Oracle Big Data Appliance.

Read the updated post here.

Monday Jan 09, 2012

Big Data Appliance and Big Data Connectors are now Generally Available

Today - January 10th, we announced the general availability of Oracle Big Data Appliance and Oracle Big Data Connectors as well as a partnership with Cloudera. Now that should be fun to start the new year in big data land!!

Big Data Appliance

Oracle Big Data Appliance brings Big Data solutions to mainstream enterprises. Built using industry-standard hardware from Sun and Cloudera's Distribution including Apache Hadoop, the Big Data Appliance is designed and optimized for big data workloads. By integrating the key components of a big data platform into a single product, Oracle Big Data Appliance delivers an affordable, scalable and fully supported big data infrastructure without the risks of a custom built solution. The Big Data Appliance integrates tightly with Oracle Exadata and Oracle Database using Oracle Big Data Connectors, and enables analysis of all data in the enterprise -structured and unstructured.

Big Data Connectors

Built from the ground up by Oracle, Oracle Big Data Connectors delivers a high-performance Hadoop to Oracle Database integration solution and enables optimized analysis using Oracle’s distribution of open source R analysis directly on Hadoop data. By providing efficient connectivity, Big Data Connectors enables analysis of all data in the enterprise – both structured and unstructured.

Cloudera CDH and Cloudera Manager

Oracle Big Data Appliance contains Cloudera’s Distribution including Apache Hadoop (CDH) and Cloudera Manager. CDH is the #1 Apache Hadoop-based distribution in commercial and non-commercial environments. CDH consists of 100% open source Apache Hadoop plus the comprehensive set of open source software components needed to use Hadoop. Cloudera Manager is an end-to-end management application for CDH. Cloudera Manager gives a cluster-wide, real-time view of nodes and services running; provides a single, central place to enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help optimize cluster performance and utilization.

More Information

Data sheets, white papers and other interesting information can be found here:

    * Big Data Appliance OTN page
    * Big Data Connectors OTN page

Happy new year and I hope life just got a bit more interesting!!

Thursday Dec 15, 2011

Understanding a Big Data Implementation and its Components

I often get asked about big data, and more often than not we seem to be talking at different levels of abstraction and understanding. Words like real time show up, words like advanced analytics show up and we are instantly talking about products. The latter is typically not a good idea.

So let’s try to step back and go look at what big data means from a use case perspective and how we then map this use case into a usable, high-level infrastructure picture. As we walk through this all you will – hopefully – start to see a pattern and start to understand how words like real time and analytics fit…

The Use Case in Business Terms

Rather then inventing something from scratch I’ve looked at the keynote use case describing Smart Mall (you can see a nice animation and explanation of smart mall in this video).

The idea behind this is often referred to as “multi-channel customer interaction”, meaning as much as “how can I interact with customers that are in my brick and mortar store via their phone”. Rather than having each customer pop out there smart phone to go browse prices on the internet, I would like to drive their behavior pro-actively.

The goals of smart mall are straight forward of course:

  • Increase store traffic within the mall
  • Increase revenue per visit and per transaction
  • Reduce the non-buy percentage

What do I need?

In terms of technologies you would be looking at:

  • Smart Devices with location information tied to an invidivual
  • Data collection / decision points for real-time interactions and analytics
  • Storage and Processing facilities for batch oriented analytics

In terms of data sets you would want to have at least:

  • Customer profiles tied to an individual linked to their identifying device (phone, loyalty card etc.)
  • A very fine grained customer segmentation
  • Tied to detailed buying behavior
  • Tied to elements like coupon usage, preferred products and other product recommendation like data sets

High-Level Components

A picture speaks a thousand words, so the below is showing both the real-time decision making infrastructure and the batch data processing and model generation (analytics) infrastructure.

The first – and arguably most important step and the most important piece of data – is the identification of a customer. Step 1 is in this case the fact that a user with cell phone walks into a mall. By doing so we trigger the lookups in step 2a and 2b in a user profile database. We will discuss this a little more later, but in general this is a database leveraging an indexed structure to do fast and efficient lookups. Once we have found the actual customer, we feed the profile of this customer into our real time expert engine – step 3. The models in the expert system (customer built or COTS software) evaluate the offers and the profile and determine what action to take (send a coupon for something). All of this happens in real time… keeping in mind that websites do this in milliseconds and our smart mall would probably be ok doing it in a second or so.

To build accurate models – and this where a lot of the typical big data buzz words come around, we add a batch oriented massive processing farm into the picture. The lower half in the picture above shows how we leverage a set of components to create a model of buying behavior. Traditionally we would leverage the database (DW) for this. We still do, but we now leverage an infrastructure before that to go after much more data and to continuously re-evaluate all that data with new additions.

A word on the sources. One key element is POS data (in the relational database) which I want to link to customer information (either from my web store or from cell phones or from loyalty cards). The NoSQL DB – Customer Profiles in the picture show the web store element. It is very important to make sure this multi-channel data is integrated (and de-duplicated but that is a different topic) with my web browsing, purchasing, searching and social media data.

Once that is done, I can puzzle together of the behavior of an individual. In essence big data allows micro segmentation at the person level. In effect for every one of my millions of customers!

The final goal of all of this is to build a highly accurate model to place within the real time decision engine. The goal of that model is directly linked to our business goals mentioned earlier. In other words, how can I send you a coupon while you are in the mall that gets you to the store and gets you to spend money…

Detailed Data Flows and Product Ideas

Now, how do I implement this with real products and how does my data flow within this ecosystem? That is something shown in the following sections…

Step 1 – Collect Data

To look up data, collect it and make decisions on it you will need to implement a system that is distributed. As these devices essentially keep on sending data, you need to be able to load the data (collect or acquire) without much delay. That is done like below in the collection points. That is also the place to evaluate for real time decisions. We will come back to the Collection points later…

The data from the collection points flows into the Hadoop cluster – in our case of course a big data appliance. You would also feed other data into this. The social feeds shown above would come from a data aggregator (typically a company) that sorts out relevant hash tags for example. Then you use Flume or Scribe to load the data into the Hadoop cluster.

Next step is the add data and start collating, interpreting and understanding the data in relation to each other.

For instance, add user profiles to the social feeds and the location data to build up a comprehensive understanding of an individual user and the patterns associated with this user. Typically this is done using MapReduce on Hadoop. The NoSQL user profiles are batch loaded from NoSQL DB via a Hadoop Input Format and thus added to the MapReduce data sets.

To combine it all with Point of Sales (POS) data, with our Siebel CRM data and all sorts of other transactional data you would use Oracle Loader for Hadoop to efficiently move reduced data into Oracle. Now you have a comprehensive view of the data that your users can go after. Either via Exalytics or BI tools or, and this is the interesting piece for this post – via things like data mining.

That latter phase – here called analyze will create data mining models and statistical models that are going to be used to produce the right coupons. These models are the real crown jewels as they allow an organization to make decisions in real time based on very accurate models. The models are going into the Collection and Decision points to now act on real time data.

In the picture above you see the gray model being utilized in the Expert Engine. That model describes / predicts behavior of an individual customer and based on that prediction we determine what action to undertake.

The above is an end-to-end look at Big Data and real time decisions. Big Data allows us to leverage tremendous data and processing resources to come to accurate models. It also allows us to find out all sorts of things that we were not expecting, creating more accurate models, but also creating new ideas, new business etc.

Once the Big Data Appliance is available you can implement the entire solution as shown here on Oracle technology… now you just need to find a few people who understand the programming models and create those crown jewels.

Friday Nov 11, 2011

My Take on Hadoop World 2011

I’m sure some of you have read pieces about Hadoop World and I did see some headlines which were somewhat, shall we say, interesting?

I thought the keynote by Larry Feinsmith of JP Morgan Chase & Co was one of the highlights of the conference for me. The reason was very simple, he addressed some real use cases outside of internet and ad platforms.

The following are my notes, since the keynote was recorded I presume you can go and look at at some point…

On the use cases that were mentioned:

  1. ETL – how can I do complex data transformation at scale
    1. Doing Basel III liquidity analysis
    2. Private banking – transaction filtering to feed [relational] data marts
  2. Common Data Platform – a place to keep data that is (or will be) valuable some day, to someone, somewhere
    1. 360 Degree view of customers – become pro-active and look at events across lines of business. For example make sure the mortgage folks know about direct deposits being stopped into an account and ensure the bank is pro-active to service the customer
    2. Treasury and Security – Global Payment Hub [I think this is really consolidation of data to cross reference activity across business and geographies]
  3. Data Mining
    1. Bypass data engineering [I interpret this as running a lot of a large data set rather than on samples]
    2. Fraud prevention – work on event triggers, say a number of failed log-ins to the website. When they occur grab web logs, firewall logs and rules and start to figure out who is trying to log in. Is this me, who forget his password, or is it someone in some other country trying to guess passwords
    3. Trade quality analysis – do a batch analysis or all trades done and run them through an analysis or comparison pipeline

One of the key requests – if you can say it like that – was for vendors and entrepreneurs to make sure that new tools work with existing tools. JPMC has a large footprint of BI Tools and Big Data reporting and tools should work with those tools, rather than be separate.

Security and Entitlement – how to protect data within a large cluster from unwanted snooping was another topic that came up.

I thought his Elephant ears graph was interesting (couldn’t actually read the points on it, but the concept certainly made some sense) and it was interesting – when asked to show hands – how the audience did not (!) think that RDBMS and Hadoop technology would overlap completely within a few years.

Another interesting session was the session from Disney discussing how Disney is building a DaaS (Data as a Service) platform and how Hadoop processing capabilities are mixed with Database technologies. I thought this one of the best sessions I have seen in a long time. It discussed real use case, where problems existed, how they were solved and how Disney planned some of it.

The planning focused on three things/phases:

  1. Determine the Strategy – Design a platform and evangelize this within the organization
  2. Focus on the people – Hire key people, grow and train the staff (and do not overload what you have with new things on top of their day-to-day job), leverage a partner with experience
  3. Work on Execution of the strategy – Implement the platform Hadoop next to the other technologies and work toward the DaaS platform

This kind of fitted with some of the Linked-In comments, best summarized in “Think Platform – Think Hadoop”. In other words [my interpretation], step back and engineer a platform (like DaaS in the Disney example), then layer the rest of the solutions on top of this platform.

One general observation, I got the impression that we have knowledge gaps left and right. On the one hand are people looking for more information and details on the Hadoop tools and languages. On the other I got the impression that the capabilities of today’s relational databases are underestimated. Mostly in terms of data volumes and parallel processing capabilities or things like commodity hardware scale-out models.

All in all I liked this conference, it was great to chat with a wide range of people on Oracle big data, on big data, on use cases and all sorts of other stuff. Just hope they get a set of bigger rooms next time… and yes, I hope I’m going to be back next year!


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« October 2015