Friday Jan 16, 2015

Deploying SAS High Performance Analytics on Big Data Appliance

Oracle and SAS have an ongoing commitment to our joint customers to deliver value-added technology integrations through engineered systems such as Exadata, Big Data Appliance, SuperCluster,  Exalogic and ZFS Storage Appliance.  Dedicated resources manage and execute on joint SAS/Oracle Database, Fusion Middleware, and Oracle Solaris integration projects; providing customer support, including sizing and IT infrastructure optimization and consolidation.  Oracle support teams are onsite at SAS Headquarters in Cary, NC (USA); and in the field on a global basis.

The latest in this effort is to enable our joint customers to deploy SAS High Performance Analytics on Big Data Appliance. This effort enables SAS users to leverage the lower cost infrastructure Hadoop offers in a production ready deployment on Oracle Big Data Appliance. Here from Paul Kent (VP Big Data, SAS) on some of the details.

Read more on deploying SAS High Performance Analytics on Don't miss the deployment guide and best practices here.

Monday Jan 06, 2014

CaixaBank deploys new big data infrastructure on Oracle

CaixaBank is Spain’s largest domestic bank by market share with a customer base of 13.7 It is also Spain’s leading bank in terms of innovation and technology, and one of the most prominent innovators worldwide. CaixaBank has been recently awarded the title of the World’s Most Innovative Bank at the 2013 Global Banking Innovation Awards (November 2013).

Like most financial services companies CaixaBank wants to get closer to its customers by collecting data about their activities across all the different channels (offices, internet, phone banking, ATMs, etc.). In the old days we used to call this CRM and then this morphed into "360-degree view" etc etc. While many companies have delivered these types of projects and customers feel much more connected and in control of their relationship with their bank the capture of streams of big data has the potential to create another revolution in the way we interact with our bank. What banks like CaixaBank want to do is to capture data in one part of the business and make it available to all the other lines of business as quickly as possible.

Big data is allowing businesses like  CaixaBank to significantly enhance the business value of their existing customer data by integrating it with all sorts of other internal and external data sets. This is probably the most exciting part of big data because the potential business benefits are really only constrained by imagination of the team working on these type of projects. However, that in itself does create problems in terms of securing funding and ownership of projects because the benefits can be difficult to estimate which is where all the industry use cases, conference papers and blog posts can help in terms of providing insight into what is going on in across the market in broad general terms.

To help them implement a strategic Big Data project, CaixaBank has selected Oracle for the deployment of its new Big Data infrastructure. This project, which includes an array of Oracle solutions, positions CaixaBank at the forefront of innovation in the banking industry. The new infrastructure will allow CaixaBank to maximize the business value from any kind of data and embark on new business innovation projects based on valuable information gathered from large data sets. Projects currently under review include:

  • Development of predictive models to improve customer value
  • Identifying cross-selling and up-selling  opportunities
  • Development of personalized offers to customers
  • Reinforcement of risk management and brand protection services
  • More powerful fraud analysis 
  • General streamlining of current processes to reduce time-to-market
  • Support new regulatory requirements

The Oracle solution (including Oracle Engineered Systems, Oracle Software and Oracle Consulting Services) consists in the implementation of a new Information Management Architecture that provides a unified corporate data model and new advanced analytic capabilities (for more information about how Oracle's Reference Architecture can help you integrate structured, semi-structured and unstructured information into a single logical information resource that can be exploited for commercial gain click here to view our whitepaper)

The importance of the project is best explained by Juan Maria Nin, CEO of CaixaBank:

“Business innovation is the key for success in today’s highly competitive banking environment. The implementation of this Big Data solution will help CaixaBank remain at the forefront of innovation in the financial sector, delivering the best and most competitive services to our customers”.

The Oracle press release is here:

Thursday Jul 18, 2013

Practical HDFS Permissions


Documentation and most discussions are quick to point out that HDFS provides OS-level permissions on files and directories.  However, there is less readily-available information about what the effects of OS-level permissions are on accessing data in HDFS via higher-level abstractions such as Hive or Pig.  To provide a bit of clarity, I decided to run through the effects of permissions on different interactions with HDFS.

The Setup

In this scenario, we have three users: oracle, dan, and not_dan.  The oracle user has captured some data in an HDFS directory.  The directory has 750 permissions: read/write/execute for oracle, read/execute for dan, and no access for not_dan.  One of the files in the directory has 700 permissions, meaning that only the oracle user can read it.  Each user will tries to do the following tasks:

  • List the contents of the directory
  • Count the lines in a subset of files including the file with 700 permissions
  • Run a simple Hive query over the directory

Listing Files

Each user issues the command

hadoop fs -ls /user/shared/moving_average|more

And what do they see:

[oracle@localhost ~]$ hadoop fs -ls /user/shared/moving_average|more

Found 564 items

Obviously, the oracle user can see all the files in its own directory.

[dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more
Found 564 items

Similarly, since dan has group read access, that user can also list all the files. The user without group read permissions, however, receives an error.

[not_dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more

ls: Permission denied: user=not_dan, access=READ_EXECUTE,


Counting Rows in the Shell

In this test, each user pipes a set of HDFS files into a unix command and counts rows.  Recall, one of the files has 700 permissions.

The oracle user, again, can see all the available data:

[oracle@localhost ~]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l

The user with partial permissions receives an error on the console, but can access the data they have permissions on.  Naturally, the user without permissions only receives the error.

[dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l
cat: Permission denied: user=dan, access=READ, inode="/user/shared/moving_average/FlumeData.1374082184056":oracle:shared_hdfs:-rw-------
[not_dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l
cat: Permission denied: user=not_dan, access=READ_EXECUTE, inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x---

Permissions on Hive

In this final test, the oracle user defines an external Hive table over the shared directory.  Each user issues a simple COUNT(*) query against the directory.  Interestingly, the results are not the same as piping the datastream to the shell.

The oracle user's query runs correctly, while both dan and not_dan's queries fail:

As dan

Job Submission failed with exception ' /user/shared/moving_average/FlumeData.1374082184056 does not exist)'

As not_dan

Job Submission failed with exception '
(Permission denied: user=not_dan, access=READ_EXECUTE,

So, what's going on here? In each case, the query fails, but for different reasons. In the case of not_dan, the query fails because the user has no permissions on the directory. However, the query issued by dan fails because of a FileNotFound exception. Because dan does not have read permissions on the file, Hive cannot find all the files necessary to build the underlying MapReduce job. Thus, the query fails before being submitted to the JobTracker.  The rule then, becomes simple: to issue a Hive query, a user must have read permissions on all files read by the query. If a user has permissions on one set of partition directories,  but not another, they can issue queries against the readable partitions, but not against the entire table.


In a nutshell, the OS-level permissions of HDFS behave just as we would expect in the shell. However, problems can arise when tools like Hive or Pig try to construct MapReduce jobs. As a best practice, permissions structures should be tested against the tools which will access the data. This ensures that users can read
what they are allowed to, in the manner that they need to. 

[Read More]

Tuesday Mar 05, 2013

Hadoop Cluster: Build vs. Buy (part II)

About a year ago we did a comparison (with an update here) of Build your Own Hadoop cluster and a Big Data Appliance, where we focused purely on the hardware and software cost. We thought it could use an update, but luckily an analyst firm did one for us and this time it covers both the Hardware/Software costs, but also ventures a lot more into dealing with other costs.

Read all about it in ESG's Getting Real About Big Data, Build vs Buy (Feb 2013) here.

Some highlights from the report:

  • Oracle Big Data Appliance($450k) is 39% less costly than "build your own"($733k) 
  • OBDA reduces time-to-market by 33% vs "build"

But, the report is not just about those numbers, it covers a number of very interesting things like 3 Hadoop Myths, the importance of big data in the near future and the priority customers give to improving their analytics footprint.

Enjoy the read!

Wednesday Jan 30, 2013

Parallel R: Quick Ways Model More


I am less and less often mistaken for a pirate when I mention the R language.  While I miss the excuse to wear an eyepatch, I'm glad more people are beginning to explore a statistical language I've been touting for years.  When it comes to plotting or running complex statistics in a single line of code, R is a great tool to have.  That said, there are plenty of pitfalls for the casual or new user: syntax, learning to write vectorized code, or even just knowing which "apply" function you really should choose.

  I want to explore a slightly less-often considered aspect of R development: parallelism.  Out of the box, R can seem very limited to someone used to working on compute clusters or even a multicore server.  However, there are a few tricks we can leverage to get the most out of R on everything from a personal workstation to a Hadoop cluster.

 R is Single-Threaded

The R interpreter is -- and likely always will be -- single-threaded.  This means loading data frames is done in a single thread.  So is building your linear model, or generating that pretty surface plot.  Even on my laptop, that's a lot of threads to not use for modeling.  No matter how much my web browser might covet those cycles, I'd like to use them for work.

Rather than a complex multithreaded re-implementation, the R interpreter offers a number of ways to allow users to selectively apply parallelism.  Some of these approaches leverage MPI libraries and mirror that message passing approach.  Others allow a more implicit parallelism via "foreach" or "apply" constructs. We'll just focus on a pair of strategies using the parallelism that's been included in R since it's 2.14.1 version: the parallel library.

 Setting The Stage for Parallel Execution

We're going to need to load a few libraries into our R session before we can execute anything outside of our single-thread.  We'll use the doParallel and foreach because they allow us to focus on what to parallelize rather than how to coordinate our threads.

> data(iris)

Knowing that calculations in R will be single-threaded, we want to use the parallel package to operate on logical subsets of the data simultaneously.  For example, I loaded a set of data about Iris which contains a number of different species.  One way I might want to parallelize is to fit the same each species simultaneously.  For that, I'm going to have to split the data by species:

> species.split <- split(iris, iris$Species)

 This gives us a list we can iterate over -- or parallelize.  From here on out, it's simply a question of deciding what resources we want to leverage: local CPUs or remote hosts.


We're going to use the makeCluster function to bind together a set of computational resources.  But first we need to decide: do we want to use only local CPUs, or is it necessary to open up socket connections to other machines distribute our workload?  In the former case we'll use makeCluster to create what's called a FORK cluster (in that it uses UNIX's fork call to create slaves).  In the latter, we'll create a SOCK cluster by opening up sockets to a list of remote hosts and starting slave processes on them.

Here's a FORK cluster which uses all my cores:

> cl <- makeCluster(detectCores())

And here's a SOCK cluster across three nodes (password-less SSH is required)

> hostlist <- c("", "", "")
cl <- makeCluster(hostlist)

In each case, I call registerDoParallel to bind this cluster to the %dopar% operator.  This is the operator which will let us easily iterate in parallel.

Running in Parallel

Once we've got something to iterate over and a cluster with which to do it, modeling in parallel becomes straightforward.  Suppose I want to fit a model of sepal length as a linear combination of petal characteristics.  In that case, the code is simply:

> species.models <- foreach(i=species.split) %dopar% {
m<-lm(i$Sepal.Length ~ i$Petal.Width*i$Petal.Length);

But I'm not just restricted to fitting linear models on my little cluster.  I can run k-means clustering for several different k simultaneously using basically the same block:

> species.clusters<- foreach(i=2:5) %dopar% {
km <- kmeans(iris, i);

When I'm done with my block, I can just call stopCluster(cl) to ensure my processes terminate and I'm not hogging resources.

Using Hadoop

Finally, there will be situations in which I need to deploy in parallel against much larger datasets -- specifically, datasets stored in HDFS.  Both Hive and Pig will let me run an R script as part of a streaming process.  In Hive, the TRANSFORM operator will send data to an R Script.  In Pig, you can use theSTREAM operator to send a whole bag to an R script.  However, you can't stream from within Pig'sFOREACH blocks, so I occasionally use a UDF which invokes R scripts for me.

Regardless of the method you choose to send HDFS data to an R process, it's important to make sure your R script can consume data streaming from standard input.  I find the most expedient way of doing this via the file function.  A typical script might start:

#! /usr/bin/env Rscript
#Connection to STDIN for reading a data frame
con <- file(description="stdin") <- read.table(con, header=FALSE, sep=",")


We've covered several ways to push R beyond the the bounds of its single-threaded core.  There are forking and socket mechanisms for spreading our work around, not to mention tricks for leveraging the power of Hadoop Streaming.  In each case, however, one thing stands out: we must be smart as modelers and understand what can and should be done in parallel.

[Read More]

Monday Jan 28, 2013

First Oracle BIWA Data Scientists Certified

For those who attended the BIWA Summit a few weeks ago, you would have seen the data scientist certification. BIWA just listed the first batch of data scientists it certified:

Instructor Level Certificate  - Brendan Tierney

Oracle Data Scientist Certificate 
Don Ferguson, CherryRoad Technologies
Jorge Anicama, IBM (GBS)
Tim Vlamis, Vlamis Software Solutions
Vijayalakshmi Muthukrishnan, Motorola
Sicheng Liu, Deloitte Consulting
Avik Bhattacharya, Printpack Inc.
Ari Kaplan, Ariball
Paul Mitchell, Oracle

Associate Level 
Suresh Anand, Sashatech LLC

Participation Certificate 
Ahmed Kopap
Ekine Akuiyibo
Khader Mohiuddin

More on the program, see here:

Friday Jan 25, 2013

Announcing: OTN Big Data Developer Day

Announcing the first Big Data Developer Day. A full day with two tracks and hands-on on all things Big Data at Oracle!!

An influx of new data types combined with new approaches for analyzing data are creating untapped growth opportunities that have the potential to transform your business. Oracle is the first vendor to provide a complete and integrated set of enterprise-ready products to address the full spectrum of big data business requirements. Jumpstart your understanding of big data in the enterprise by attending this complementary one-day hands-on workshop. You will learn from technical experts how to:

  • Write MapReduce on Oracle’s Big Data Platform
  • Manage a Big Data environment
  • Access Oracle NoSQL Database
  • Manage Oracle NoSQL DB Cluster
  • Use data from a Hadoop Cluster with Oracle
  • Develop analytics on big data

Register today to learn these skills which you can immediately put to use within your organization.

For more information and to sign up (space is limited!) click the link here.

So if you are in the bay area, do come and learn the coolest new technologies.

Monday Jan 21, 2013

A short Big Data and Engineered Systems Customer Video

Apart from the 2012 Government Big Data Award hear from Thomson Reuters in this video:

Friday Jan 18, 2013

Big Data Appliance X3-2 Updates

Untitled Document

Hello world. Waaw, time went by too fast. Happy new year, and here is the long past due update on the new Big Data Appliance and the software updates.

Big Data Appliance X3-2

Both the software as well as the hardware of the Big Data Appliance got a refresher.

Hardware Update

A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:

Big Data Appliance v1

Big Data Appliance X3-2


2 x 6-Core Intel® Xeon® 5675 (3.06 GHz)
2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
64GB expandable to 512GB

12 x 3TB High Capacity SAS

12 x 3TB High Capacity SAS
1 KVM Switch
N/A (removed)

For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X3-2. For those wondering what we did with the 2RU we now have left from the KVM, that is open space, at the top of the rack.

The higher core count gives a BDA X3-2 more parallel compute power while saving some 30% in energy and heat.

Software Update

As we did with Hardware, a good place to start is a quick overview of the software changes in below table:

Big Data Appliance v1.1.x Software Stack Big Data Appliance V2.0.1 Software Stack
Oracle Linux 5.6
Oracle Linux 5.8 with UEK
Cloudera CDH
CDH 3u4
CDH 4.1.x
Cloudera Manager
CM 3
CM 4.1
Oracle Enterprise Manager
Big Data Appliance Plug-In for Enterprise Manager
Open Source R
Oracle R Distribution 2.x
Big Data Connectors *
Big Data Connectors 1.1.x
Big Data Connectors 2.0.x
Oracle NoSQL Database CE **
NoSQL DB 1.x
NoSQL DB 2.x

* Oracle Big Data Connectors is a separately licensed product which can be pre-installed and pre-configured on BDA
** Oracle NoSQL DB 2.x will be pre-installed in a future update to Mammoth but can be applied manually today

Apart from the versions updates, bug fixes and a great number of performance improvements across the entire system, the biggest updates are the inclusion of CDH 4.1.2 and the default set up of highly available name nodes for Hadoop, the Enterprise Manager management of the BDA, the uptake of the Oracle R Distribution and the updates to Oracle NoSQL Database. In a nutshell these updates deliver the following improvements:

Cloudera CDH 4.1.x

The latest version of CDH and CM deliver:

  • Higher overall performance
  • Highly available name nodes with the BDA using failover quorum processes instead of an external HA filer solution
  • Vastly expanded management capabilities via CM 4

On top of this, BDA now has both Zookeeper and Oozie configured out of the box.

Oracle Enterprise Manager

The new Big Data Appliance Plug-In for Enterprise Manager delivers the first end-to-end management of the Hadoop cluster from hardware metrics to software and Hadoop metrics. To achieve the end-to-end management of the system Enterprise Manager delivers all the system metrics users are used to from the Exadata Plug-In for Enterprise Manager. Enterprise Manager enables a seamless transition between the Hardware and high level software monitoring and the expanded Hadoop monitoring and diagnostics from Cloudera Manager. This combination of functionality makes operations for a BDA simpler and allows operations staff to seamlessly switch between their Exadata, Big Data Appliance and other Oracle Engineered systems.

Oracle R Distribution

The big difference between Oracle R Distribution and the Opensource R distribution is that Oracle R Distribution is enabled to dynamically load the math kernel libraries on the CPUs from both Intel and AMD. This increases performance of basic calculations, which in turn increases the performance of the overall R calculations because more math is off-loaded into the CPUs.

Oracle NoSQL Database 2.x

A great number of great new features are added into NoSQL DB 2.x. Most of these are in both the Community Edition as well in the Enterprise Edition. Charles Lamb has a nice concise post describing what is new here.

Big Data Connectors

To close out, Big Data Connectors got a refresher focused on performance, so download the new products here and give them a go via this download page. More information on news, read the data sheet here.

Wednesday Sep 05, 2012

E-Book on big data (featuring Analysts, Customers and more)

As we are gearing up for Openworld, here is a nice E-book on big data to start paging through. It contains Gartner's take on big data, customer and partner interviews and a lot more good info. Enjoy the read so you come prepared for Openworld!!

Read the E-Book here.

For those coming to Oracle Openworld (or the Americas Cup races around the same time), you can find big data sessions via this URL.



The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« March 2015