Tuesday May 22, 2012

Following up on a comment

Since I was on vacation I missed an interesting question on one of the posts:

I see the interesting pyramid in the middle of your article showing the current status of big data market in the many eyes of the beholders and have made many thoughts on that. In your opinion, do you think "Big Data is our business" is only applicable for those Internet-scale companies, independent software vendors OR it will happen soon for most large-scale corporations in the upcoming future? I will think that "Big Data improves our business" shall be applicable to those large-scale corporate and up to that line only in the mainstream. Would like to hear more comments from your side especially on the use of Big Data for Corporate IT or in the Corporate level.



Posted by Ivan Ng on April 25, 2012 at 04:23 AM PDT #


I think you can argue about this quite a bit, but I do believe that top tier is something we should be able to achieve, even if we are not an IT, Web or Data company. But you make a good point. At its current state, the wording is a bit off and would lead people to believe what you said... Does a manufacturing company really become a data company?

Maybe it should read "Big Data Transforms Our Business" ... which is what I am really after. Improvements are good, creativity should make us leverage data and change the outcomes of business processes. Then I think it is achievable for everyone...

Then a manufacturing company can transform the business, a telecom company can all of sudden personalize every interaction with every customer.

Thanks for the great comment, it certainly helped formulating some of this more crisp and I hope it makes sense.

Monday May 14, 2012

Price comparison for Big Data Appliance

Untitled Document

We are seeing a lot of discussions about pricing and comparable pricing between a "standard Hadoop cluster" and the Oracle Big Data Appliance. This post is aimed at providing a simple apples-to-apples comparison and a clarification of what is, and what is not included in the pricing and packaging of Oracle Big Data Appliance.

Read the updated post here.

Wednesday May 09, 2012

Big Data: BDA Networking for the Data Center

Quick note, just published a nice comprehensive paper on how to network a Big Data Appliance into your data center. You can find the paper on the main BDA OTN page. Direct link here.

Happy networking...

Tuesday Apr 17, 2012

Read how the Avengers (tm) use Oracle Big Data Appliance

Decisive Power with Oracle Big Data Appliance

Raw information—like raw power—must be channeled and transformed to be useful. Every day, S.H.I.E.L.D. captures terabytes of information from a variety of sources—surveillance videos, satellites, sensors, field reports, network traffic—and all of this high-volume, high-velocity, and high-variety data is processed, filtered, transformed, and sorted in Oracle Big Data Appliance.

Read all about the new Marvel "The Avengers" movie, the systems in their data center and their use of big data here.

Tuesday Apr 10, 2012

Self-Study course on ODI Application Adapter for Hadoop

For those of you who want to work with Oracle Data Integrator and its Hadoop capabilities, a good way to start is the newly released self-study course from Oracle University. You can find the course here.

Enjoy, and if you have any feedback, do send this into Oracle University by logging in (so we can unleash our big data analytics on it ;-) ).

Monday Apr 09, 2012

Big Data - Must Bring Creativity

As everyone reads more about big data, I get more and more questions on use cases. “How should we use big data?” is the most common question. “Are there applications available for my vertical?” and “what do others in my vertical do with big data?” are a close second on the list. There are various reports out there by many authors which describe possible or real use cases. A search on the beacon of truth will no doubt get you links to most, but are they really relevant?

Then when I was reading Fortune Magazine (I do read the paper copy – guess I am old fashioned) the graph of disconnected points in my brain finally connected while reading the edition with the Facebook cover article (volume 165, number 4). Interestingly however, the Facebook articla was not the article that got my dots connected and persuaded me to writing this post.

The article that connected the dots was about the re-invention of JCPenney. More precisely about Ron Johnson, JCPenney’s new CEO and his thoughts on how to re-invent a department store business. For those who do not know who Ron Johnson is, it is the person who – together with Steve Jobs – created the Apple Store for you. That would be the physical, brick-and-mortar one, the hugely successful one (more $$$ per square foot than any other store)…

His quote in the article (following quote is of course copyright Fortune Magazine!) says it all:

“Improvement merely lets you hit the numbers. Creativity is what transforms.” Ron Johnson, CEO, JCPenney

Again, I think that quote says it all, it tells you that you will not find the transformational use case for your business in all those articles and papers on the internet. It says that you will have to dig deeper into the actual business you are running, and try to find that one thing, that one idea that will transform your business. I’m going to guess that this idea – if you are reading this blog post – is somehow related to having more data, or doing much more analytics…

With that in mind look at the following picture. It gives you an idea of the implementations in production of big data solutions (Gartner’s Mark Beyer often uses the 5-20-75 scale to explain where technology is):

What this really means is that big data is not yet in the mainstream of technology deployments. It also means that you will not be able to buy an entire solution off the shelve, that instead, you will have to be the first one to implement that crazy successful idea. That is a good thing! It means that there is so much competitive advantage to be had by investing in big data now…

I then caught a glimpse of Oracle Profit (Volume 17, Number 1), which quotes Mark Hurd, Oracle’s President:

“Technology presents the opportunity to Transform Business”

Creativity and technology, now that is something that will really transform a business. And that is why I am writing this post! We – Oracle – can give you the technology platform, we can even give you analytical building blocks (LEGO for your analytics), but you, your business people, are in charge of that big idea.

How do you start with Big Data projects?

I hope the above at least tickled your brain, because the following will want you to believe! But before we change every business, let’s get grounded in what a real start to a big data project is. For many of us, our first big data project is going to be something that needs to prove “it” can be done.

Step 1) Gather the required technology and people to get a real project done. You will need at least (access to) the following:

  • Advanced analytics (statistics, data mining, graph analytics, semantics, spatial etc.)
  • A technology foundation that can handle a lot of data and allows you to analyze that data (it may not be fully at scale). Make sure you understand what technologies can do what for you, and leverage the internet of ideas to jump start your knowledge on these technologies
  • A data set you know, understand, that at least supports the basic idea you are pursuing in your improvement. For example, if you are going to work on social graph driven churn analysis, you will need both your customer data and relevant social data
  • A small number of people to work with and drive the above – you might not have them all ready to go, so find someone who is willing to learn and excited about technology…

Step 2) Find a problem you have today (risk, fraud, CRM etc.), that actually costs you money and improve on that problem using the stuff you gathered in step 1. This is your playground, this is your chance to learn (!) and in the process improve something in your organization. It also will prove this analytics and big data stuff really works and drives business value. Your known problem is something that gives you a quantifiable ROI. I’ve shortened the time to do X, adding $Y in our coffers by spending only $Z  (where Z < Y)… Oh, and make sure you do this in the shortest possible time frame or you are going to be too late to the party.

For getting things done in #2 (and #4 below), you should actually read the Fortune article mentioned above on how Facebook works, and try to understand the mentality and method applied to new products and features. Build, show and improve or rethink but never take “no” for an answer if you believe in something.

Step 3) Build a production environment and do it all on real hardware and software. Potentially re-evaluate some of the technology you had used before (this stuff better work in production). And then do it all again in step 4 when it really matters.

Step 4) Go catch the big fish! Go after the big idea; what is the new business this opens? That fantastic new component in your services. Rally the troops and go do it! Be creative, use the technology and forget the boundaries. Do make sure you find a corporate sponsor to see this through.

Once you arrive in Step 4, think big, build the first proto-type and never, ever hedge your bets. Go for the big idea, focus and don’t try to do many other things or try to spread the risk by not really doing anything new to avoid the risk of failure.

My Conclusion

IMHO, the way to real innovation by leveraging a big data solution is to first follow the money driven by improvement. Use that pilot to learn the technologies (analytics is key!) while solving a real tangible problem, then go all in and do the cool stuff.

PS. This is my motivational speech. I genuinely believe that most of the technology hurdles are gone, we just need to harness the creativity that naturally lives in a business, apply the technology and stick to the big ideas to transform business.

Tuesday Feb 28, 2012

Interesting IDC report on Oracle's Big Data Solutions

I thought I share this link. Very interesting report written by IDC on Oracle's Big Data strategy.

Happy Reading!

Thursday Feb 23, 2012

Announcing the Big Data OTN Forums

The new OTN forums for big data are now live.

Follow this link for the Forum home.

We have two forums, one covering big data as a topic, which includes Oracle Big Data Appliance and the Hadoop ecosystem, a second covering all the components in Oracle Big Data Connectors. The forums are monitored by Oracle and by folks in development, so we hope to provide you with excellent value on all your questions and ideas.

Monday Feb 20, 2012

We are hiring!

Oracle’s Big Data development team is Hiring a Product Manager

About Us and the Product

Big Data is here today. Leading-edge enterprises are implementing Big Data projects. Oracle has extended its product line to embrace the new technologies and the new opportunities in Big Data. The Big Data development team has recently introduced the Big Data Appliance, an engineered system that combines hardware and software optimized for Hadoop and Oracle NoSQL Database (More information on Oracle’s Big Data offerings can be found at oracle.com/bigdata). The Big Data development team is part of the data warehouse organization in Oracle’s Database Server Technologies division, a vibrant engineering organization with deep experience in scalable, parallel data processing, complex query optimization, and advanced analytics.

About the Role

This position is located at our headquarters location in Redwood Shores, CA (San Francisco Bay Area). We are seeking a product manager for the Big Data Appliance. As product manager, you would leverage your strong technical background in order to help define the roadmap for the Oracle Big Data Appliance, and become one of the faces of Big Data at Oracle. You will be actively writing collateral, delivering presentations, and visiting customers to ensure the success of Oracle Big Data Appliance and other products in Oracle’s Big Data portfolio, while also working internally within the development organization to ensure that the Big Data Appliance meets all of the current and future requirements of our customers. 

Learn more and apply: http://www.linkedin.com/jobs?viewJob=&jobId=2555091

Thursday Feb 09, 2012

Sizing for data volume or performance or both?

Before you start reading this post please understand the following:

  • General Hadoop capacity planning guidelines and principles are here. I’m not doubting, replicating or replacing them.
  • I’m going to use our (UPDATED as of Dec 2012)Big Data Appliance as the baseline set of components to discuss capacity. That is because my brain does not hold theoretical hardware configurations.
  • This is (a bit of) a theoretical post to make the point that you worry about both (just given away the conclusion!) but I’m not writing the definitive guide to sizing your cluster. I just want to get everyone to think of Hadoop as processing AND storage. Not either or!

Now that you are warned, read on…

Imagine you want to store ~50TB (sounded like a nice round number) of data on a Hadoop cluster without worrying about any data growth for now (I told you this was theoretical). I’ll leave the default replication for Hadoop at 3 and simple math now dictates that I need 3 * 50TB = 150TB of Hadoop storage.

I also need space to do my MapReduce work (and this is the same for Pig and Hive which generate MapReduce) so your system will be even bigger to ensure Hadoop can write temporary files, shuffle date etc.

Now, that is fabulous, so I need to talk to my lovely Oracle Rep and buy a Big Data Appliance which would easily hold the above mentioned 50TB with triple replication. It actually holds 150TB (to make the math easy for me) or so of user data, and you will instantly say that the BDA is way to big!

Ah, but how fast do you want to process data? A Big Data Appliance has 18 nodes, each node has 12 cores to do the work for you. MapReduce is using processes called mappers and reducers (really!) to do the actual work.

Let’s assume that we are allowing Hadoop to spin up 15 mappers per node and 10 reducers per node. Let’s further assume we are going full bore and have every slot allocated to the current and only job’s mappers and reducers (they do not run together I know, theoretical exercise – remember?).

Because you decide the Big Data Appliance was way to big, you have bought 8 equivalent nodes to fit your data. Two of these run your Name Node, your Jobtracker and secondary Name Node (and you should actually have three nodes of all this, but I’m generous and say we run Jobtracker on Secondary Name Node). You have however 6 nodes for the data nodes based on your capacity based sizing.

That system you just bought based on storage will give us 6 * 15 = 90 mappers and 6 * 10 = 60 reducers working on my workload (the 2 other nodes do not run data nodes and do not run mappers and reducers).

Now let’s assume that I finish my job in N minutes on my lovely 8 node cluster by leveraging the full set of workers, and assume that my business users want to refresh the state of the world every N/2 minutes (it always has to go faster), then the assumption would be to simply get 2 * the number of nodes in my original cluster assuming linear scalability… The assumption is reasonable by the way for a lot of workloads, certainly for the ones in social, search and other data patterns that show little data skew because of their overall data size.

A Big Data Appliance gives us 15 * 15 = 225 mappers and 15 * 10 = 150 reducers working on my 50TB of user data… providing a 2.5x speed up on my data set.

Just another reference point on this, a Terasort of 100GB is run on a 20 node cluster with a total disk capacity of 80TB. Now that is of course a little too much, but you will see the point of not worrying too much about “that big system” and think processing power rather than storage.


You will need to worry about the processing requirements and you will need to understand the characteristics of the machine and the data. You should not size a system, or discard something as too big right away by just thinking about your raw data size. You should really, really consider Hadoop to be a system that scales processing and data storage together and use the benefits of the scale-out to balance data size with runtimes.

PS. Yes, I completely ignored those fabulous compression algorithms… Compression can certainly play a role here but I’ve left it out for now. Mostly because it is extremely hard (at least for my brain) to figure out an average compression rate and because you may decide to only compress older data, and compression costs CPU, but allows faster scan speeds and more of this fun stuff…

Monday Jan 09, 2012

Big Data Appliance and Big Data Connectors are now Generally Available

Today - January 10th, we announced the general availability of Oracle Big Data Appliance and Oracle Big Data Connectors as well as a partnership with Cloudera. Now that should be fun to start the new year in big data land!!

Big Data Appliance

Oracle Big Data Appliance brings Big Data solutions to mainstream enterprises. Built using industry-standard hardware from Sun and Cloudera's Distribution including Apache Hadoop, the Big Data Appliance is designed and optimized for big data workloads. By integrating the key components of a big data platform into a single product, Oracle Big Data Appliance delivers an affordable, scalable and fully supported big data infrastructure without the risks of a custom built solution. The Big Data Appliance integrates tightly with Oracle Exadata and Oracle Database using Oracle Big Data Connectors, and enables analysis of all data in the enterprise -structured and unstructured.

Big Data Connectors

Built from the ground up by Oracle, Oracle Big Data Connectors delivers a high-performance Hadoop to Oracle Database integration solution and enables optimized analysis using Oracle’s distribution of open source R analysis directly on Hadoop data. By providing efficient connectivity, Big Data Connectors enables analysis of all data in the enterprise – both structured and unstructured.

Cloudera CDH and Cloudera Manager

Oracle Big Data Appliance contains Cloudera’s Distribution including Apache Hadoop (CDH) and Cloudera Manager. CDH is the #1 Apache Hadoop-based distribution in commercial and non-commercial environments. CDH consists of 100% open source Apache Hadoop plus the comprehensive set of open source software components needed to use Hadoop. Cloudera Manager is an end-to-end management application for CDH. Cloudera Manager gives a cluster-wide, real-time view of nodes and services running; provides a single, central place to enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help optimize cluster performance and utilization.

More Information

Data sheets, white papers and other interesting information can be found here:

    * Big Data Appliance OTN page
    * Big Data Connectors OTN page

Happy new year and I hope life just got a bit more interesting!!

Thursday Dec 15, 2011

Understanding a Big Data Implementation and its Components

I often get asked about big data, and more often than not we seem to be talking at different levels of abstraction and understanding. Words like real time show up, words like advanced analytics show up and we are instantly talking about products. The latter is typically not a good idea.

So let’s try to step back and go look at what big data means from a use case perspective and how we then map this use case into a usable, high-level infrastructure picture. As we walk through this all you will – hopefully – start to see a pattern and start to understand how words like real time and analytics fit…

The Use Case in Business Terms

Rather then inventing something from scratch I’ve looked at the keynote use case describing Smart Mall (you can see a nice animation and explanation of smart mall in this video).

The idea behind this is often referred to as “multi-channel customer interaction”, meaning as much as “how can I interact with customers that are in my brick and mortar store via their phone”. Rather than having each customer pop out there smart phone to go browse prices on the internet, I would like to drive their behavior pro-actively.

The goals of smart mall are straight forward of course:

  • Increase store traffic within the mall
  • Increase revenue per visit and per transaction
  • Reduce the non-buy percentage

What do I need?

In terms of technologies you would be looking at:

  • Smart Devices with location information tied to an invidivual
  • Data collection / decision points for real-time interactions and analytics
  • Storage and Processing facilities for batch oriented analytics

In terms of data sets you would want to have at least:

  • Customer profiles tied to an individual linked to their identifying device (phone, loyalty card etc.)
  • A very fine grained customer segmentation
  • Tied to detailed buying behavior
  • Tied to elements like coupon usage, preferred products and other product recommendation like data sets

High-Level Components

A picture speaks a thousand words, so the below is showing both the real-time decision making infrastructure and the batch data processing and model generation (analytics) infrastructure.

The first – and arguably most important step and the most important piece of data – is the identification of a customer. Step 1 is in this case the fact that a user with cell phone walks into a mall. By doing so we trigger the lookups in step 2a and 2b in a user profile database. We will discuss this a little more later, but in general this is a database leveraging an indexed structure to do fast and efficient lookups. Once we have found the actual customer, we feed the profile of this customer into our real time expert engine – step 3. The models in the expert system (customer built or COTS software) evaluate the offers and the profile and determine what action to take (send a coupon for something). All of this happens in real time… keeping in mind that websites do this in milliseconds and our smart mall would probably be ok doing it in a second or so.

To build accurate models – and this where a lot of the typical big data buzz words come around, we add a batch oriented massive processing farm into the picture. The lower half in the picture above shows how we leverage a set of components to create a model of buying behavior. Traditionally we would leverage the database (DW) for this. We still do, but we now leverage an infrastructure before that to go after much more data and to continuously re-evaluate all that data with new additions.

A word on the sources. One key element is POS data (in the relational database) which I want to link to customer information (either from my web store or from cell phones or from loyalty cards). The NoSQL DB – Customer Profiles in the picture show the web store element. It is very important to make sure this multi-channel data is integrated (and de-duplicated but that is a different topic) with my web browsing, purchasing, searching and social media data.

Once that is done, I can puzzle together of the behavior of an individual. In essence big data allows micro segmentation at the person level. In effect for every one of my millions of customers!

The final goal of all of this is to build a highly accurate model to place within the real time decision engine. The goal of that model is directly linked to our business goals mentioned earlier. In other words, how can I send you a coupon while you are in the mall that gets you to the store and gets you to spend money…

Detailed Data Flows and Product Ideas

Now, how do I implement this with real products and how does my data flow within this ecosystem? That is something shown in the following sections…

Step 1 – Collect Data

To look up data, collect it and make decisions on it you will need to implement a system that is distributed. As these devices essentially keep on sending data, you need to be able to load the data (collect or acquire) without much delay. That is done like below in the collection points. That is also the place to evaluate for real time decisions. We will come back to the Collection points later…

The data from the collection points flows into the Hadoop cluster – in our case of course a big data appliance. You would also feed other data into this. The social feeds shown above would come from a data aggregator (typically a company) that sorts out relevant hash tags for example. Then you use Flume or Scribe to load the data into the Hadoop cluster.

Next step is the add data and start collating, interpreting and understanding the data in relation to each other.

For instance, add user profiles to the social feeds and the location data to build up a comprehensive understanding of an individual user and the patterns associated with this user. Typically this is done using MapReduce on Hadoop. The NoSQL user profiles are batch loaded from NoSQL DB via a Hadoop Input Format and thus added to the MapReduce data sets.

To combine it all with Point of Sales (POS) data, with our Siebel CRM data and all sorts of other transactional data you would use Oracle Loader for Hadoop to efficiently move reduced data into Oracle. Now you have a comprehensive view of the data that your users can go after. Either via Exalytics or BI tools or, and this is the interesting piece for this post – via things like data mining.

That latter phase – here called analyze will create data mining models and statistical models that are going to be used to produce the right coupons. These models are the real crown jewels as they allow an organization to make decisions in real time based on very accurate models. The models are going into the Collection and Decision points to now act on real time data.

In the picture above you see the gray model being utilized in the Expert Engine. That model describes / predicts behavior of an individual customer and based on that prediction we determine what action to undertake.

The above is an end-to-end look at Big Data and real time decisions. Big Data allows us to leverage tremendous data and processing resources to come to accurate models. It also allows us to find out all sorts of things that we were not expecting, creating more accurate models, but also creating new ideas, new business etc.

Once the Big Data Appliance is available you can implement the entire solution as shown here on Oracle technology… now you just need to find a few people who understand the programming models and create those crown jewels.

Friday Dec 09, 2011

Big Data Videos

I've been a bit quiet, been a bit busy working towards releasing our Big Data Appliance. But I thought I'd share the Youtube versions of the Openworld Videos on big data:

Big Data -- The Challenge

Big Data -- Gold Mine, or just Stuff

Big Data -- Big Data Speaks

Big Data -- Everything You Always Wanted to Know

Big Data -- Little Data

Should be fun to watch over the weekend!

Thursday Nov 17, 2011

Article on Oracle NoSQL Database Testing

A quick post, looks like Infoworld did a test with Oracle SQL Database and wrote about it. Read more here:


Enjoy and maybe test this one when you start your investigations into a NoSQL Database!

Friday Nov 11, 2011

My Take on Hadoop World 2011

I’m sure some of you have read pieces about Hadoop World and I did see some headlines which were somewhat, shall we say, interesting?

I thought the keynote by Larry Feinsmith of JP Morgan Chase & Co was one of the highlights of the conference for me. The reason was very simple, he addressed some real use cases outside of internet and ad platforms.

The following are my notes, since the keynote was recorded I presume you can go and look at Hadoopworld.com at some point…

On the use cases that were mentioned:

  1. ETL – how can I do complex data transformation at scale
    1. Doing Basel III liquidity analysis
    2. Private banking – transaction filtering to feed [relational] data marts
  2. Common Data Platform – a place to keep data that is (or will be) valuable some day, to someone, somewhere
    1. 360 Degree view of customers – become pro-active and look at events across lines of business. For example make sure the mortgage folks know about direct deposits being stopped into an account and ensure the bank is pro-active to service the customer
    2. Treasury and Security – Global Payment Hub [I think this is really consolidation of data to cross reference activity across business and geographies]
  3. Data Mining
    1. Bypass data engineering [I interpret this as running a lot of a large data set rather than on samples]
    2. Fraud prevention – work on event triggers, say a number of failed log-ins to the website. When they occur grab web logs, firewall logs and rules and start to figure out who is trying to log in. Is this me, who forget his password, or is it someone in some other country trying to guess passwords
    3. Trade quality analysis – do a batch analysis or all trades done and run them through an analysis or comparison pipeline

One of the key requests – if you can say it like that – was for vendors and entrepreneurs to make sure that new tools work with existing tools. JPMC has a large footprint of BI Tools and Big Data reporting and tools should work with those tools, rather than be separate.

Security and Entitlement – how to protect data within a large cluster from unwanted snooping was another topic that came up.

I thought his Elephant ears graph was interesting (couldn’t actually read the points on it, but the concept certainly made some sense) and it was interesting – when asked to show hands – how the audience did not (!) think that RDBMS and Hadoop technology would overlap completely within a few years.

Another interesting session was the session from Disney discussing how Disney is building a DaaS (Data as a Service) platform and how Hadoop processing capabilities are mixed with Database technologies. I thought this one of the best sessions I have seen in a long time. It discussed real use case, where problems existed, how they were solved and how Disney planned some of it.

The planning focused on three things/phases:

  1. Determine the Strategy – Design a platform and evangelize this within the organization
  2. Focus on the people – Hire key people, grow and train the staff (and do not overload what you have with new things on top of their day-to-day job), leverage a partner with experience
  3. Work on Execution of the strategy – Implement the platform Hadoop next to the other technologies and work toward the DaaS platform

This kind of fitted with some of the Linked-In comments, best summarized in “Think Platform – Think Hadoop”. In other words [my interpretation], step back and engineer a platform (like DaaS in the Disney example), then layer the rest of the solutions on top of this platform.

One general observation, I got the impression that we have knowledge gaps left and right. On the one hand are people looking for more information and details on the Hadoop tools and languages. On the other I got the impression that the capabilities of today’s relational databases are underestimated. Mostly in terms of data volumes and parallel processing capabilities or things like commodity hardware scale-out models.

All in all I liked this conference, it was great to chat with a wide range of people on Oracle big data, on big data, on use cases and all sorts of other stuff. Just hope they get a set of bigger rooms next time… and yes, I hope I’m going to be back next year!


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« July 2016