Thursday Sep 29, 2011

Added Session: Big Data Appliance

We added a new session to discuss Oracle Big Data Appliance. Here are the session details:

Oracle Big Data Appliance: Big Data for the Enterprise
Wednesday 10:15 AM
Marriott Marquis - Golden Gate C3 

Should be a fun session... see you all there!!

Monday Sep 19, 2011

Focus On: Big Data at Openworld

With Oracle Openworld rapidly approaching and many new things coming / being announced around big data at Openworld, I figured it is good to share some of the sessions that are interesting in the big data context.

All big data related session can be found here: Focus on Big Data.

A couple of important highlights to set the scene:

  • Sunday 5.30pm: Openworld Welcome Keynote featuring some of the announcements
  • Monday 2.00pm: Extreme Data Management - Are you Ready? by Andy Mendelsohn - this one promises to be a very fun session mixing fun and technical content

With those session behind your belt, you are ready to dive into the details as listed in the Focus On document. And if you just want to look at the new machines, come visit us at the Engineered Systems Showcase in front of the key note hall or come look around the Big Data area in the database demogrounds section!

See you all in San Francisco!

Friday Aug 05, 2011

Big Data: In-Memory MapReduce

Achieving the impossible often comes at a cost. In the today’s big data world that cost is often latency. Bringing real-time, deep analytics to big data requires improvements to the core of the big data platform. In this post we will discuss parallel analytics and how we can bring in-memory – or real-time – to very large data volumes.

The Parallel Platform that run in Memory

One of the often overlooked things we did in Oracle 11.2 is that we changed the behavior of data that resides in memory across a RAC cluster. Instead of restricting the data size to the size of a the memory in a single node in the cluster, we now allow the database to see the memory as a large pool. We no longer do cache fusion to replicate all pieces of data in an object to all nodes.


That cache fusion process is show above. In state A, node 1 has acquired pieces of data into the buffer cache, as has node 2. These are different pieces of data. Cache fusion now kicks in and ensures that all data for that object is in the cache of each node.

In 11.2 we have changed this by allowing parallel execution to leverage the buffer cache as a grid. To do that we no longer do cache fusion, but instead pin (affinitize is the not so nice English word we use) data onto the memory of a node based on some internal algorithm. Parallel execution keeps track of where data lives and shuffles the query to that node (rather than using cache fusion to move data around).


The above shows how this works. P1 and P2 are chunks of data living in the buffer cache and are affinitized with a certain node. The parallel server processes on that node will execute the query and send the results back to the node that originated the query. With this in-memory grid we can process much larger data volumes.

In-Database MapReduce becomes In-Memory MapReduce

We talked about in-database MapReduce quite a bit on this blog, so I won’t repeat myself. If you are new to this topic, have a look at this post.

Because of how the database is architected, any code running within it leveraging parallelism can now use the data hosted in memory of the machine. Whether this is across the grid or not, doesn’t matter. So rather than having to figure out how to create a system that allows for MapReduce code to run in memory, you need to just figure out how to write the code, Oracle will ensure that if the data fits, it will leverage memory instead of disk. That is shown in the following picture.


Now, is this big data? Well, if I look at an Exadata X2-8  machine today, we will have 2TB of memory to work with and with Exadata Hybrid Columnar Compression (yes the data resides in memory on compressed state) this means I should easily be able to run upwards of 10TB in memory.  As memory footprints grow, more data will fit within the memory and we can do more and more interesting analytics on that data, bringing at least realtime to some fairly complex analytics.

More at Oracle Openworld

For those of you who are going to San Francisco (with or without flowers in your hat), expect to see and hear a lot more on big data and MapReduce! See you there!

Monday Jun 27, 2011

Big Data Accelerator

For everyone who does not regularly listen to earnings calls, Oracle's Q4 call was interesting (as it mostly is). One of the announcements in the call was the Big Data Accelerator from Oracle (Seeking Alpha link here - slightly tweaked for correctness shown below):

 "The big data accelerator includes some of the standard open source software, HDFS, the file system and a number of other pieces, but also some Oracle components that we think can dramatically speed up the entire map-reduce process. And will be particularly attractive to Java programmers [...]. There are some interesting applications they do, ETL is one. Log processing is another. We're going to have a lot of those features, functions and pre-built applications in our big data accelerator."

 Not much else we can say right now, more on this (and Big Data in general) at Openworld!

Tuesday Jun 07, 2011

Big Data: Achieve the Impossible in Real-Time

Sure, we all want to make the impossible possible… in any scenario, in any business. Here we are talking about driving performance to levels previously considered impossible and doing so by using just data and advanced analytics.
An amazing example of this is the BMW Oracle Americas cup boat and its usage of sensor data and deep analytics (story here).

Consider these two quotes from the article:

"They were measuring an incredible number of parameters across the trimaran, collected 10 times per second, so there were vast amounts of [sensor] data available for analysis. An hour of sailing generates 90 million data points."

"[…] we could compare our performance from the first day of sailing to the very last day of sailing, with incremental improvements the whole way through. With data mining we could check data against the things we saw, and we could find things that weren't otherwise easily observable and findable."

Winning the Cup

BMW Oracle Racing © Photo Gilles Martin-Raget

The end result of all of this (and do read the entire article, it is truly amazing with things like data projected in sunglasses!) that the guys on the boat can make a sailboat to go THREE times as fast as the wind that propels the boat.

To make this magic happen, a couple of things had to be done:

  1. Put the sensors in place and capture all the data
  2. Organize the data and analyze all of it in real-time
  3. Provide the decisions to the people who need it, exactly when they need it (like in the helmsman’s sunglasses!)
  4. Convince the best sailors in the world to trust and use the analysis to drive the boat

Since this blog is not about sailing but about data warehousing, big data and other (only slightly) less cool things, the intent is to explain how you can deliver magic like this in your company?

Move your company onto the next value curve

The above example gives you an actual environment where the combination of high volume, high velocity sensor data, deep analytics and real-time decisions are used to drive performance. This example is a real big data story.

Sure, a multi-billion dollar business will collect often more data, but the point of the above story is analyzing a previously unseen, massive influx of data – the team estimated 40x more data than in conventional environments. However, the extra interesting aspect is that decisions are automated. Rather than flooding the sunglasses with data, only relevant decisions and data are projected. No need for the helmsman to interpret the data, he needed to simply act on the decision points.

To project the idea of acting on decision points into an organization, your IT will have to start changing, as will your end users. To do so, you need to jump onto the bandwagon called big data. The following describes how to get on that bandwagon.

Today, your organization is doing the best it can by leveraging its current IT and DW platforms. That means – for most organizations – that you have squeezed all the relevant information out of the historical data assets you analyze. You are the dot on the lower value curve and you are on the plateau. Any extra dollar invested in the plateau is just about keeping the lights on, not about generating competitive advantage or business value. To jump to the next curve, you need to find some way to harness the challenges imposed by big data.

Value Curves Today

From an infrastructure perspective, you must design a big data platform. That big data platform is a fundamental part of your IT infrastructure if your company wants to compete over the next few years.

Value Curves Tomorrow

The main components in the big data platform provide:

  • Deep Analytics – a fully parallel, extensive and extensible toolbox full of advanced and novel statistical and data mining capabilities
  • High Agility – the ability to create temporary analytics environments in an end-user driven, yet secure and scalable environment to deliver new and novel insights to the operational business
  • Massive Scalability – the ability to scale analytics and sandboxes to previously unknown scales while leveraging previously untapped data potential
  • Low Latency – the ability to instantly act based on these advanced analytics in your operational, production environments

Read between the lines and you see that the big data platform is based on the three hottest topics in the industry: security, cloud computing and big data, all working in conjunction to deliver the next generation big data computing platform.

IT Drives Business Value

Over the next couple of years, companies which drive efficiency, agility and IT as a service via the cloud, which drive new initiatives and top line growth leveraging big data and analytics, keep all their data safe and secure, will be the leaders in their industry.

Oracle is building the next generation big data platforms on these three pillars: cloud, security and big data. Over the next couple of months – leading up to Oracle OpenWorld – we will cover details about Oracle’s analytical platform and in-memory computing for real-time big data (and general purpose speed!) on this blog.

A little bit of homework to prepare you for those topics is required. If you have not yet read the following, do give them a go, they are a good read:

These - older - blog posts will get you an understanding of in-database mapreduce techniques, how to integrate with Hadoop and a peak at some futuristic applications that I think would be generally cool and surely be coming down the pipeline in some form or fashion.


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« April 2014