A Closer Look at Oracle Big Data Appliance

Oracle Openworld just flew by… a lot of things happened in the big data space of course and you can read a lot of articles, blogs and other interesting materials all over.

What I thought I’d do here is to go through the big data appliance in a little more detail so everyone understands what the make-up of the machine is, what software we are putting on the machine and how it integrates with the Exadata machines.

Now, if you are bored reading, you can actually see and hear Todd and me discuss all this stuff using this link. This should be fun if you have never been to Openworld, as the interview is recorded at the OTN Lounge in the Howard street tent.

Oracle Big Data Appliance

The machine details are as follows:

  • 18 Nodes – Sun Servers
  • 2 CPUs per node, each with 6 cores (216 cores total)
  • 12 Disks per node (432 TB raw disk total)
  • Redundant InfiniBand Switches with 10GigE connectivity

To scale the machines, simply add a rack to the original full rack via InfiniBand. By leveraging InfiniBand we generally remove the network bottlenecks in the machine and between the machines. We chose InfiniBand over the 10GigE connectivity because we do believe network capacity of 40Gb/sec is a valuable asset in a Hadoop cluster. We also think that using InfiniBand to connect the big data appliance to an Exadata machine will have a positive influence of the batch loads done into an Oracle system.

cache_fusion_states

The software we are going to pre-install on the machine is:

  • Oracle Linux and Oracle Hotspot
  • Open-source distribution of Apache Hadoop
  • Oracle NoSQL Database Enterprise Edition (also available stand-alone)
  • Oracle Loader for Hadoop (also available stand-alone)
  • Open-source distribution of R (statistical package)
  • Oracle Data Integrator Application Adapter for Hadoop (also available stand-alone with ODI)

The goal of this software stack combined with the Sun hardware as an appliance is to create an enterprise class solution for Big Data that is:

  • Optimized and Complete - Everything you need to store and integrate your lower information density data
  • Integrated with Oracle Exadata - Analyze all your data
  • Easy to Deploy - Risk Free, Quick Installation and Setup
  • Single Vendor Support - Full Oracle support for the entire system and software set

As we get closer to the delivery date, you will see more detailed descriptions of the appliance, so stay tuned.

Comments:

JP,
One of the questions I had at OOW was had anyone approached Mellanox the Infiniband switch vendor about their UDA Plugin which bypasses the TCPIP layer for RDMA. It would seem that this would be a major differentiator for Oracle's pre-built box.

Matt

Posted by Matt Topper on October 14, 2011 at 01:44 PM PDT #

Hi Matt,

Keep that thought! Will update you all on these details as we get closer to the release data.

JP

Posted by Jean-Pierre on October 17, 2011 at 03:58 PM PDT #

Given that all the large production Hadoop clusters (where large = >500 nodes) tend to run 1GbE or bonded 2x1GbE, it's surprising to hear your claim that Infiniband is better. the only paper to say it shows an improvement is Sur et al., 2010, "Can High-Performance Interconnects Benefit Hadoop Distributed File System?" -and they showed that you needed SSD for the benefits to kick in. Once you've gone to SSD+Infiniband you can't (today) call yourself Big Data, more "medium expensive data".

The whole point about Apache Hadoop is to compensate for bandwidth limitations through topology-aware work scheduling; to compensate for switch and server failure through switch-aware block replication. Either there is something badly wrong with the test workloads where locality isn't very good, the worker nodes don't have enough spare compute capacity for work, or this is an ad-hoc justification of a design decision to focus on bandwidth over storage. It may be for non-MapReduce workloads, but if you think that InfiniBand is that important to Hadoop then I fear you may have some fundamental misunderstandings about the technology.

Posted by SteveL on October 24, 2011 at 02:55 AM PDT #

Hi Steve,

Yes, we are fully aware that you are aiming for locality when doing mapreduce workloads on a distributed system like Hadoop. And as such you are really going to worry about what data lives where. And - simplistically speaking - you do that because you try to avoid shuffling data from node 1 to node 34. Anytime you do move data - and you will in a distributed cluster - you will hit the network. And you often will hit a network bottleneck on 1GigE.

This is no different from any parallel (distributed) system like for example an RDBMS, where if you have no skew, great co-location etc. everything works wonderfully well. But it doesn't always turn out to be the case...

What we are observing is that the large mission critical implementations are moving to or considering at least 10GigE. We are simply saying, just move to 40Gbps (Infiniband) and remove you network bottlenecks because your data distribution is not always going to be perfect.

Leveraging Infiniband will also allow us to move data efficiently to non-Hadoop components in the environment, like Exadata. We see tremendous value in quickly and efficiently integrating data across these systems.

Last but not least, we attempt to engineer systems that will be highly effective for our customer base. As such we try to deliver a system is optimized where we can optimize it by leveraging advanced technology. One of these pieces is Infiniband.

Posted by Jean-Pierre on October 24, 2011 at 04:30 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
4
5
6
7
8
9
10
11
12
13
14
16
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today