Friday Sep 26, 2014

Oracle Big Data Lite 4.0 Virtual Machine Now Available

Big Data Lite 4.0 is now available for download from OTN.  There are lots of new capabilities in this latest version:
  • Oracle Database 12c (, including new JSON support and Oracle Big Data SQL-enabled external tables.  Check out this hands-on lab to learn how to securely analyze all your data - across both Hadoop and Oracle Database 12c - using Big Data SQL.
  • New versions of SQL Developer and Data Modeler that support Hive access and automatic generation of Big Data SQL external tables
  • GoldenGate and the latest ODI versions are now included - with some great new hands-on labs.
  • Cloudera Manager is back - you can now optionally use CM to manage your Hadoop environment (requires 10GB memory devoted to the VM).  If you don't want to use CM, you can use the manual CDH configuration with the Big Data Lite services application
  • New versions of the entire stack... Big Data Connectors, NoSQL Database, CDH, JDeveloper and more.

Here's the inventory of all the features and version:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Advanced Analytics, OLAP, Spatial and more
  • Cloudera Distribution including Apache Hadoop (CDH5.1.2)
  • Cloudera Manager (5.1.2)
  • Oracle Big Data Connectors 4.0
    • Oracle SQL Connector for HDFS 3.1.0
    • Oracle Loader for Hadoop 3.2.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.0.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.0.14)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1

Tuesday Sep 23, 2014

Big Data IM Reference Architecture

Just in time for Oracle Openworld, the new Big Data Information Management Reference Architecture is posted on our OTN pages. The reference architecture attempts to create order in the wild west of new technologies, the flurry of new ideas and most importantly tries to go from marketing hype to a real, implementable architecture.

To get all the details, read the paper here. Thanks to the EMEA architecture team , the folks at Rittman Mead Consulting and all others involved.

Tuesday Jul 15, 2014

Oracle Big Data SQL: One Fast Query, All Your Data


Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

SQL on Hadoop

As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

SQL On More Than Hadoop

For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

This is why we think Oracle Big Data SQL is obvious too.

It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

  • Maintain the valuable characteristics of the system storing the data
  • Unify metadata to understand how to execute queries
  • Optimize execution to take advantage of the systems storing the data

For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.


So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

It's SQL that plays data where it lies, letting you place data where you think it belongs.

[Read More]

Thursday Jun 26, 2014

Big Data Breakthrough; Watch the Webcast on July 15th

Wednesday Jun 25, 2014

Big Data Appliance now part of Exastack Partner Program

With data growing at unprecedented levels, application providers need a robust, scalable and high-performance infrastructure to manage and analyze data at the speed of business. Helping Independent Software Vendors (ISVs) meet these challenges, Oracle PartnerNetwork (OPN) is today extending its Oracle Exastack program to give partners the opportunity to achieve Oracle Exastack Ready or Optimized status for Oracle Big Data Appliance and Oracle Database Appliance.

The announcement was made today during the sixth annual Oracle PartnerNetwork (OPN) global kickoff event, “Engineered for Success: Oracle and Partners Winning Together.” For a replay of the event, visit:

Here is a word from one of our partners: “TCS was looking for an enterprise grade and reliable Hadoop solution for our HDMS document management application,” said Jayant Dani, principal consultant, Component Engineering Group, TCS.  “We were drawn to Oracle's Big Data Appliance for its ready to use platform, superior performance, scalability, and simple scale-out capabilities.”

Read the entire news release here.

Thursday Jun 05, 2014

Globacom and mCentric Deploy BDA and NoSQL Database to analyze network traffic 40x faster

In a fast evolving market, speed is of the essence. mCentric and Globacom leveraged Big Data Appliance, Oracle NoSQL Database to save over 35,000 Call-Processing minutes daily and analyze network traffic 40x faster. 

Here are some highlights from the profile:

Why Oracle

“Oracle Big Data Appliance works well for very large amounts of structured and unstructured data. It is the most agile events-storage system for our collect-it-now and analyze-it-later set of business requirements. Moreover, choosing a prebuilt solution drastically reduced implementation time. We got the big data benefits without needing to assemble and tune a custom-built system, and without the hidden costs required to maintain a large number of servers in our data center. A single support license covers both the hardware and the integrated software, and we have one central point of contact for support,” said Sanjib Roy, CTO, Globacom.

Implementation Process

It took only five days for Oracle partner mCentric to deploy Oracle Big Data Appliance, perform the software install and configuration, certification, and resiliency testing. The entire process—from site planning to phase-I, go-live—was executed in just over ten weeks, well ahead of the four months allocated to complete the project.

Oracle partner mCentric leveraged Oracle Advanced Customer Support Services’ implementation methodology to ensure configurations are tailored for peak performance, all patches are applied, and software and communications are consistently tested using proven methodologies and best practices.

Read the entire profile here.

Tuesday May 13, 2014

Announcing: Big Data Lite VM 3.0

Last week we released Big Data Lite VM 3.0. It contains the latest update on the VM for the entire stack.

Oracle Enterprise Linux 6.4
Oracle Database 12c Release 1 Enterprise Edition ( with Oracle Advanced Analytics, Spatial & Graph, and more
Cloudera’s Distribution including Apache Hadoop (CDH 5.0)
Oracle Big Data Connectors 3.0
        Oracle SQL Connector for HDFS 3.0.0
        Oracle Loader for Hadoop 3.0.0
        Oracle Data Integrator 12c
        Oracle R Advanced Analytics for Hadoop 2.4.0
        Oracle XQuery for Hadoop 3.0.0
Oracle NoSQL Database Enterprise Edition 12cR1 (3.0.5)
Oracle JDeveloper 11g
Oracle SQL Developer 4.0
Oracle Data Integrator 12cR1
Oracle R Distribution 3.0.1

The download page is on OTN in its usual place.

Tuesday Apr 22, 2014

Announcing: Big Data Appliance 3.0 and Big Data Connectors 3.0

Today we are releasing Big Data Appliance 3.0 (which includes the just released Oracle NoSQL Database 3.0) and Big Data Connectors 3.0.These releases deliver a large number of interesting and cool features and enhance the overall Oracle Big Data Management System that we think is going to be the core of information management going forward.

This post highlights a few of the new enhancements across the BDA, NoSQL DB and BDC stack.

Big Data Appliance 3.0:
  • Pre-configured and pre-installed CDH 5.0 with default support for YARN and MR2
  • Upgrade from BDA 2.5 (CDH 4.6) to BDA 3.0 (CDH 5.0)
  • Full encryption (at rest and over the network) from a single vendor in an appliance
  • Kerberos and Apache Sentry pre-configured
  • Partition Pruning through Oracle SQL Connector for Hadoop
  • Apache Spark (incl. Spark Streaming) support
  • More
Oracle NoSQL Database 3.0:
  • Table data model support layered on top of distributed key-value model
  • Support for Secondary Indexing
  • Support for "Data Centers" => Metro zones for DR and secondary zones for read-only workloads
  • Authentication and network encryption
  • More

You can read about all of these features by going to links above and reading the OTN page, data sheets and other relevant information.

While BDA 3.0 immediately delivers upgrade from BDA 2.5, Oracle will also support the current version and we fully expect more BDA 2.x releases based on more CDH 4.x releases. As a customer you now have a choice how to deploy BDA and which version it is you want to run, while knowing you can upgrade to the latest and greatest in a safe manner.

Thursday Apr 03, 2014

Updated: Price Comparison for Big Data Appliance and Hadoop

Untitled Document

It was time to update this post a little. Big Data Appliance grew, got more features and prices as well as insights just changed all across the board. So, here is an update.

The post is still aimed at providing a simple apples-to-apples comparison and a clarification of what is, and what is not included in the pricing and packaging of Oracle Big Data Appliance when compared to "I'm doing this myself - DIY style".

Oracle Big Data Appliance Details

A few of the most overlooked items in pricing out a Hadoop cluster are the cost of software, the cost of actual production-ready hardware and the required networking equipment. A Hadoop cluster needs more than just CPUs and disks... For Oracle Big Data Appliance we assume that you would want to run this system as a production system (with hot-pluggable components and redundant components in your system). We also assume you want the leading Hadoop distribution plus support for that software. You'd want to look at securing the cluster and possibly encrypting data at rest and over the network. Speaking of network, InfiniBand will eliminate network saturation issues - which is important for your Hadoop cluster.

With that in mind, Oracle Big Data Appliance is an engineered system built for production clusters.  It is pre-installed and pre-configured with Cloudera CDH and all (I emphasize all!) options included and we (with the help of Cloudera of course) have done the tuning of the system for you. On top of that, the price of the hardware (US$ 525,000 for a full rack system - more configs and smaller sizes => read more) includes the cost of Cloudera CDH, its options and Cloudera Manager (for the life of the machine - so not a subscription).

So, for US$ 525,000 you get the following:

  • Big Data Appliance Hardware (comes with Automatic Service Request upon component failures)
  • Cloudera CDH and Cloudera Manager
  • All Cloudera options as well as Accumulo and Spark (CDH 5.0)
  • Oracle Linux and the Oracle JDK
  • Oracle Distribution of R
  • Oracle NoSQL Database Community Edition
  • Oracle Big Data Appliance Enterprise Manager Plug-In

The support cost for the above is a single line item.. The list price for Premier Support for Systems per the Oracle Price list (see source below) is US$ 63,000 per year.

To do a simple 3 year comparison with other systems, the following table shows the details and the totals for Oracle Big Data Appliance. Note that the only additional item is the install and configuration cost which are done by Oracle personnel or partners, on-site:

Year 1 Year 2 Year 3 3 Year
BDA Cost

Annual Support Cost

On-site Install (approximately)


For this you will get a full rack BDA (18 Sun X4-2L servers, 288 cores (Two Intel Xeon E5-2650V2 CPUs per node), 864TB disk (twelve 4TB disks per node), plus software, plus support, plus on-site setup and configuration. Or in terms of cost per raw TB at purchase and at list pricing: $697.

HP DL-380 Comparative System (this is changed from the original post to the more common DL-380's)

To build a comparative hardware solution to the Big Data Appliance we picked an HP-DL180 configuration and built up the servers using the website for pricing. The following is the price for a single server.

Model Number Description Quantity Total Price
653200-B21 ProLiant DL380p Gen8 Rackmount Factory Integrated 8 SFF CTO Model (2U) with no processor, 24 DIMM with no memory, open bay (diskless) with 8 SFF drive cage, Smart Array P420i controller with Zero Memory, 3 x PCIe 3.0 slots, 1 FlexibleLOM connector, no power supply, 4 x redundant fans, Integrated HP iLO Management Engine
2.6GHz Xeon E5-2650 v2 processor (1 chip, 8 cores) with 20MB L3 cache - Factory Integrated Only
HP 1GbE 4-port 331FLR Adapter - Factory Integrated Only
460W Common Slot Gold Hot Plug Power Supply
HP Rack 10000 G2 Series - 10842 (42U) 800mm Wide Cabinet - Pallet Universal Rack
8GB (1 x 8GB) Single Rank x8 PC3L-12800R (DDR3-1600) Registered CAS-11 Low Voltage Memory Kit
HP Smart Array P222/512MB FBWC 6Gb 1-port Int/1-port Ext SAS controller 1
4TB 6Gb SAS 7.2K LFF hot-plug SmartDrive SC Midline disk drive (3.5") with 1-year warranty

Grand Total for a single server (list prices)


On top of this we need InfiniBand switches. Oracle Big Data Appliance comes with 3 IB switches, allowing us to expand the cluster without suddenly requiring extra switches. And, we do expect these machines to be a part of a much larger clusters. The IB switches are somewhere in the neighborhood of US$ 6,000 per switch, so add $18,000 per rack and add a management switch (BDA uses a Cisco switch) which seems to be around $15,000 list. The total switching comes to roughly $33,000.

We will also need Cloudera Enterprise subscription - and to compare apples to apples, we will do it for all software. Some sources (see this document) peg CDH Core at $3,382 list per node and per year (24*7 support). Since BDA has more software (all options) and that pricing is not public I am going to make an educated calculation and rounding and double the price with a rounding to the nearest nice and round number. That gets me to $7,000 per node, per year for 24*7 support. 

BDA also comes with on-disk encryption, which is even harder to price out. My somewhat educated guess is around $1,500 list or so per node and per year. Oh, and lets not forget the Linux subscription, which lists at $1,299 per node per year. We also run a MySQL database (enterprise edition with replication), which costs list subscription $5,000. We run it replicated over 2 nodes.

This all gets us to roughly $10,000 list price per node per year for all applicable software subscriptions and support and an additional $10,000 for the two MySQL nodes.

HP + Cloudera Do-it-Yourself System

Let's go build our own system. The specs are like a BDA, so we will have 18 servers and all other components included. 

Year 1 Year 2 Year 3 Total




SW Subscriptions and Support

Installation and Configuration


Some will argue that the installation and configuration is free (you already pay your data center team), but I would argue that something that takes a short amount of time when done by Oracle, is worth the equivalent if it takes you a lot longer to get all this installed, optimized, and running. Nevertheless, here is some math on how to get to that cost anyways: approximately 150 hours of labor per rack for the pure install work. That adds up to US $15,000 if we assume a cost per hour of $100. 

Note: those $15,000 do NOT include optimizations and tuning to Hadoop, to the OS, to Java and other interesting things like networking settings across all these areas. You will now need to spend time to figure out the number of slots you allocate per node, the file system block size (do you use Apache defaults, or Cloudera's or something else) and many more things at system level. On top of that, we pre-configure for example Kerberos and Apache Sentry giving you a secure authorization and authentication method, as well as have a one-click on-disk and network encryption setting. Of course you can contact various other companies to do this for you.

You can also argue that "you want the cheapest hardware possible", because Hadoop is built to deal with failures, so it is OK for things to regularly fail. Yes, Hadoop does deal well with hardware failures, but your data center is probably much less keen about this idea, because someone is going to replace the disks (all the time). So make sure the disks are hot-swappable. An oh, that someone swapping the disks does cost money... The other consideration is failures in important components like power... redundant power in a rack is a good thing to have. All of this is included (and thought about) in Oracle Big Data Appliance.

In other words, do you really want spend weeks installing, configuring and learning or would you rather start to build applications on top of the Hadoop cluster and thus providing value to your organization.

The Differences

The main differences between Oracle Big Data Appliance and a DIY approach are:

  1. A DIY system - at list price with basic installation but no optimization - is a staggering $220 cheaper as an initial purchase
  2. A DIY system - at list price with basic installation but no optimization - is almost $250,000 more expensive over 3 years.
    Note to purchasing, you can spend this on building or buying applications on your cluster (or buy some real intriguing Oracle software)
  3. The support for the DIY system includes five (5) vendors. Your hardware support vendor, the OS vendor, your Hadoop vendor, your encryption vendor as well as your database vendor. Oracle Big Data Appliance is supported end-to-end by a single vendor: Oracle
  4. Time to value. While we trust that your IT staff will get the DIY system up and running, the Oracle system allows for a much faster "loading dock to loading data" time. Typically a few days instead of a few weeks (or even months)
  5. Oracle Big Data Appliance is tuned and configured to take advantage of the software stack, the CPUs and InfiniBand network it runs on
  6. Any issue we, you or any other BDA customer finds in the system is fixed for all customers. You do not have a unique configuration, with unique issues on top of the generic issues.


In an apples-to-apples comparison of a production Hadoop cluster, Oracle Big Data Appliance starts of with the same acquisition prices and comes out ahead in terms of TCO over 3 years. It allows an organization to enter the Hadoop world with a production-grade system in a very short time reducing both risk as well as reducing time to market.

As always, when in doubt, simply contact your friendly Oracle representative for questions, support and detailed quotes.


HP and related pricing: or (the latter is a paid service - sorry!)
Oracle Pricing:
MySQL Pricing:

Wednesday Mar 26, 2014

Oracle Big Data Lite Virtual Machine - Version 2.5 Now Available

Oracle Big Data Appliance Version 2.5 was released last week.  Some great new features in this release- including a continued security focus (on-disk encryption and automated configuration of Sentry for data authorization) and updates to Cloudera Distribution of Apache Hadoop and Cloudera Manager.

With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine.  Oracle Big Data Lite provides an integrated environment to help you get started with the Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you to begin using the system right away. The following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition (
  • Cloudera’s Distribution including Apache Hadoop (CDH4.6)
  • Cloudera Manager 4.8.2
  • Cloudera Enterprise Technology, including:
    • Cloudera RTQ (Impala 1.2.3)
    • Cloudera RTS (Search 1.2)
  • Oracle Big Data Connectors 2.5
    • Oracle SQL Connector for HDFS 2.3.0
    • Oracle Loader for Hadoop 2.3.1
    • Oracle Data Integrator 11g
    • Oracle R Advanced Analytics for Hadoop 2.3.1
    • Oracle XQuery for Hadoop 2.4.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)
  • Oracle JDeveloper 11g
  • Oracle SQL Developer 4.0
  • Oracle Data Integrator 12cR1
  • Oracle R Distribution 3.0.1

Go to the Oracle Big Data Lite Virtual Machine landing page on OTN to download the latest release.

Wednesday Mar 19, 2014

Announcing Encryption of Data-at-Rest on Big Data Appliance

With the release of Big Data Appliance software bundle 2.5, BDA completes the encryption story underneath Cloudera CDH. BDA already came with network encryption, ensuring no network sniffing can be applied in between the nodes, it now adds encryption of data-at-rest.

A Brief Overview

Encryption of data-at-rest can be done in 2 modes. One mode leverages the Trusted Platform Module (TPM) on the motherboard to provide a key to encrypt the data on disk. This mode does not require a password or pass phrase but relies on the motherboard. The second mode leverages a passphrase, which in turn will be used to generate a private-public key pair generated with OpenSSL. The key pair is encrypted as well.

The passphrase encryption has a few more interesting aspects. For one, it does require the passphrase to be entered upon re-booting the system. Leveraging the TPM option does not require any manual intervention at reboot. On Big Data Appliance it is possible to regularly change the passphrase without impacting the encryption, or required re-encryption of the data.

Neither one of the encryption methods affect user access to user data. In other words, on an unprotected cluster a user that can read data before encryption will be able to read data after encryption. The goal is to ensure data is protected on physical media - like theft or incorrect disposal of a disk. Both forms protect from that, but only passphrase based encryption protects from disposal or theft of a server.

On BDA, it is possible to switch between these two methods. This does have impact on running the cluster as data needs to be re-encrypted. For this step the cluster will be down, however data is not duplicated, so there is no need to reserve double the space to do the re-encryption.

How to Encrypt Data

As with all installation or changes on Big Data Appliance you will leverage Mammoth to do the install with encryption or to make changes to the system if you are already in production. Before you set up either of the two modes of data-at-rest encryption, you should consider your requirements. Changing the mode - as described - is possible, but will require the cluster to be down for re-encryption.

Full Set of Security Features

Encryption - out-of-the-box is yet another feature that is specific to Oracle Big Data Appliance. On top of pre-configured Kerberos, Apache Sentry, Oracle Audit Vault Encryption now adds another security dimension. To read more about the full set of features start here.

Thursday Mar 13, 2014

Video: Big Data Connectors and IDH (Strata)

The certification of Oracle Big Data Connectors on Intel Distribution for Hadoop now complete (see our previous post). This video from Strata gives you a nice overview of IDH and BDC.

Friday Mar 07, 2014

Intel® Distribution for Apache Hadoop* certified with Oracle Big Data Connectors

Intel partnered with Oracle to certify compatibility between Intel® Distribution for Apache Hadoop* (IDH) and Oracle Big Data Connectors*.  Users can now connect IDH to Oracle Database with Oracle Big Data Connectors, taking advantage of the high performance feature-rich components of that product suite. Applications on IDH can leverage the connectors for fast load into Oracle Database, in-place query of data in HDFS with Oracle SQL, analytics in Hadoop with R, XQuery processing on Hadoop, and native Hadoop integration within Oracle Data Integrator.

Read the whole post here.

Monday Jan 27, 2014

Announcing: Oracle Big Data Lite Virtual Machine

You've been hearing alot about Oracle's big data platform. Today, we're pleased to announce Oracle Big Data Lite Virtual Machine - an environment to help you get started with the platform. And, we have a great OTN Virtual Developer Day event scheduled where you can start using our big data products as part of a series of workshops.

BigDataLite Picture

Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle's big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It's been configured to run on a "developer class" computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.

How do you get started? Why not start by registering for the Virtual Developer Day scheduled for Tuesday, February 4, 2014 - 9am  to 1pm PT / 12pm to 4pm ET / 3pm to 7pm BRT:


There will be 45 minute sessions delivered by product experts (from both Oracle and Oracle Aces) - highlighted by Tom Kyte and Jonathan Lewis' keynote "Landscape of Oracle Database Technology Evolution". Some of the big data technical sessions include:

  • Oracle NoSQL Database Installation and Cluster Topology Deployment
  • Application Development & Schema Design with Oracle NoSQL Database
  • Processing Twitter Data with Hadoop
  • Use Data from a Hadoop Cluster with Oracle Database
  • Make the Right Offers to Customers Using Oracle Advanced Analytics
  • In-DB Map Reduce with SQL/Hadoop
  • Pattern Matching in SQL

Keep an eye on this space - we'll be publishing how-to's that leverage the new Oracle Big Data Lite VM. And, of course, we'd love to hear about the clever applications you build as well!

Tuesday Dec 03, 2013

Big Data - Real and Practical Use Cases: Expand the Data Warehouse

As a follow-on to the previous post (here) on use cases, follow the link below to a recording that explains how to go about expanding the data warehouse into a big data platform.

The idea behind it all is to cover a best practice on adding to the existing data warehouse and expanding the system (shown in the figure below) to deal with:

  • Real-Time ingest and reporting on large volumes of data
  • A cost effective strategy to analyze all data across types and at large volumes
  • Deliver SQL access and analytics to all data 
  • Deliver the best query performance to match the business requirements

Roles of the Components

To access the webcast:

  • Register on this page
  • Scroll down and find the session labeled "Best practices for expanding the data warehouse with new big data streams"
  • Have fun
you want the slides, go to slideshare.

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« November 2015