Thursday Apr 03, 2014

Updated: Price Comparison for Big Data Appliance and Hadoop

Untitled Document

It was time to update this post a little. Big Data Appliance grew, got more features and prices as well as insights just changed all across the board. So, here is an update.

The post is still aimed at providing a simple apples-to-apples comparison and a clarification of what is, and what is not included in the pricing and packaging of Oracle Big Data Appliance when compared to "I'm doing this myself - DIY style".

Oracle Big Data Appliance Details

A few of the most overlooked items in pricing out a Hadoop cluster are the cost of software, the cost of actual production-ready hardware and the required networking equipment. A Hadoop cluster needs more than just CPUs and disks... For Oracle Big Data Appliance we assume that you would want to run this system as a production system (with hot-pluggable components and redundant components in your system). We also assume you want the leading Hadoop distribution plus support for that software. You'd want to look at securing the cluster and possibly encrypting data at rest and over the network. Speaking of network, InfiniBand will eliminate network saturation issues - which is important for your Hadoop cluster.

With that in mind, Oracle Big Data Appliance is an engineered system built for production clusters.  It is pre-installed and pre-configured with Cloudera CDH and all (I emphasize all!) options included and we (with the help of Cloudera of course) have done the tuning of the system for you. On top of that, the price of the hardware (US$ 525,000 for a full rack system - more configs and smaller sizes => read more) includes the cost of Cloudera CDH, its options and Cloudera Manager (for the life of the machine - so not a subscription).

So, for US$ 525,000 you get the following:

  • Big Data Appliance Hardware (comes with Automatic Service Request upon component failures)
  • Cloudera CDH and Cloudera Manager
  • All Cloudera options as well as Accumulo and Spark (CDH 5.0)
  • Oracle Linux and the Oracle JDK
  • Oracle Distribution of R
  • Oracle NoSQL Database Community Edition
  • Oracle Big Data Appliance Enterprise Manager Plug-In

The support cost for the above is a single line item.. The list price for Premier Support for Systems per the Oracle Price list (see source below) is US$ 63,000 per year.

To do a simple 3 year comparison with other systems, the following table shows the details and the totals for Oracle Big Data Appliance. Note that the only additional item is the install and configuration cost which are done by Oracle personnel or partners, on-site:


Year 1 Year 2 Year 3 3 Year
Total
BDA Cost
$525,000



Annual Support Cost
$63,000
$63,000
$63,000

On-site Install (approximately)
$14,000



Total
$602,000
$63,000
$63,000
$728,150

For this you will get a full rack BDA (18 Sun X4-2L servers, 288 cores (Two Intel Xeon E5-2650V2 CPUs per node), 864TB disk (twelve 4TB disks per node), plus software, plus support, plus on-site setup and configuration. Or in terms of cost per raw TB at purchase and at list pricing: $697.

HP DL-380 Comparative System (this is changed from the original post to the more common DL-380's)

To build a comparative hardware solution to the Big Data Appliance we picked an HP-DL180 configuration and built up the servers using the HP.com website for pricing. The following is the price for a single server.

Model Number Description Quantity Total Price
653200-B21 ProLiant DL380p Gen8 Rackmount Factory Integrated 8 SFF CTO Model (2U) with no processor, 24 DIMM with no memory, open bay (diskless) with 8 SFF drive cage, Smart Array P420i controller with Zero Memory, 3 x PCIe 3.0 slots, 1 FlexibleLOM connector, no power supply, 4 x redundant fans, Integrated HP iLO Management Engine
1
$2,051
715218-L21
2.6GHz Xeon E5-2650 v2 processor (1 chip, 8 cores) with 20MB L3 cache - Factory Integrated Only
2
$3,118
684208-B21
HP 1GbE 4-port 331FLR Adapter - Factory Integrated Only
1
$25
503296-B21
460W Common Slot Gold Hot Plug Power Supply
1
$229
AF041A
HP Rack 10000 G2 Series - 10842 (42U) 800mm Wide Cabinet - Pallet Universal Rack
0
$0
731765-B21
8GB (1 x 8GB) Single Rank x8 PC3L-12800R (DDR3-1600) Registered CAS-11 Low Voltage Memory Kit
8
$1,600
631667-B21
HP Smart Array P222/512MB FBWC 6Gb 1-port Int/1-port Ext SAS controller 1
$599
695510-B21
4TB 6Gb SAS 7.2K LFF hot-plug SmartDrive SC Midline disk drive (3.5") with 1-year warranty
12
$12,588





Grand Total for a single server (list prices)

$20,210

On top of this we need InfiniBand switches. Oracle Big Data Appliance comes with 3 IB switches, allowing us to expand the cluster without suddenly requiring extra switches. And, we do expect these machines to be a part of a much larger clusters. The IB switches are somewhere in the neighborhood of US$ 6,000 per switch, so add $18,000 per rack and add a management switch (BDA uses a Cisco switch) which seems to be around $15,000 list. The total switching comes to roughly $33,000.

We will also need Cloudera Enterprise subscription - and to compare apples to apples, we will do it for all software. Some sources (see this document) peg CDH Core at $3,382 list per node and per year (24*7 support). Since BDA has more software (all options) and that pricing is not public I am going to make an educated calculation and rounding and double the price with a rounding to the nearest nice and round number. That gets me to $7,000 per node, per year for 24*7 support. 

BDA also comes with on-disk encryption, which is even harder to price out. My somewhat educated guess is around $1,500 list or so per node and per year. Oh, and lets not forget the Linux subscription, which lists at $1,299 per node per year. We also run a MySQL database (enterprise edition with replication), which costs list subscription $5,000. We run it replicated over 2 nodes.

This all gets us to roughly $10,000 list price per node per year for all applicable software subscriptions and support and an additional $10,000 for the two MySQL nodes.

HP + Cloudera Do-it-Yourself System

Let's go build our own system. The specs are like a BDA, so we will have 18 servers and all other components included. 


Year 1 Year 2 Year 3 Total

Servers

$363,780



Networking
$33,000



SW Subscriptions and Support
$190,000
$190,000
$190,000

Installation and Configuration
$15,000



Total
$601,780
$190,000
$190,000
$981,780

Some will argue that the installation and configuration is free (you already pay your data center team), but I would argue that something that takes a short amount of time when done by Oracle, is worth the equivalent if it takes you a lot longer to get all this installed, optimized, and running. Nevertheless, here is some math on how to get to that cost anyways: approximately 150 hours of labor per rack for the pure install work. That adds up to US $15,000 if we assume a cost per hour of $100. 

Note: those $15,000 do NOT include optimizations and tuning to Hadoop, to the OS, to Java and other interesting things like networking settings across all these areas. You will now need to spend time to figure out the number of slots you allocate per node, the file system block size (do you use Apache defaults, or Cloudera's or something else) and many more things at system level. On top of that, we pre-configure for example Kerberos and Apache Sentry giving you a secure authorization and authentication method, as well as have a one-click on-disk and network encryption setting. Of course you can contact various other companies to do this for you.

You can also argue that "you want the cheapest hardware possible", because Hadoop is built to deal with failures, so it is OK for things to regularly fail. Yes, Hadoop does deal well with hardware failures, but your data center is probably much less keen about this idea, because someone is going to replace the disks (all the time). So make sure the disks are hot-swappable. An oh, that someone swapping the disks does cost money... The other consideration is failures in important components like power... redundant power in a rack is a good thing to have. All of this is included (and thought about) in Oracle Big Data Appliance.

In other words, do you really want spend weeks installing, configuring and learning or would you rather start to build applications on top of the Hadoop cluster and thus providing value to your organization.

The Differences

The main differences between Oracle Big Data Appliance and a DIY approach are:

  1. A DIY system - at list price with basic installation but no optimization - is a staggering $220 cheaper as an initial purchase
  2. A DIY system - at list price with basic installation but no optimization - is almost $250,000 more expensive over 3 years.
    Note to purchasing, you can spend this on building or buying applications on your cluster (or buy some real intriguing Oracle software)
  3. The support for the DIY system includes five (5) vendors. Your hardware support vendor, the OS vendor, your Hadoop vendor, your encryption vendor as well as your database vendor. Oracle Big Data Appliance is supported end-to-end by a single vendor: Oracle
  4. Time to value. While we trust that your IT staff will get the DIY system up and running, the Oracle system allows for a much faster "loading dock to loading data" time. Typically a few days instead of a few weeks (or even months)
  5. Oracle Big Data Appliance is tuned and configured to take advantage of the software stack, the CPUs and InfiniBand network it runs on
  6. Any issue we, you or any other BDA customer finds in the system is fixed for all customers. You do not have a unique configuration, with unique issues on top of the generic issues.

Conclusion

In an apples-to-apples comparison of a production Hadoop cluster, Oracle Big Data Appliance starts of with the same acquisition prices and comes out ahead in terms of TCO over 3 years. It allows an organization to enter the Hadoop world with a production-grade system in a very short time reducing both risk as well as reducing time to market.

As always, when in doubt, simply contact your friendly Oracle representative for questions, support and detailed quotes.

Sources:

HP and related pricing: http://www.hp.com or http://www.ideasinternational.com/ (the latter is a paid service - sorry!)
Oracle Pricing: http://www.oracle.com/us/corporate/pricing/exadata-pricelist-070598.pdf
MySQL Pricing: http://www.oracle.com/us/corporate/pricing/price-lists/mysql-pricelist-183985.pdf

Wednesday Mar 26, 2014

Oracle Big Data Lite Virtual Machine - Version 2.5 Now Available

Oracle Big Data Appliance Version 2.5 was released last week.  Some great new features in this release- including a continued security focus (on-disk encryption and automated configuration of Sentry for data authorization) and updates to Cloudera Distribution of Apache Hadoop and Cloudera Manager.

With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine.  Oracle Big Data Lite provides an integrated environment to help you get started with the Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you to begin using the system right away. The following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition (12.1.0.1)
  • Cloudera’s Distribution including Apache Hadoop (CDH4.6)
  • Cloudera Manager 4.8.2
  • Cloudera Enterprise Technology, including:
    • Cloudera RTQ (Impala 1.2.3)
    • Cloudera RTS (Search 1.2)
  • Oracle Big Data Connectors 2.5
    • Oracle SQL Connector for HDFS 2.3.0
    • Oracle Loader for Hadoop 2.3.1
    • Oracle Data Integrator 11g
    • Oracle R Advanced Analytics for Hadoop 2.3.1
    • Oracle XQuery for Hadoop 2.4.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)
  • Oracle JDeveloper 11g
  • Oracle SQL Developer 4.0
  • Oracle Data Integrator 12cR1
  • Oracle R Distribution 3.0.1

Go to the Oracle Big Data Lite Virtual Machine landing page on OTN to download the latest release.

Monday Mar 24, 2014

Demonstration: Auditing Data Access Across the Enterprise

Security has been an important theme across recent Big Data Appliance releases. Our most recent release includes encryption of data at rest and automatic configuration of Sentry for data authorization. This is in addition to the security features previously added to the BDA, including Kerberos-based authentication, network encryption and auditing.

Auditing data access across the enterprise - including databases, operating systems and Hadoop - is critically important and oftentimes required for SOX, PCI and other regulations. Let's take a look at a demonstration of how Oracle Audit Vault and Database Firewall delivers comprehensive audit collection, alerting and reporting of activity on an Oracle Big Data Appliance and Oracle Database 12c. 

Configuration

In this scenario, we've set up auditing for both the BDA and Oracle Database 12c.

architecture

The Audit Vault Server is deployed to its own secure server and serves as mission control for auditing. It is used to administer audit policies, configure activities that are tracked on the secured targets and provide robust audit reporting and alerting. In many ways, Audit Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources into an audit schema and then delivers both pre-built and ad hoc reporting capabilities.

For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database 12c monitored targets; these agents are responsible for managing collectors that gather activity data. This is a secure agent deployment; the Audit Vault Server has a trusted relationship with each agent. To set up the trusted relationship, the agent makes an activation request to the Audit Vault Server; this request is then activated (or "approved") by the AV Administrator. The monitored target then applies an AV Server generated Agent Activation Key to complete the activation.

agents

On the BDA, these installation and configuration steps have all been automated for you. Using the BDA's Configuration Generation Utility, you simply specify that you would like to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the audit data. Mammoth - the BDA's installation tool - uses this information to configure the audit processing. Specifically, it sets up audit trails across the following services:

  • HDFS: collects all file access activity
  • MapReduce:  identifies who ran what jobs on the cluster
  • Oozie:  audits who ran what as part of a workflow
  • Hive:  captures changes that were made to the Hive metadata

There is much more flexibility when monitoring the Oracle Database. You can create audit policies for SQL statements, schema objects, privileges and more. Check out the auditor's guide for more details. In our demonstration, we kept it simple: we are capturing all select statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user and any unsuccessful attempts at selecting from any table in any schema.

Now that we are capturing activity across the BDA and Oracle Database 12c, we'll set up an alert to fire whenever there is suspicious activity attempted over sensitive HR data in Hadoop:

setup_alert

In the alert definition found above, a critical alert is defined as three unsuccessful attempts from a given IP address to access data in the HR directory. Alert definitions are extremely flexible - using any audited field as input into a conditional expression. And, they are automatically delivered to the Audit Vault Server's monitoring dashboard - as well as via email to appropriate security administrators.

Now that auditing is configured, we'll generate activity by two different users: oracle and DrEvil. We'll then see how the audit data is consolidated in the Audit Vault Server and how auditors can interrogate that data.

Capturing Activity

The demonstration is driven by a few scripts that generate different types of activity by both the oracle and DrEvil users. These activities include:

  • an oozie workflow that removes salary data from HDFS
  • numerous HDFS commands that upload files, change file access privileges, copy files and list the contents of directories and files
  • hive commands that query, create, alter and drop tables
  • Oracle Database commands that connect as different users, create and drop users, select from tables and delete records from a table

After running the scripts, we log into the Audit Vault Server as an auditor. Immediately, we see our alert has been triggered by the users' activity.

alert

Drilling down on the alert reveals DrEvil's three failed attempts to access the sensitive data in HDFS:

alert details

Now that we see the alert triggered in the dashboard, let's see what other activity is taking place on the BDA and in the Oracle Database.

Ad Hoc Reporting

Audit Vault Server delivers rich reporting capabilities that enables you to better understand the activity that has taken place across the enterprise. In addition to the numerous reports that are delivered out of box with Audit Vault, you can create your own custom reports that meet your own personal needs. Here, we are looking at a BDA monitoring report that focuses on Hadoop activities that occurred in the last 24 hours:

monitor events

As you can see, the report tells you all of the key elements required to understand: 1) when the activity took place, 2) the source service for the event, 3) what object was referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the ip address (or host) that initiated the event, and 7) how the object was modified or accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed attempts to open the sensitive salaries.txt file.

Notice, events may be related to one another. The Hive command "ALTER TABLE my_salarys RENAME TO my_salaries" will generate two events. The first event is sourced from the Metastore; the alter table command is captured and the metadata definition is updated. The Hive command also impacts HDFS; the table name is represented by an HDFS folder. Therefore, an HDFS event is logged that renames the "my_salarys" folder to "my_salaries".

Next, consider an Oozie workflow that performs a simple task: delete a file "salaries2.txt" in HDFS. This Oozie worflow generates the following events:

oozie-workflow

  1. First, an Oozie workflow event is generated indicating the start of the workflow.
  2. The workflow definition is read from the "workflow.xml" file found in HDFS.
  3. An Oozie working directory is created
  4. The salaries2.txt file is deleted from HDFS
  5. Oozie runs its clean-up process

The Audit Vault reports are able to reveal all of the underlying activity that is executed by the Oozie workflow. It's flexible reporting allows you to sequence these independent events into a logical series of related activities.

The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few unsuccessful attempts to access sensitive salary data in HDFS. But, what other unsuccessful events have occured recently across our data platform? We'll use Audit Vault's ad hoc reporting capabilities to answer that question. Report filters enable users to search audit data based on a range of conditions. Here, we'll keep it pretty simple; let's find all failed access attempts across both the BDA and the Oracle Database within the last two hours:

across-sources

Again, DrEvil's activity stands out. As you can see, DrEvil is attempting to access sensitive salary data not only in HDFS - but also in the Oracle Database.

Summary

Security and integration with the rest of the Oracle ecosystem are two tablestakes that are critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewall's auditing of data across the BDA, databases and operating systems epitomizes this goal - providing a single repository and reporting environment for all your audit data.

Wednesday Mar 19, 2014

Announcing Encryption of Data-at-Rest on Big Data Appliance

With the release of Big Data Appliance software bundle 2.5, BDA completes the encryption story underneath Cloudera CDH. BDA already came with network encryption, ensuring no network sniffing can be applied in between the nodes, it now adds encryption of data-at-rest.

A Brief Overview

Encryption of data-at-rest can be done in 2 modes. One mode leverages the Trusted Platform Module (TPM) on the motherboard to provide a key to encrypt the data on disk. This mode does not require a password or pass phrase but relies on the motherboard. The second mode leverages a passphrase, which in turn will be used to generate a private-public key pair generated with OpenSSL. The key pair is encrypted as well.

The passphrase encryption has a few more interesting aspects. For one, it does require the passphrase to be entered upon re-booting the system. Leveraging the TPM option does not require any manual intervention at reboot. On Big Data Appliance it is possible to regularly change the passphrase without impacting the encryption, or required re-encryption of the data.

Neither one of the encryption methods affect user access to user data. In other words, on an unprotected cluster a user that can read data before encryption will be able to read data after encryption. The goal is to ensure data is protected on physical media - like theft or incorrect disposal of a disk. Both forms protect from that, but only passphrase based encryption protects from disposal or theft of a server.

On BDA, it is possible to switch between these two methods. This does have impact on running the cluster as data needs to be re-encrypted. For this step the cluster will be down, however data is not duplicated, so there is no need to reserve double the space to do the re-encryption.

How to Encrypt Data

As with all installation or changes on Big Data Appliance you will leverage Mammoth to do the install with encryption or to make changes to the system if you are already in production. Before you set up either of the two modes of data-at-rest encryption, you should consider your requirements. Changing the mode - as described - is possible, but will require the cluster to be down for re-encryption.

Full Set of Security Features

Encryption - out-of-the-box is yet another feature that is specific to Oracle Big Data Appliance. On top of pre-configured Kerberos, Apache Sentry, Oracle Audit Vault Encryption now adds another security dimension. To read more about the full set of features start here.

Thursday Mar 13, 2014

Video: Big Data Connectors and IDH (Strata)

The certification of Oracle Big Data Connectors on Intel Distribution for Hadoop now complete (see our previous post). This video from Strata gives you a nice overview of IDH and BDC.

Friday Mar 07, 2014

Intel® Distribution for Apache Hadoop* certified with Oracle Big Data Connectors

Intel partnered with Oracle to certify compatibility between Intel® Distribution for Apache Hadoop* (IDH) and Oracle Big Data Connectors*.  Users can now connect IDH to Oracle Database with Oracle Big Data Connectors, taking advantage of the high performance feature-rich components of that product suite. Applications on IDH can leverage the connectors for fast load into Oracle Database, in-place query of data in HDFS with Oracle SQL, analytics in Hadoop with R, XQuery processing on Hadoop, and native Hadoop integration within Oracle Data Integrator.

Read the whole post here.

Monday Jan 27, 2014

Announcing: Oracle Big Data Lite Virtual Machine

You've been hearing alot about Oracle's big data platform. Today, we're pleased to announce Oracle Big Data Lite Virtual Machine - an environment to help you get started with the platform. And, we have a great OTN Virtual Developer Day event scheduled where you can start using our big data products as part of a series of workshops.


BigDataLite Picture

Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle's big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It's been configured to run on a "developer class" computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.

How do you get started? Why not start by registering for the Virtual Developer Day scheduled for Tuesday, February 4, 2014 - 9am  to 1pm PT / 12pm to 4pm ET / 3pm to 7pm BRT:

OTN_VDD

There will be 45 minute sessions delivered by product experts (from both Oracle and Oracle Aces) - highlighted by Tom Kyte and Jonathan Lewis' keynote "Landscape of Oracle Database Technology Evolution". Some of the big data technical sessions include:

  • Oracle NoSQL Database Installation and Cluster Topology Deployment
  • Application Development & Schema Design with Oracle NoSQL Database
  • Processing Twitter Data with Hadoop
  • Use Data from a Hadoop Cluster with Oracle Database
  • Make the Right Offers to Customers Using Oracle Advanced Analytics
  • In-DB Map Reduce with SQL/Hadoop
  • Pattern Matching in SQL

Keep an eye on this space - we'll be publishing how-to's that leverage the new Oracle Big Data Lite VM. And, of course, we'd love to hear about the clever applications you build as well!

Monday Jan 06, 2014

CaixaBank deploys new big data infrastructure on Oracle

CaixaBank is Spain’s largest domestic bank by market share with a customer base of 13.7 It is also Spain’s leading bank in terms of innovation and technology, and one of the most prominent innovators worldwide. CaixaBank has been recently awarded the title of the World’s Most Innovative Bank at the 2013 Global Banking Innovation Awards (November 2013).

Like most financial services companies CaixaBank wants to get closer to its customers by collecting data about their activities across all the different channels (offices, internet, phone banking, ATMs, etc.). In the old days we used to call this CRM and then this morphed into "360-degree view" etc etc. While many companies have delivered these types of projects and customers feel much more connected and in control of their relationship with their bank the capture of streams of big data has the potential to create another revolution in the way we interact with our bank. What banks like CaixaBank want to do is to capture data in one part of the business and make it available to all the other lines of business as quickly as possible.

Big data is allowing businesses like  CaixaBank to significantly enhance the business value of their existing customer data by integrating it with all sorts of other internal and external data sets. This is probably the most exciting part of big data because the potential business benefits are really only constrained by imagination of the team working on these type of projects. However, that in itself does create problems in terms of securing funding and ownership of projects because the benefits can be difficult to estimate which is where all the industry use cases, conference papers and blog posts can help in terms of providing insight into what is going on in across the market in broad general terms.

To help them implement a strategic Big Data project, CaixaBank has selected Oracle for the deployment of its new Big Data infrastructure. This project, which includes an array of Oracle solutions, positions CaixaBank at the forefront of innovation in the banking industry. The new infrastructure will allow CaixaBank to maximize the business value from any kind of data and embark on new business innovation projects based on valuable information gathered from large data sets. Projects currently under review include:

  • Development of predictive models to improve customer value
  • Identifying cross-selling and up-selling  opportunities
  • Development of personalized offers to customers
  • Reinforcement of risk management and brand protection services
  • More powerful fraud analysis 
  • General streamlining of current processes to reduce time-to-market
  • Support new regulatory requirements

The Oracle solution (including Oracle Engineered Systems, Oracle Software and Oracle Consulting Services) consists in the implementation of a new Information Management Architecture that provides a unified corporate data model and new advanced analytic capabilities (for more information about how Oracle's Reference Architecture can help you integrate structured, semi-structured and unstructured information into a single logical information resource that can be exploited for commercial gain click here to view our whitepaper)

The importance of the project is best explained by Juan Maria Nin, CEO of CaixaBank:

“Business innovation is the key for success in today’s highly competitive banking environment. The implementation of this Big Data solution will help CaixaBank remain at the forefront of innovation in the financial sector, delivering the best and most competitive services to our customers”.

The Oracle press release is here: https://emeapressoffice.oracle.com/Press-Releases/CaixaBank-Selects-Oracle-for-the-Deployment-of-its-New-Big-Data-Infrastructure-4183.aspx

Tuesday Dec 03, 2013

Big Data - Real and Practical Use Cases: Expand the Data Warehouse

As a follow-on to the previous post (here) on use cases, follow the link below to a recording that explains how to go about expanding the data warehouse into a big data platform.

The idea behind it all is to cover a best practice on adding to the existing data warehouse and expanding the system (shown in the figure below) to deal with:

  • Real-Time ingest and reporting on large volumes of data
  • A cost effective strategy to analyze all data across types and at large volumes
  • Deliver SQL access and analytics to all data 
  • Deliver the best query performance to match the business requirements

Roles of the Components

To access the webcast:

  • Register on this page
  • Scroll down and find the session labeled "Best practices for expanding the data warehouse with new big data streams"
  • Have fun
you want the slides, go to slideshare.

Wednesday Nov 27, 2013

Big Data - Real and Practical Use Cases

The goal of this post is to explain in a few succinct patterns how organizations can start to work with big data and identify credible and doable big data projects. This goal is achieved by describing a set of general patterns that can be seen in the market today.
Big Data Usage Patterns
The following usage patterns are derived from actual customer projects across a large number of industries and cross boundaries between commercial enterprises and public sector. These patterns are also geographically applicable and technically feasible with today’s technologies.
This paper will address the following four usage patterns:
  • Data Factory – a pattern that enable an organization to integrate and transform – in a batch method – large diverse data sets before moving this data into an upstream system like an RDBMS or a NoSQL system. Data in the data factory is possibly transient and the focus is on data processing.
  • Data Warehouse Expansion with a Data Reservoir – a pattern that expands the data warehouse with a large scale Hadoop system to capture data at lower grain and higher diversity, which is then fed into upstream systems. Data in the data reservoir is persistent and the focus is on data processing as well as data storage as well as the reuse of data.
  • Information Discovery with a Data Reservoir – a pattern that creates a data reservoir for discovery data marts or discovery systems like Oracle Endeca to tap into a wide range of data elements. The goal is to simplify data acquisition into discovery tools and to initiate discovery on raw data.
  • Closed Loop Recommendation and Analytics system – a pattern that is often considered the holy grail of data systems. This pattern combines both analytics on historical data, event processing or real time actions on current events and closes the loop between the two to continuously improve real time actions based on current and historical event correlation.

Pattern 1: Data Factory

The core business reason to build a Data Factory as it is presented here is to implement a cost savings strategy by placing long-running batch jobs on a cheaper system. The project is often funded by not spending money on the more expensive system – for example by switching Mainframe MIPS off  - and instead leveraging that cost savings to fund the Data Factory. The first figure shows a simplified implementation of the Data Factory.
As the image below shows, the data factory must be scalable, flexible and (more) cost effective for processing the data. The typical system used to build a data factory is Apache Hadoop or in the case of Oracle’s Big Data Appliance – Cloudera’s Distribution including Apache Hadoop (CDH).

data factory

Hadoop (and therefore Big Data Appliance and CDH) offers an extremely scalable environment to process large data volumes (or a large number of small data sets) and jobs. Most typical is the offload of large batch updates, matching and de-duplication jobs etc. Hadoop also offers a very flexible model, where data is interpreted on read, rather than on write. This idea enables a data factory to quickly accommodate all types of data, which can then be processed in programs written in Hive, Pig or MapReduce.
As shown in above the data factory is an integration platform, much like an ETL tool. Data sets land in the data factory, batch jobs process data and this processed data moves into the upstream systems. These upstream systems include RDBMS’s which are then used for various information needs. In the case of a Data Warehouse, this is very close to pattern 2 described below, with the difference that in the data factory data is often transient and removed after the processing is done.
This transient nature of data is not a required feature, but it is often implemented to keep the Hadoop cluster relatively small. The aim is generally to just transform data in a more cost effective manner.
In the case of an upstream system in NoSQL systems, data is often prepared in a specific key-value format to be served up to end applications like a website. NoSQL databases work really well for that purpose, but the batch processing is better left to Hadoop cluster.
It is very common for data to flow in the reverse order or for data from RDBMS or NoSQL databases to flow into the data factory. In most cases this is reference data, like customer master data. In order to process new customer data, this master data is required in the Data Factory.
Because of its low risk profile – the logic of these batch processes is well known and understood – and funding from savings in other systems, the Data Factory is typically an IT department’s first attempt at a big data project. The down side of a Data Factory project is that business users see very little benefits in that they do not get new insights out of big data.

Pattern 2: Data Warehouse Expansion

The common way to drive new insights out of big data is pattern two. Expanding the data warehouse with a data reservoir enables an organization to expand the raw data captured in a system that is able to add agility to the organization. The graphical pattern is shown in below.


DW Expansion

A Data Reservoir – like the Data Factory from Pattern 1 – is based on Hadoop and Oracle Big Data Appliance, but rather then have transient data and just process data and then hand the data off, a Data Reservoir aims to store data at a lower than previously stored grain for a period much longer than previous periods.
The Data Reservoir is initially used to capture data, aggregate new metrics and augment (not replace) the data warehouse with new and expansive KPIs or context information. A very typical addition is the sentiment of a customer towards a product or brand which is added to a customer table in the data warehouse.
The addition of new KPIs or new context information is a continuous process. That is, new analytics on raw and correlated data should find their way into the upstream Data Warehouse on a very, very regular basis.
As the Data Reservoir grows and starts to become known to exist because of the new KPIs or context, users should start to look at the Data Reservoir as an environment to “experiment” and “play” with data. With some rudimentary programming skills power users can start to combine various data elements in the Data Reservoir, using for example Hive. This enables the users to verify a hypotheses without the need to build a new data mart. Hadoop and the Data Reservoir now becomes an economically viable sandbox for power users driving innovation, agility and possibly revenue from hitherto unused data.

Pattern 3: Information Discovery

Agility for power users and expert programmers is one thing, but eventually the goal is to enable business users to discover new and exciting things in the data. Pattern 3 combines the data reservoir with a special information discovery system to provide a Graphical User Interface specifically for data discovery. This GUI emulates in many ways how an end user today searches for information on the internet.
To empower a set of business users to truly discover information, they first and foremost require a Discovery tool. A project should therefore always start with that asset.


Once the Discovery tool (like Oracle Endeca) is in place, it pays to start to leverage the Data Reservoir to feed the Discovery tool. As is shown above, the Data Reservoir is continuously fed with new data. The Discovery tool is a business user’s tool to create ad-hoc data marts in the discovery tool. Having the Data Reservoir simplifies the acquisition by end users because they only need to look in one place for data.
In essence, the Data Reservoir now is used to drive two different systems; the Data Warehouse and the Information Discovery environment and in practice users will very quickly gravitate to the appropriate system. But no matter which system they use, they now have the ability to drive value from data into the organization.

Pattern 4: Closed Loop Recommendation and Analytics System

So far, most of what was discussed was analytics and batch based. But a lot of organizations want to come to some real time interaction model with their end customers (or in the world of the Internet of Things – with other machines and sensors).

Closed Loop System

Hadoop is very good at providing the Data Factory and the Data Reservoir, at providing a sandbox, at providing massive storage and processing capabilities, but it is less good at doing things in real time. Therefore, to build a closed loop recommendation system – which should react in real time – Hadoop is only one of the components .
Typically the bottom half of the last figure is akin to pattern 2 and is used to catch all data, analyze the correlations between recorded events (detected fraud for example) and generate a set of predictive models describing something like “if a, b and c during a transaction – mark as suspect and hand off to an agent”. This model would for example block a credit card transaction.
To make such a system work it is important to use the right technology at both levels. Real time technologies like Oracle NoSQL Database, Oracle Real Time Decisions and Oracle Event Processing work on the data stream in flight. Oracle Big Data Appliance, Oracle Exadata/Database and Oracle Advanced Analytics provide the infrastructure to create, refine and expose the models.

Summary

Today’s big data technologies offer a wide variety of capabilities. Leveraging these capabilities with the existing environment and skills already in place according to the four patterns described does enable an organization to benefit from big data today. It is a matter of identifying the applicable pattern for your organization and then to start on the implementation.

The technology is ready. Are you?

Thursday Nov 14, 2013

Data Scientist Boot camp (Skills and Training)

As almost everyone is interested in data science, take the boot camp to get ahead of the curve. Leverage this free Data Science Boot camp from Oracle Academy to learn some of the following things:

  • Introduction: Providing Data-Driven Answers to Business Questions
  • Lesson 1: Acquiring and Transforming Big Data
  • Lesson 2: Finding Value in Shopping Baskets
  • Lesson 3: Unsupervised Learning for Clustering
  • Lesson 4: Supervised Learning for Classification and Prediction
  • Lesson 5: Classical Statistics in a Big Data World
  • Lesson 6: Building and Exploring Graphs

You will also find the code samples that go with the training and you can get of to a running start.


Tuesday Nov 12, 2013

Big Data Appliance X4-2 Release Announcement

Today we are announcing the release of the 3rd generation Big Data Appliance. Read the Press Release here.

Software Focus

The focus for this 3rd generation of Big Data Appliance is:

  • Comprehensive and Open - Big Data Appliance now includes all Cloudera Software, including Back-up and Disaster Recovery (BDR), Search, Impala, Navigator as well as the previously included components (like CDH, HBase and Cloudera Manager) and Oracle NoSQL Database (CE or EE).
  • Lower TCO then DIY Hadoop Systems
  • Simplified Operations while providing an open platform for the organization
  • Comprehensive security including the new Audit Vault and Database Firewall software, Apache Sentry and Kerberos configured out-of-the-box

Hardware Update

A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:


Big Data Appliance X3-2

Big Data Appliance X4-2

CPU

2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
2 x 8-Core Intel® Xeon® E5-2650 V2 (2.6 GHz)
Memory
64GB
64GB
Disk

12 x 3TB High Capacity SAS

12 x 4TB High Capacity SAS
InfiniBand
40Gb/sec
40Gb/sec
Ethernet
10Gb/sec
10Gb/sec

For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X4-2. The larger disks give BDA X4-2 33% more capacity over the previous generation while adding faster CPUs. Memory for BDA is expandable to 512 GB per node and can be done on a per-node basis, for example for NameNodes or for HBase region servers, or for NoSQL Database nodes.

Software Details

More details in terms of software and the current versions (note BDA follows a three monthly update cycle for Cloudera and other software):


Big Data Appliance 2.2 Software Stack Big Data Appliance 2.3 Software Stack
Linux
Oracle Linux 5.8 with UEK 1
Oracle Linux 6.4 with UEK 2
JDK
JDK 6
JDK 7
Cloudera CDH
CDH 4.3
CDH 4.4
Cloudera Manager
CM 4.6
CM 4.7

And like we said at the beginning it is important to understand that all other Cloudera components are now included in the price of Oracle Big Data Appliance. They are fully supported by Oracle and available for all BDA customers.

For more information:



Monday Nov 04, 2013

New Big Data Appliance Security Features

The Oracle Big Data Appliance (BDA) is an engineered system for big data processing.  It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing.  The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security.  Oracle Database customers have benefited from a rich set of security features:  encryption, redaction, data masking, database firewall, label based access control – and much, much more.  They want similar capabilities with their Hadoop cluster.   

Unfortunately, Hadoop wasn’t developed with security in mind.  By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database.  Some critical security features have been implemented – but even those capabilities are arduous to setup and configure.  Oracle believes that a key element of an optimized appliance is that its data should be secure.  Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing.

Security Starts at Authentication

A successful security strategy is predicated on strong authentication – for both users and software services.  Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”.  The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users.

Default Authentication in Hadoop

By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system.  Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user.  In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt.  When logged in as the hr user, you can see the following files.  Notice, we’re using the Hadoop command line utilities for accessing the data:

$ hadoop fs -ls /user/hrdata

Found 1 items
-rw-r--r--   1 oracle supergroup         70 2013-10-31 10:38 /user/hrdata/salaries.txt

$ hadoop fs -cat /user/hrdata/salaries.txt
Tom Brady,11000000
Tom Hanks,5000000
Bob Smith,250000
Oprah,300000000

User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”. 

$ hadoop fs -ls /user
Found 1 items
drwx------   - hr supergroup          0 2013-10-31 10:38 /user/hrdata

However, DrEvil cannot view the contents of the folder due to lack of access privileges:

$ hadoop fs -ls /user/hrdata
ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------

Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster:

$ sudo useradd hr
$ sudo passwd
$ su hr
$ hadoop fs -cat /user/hrdata/salaries.txt
Tom Brady,11000000
Tom Hanks,5000000
Bob Smith,250000
Oprah,300000000

Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS.

Big Data Appliance Provides Secure Authentication

The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration.


Figure 1: Kerberos Integration

The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc.

At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system.

It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment.

Authorize Access to Sensitive Data

Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server.

For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration.

Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements:

  • Secure Authorization:  the ability to control access to data and/or privileges on data for authenticated users.
  • Fine-Grained Authorization:  the ability to give users access to a subset of the data (e.g. column) in a database
  • Role-Based Authorization:  the ability to create/apply template-based privileges based on functional roles.
With Sentry, “all”, “select” or “insert” privileges are granted to an object. The descendants of that object automatically inherit that privilege. A collection of privileges across many objects may be aggregated into a role – and users/groups are then assigned that role. This leads to simplified administration of security across the system.

Sentry Object Hieararchy

Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views.

Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future.

Audit Hadoop Cluster Activity

Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster:


Figure 3: Monitored Hadoop services.

At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include:

  • MapReduce:  correlate the MapReduce job that accessed the file
  • Oozie:  describes who ran what as part of a workflow
  • Hive:  captures changes were made to the Hive metadata

The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA.

Audit Vault Server

Figure 4: Consolidated audit data across the enterprise. 

Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies.

Conclusion

Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.

Tuesday Oct 29, 2013

Start your journey into Big Data with the Oracle Academy today!

 Big Data has the power to change the way we work, live, and think. The datafication of everything will create unprecedented demand for data scientists, software developers and engineers who can derive value from unstructured data to transform the world.

The Oracle Academy Big Data Resource Guide is a collection of articles, videos, and other resources organized to help you gain a deeper understanding of the exciting field of Big Data. To start your journey visit the Oracle Academy website here: https://academy.oracle.com/oa-web-big-data.html. This landing pad will guide through the whole area of big data using the following structure:

  1. What is “Big Data”
  2. Engineered Systems
  3. Integration
  4. Database and Data Analytics
  5. Advanced Information
  6. Supplemental Information

This is great resource packed with must-see videos and must-read whitepapers and blog posts by industry leaders. 

Enjoy

Technorati Tags: , , ,

Tuesday Oct 15, 2013

Are you ready for Hadoop?

To find out, take the assessment here. Have fun!
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
4
5
6
7
8
9
10
11
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today