Tuesday Apr 22, 2014

Announcing: Big Data Appliance 3.0 and Big Data Connectors 3.0

Today we are releasing Big Data Appliance 3.0 (which includes the just released Oracle NoSQL Database 3.0) and Big Data Connectors 3.0.These releases deliver a large number of interesting and cool features and enhance the overall Oracle Big Data Management System that we think is going to be the core of information management going forward.


This post highlights a few of the new enhancements across the BDA, NoSQL DB and BDC stack.

Big Data Appliance 3.0:
  • Pre-configured and pre-installed CDH 5.0 with default support for YARN and MR2
  • Upgrade from BDA 2.5 (CDH 4.6) to BDA 3.0 (CDH 5.0)
  • Full encryption (at rest and over the network) from a single vendor in an appliance
  • Kerberos and Apache Sentry pre-configured
  • Partition Pruning through Oracle SQL Connector for Hadoop
  • Apache Spark (incl. Spark Streaming) support
  • More
Oracle NoSQL Database 3.0:
  • Table data model support layered on top of distributed key-value model
  • Support for Secondary Indexing
  • Support for "Data Centers" => Metro zones for DR and secondary zones for read-only workloads
  • Authentication and network encryption
  • More

You can read about all of these features by going to links above and reading the OTN page, data sheets and other relevant information.

While BDA 3.0 immediately delivers upgrade from BDA 2.5, Oracle will also support the current version and we fully expect more BDA 2.x releases based on more CDH 4.x releases. As a customer you now have a choice how to deploy BDA and which version it is you want to run, while knowing you can upgrade to the latest and greatest in a safe manner.

Monday Mar 24, 2014

Demonstration: Auditing Data Access Across the Enterprise

Security has been an important theme across recent Big Data Appliance releases. Our most recent release includes encryption of data at rest and automatic configuration of Sentry for data authorization. This is in addition to the security features previously added to the BDA, including Kerberos-based authentication, network encryption and auditing.

Auditing data access across the enterprise - including databases, operating systems and Hadoop - is critically important and oftentimes required for SOX, PCI and other regulations. Let's take a look at a demonstration of how Oracle Audit Vault and Database Firewall delivers comprehensive audit collection, alerting and reporting of activity on an Oracle Big Data Appliance and Oracle Database 12c. 

Configuration

In this scenario, we've set up auditing for both the BDA and Oracle Database 12c.

architecture

The Audit Vault Server is deployed to its own secure server and serves as mission control for auditing. It is used to administer audit policies, configure activities that are tracked on the secured targets and provide robust audit reporting and alerting. In many ways, Audit Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources into an audit schema and then delivers both pre-built and ad hoc reporting capabilities.

For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database 12c monitored targets; these agents are responsible for managing collectors that gather activity data. This is a secure agent deployment; the Audit Vault Server has a trusted relationship with each agent. To set up the trusted relationship, the agent makes an activation request to the Audit Vault Server; this request is then activated (or "approved") by the AV Administrator. The monitored target then applies an AV Server generated Agent Activation Key to complete the activation.

agents

On the BDA, these installation and configuration steps have all been automated for you. Using the BDA's Configuration Generation Utility, you simply specify that you would like to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the audit data. Mammoth - the BDA's installation tool - uses this information to configure the audit processing. Specifically, it sets up audit trails across the following services:

  • HDFS: collects all file access activity
  • MapReduce:  identifies who ran what jobs on the cluster
  • Oozie:  audits who ran what as part of a workflow
  • Hive:  captures changes that were made to the Hive metadata

There is much more flexibility when monitoring the Oracle Database. You can create audit policies for SQL statements, schema objects, privileges and more. Check out the auditor's guide for more details. In our demonstration, we kept it simple: we are capturing all select statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user and any unsuccessful attempts at selecting from any table in any schema.

Now that we are capturing activity across the BDA and Oracle Database 12c, we'll set up an alert to fire whenever there is suspicious activity attempted over sensitive HR data in Hadoop:

setup_alert

In the alert definition found above, a critical alert is defined as three unsuccessful attempts from a given IP address to access data in the HR directory. Alert definitions are extremely flexible - using any audited field as input into a conditional expression. And, they are automatically delivered to the Audit Vault Server's monitoring dashboard - as well as via email to appropriate security administrators.

Now that auditing is configured, we'll generate activity by two different users: oracle and DrEvil. We'll then see how the audit data is consolidated in the Audit Vault Server and how auditors can interrogate that data.

Capturing Activity

The demonstration is driven by a few scripts that generate different types of activity by both the oracle and DrEvil users. These activities include:

  • an oozie workflow that removes salary data from HDFS
  • numerous HDFS commands that upload files, change file access privileges, copy files and list the contents of directories and files
  • hive commands that query, create, alter and drop tables
  • Oracle Database commands that connect as different users, create and drop users, select from tables and delete records from a table

After running the scripts, we log into the Audit Vault Server as an auditor. Immediately, we see our alert has been triggered by the users' activity.

alert

Drilling down on the alert reveals DrEvil's three failed attempts to access the sensitive data in HDFS:

alert details

Now that we see the alert triggered in the dashboard, let's see what other activity is taking place on the BDA and in the Oracle Database.

Ad Hoc Reporting

Audit Vault Server delivers rich reporting capabilities that enables you to better understand the activity that has taken place across the enterprise. In addition to the numerous reports that are delivered out of box with Audit Vault, you can create your own custom reports that meet your own personal needs. Here, we are looking at a BDA monitoring report that focuses on Hadoop activities that occurred in the last 24 hours:

monitor events

As you can see, the report tells you all of the key elements required to understand: 1) when the activity took place, 2) the source service for the event, 3) what object was referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the ip address (or host) that initiated the event, and 7) how the object was modified or accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed attempts to open the sensitive salaries.txt file.

Notice, events may be related to one another. The Hive command "ALTER TABLE my_salarys RENAME TO my_salaries" will generate two events. The first event is sourced from the Metastore; the alter table command is captured and the metadata definition is updated. The Hive command also impacts HDFS; the table name is represented by an HDFS folder. Therefore, an HDFS event is logged that renames the "my_salarys" folder to "my_salaries".

Next, consider an Oozie workflow that performs a simple task: delete a file "salaries2.txt" in HDFS. This Oozie worflow generates the following events:

oozie-workflow

  1. First, an Oozie workflow event is generated indicating the start of the workflow.
  2. The workflow definition is read from the "workflow.xml" file found in HDFS.
  3. An Oozie working directory is created
  4. The salaries2.txt file is deleted from HDFS
  5. Oozie runs its clean-up process

The Audit Vault reports are able to reveal all of the underlying activity that is executed by the Oozie workflow. It's flexible reporting allows you to sequence these independent events into a logical series of related activities.

The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few unsuccessful attempts to access sensitive salary data in HDFS. But, what other unsuccessful events have occured recently across our data platform? We'll use Audit Vault's ad hoc reporting capabilities to answer that question. Report filters enable users to search audit data based on a range of conditions. Here, we'll keep it pretty simple; let's find all failed access attempts across both the BDA and the Oracle Database within the last two hours:

across-sources

Again, DrEvil's activity stands out. As you can see, DrEvil is attempting to access sensitive salary data not only in HDFS - but also in the Oracle Database.

Summary

Security and integration with the rest of the Oracle ecosystem are two tablestakes that are critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewall's auditing of data across the BDA, databases and operating systems epitomizes this goal - providing a single repository and reporting environment for all your audit data.

Monday Nov 04, 2013

New Big Data Appliance Security Features

The Oracle Big Data Appliance (BDA) is an engineered system for big data processing.  It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing.  The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security.  Oracle Database customers have benefited from a rich set of security features:  encryption, redaction, data masking, database firewall, label based access control – and much, much more.  They want similar capabilities with their Hadoop cluster.   

Unfortunately, Hadoop wasn’t developed with security in mind.  By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database.  Some critical security features have been implemented – but even those capabilities are arduous to setup and configure.  Oracle believes that a key element of an optimized appliance is that its data should be secure.  Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing.

Security Starts at Authentication

A successful security strategy is predicated on strong authentication – for both users and software services.  Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”.  The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users.

Default Authentication in Hadoop

By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system.  Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user.  In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt.  When logged in as the hr user, you can see the following files.  Notice, we’re using the Hadoop command line utilities for accessing the data:

$ hadoop fs -ls /user/hrdata

Found 1 items
-rw-r--r--   1 oracle supergroup         70 2013-10-31 10:38 /user/hrdata/salaries.txt

$ hadoop fs -cat /user/hrdata/salaries.txt
Tom Brady,11000000
Tom Hanks,5000000
Bob Smith,250000
Oprah,300000000

User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”. 

$ hadoop fs -ls /user
Found 1 items
drwx------   - hr supergroup          0 2013-10-31 10:38 /user/hrdata

However, DrEvil cannot view the contents of the folder due to lack of access privileges:

$ hadoop fs -ls /user/hrdata
ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------

Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster:

$ sudo useradd hr
$ sudo passwd
$ su hr
$ hadoop fs -cat /user/hrdata/salaries.txt
Tom Brady,11000000
Tom Hanks,5000000
Bob Smith,250000
Oprah,300000000

Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS.

Big Data Appliance Provides Secure Authentication

The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration.


Figure 1: Kerberos Integration

The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc.

At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system.

It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment.

Authorize Access to Sensitive Data

Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server.

For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration.

Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements:

  • Secure Authorization:  the ability to control access to data and/or privileges on data for authenticated users.
  • Fine-Grained Authorization:  the ability to give users access to a subset of the data (e.g. column) in a database
  • Role-Based Authorization:  the ability to create/apply template-based privileges based on functional roles.
With Sentry, “all”, “select” or “insert” privileges are granted to an object. The descendants of that object automatically inherit that privilege. A collection of privileges across many objects may be aggregated into a role – and users/groups are then assigned that role. This leads to simplified administration of security across the system.

Sentry Object Hieararchy

Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views.

Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future.

Audit Hadoop Cluster Activity

Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster:


Figure 3: Monitored Hadoop services.

At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include:

  • MapReduce:  correlate the MapReduce job that accessed the file
  • Oozie:  describes who ran what as part of a workflow
  • Hive:  captures changes were made to the Hive metadata

The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA.

Audit Vault Server

Figure 4: Consolidated audit data across the enterprise. 

Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies.

Conclusion

Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.

Monday Oct 15, 2012

The Oldest Big Data Problem: Parsing Human Language

There's a new whitepaper up on Oracle Technology Network which details the use of Digital Reasoning Systems' Synthesys software on Oracle Big Data Appliance.  Digital Reasoning's approach is inherently "big data friendly," as it leverages multiple components of the Hadoop ecosystem.  Moreover, the paper addresses the oldest big data problem of them all: extracting knowledge from human text.

  You can find the paper here.

  From the Executive Summary:

There is a wealth of information to be extracted from natural language, but that extraction is challenging. The volume of human language we generate constitutes a natural Big Data problem, while its complexity and nuance requires a particular expertise to model and mine. In this paper we illustrate the impressive combination of Oracle Big Data Appliance and Digital Reasoning Synthesys software. The combination of Synthesys and Big Data Appliance makes it possible to analyze tens of millions of documents in a matter of hours. Moreover, this powerful combination achieves four times greater throughput than conducting the equivalent analysis on a much larger cloud-deployed Hadoop cluster.

[Read More]

Friday Oct 14, 2011

A Closer Look at Oracle Big Data Appliance

Oracle Openworld just flew by… a lot of things happened in the big data space of course and you can read a lot of articles, blogs and other interesting materials all over.

What I thought I’d do here is to go through the big data appliance in a little more detail so everyone understands what the make-up of the machine is, what software we are putting on the machine and how it integrates with the Exadata machines.

Now, if you are bored reading, you can actually see and hear Todd and me discuss all this stuff using this link. This should be fun if you have never been to Openworld, as the interview is recorded at the OTN Lounge in the Howard street tent.

Oracle Big Data Appliance

The machine details are as follows:

  • 18 Nodes – Sun Servers
  • 2 CPUs per node, each with 6 cores (216 cores total)
  • 12 Disks per node (432 TB raw disk total)
  • Redundant InfiniBand Switches with 10GigE connectivity

To scale the machines, simply add a rack to the original full rack via InfiniBand. By leveraging InfiniBand we generally remove the network bottlenecks in the machine and between the machines. We chose InfiniBand over the 10GigE connectivity because we do believe network capacity of 40Gb/sec is a valuable asset in a Hadoop cluster. We also think that using InfiniBand to connect the big data appliance to an Exadata machine will have a positive influence of the batch loads done into an Oracle system.

cache_fusion_states

The software we are going to pre-install on the machine is:

  • Oracle Linux and Oracle Hotspot
  • Open-source distribution of Apache Hadoop
  • Oracle NoSQL Database Enterprise Edition (also available stand-alone)
  • Oracle Loader for Hadoop (also available stand-alone)
  • Open-source distribution of R (statistical package)
  • Oracle Data Integrator Application Adapter for Hadoop (also available stand-alone with ODI)

The goal of this software stack combined with the Sun hardware as an appliance is to create an enterprise class solution for Big Data that is:

  • Optimized and Complete - Everything you need to store and integrate your lower information density data
  • Integrated with Oracle Exadata - Analyze all your data
  • Easy to Deploy - Risk Free, Quick Installation and Setup
  • Single Vendor Support - Full Oracle support for the entire system and software set

As we get closer to the delivery date, you will see more detailed descriptions of the appliance, so stay tuned.

Thursday Sep 29, 2011

Added Session: Big Data Appliance

We added a new session to discuss Oracle Big Data Appliance. Here are the session details:

Oracle Big Data Appliance: Big Data for the Enterprise
Wednesday 10:15 AM
Marriott Marquis - Golden Gate C3 

Should be a fun session... see you all there!!

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
4
5
6
7
8
9
10
11
12
13
14
16
18
19
20
21
23
24
25
26
27
28
29
30
   
       
Today