Friday Nov 20, 2015

Review of Data Warehousing and Big Data at #oow15

DW BigData Review Of OOW15

This year OpenWorld was bigger, more exciting and packed with sessions about the very latest technology and product features. Most importantly, both data warehousing and Big Data were at the heart of this year’s conference across a number of keynotes and a huge number of general sessions. Our hands-on labs were all completely full as people got valuable hands-on time with our most important new features. The key focus areas at this year’s conference were:
  • Database 12c for Data Warehousing
  • Big Data and the Internet of Things 
  • Cloud from on-premise to on the Cloud to running hybrid Cloud systems
  • Analytics and SQL continues to evolve to enable more and more sophisticated analysis. 
All these topics appeared across the main keynote sessions including live on-stage demonstrations of how many of our news features can be used to increase the performance and analytical capability of your data warehouse and big data management system - checkout the on-demand videos for the keynotes and executive interviews....
[Read More]

Thursday Sep 03, 2015

Oracle Big Data Lite 4.2.1 - Includes Big Data Discovery

We just released Oracle Big Data Lite 4.2.1 VM.  This VM provides many of the key big data technologies that are part of Oracle's big data platform.  Along with all the great features of the previous version, Big Data Lite now adds Oracle Big Data Discovery 1.1:

The list of big data capabilities provided by the virtual machine continues to grow.  Here's a list of all the products that are pre-configured:

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Discovery 1.1
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0 
Take it for a spin - and check out the tutorials and demos that are available from the Big Data Lite download page.

Thursday Jul 02, 2015

Using Oracle Big Data Spatial and Graph

Wondering how to get started with graph analyses?  The latest Oracle Big Data Lite VM includes Oracle's new spatial and graph toolkit for big data.  Check out these two blog posts that describe how to find interesting relationships in data:

 Pretty cool :)


Saturday Jun 20, 2015

Oracle Big Data Spatial and Graph - Installing the Image Processing Framework

Oracle Big Data Lite 4.2 was just released - and one of the cool new features is Oracle Spatial and Graph.  In order to use this new feature, there is one more configuration step required.  Normally, we include everything you need in the VM - but this is a component that we couldn't distribute.

For the Big Data Spatial Image Processing Framework, you will need to install and configure Proj.4 - Cartographic Projections Library.  Simply follow these steps: 

  • Start the Big Data Lite VM and log in as user "oracle"
  • Launch firefox and download this tarball (​ to ~/Downloads
  • Run the following commands at the linux prompt:
    • cd ~/Downloads
    • tar -xvf proj-4.9.1.tar.gz
    • cd proj-4.9.1
    • ./configure
    • make
    • sudo make install

This will create the file in directory /usr/local/lib/.  Now that the file has been created, create links to it in the appropriate directories.  At the linux prompt:

  • sudo ln -s /usr/local/lib/ /u02/oracle-spatial-graph/shareddir/spatial/demo/imageserver/native/
  • sudo ln -s /usr/local/lib/ /usr/lib/hadoop/lib/native/

That's all there is to it.  Big Data Lite is now ready for Orace Big Data Spatial and Graph!

Oracle Big Data Lite 4.2 Now Available!

Oracle Big Data Lite Virtual Machine 4.2 is now available on OTN.  For those of you that are new to the VM - it is a great way to get started with Oracle's big data platform.  It has a ton of products installed and configured - including: 

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0

Check out our new product - Oracle Big Data Spatial and Graph (and don't forget to read the blog post on a small config update you'll need to make to use it).  It's a great way to find relationships in data and query and visualize geographic data.  Speaking of analysis... Oracle R Advanced Analytics for Hadoop now leverages Spark for many of its algorithms for (way) faster processing.

 But, that's just a couple of features... download the VM and check it out for yourself :). 

Wednesday Mar 11, 2015

Oracle Big Data Lite 4.1 VM is available on OTN

Oracle Big Data Lite 4.1 VM is now available for download on OTN.  Big Data Lite includes many of the key capabilities of Oracle's big data platform.  Each of the components have been configure to work together - and there are many hands-on labs and demonstrations to help you get started using the system.  Below is a listing of what's included:

  • Oracle Enterprise Linux 6.5
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.3.0)
  • Cloudera Manager (5.3.0)
  • Oracle Big Data Connectors 4.1
    • Oracle SQL Connector for HDFS 3.2.0
    • Oracle Loader for Hadoop 3.3.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.1.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.2.5)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.3.0
  • Oracle CopyToBDA 1.1 



Tuesday Mar 03, 2015

Are you leveraging Oracle's database innovations for Cloud and Big data?

If you are interested in big data, Hadoop, SQL and data warehousing then mark your calendars because on March 18th at 10:00AM PST/1:00PM EST, you will be able to hear Tom Kyte (Oracle Database Architect) talk about how you can use Oracle Big Data SQL to seamlessly integrate all your Hadoop big data datasets with your relational schemas stored in Oracle Database 12c. As part of this discussion Tom will outline how you can build the perfect foundation for your enterprise big data management system using Oracle's innovative technology.

If you are working on a data warehousing project and/or a big data project then this is one webcast you will not want to miss so register today (click here) to hear the latest about Oracle Database innovations and best practices. The full list of speakers is:

Tom Kyte
Oracle Database Architect
Keith Wilcox
VP, Database Administration
Bill Callahan
Director, Principal Engineer,
CCC Information Services, Inc.

Why SQL is the natural language for data analysis

Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it. 

The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users. 

[Read More]

Friday Sep 26, 2014

Oracle Big Data Lite 4.0 Virtual Machine Now Available

Big Data Lite 4.0 is now available for download from OTN.  There are lots of new capabilities in this latest version:
  • Oracle Database 12c (, including new JSON support and Oracle Big Data SQL-enabled external tables.  Check out this hands-on lab to learn how to securely analyze all your data - across both Hadoop and Oracle Database 12c - using Big Data SQL.
  • New versions of SQL Developer and Data Modeler that support Hive access and automatic generation of Big Data SQL external tables
  • GoldenGate and the latest ODI versions are now included - with some great new hands-on labs.
  • Cloudera Manager is back - you can now optionally use CM to manage your Hadoop environment (requires 10GB memory devoted to the VM).  If you don't want to use CM, you can use the manual CDH configuration with the Big Data Lite services application
  • New versions of the entire stack... Big Data Connectors, NoSQL Database, CDH, JDeveloper and more.

Here's the inventory of all the features and version:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Advanced Analytics, OLAP, Spatial and more
  • Cloudera Distribution including Apache Hadoop (CDH5.1.2)
  • Cloudera Manager (5.1.2)
  • Oracle Big Data Connectors 4.0
    • Oracle SQL Connector for HDFS 3.1.0
    • Oracle Loader for Hadoop 3.2.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.0.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.0.14)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1

Monday Sep 15, 2014

Oracle SQL Developer & Data Modeler Support for Oracle Big Data SQL

Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle Big Data SQL.  The tools allow you to connect to Hive, use the SQL Worksheet to query, create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle external tables that dynamically access data sources defined in the Hive metastore.  

Let's take a look at what it takes to get started and then preview this new capability.

Setting up Connections to Hive

The first thing you need to do is set up a JDBC connection to Hive.  Follow these steps to set up the connection:

Download and Unzip JDBC Drivers

Cloudera provides high performance JDBC drivers that are required for connectivity:

  • Download the Hive Drivers from the Cloudera Downloads page to a local directory
  • Unzip the archive
    • unzip
  • Three zip files are contained within the archive.  Unzip the JDBC4 archive to a target directory that is accessible to SQL Developer (e.g. /home/oracle/jdbc below): 
    • unzip -d /home/oracle/jdbc/
    • Note: you will get an error when attempting to open a Hive connection in SQL Developer if you use a different JDBC version. Ensure you use JDBC4 and not JDBC41.

Now that the JDBC drivers have been extracted, update SQL Developer to use the new drivers.

Update SQL Developer to use the Cloudera Hive JDBC Drivers

Update the preferences in SQL Developer to leverage the new drivers:

  • Start SQL Developer
  • Go to Tools -> Preferences
  • Navigate to Database -> Third Party JDBC Drivers
  • Add all of the jar files contained in the zip to the Third-party JDBC Driver Path.  It should look like the picture below:
    sql developer preferences

  • Restart SQL Developer

Create a Connection

Now that SQL Developer is configured to access Hive, let's create a connection to Hiveserver2.  Click the New Connection button in the SQL Developer toolbar.  You'll need to have an ID, password and the port where Hiveserver2 is running:

connect to hiveserver2

The example above is creating a connection called hive which connects to Hiveserver2 on localhost running on port 10000.  The Database field is optional; here we are specifying the default database.

Using the Hive Connection

The Hive connection is now treated like any other connection in SQL Developer.  The tables are organized into Hive databases; you can review the tables' data, properties, partitions, indexes, details and DDL:

sqldeveloper - view data in hive

And, you can use the SQL Worksheet to run custom queries, perform DDL operations - whatever is supported in Hive:


Here, we've altered the definition of a hive table and then queried that table in the worksheet.

Create Big Data SQL-enabled Tables Using Oracle Data Modeler

Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables.  Let's create a few tables using the metadata from the Hive Metastore.  Invoke the import wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item.  You will see the same connections found in the SQL Developer connection navigator:

pick a connection

After selecting the hive connection and a database, select the tables to import:

pick tables to import

There could be any number of tables here - in our case we will select three tables to import.  After completing the import, the logical table definitions appear in our palette:

imported tables

You can update the logical table definitions - and in our case we will want to do so.  For example, the recommended column in Hive is defined as a string (i.e. there is no precision) - which the Data Modeler casts as a varchar2(4000).  We have domain knowledge and understand that this field is really much smaller - so we'll update it to the appropriate size:

update prop

Now that we're comfortable with the table definitions, let's generate the DDL and create the tables in Oracle Database 12c.  Use the Data Modeler DDL Preview to generate the DDL for those tables - and then apply the definitions in the Oracle Database SQL Worksheet:

preview ddl

Edit the Table Definitions

The SQL Developer table editor has been updated so that it now understands all of the properties that control Big Data SQL external table processing.  For example, edit table movieapp_log_json:

edit table props

You can update the source cluster for the data, how invalid records should be processed, how to map hive table columns to the corresponding Oracle table columns (if they don't match), and much more.

Query All Your Data

You now have full Oracle SQL access to data across the platform.  In our example, we can combine data from Hadoop with data in our Oracle Database.  The data in Hadoop can be in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big Data SQL can access it!  Below, we're combining click data from the JSON-based movie application log with data in our Oracle Database tables to determine how the company's customers rate blockbuster movies:

compare to blockbuster movies

Looks like they don't think too highly of them! Of course - the ratings data is fictitious ;)

Friday Sep 12, 2014

Announcement: Big Data SQL is Generally Available

Oracle Big Data SQL and Big Data Appliance 4.0 are generally available

Big Data Appliance 4.0 receives the following upgrades:


  • Big Data SQL: Join data in Hadoop with Oracle Database using Oracle SQL 
    • New external table types for handling data stored in Hadoop 
    • Smart Scan for Hadoop to provide fast query performance 
    • Requires additional license for Big Data SQL 
    • Requires Exadata Database Machine running DB 
  • Automated recovery from server failure 
    • This includes migration of master roles to a different server and re-provisioning of a slave node on a server that has been replaced 
  • NoSQL DB multiple-zone configurations on BDA 
    • When adding nodes to a NoSQL DB BDA cluster, they can be added to an existing zone or to a new zone. 
  •  Update to using the 12c ODI Agent on BDA clusters 

For more information on Big Data SQL, check out:



[Read More]

Tuesday Jul 15, 2014

Oracle Big Data SQL: One Fast Query, All Your Data


Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

SQL on Hadoop

As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

SQL On More Than Hadoop

For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

This is why we think Oracle Big Data SQL is obvious too.

It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

  • Maintain the valuable characteristics of the system storing the data
  • Unify metadata to understand how to execute queries
  • Optimize execution to take advantage of the systems storing the data

For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.


So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

It's SQL that plays data where it lies, letting you place data where you think it belongs.

[Read More]

Wednesday Mar 26, 2014

Oracle Big Data Lite Virtual Machine - Version 2.5 Now Available

Oracle Big Data Appliance Version 2.5 was released last week.  Some great new features in this release- including a continued security focus (on-disk encryption and automated configuration of Sentry for data authorization) and updates to Cloudera Distribution of Apache Hadoop and Cloudera Manager.

With each BDA release, we have a new release of Oracle Big Data Lite Virtual Machine.  Oracle Big Data Lite provides an integrated environment to help you get started with the Oracle Big Data platform. Many Oracle Big Data platform components have been installed and configured - allowing you to begin using the system right away. The following components are included on Oracle Big Data Lite Virtual Machine v 2.5:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition (
  • Cloudera’s Distribution including Apache Hadoop (CDH4.6)
  • Cloudera Manager 4.8.2
  • Cloudera Enterprise Technology, including:
    • Cloudera RTQ (Impala 1.2.3)
    • Cloudera RTS (Search 1.2)
  • Oracle Big Data Connectors 2.5
    • Oracle SQL Connector for HDFS 2.3.0
    • Oracle Loader for Hadoop 2.3.1
    • Oracle Data Integrator 11g
    • Oracle R Advanced Analytics for Hadoop 2.3.1
    • Oracle XQuery for Hadoop 2.4.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (2.1.54)
  • Oracle JDeveloper 11g
  • Oracle SQL Developer 4.0
  • Oracle Data Integrator 12cR1
  • Oracle R Distribution 3.0.1

Go to the Oracle Big Data Lite Virtual Machine landing page on OTN to download the latest release.

Monday Mar 24, 2014

Demonstration: Auditing Data Access Across the Enterprise

Security has been an important theme across recent Big Data Appliance releases. Our most recent release includes encryption of data at rest and automatic configuration of Sentry for data authorization. This is in addition to the security features previously added to the BDA, including Kerberos-based authentication, network encryption and auditing.

Auditing data access across the enterprise - including databases, operating systems and Hadoop - is critically important and oftentimes required for SOX, PCI and other regulations. Let's take a look at a demonstration of how Oracle Audit Vault and Database Firewall delivers comprehensive audit collection, alerting and reporting of activity on an Oracle Big Data Appliance and Oracle Database 12c. 


In this scenario, we've set up auditing for both the BDA and Oracle Database 12c.


The Audit Vault Server is deployed to its own secure server and serves as mission control for auditing. It is used to administer audit policies, configure activities that are tracked on the secured targets and provide robust audit reporting and alerting. In many ways, Audit Vault is a specialized auditing data warehouse. It automates ETL from a variety of sources into an audit schema and then delivers both pre-built and ad hoc reporting capabilities.

For our demonstration, Audit Vault agents are deployed to the BDA and Oracle Database 12c monitored targets; these agents are responsible for managing collectors that gather activity data. This is a secure agent deployment; the Audit Vault Server has a trusted relationship with each agent. To set up the trusted relationship, the agent makes an activation request to the Audit Vault Server; this request is then activated (or "approved") by the AV Administrator. The monitored target then applies an AV Server generated Agent Activation Key to complete the activation.


On the BDA, these installation and configuration steps have all been automated for you. Using the BDA's Configuration Generation Utility, you simply specify that you would like to audit activity in Hadoop. Then, you identify the Audit Vault Server that will receive the audit data. Mammoth - the BDA's installation tool - uses this information to configure the audit processing. Specifically, it sets up audit trails across the following services:

  • HDFS: collects all file access activity
  • MapReduce:  identifies who ran what jobs on the cluster
  • Oozie:  audits who ran what as part of a workflow
  • Hive:  captures changes that were made to the Hive metadata

There is much more flexibility when monitoring the Oracle Database. You can create audit policies for SQL statements, schema objects, privileges and more. Check out the auditor's guide for more details. In our demonstration, we kept it simple: we are capturing all select statements on the sensitive HR.EMPLOYEES table, all statements made by the HR user and any unsuccessful attempts at selecting from any table in any schema.

Now that we are capturing activity across the BDA and Oracle Database 12c, we'll set up an alert to fire whenever there is suspicious activity attempted over sensitive HR data in Hadoop:


In the alert definition found above, a critical alert is defined as three unsuccessful attempts from a given IP address to access data in the HR directory. Alert definitions are extremely flexible - using any audited field as input into a conditional expression. And, they are automatically delivered to the Audit Vault Server's monitoring dashboard - as well as via email to appropriate security administrators.

Now that auditing is configured, we'll generate activity by two different users: oracle and DrEvil. We'll then see how the audit data is consolidated in the Audit Vault Server and how auditors can interrogate that data.

Capturing Activity

The demonstration is driven by a few scripts that generate different types of activity by both the oracle and DrEvil users. These activities include:

  • an oozie workflow that removes salary data from HDFS
  • numerous HDFS commands that upload files, change file access privileges, copy files and list the contents of directories and files
  • hive commands that query, create, alter and drop tables
  • Oracle Database commands that connect as different users, create and drop users, select from tables and delete records from a table

After running the scripts, we log into the Audit Vault Server as an auditor. Immediately, we see our alert has been triggered by the users' activity.


Drilling down on the alert reveals DrEvil's three failed attempts to access the sensitive data in HDFS:

alert details

Now that we see the alert triggered in the dashboard, let's see what other activity is taking place on the BDA and in the Oracle Database.

Ad Hoc Reporting

Audit Vault Server delivers rich reporting capabilities that enables you to better understand the activity that has taken place across the enterprise. In addition to the numerous reports that are delivered out of box with Audit Vault, you can create your own custom reports that meet your own personal needs. Here, we are looking at a BDA monitoring report that focuses on Hadoop activities that occurred in the last 24 hours:

monitor events

As you can see, the report tells you all of the key elements required to understand: 1) when the activity took place, 2) the source service for the event, 3) what object was referenced, 4) whether or not the event was successful, 5) who executed the event, 6) the ip address (or host) that initiated the event, and 7) how the object was modified or accessed. Stoplight reporting is used to highlight critical activity - including DrEvils failed attempts to open the sensitive salaries.txt file.

Notice, events may be related to one another. The Hive command "ALTER TABLE my_salarys RENAME TO my_salaries" will generate two events. The first event is sourced from the Metastore; the alter table command is captured and the metadata definition is updated. The Hive command also impacts HDFS; the table name is represented by an HDFS folder. Therefore, an HDFS event is logged that renames the "my_salarys" folder to "my_salaries".

Next, consider an Oozie workflow that performs a simple task: delete a file "salaries2.txt" in HDFS. This Oozie worflow generates the following events:


  1. First, an Oozie workflow event is generated indicating the start of the workflow.
  2. The workflow definition is read from the "workflow.xml" file found in HDFS.
  3. An Oozie working directory is created
  4. The salaries2.txt file is deleted from HDFS
  5. Oozie runs its clean-up process

The Audit Vault reports are able to reveal all of the underlying activity that is executed by the Oozie workflow. It's flexible reporting allows you to sequence these independent events into a logical series of related activities.

The reporting focus so far has been on Hadoop - but one of the core strengths of Oracle Audit Vault is its ability to consolidate all audit data. We know that DrEvil had a few unsuccessful attempts to access sensitive salary data in HDFS. But, what other unsuccessful events have occured recently across our data platform? We'll use Audit Vault's ad hoc reporting capabilities to answer that question. Report filters enable users to search audit data based on a range of conditions. Here, we'll keep it pretty simple; let's find all failed access attempts across both the BDA and the Oracle Database within the last two hours:


Again, DrEvil's activity stands out. As you can see, DrEvil is attempting to access sensitive salary data not only in HDFS - but also in the Oracle Database.


Security and integration with the rest of the Oracle ecosystem are two tablestakes that are critical to Oracle Big Data Appliance releases. Oracle Audit Vault and Database Firewall's auditing of data across the BDA, databases and operating systems epitomizes this goal - providing a single repository and reporting environment for all your audit data.

Monday Jan 27, 2014

Announcing: Oracle Big Data Lite Virtual Machine

You've been hearing alot about Oracle's big data platform. Today, we're pleased to announce Oracle Big Data Lite Virtual Machine - an environment to help you get started with the platform. And, we have a great OTN Virtual Developer Day event scheduled where you can start using our big data products as part of a series of workshops.

BigDataLite Picture

Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle's big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It's been configured to run on a "developer class" computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.

How do you get started? Why not start by registering for the Virtual Developer Day scheduled for Tuesday, February 4, 2014 - 9am  to 1pm PT / 12pm to 4pm ET / 3pm to 7pm BRT:


There will be 45 minute sessions delivered by product experts (from both Oracle and Oracle Aces) - highlighted by Tom Kyte and Jonathan Lewis' keynote "Landscape of Oracle Database Technology Evolution". Some of the big data technical sessions include:

  • Oracle NoSQL Database Installation and Cluster Topology Deployment
  • Application Development & Schema Design with Oracle NoSQL Database
  • Processing Twitter Data with Hadoop
  • Use Data from a Hadoop Cluster with Oracle Database
  • Make the Right Offers to Customers Using Oracle Advanced Analytics
  • In-DB Map Reduce with SQL/Hadoop
  • Pattern Matching in SQL

Keep an eye on this space - we'll be publishing how-to's that leverage the new Oracle Big Data Lite VM. And, of course, we'd love to hear about the clever applications you build as well!


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« November 2015