Thursday Feb 16, 2017

Big Data Lite 4.7.0 is now available on OTN!

The latest release of Big Data Lite is now available on OTN!  This 4.7 release contains key components of Oracle's big data platform.  It has demos, tutorials and more.  Listed below are the products/features that are installed:

  • Oracle Enterprise Linux 6.8
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.9.0)
  • Cloudera Manager (5.9.0)
  • Oracle Big Data Spatial and Graph 2.1
  • Oracle Big Data Discovery 1.4.0
  • Oracle Big Data Connectors 4.7
    • Oracle SQL Connector for HDFS 3.7.0
    • Oracle Loader for Hadoop 3.8.0
    • Oracle Data Integrator 12c (
    • Oracle R Advanced Analytics for Hadoop 2.7
    • Oracle XQuery for Hadoop 4.5.1
    • Oracle Data Source for Hadoop 1.2
  • Oracle NoSQL Database Enterprise Edition 12cR1 (4.2.14)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1.5 with Oracle REST Data Services 3.0.7
  • Oracle Data Integrator 12cR1 (
  • Oracle GoldenGate 12c (
  • Oracle R Distribution 3.2.0
  • Oracle Perfect Balance 2.9.0


Tuesday Jan 17, 2017

Oracle Big Data SQL: Simplifying Information Lifecycle Management

data tiers

For many years, Oracle Database has provided rich support for Information Lifecycle Management (ILM).  Numerous capabilities are available for data tiering – or storing data in different media based on access requirements and storage cost considerations.  These tiers may scale from in-memory for real time data analysis – to Database Flash for frequently accessed data – to operational data captured in Database Storage and Exadata Cells.

Hadoop offers yet another storage layer for the – the Hadoop Distributed File System (HDFS) – which offers a cost effective alternative for storing massive volumes of data.  Oracle Big Data SQL makes access to this data seamless from Oracle Database 12c; Big Data SQL is a data virtualization technology that allows users and applications to use Oracle’s rich SQL language across data stored in Oracle Database, Hadoop and NoSQL stores.  One query can combine data from all these sources.  

What this means is that ILM can now be extended to use Hadoop to store raw and archived data.  This is especially important since retaining many years of historical information in data warehouses is increasingly a requirement for both analytics and regulatory compliance...
[Read More]

Thursday Mar 17, 2016

Big Data SQL 3.0 is now available!

Oracle Big Data SQL 3.0 is now available!  This is an exciting milestone for Oracle.  With support for Cloudera CDH (both on Big Data Appliance and non-Big Data Appliance), Hortonworks HDP and Oracle Database 12c (both Exadata and non-Exadata) - the benefits derived from unified queries across relational, Hadoop and NoSQL stores can now be achieved across a wide breadth of big data deployments.

Hadoop and NoSQL are rapidly becoming key components of today's data management platform, and many Oracle Database customers use Hadoop or NoSQL in their organization. Using multiple data management solutions typically lead to data silos, where different people and applications can only access a subset of the data needed. Big Data SQL offers an industry leading solution to deliver one fast, secure SQL query on all data: in Hadoop, Oracle Database, and NoSQL. Big Data SQL leverages both innovations from Oracle Exadata and specific Hadoop features to push processing down to Hadoop, resulting in minimized data movement and extreme performance for SQL on Hadoop. 

In summary, Big Data SQL 3.0: 

  • Expands support for Hadoop platforms - covering Hortonworks HDP, Cloudera CDH on commodity hardware as well as on Oracle Big Data Appliance
  • Expands support for database platforms - covering Oracle Database 12c on commodity hardware as well as on Oracle Exadata
  • Improves performance through new features like Predicate Push-Down on top of Smart Scan and Storage Indexes 


To learn more:


Wednesday Mar 16, 2016

Maximum Availability Architecture for Big Data Appliance

Oracle Maximum Availability Architecture (MAA) is Oracle's best practices blueprint based on proven Oracle high availability technologies, along with expert recommendations and customer experiences. MAA best practices have been highly integrated into the design and operational capability of Oracle Big Data Appliance, and together they provide the most comprehensive highly available solution for Big Data.

Oracle MAA papers are published at the MAA home page of the Oracle Technology Network (OTN) website. Oracle Big Data Appliance (BDA) Maximum Availability Architecture is a best-practices blueprint for achieving an optimal high-availability deployment using Oracle high-availability technologies and recommendations.

The Oracle BDA MAA exercise for this paper was executed on Oracle Big Data Appliance and Oracle Exadata Database Machine to validate high availability and to measure downtime in various outage scenarios. The current release of this technical paper covers the first phase of the overall Oracle BDA MAA project. The project comprises the following two phases:

Phase 1: High Availability and Outage scenarios at a single site

Phase 2: Disaster Recovery Scenarios across multiple sites

The white paper covering Phase 1 is now published here

Tuesday Mar 15, 2016

Big Data SQL Quick Start. Parallel Query - Part3.

[Read More]

Sunday Mar 06, 2016

Hadoop Compression. Choosing compression codec. Part2.

Many customers are keep asking me about "default" (single) compression codec for Hadoop. Actually answer on this question is not so easy and let me explain why.

Bzip2 or not Bzip2?

In my previous blogpost I published results of the compression rate for some particular compression codecs into Hadoop. Based on those results you may think that it’s a good idea to compress everything with bzip2. But be careful with this. Within the same research, I noted that bzip2 actually has on average 3 times worse performance than Gzip for querying (decompress) and archive (compress) data (it’s not surprising based on the complexity of algorithm).  Are you ready to sacrifice performance? I think it will depend on the compression benefits derived from bzip2 and the frequency of querying this data (compression speed is not so import after data is stored in Hadoop systems since you usually compress data once and read it many times).  On average, bzip2 is 1.6 times better than gzip.  But, again my research showed that sometimes you can achieve 2.3 times better compression, while other times you may gain only 9% of the disk space usage (and performance is still much worse compared to gzip and other codecs). Second factor to keep in mind is the frequency of data querying and your performance SLAs. If you don’t care about query performance (don’t have any SLAs) and you select this data very rarely – bzip2 could be good a candidate.  Otherwise consider other options. I encourage you to benchmark your own data and decide for yourself “Bzip2 or not Bzip2”.

[Read More]

Friday Feb 19, 2016

Big Data Lite 4.4.0 is now available on OTN

big data lite

It's now available for download on OTN.  Check out this VM to help you learn about Oracle's big data platform.[Read More]

Thursday Feb 04, 2016

Hadoop Compression. Compression rate. – Part1.

Compression codecs.

Text files (csv with “,” delimiter):

Codec Type  Average rate  Minimum rate  Maximum rate
bzip2 17.36 3.88 61.81
gzip 9.73 2.9 26.55
lz4 4.75 1.66 8.71
snappy 4.19 1.61 7.86
lzo 3.39 2 5.39

RC File: 

Codec Type Average rate Minimum rate Maximum rate
 bzip2 17.51 4.31 54.66
 gzip 13.59 3.71 44.07
 lz4 7.12 2 21.23
 snappy 6.02 2.04  15.38
 lzo 4.37 2.33 7.02

Parquet file:

Codec Type Average rate Minimum rate Maximum rate
 gzip 17.8 3.9 60.35
 snappy 12.92 2.63 45.99

[Read More]

Using Spark(Scala) and Oracle Big Data Lite VM for Barcode & QR Detection

Big Data and Scalable Image Processing and Analytics

Guest post by Dave Bayard - Oracle's Big Data Pursuit Team 

One of the promises of Big Data is its flexibility to work with large volumes of unstructured types of data such as images and photos. In todayís world, there are many sources of images including social media photos, security cameras, satellite images, and more. There are many kinds of image processing and analytics that are possible from optical character recognition (OCR), license plate detection, bar code detection, face recognition, geological analysis and more. And there are many open source libraries such as OpenCV, Tesseract, ZXing, and others that are available to leverage.

[Read More]

Tuesday Jan 19, 2016

Big Data SQL Quick Start. Introduction - Part1.

Today I am going to explain steps that required to start working with Big Data SQL. It’s really easy!  I hope after this article you all will agree with me. First, if you want to get caught up on what Big Data SQL is, I recommend that you read these blogs: Oracle Big Data SQL: One Fast Query, Big Data SQL 2.0 - Now Available.

The above blogs cover design goals of Big Data SQL. One of the goals of Big Data SQL is transparency. You just define table that links to some directory in HDFS or some table in HCatalog and continue working with it like with general Oracle Database table.It’s also useful to read the product documentation.

Your first query with Big Data SQL

Let’s start with simplest one example and query data that is actually stored in HDFS via Oracle Database using Big Data SQL. I’m going to begin this example by checking of the data that actually lies into HDFS. To accomplish this, I run the hive console and check hive table DDL:

[Read More]

Thursday Jan 07, 2016

Data loading into HDFS - Part1

Today I’m going to start first article that will be devoted by very important topic in Hadoop world – data loading into HDFS. Before all, let me explain different approaches of loading and processing data in different IT systems.

Schema on Read vs Schema on Write

So, when we talking about data loading, usually we do this into system that could belong on one of two types.  One of this is schema on write. With this approach we have to define columns, data formats and so on. During the reading  every user will observe the same data set. As soon as we performed ETL (transform data in format that mostly convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid penalty for this when were loading data. Like example of schema on write system you could consider Relational data base, for example, like Oracle or MySQL.

Schema on Write

Another approach is schema on read. In this case we load data as-is without any changing and transformations.  With this approach we skip ETL (don’t transform data) step and we don’t have any headaches with data format and data structure. Just load file on file system, like coping photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise you will get set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations.

Schema on Read

Hadoop Distributed File System is classical example of schema on read system.More details about Schema on Read and Schema on Write approach you could find here. Now we are going to talk about data loading data into HDFS. I hope after explanation above, you understand that data loading into Hadoop is not equal of ETL (data doesn’t transform).

[Read More]

Thursday Dec 24, 2015

Oracle Big Data Lite 4.3.0 is Now Available on OTN

Big Data Lite 4.3.0 is now available on OTN

This latest release is packed with new features - here's the inventory of what's included:

  • Oracle Enterprise Linux 6.7
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.7)
  • Cloudera Manager (5.4.7)
  • Oracle Big Data Spatial and Graph 1.1
  • Oracle Big Data Discovery 1.1.1
  • Oracle Big Data Connectors 4.3
    • Oracle SQL Connector for HDFS 3.4.0
    • Oracle Loader for Hadoop 3.5.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.1
    • Oracle XQuery for Hadoop 4.2.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.4.7)
  • Oracle Table Access for Hadoop and Spark 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1.2 with Oracle REST Data Services 3.0
  • Oracle Data Integrator 12cR1 (12.2.1)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.2.0
  • Oracle Perfect Balance 2.5.0
Also, this release is using github as the repository for all of our sample code (  This gives us a great mechanism for updating the samples/demos between releases.  Users simply double click the "Refresh Samples" icon on the desktop to download the latest collateral.

Friday Oct 23, 2015

Performance Study: Big Data Appliance compared with DIY Hadoop

Over the past couple of months a team of Intel engineers have been working with our engineers on Oracle Big Data Appliance and performance, especially in ensuring a BDA outperforms DIY Hadoop out of the box. The good news is that your BDA, as you know it today is already 1.2x faster. We are now working to include a lot of the findings in BDA 4.3 and subsequent versions, so we are steadily expanding that 1.2x into a 2x out of box performance advantage. And that is all above and beyond the faster time to value a BDA delivers, as well as on top of the low cost you can get it for. 

Read the full paper here.

But, we thought we should add some color to all of this, and if you are at Openworld this year, come listen to Eric explain all of this in detail on Monday October 26th at 2:45 in Moscone West room 3000.

If you can't make it, here is a short little dialog we had over the results and both Eric and Lucy's take on the work they did and what they are up to next.

Q: What was the most surprising finding in tuning the system?

A: We were surprised how well the BDA performed right after its installation. Having worked for over 5 years on Hadoop, we understand it is a long iterative process to extract the best possible performance out of your hardware. BDA was a well-tuned machine and we were a little concerned we might not have much value to add... 

Q: Anything that you thought was exciting but turned out to be not such a big thing?

A: We were hoping for 5x gains from our work, but only got 2x... But, in all seriousness, we were hoping for better results from some of our memory and Java garbage collection tuning. Unfortunately they only resulted in marginal single digits gains. 

Q: What is next?

A: There is a long list of exciting new products coming from Intel in the coming year; such as hardware accelerated compression, 3d-Xpoint, almost zero latency PCIE SSDs and not to forget new processors. We are excited at the idea of tightly integrating them all with Big Data technologies! What is a better test bed that the BDA? A full software/hardware solution!

Looks like we have a lot of fun things to go work on and with, as well as of course looking into performance improvements for BDA in light of Apache Spark.

See you all at Openworld, or once again, read the paper here

Tuesday Oct 13, 2015

Big Data SQL 2.0 - Now Available

With the release of Big Data SQL 2.0 it is probably time to do a quick recap and introduce the marquee features in 2.0. The key goals of Big Data SQL are to expose data in its original format, and stored within Hadoop and NoSQL Databases through high-performance Oracle SQL being offloaded to Storage resident cells or agents. The architecture of Big Data SQL closely follows the architecture of Oracle Exadata Storage Server Software and is built on the same proven technology.

Retrieving Data With data in HDFS stored in an undetermined format (schema on read), SQL queries require some constructs to parse and interpret data for it to be processed in rows and columns. For this Big Data SQL leverages all the Hadoop constructs, notably InputFormat and SerDe Java classes optionally through Hive metadata definitions. Big Data SQL then layers the Oracle Big Data SQL Agent on top of this generic Hadoop infrastructure as can be seen below.

Accessing HDFS data through Big Data SQL

Because Big Data SQL is based on Exadata Storage Server Software, a number of benefits are instantly available. Big Data SQL not only can retrieve data, but can also score Data Mining models at the individual agent, mapping model scoring to an individual HDFS node. Likewise querying JSON documents stored in HDFS can be done with SQL directly and is executed on the agent itself.

Smart Scan

Within the Big Data SQL Agent, similar functionality exists as is available in Exadata Storage Server Software. Smart Scans apply the filter and row projections from a given SQL query on the data streaming from the HDFS Data Nodes, reducing the data that is flowing to the Database to fulfill the data request of that given query. The benefits of Smart Scan for Hadoop data are even more pronounced than for Oracle Database as tables are often very wide and very large. Because of the elimination of data at the individual HDFS node, queries across large tables are now possible within reasonable time limits enabling data warehouse style queries to be spread across data stored in both HDFS and Oracle Database.

Storage Indexes

Storage Indexes - new in Big Data SQL 2.0 - provide the same benefits of IO elimination to Big Data SQL as they provide to SQL on Exadata. The big difference is that in Big Data SQL the Storage Index works on an HDFS block (on BDA – 256MB of data) and span 32 columns instead of the usual 8. Storage Index is fully transparent to both Oracle Database and to the underlying HDFS environment. As with Exadata, the Storage Index is a memory construct managed by the Big Data SQL software and invalidated automatically when the underlying files change.

Concepts for Storage Indexes

Storage Indexes work on data exposed via Oracle External tables using both the ORACLE_HIVE and ORACLE_HDFS types. Fields are mapped to these External Tables and the Storage Index is attached to the Oracle (not the Hive) columns, so that when a query references the column(s), the Storage Index - when appropriate - kicks in. In the current version, Storage Index does not support tables defined with Storage Handlers (ex: HBase or Oracle NoSQL Database).

Compound Benefits

The Smart Scan and Storage Index features deliver compound benefits. Where Storage Indexes reduces the IO done, Smart Scan then enacts the same row filtering and column projection. This latter step remains important as it reduces the data transferred between systems.

To learn more about Big Data SQL, join us at Open World in San Francisco at the end of the month.

Thursday Sep 03, 2015

Oracle Big Data Lite 4.2.1 - Includes Big Data Discovery

We just released Oracle Big Data Lite 4.2.1 VM.  This VM provides many of the key big data technologies that are part of Oracle's big data platform.  Along with all the great features of the previous version, Big Data Lite now adds Oracle Big Data Discovery 1.1:

The list of big data capabilities provided by the virtual machine continues to grow.  Here's a list of all the products that are pre-configured:

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Discovery 1.1
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0 
Take it for a spin - and check out the tutorials and demos that are available from the Big Data Lite download page.


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« February 2017