Thursday Feb 04, 2016

Hadoop Compression. Compression rate. – Part1.

Compression codecs.

Text files (csv with “,” delimiter):

Codec Type  Average rate  Minimum rate  Maximum rate
bzip2 17.36 3.88 61.81
gzip 9.73 2.9 26.55
lz4 4.75 1.66 8.71
snappy 4.19 1.61 7.86
lzo 3.39 2 5.39

RC File: 

Codec Type Average rate Minimum rate Maximum rate
 bzip2 17.51 4.31 54.66
 gzip 13.59 3.71 44.07
 lz4 7.12 2 21.23
 snappy 6.02 2.04  15.38
 lzo 4.37 2.33 7.02

Parquet file:

Codec Type Average rate Minimum rate Maximum rate
 gzip 17.8 3.9 60.35
 snappy 12.92 2.63 45.99

[Read More]

Using Spark(Scala) and Oracle Big Data Lite VM for Barcode & QR Detection

Big Data and Scalable Image Processing and Analytics

Guest post by Dave Bayard - Oracle's Big Data Pursuit Team 

One of the promises of Big Data is its flexibility to work with large volumes of unstructured types of data such as images and photos. In todayís world, there are many sources of images including social media photos, security cameras, satellite images, and more. There are many kinds of image processing and analytics that are possible from optical character recognition (OCR), license plate detection, bar code detection, face recognition, geological analysis and more. And there are many open source libraries such as OpenCV, Tesseract, ZXing, and others that are available to leverage.

This blog and demonstration was built to show that scalable image analytics does not need to be difficult. Our primary goal in this blog is to use the Oracle Big Data Lite VM environment to demonstrate how to take an open source library and combine it into a Spark/Scala application. In particular, we will use Spark alongside the ZXing (Zebra Crossing) library to detect barcodes and QR Codes from a set of image files.

It should be also noted that instead of writing our own Spark application that we could instead have leveraged Oracleís Multimedia Analytics framework that comes as a feature of Oracle Big Data Spatial and Graph. For instance, the Multimedia Analytics framework could let us easily support videos as well as still images. In a later blog post, we will extend this example and show how to make it work with the Multimedia Analytics framework. For now you can learn more about the Multimedia Analytics framework here:

This blog has the following objectives:

  • 1.Learn about the ZXing library for barcode and QR code detection.
  • 2.Learn how to build and run a simple Spark(Scala) application using the Oracle Big Data Lite VM.
  • 3.Learn how to build and run a Spark(Scala+Java) application using the ZXing libraries.

Above is an example of a photo containing a QR code. One potential application could include going through a large set of photos and identifying which photos have QR codes and the content of those QR codes. Not everyone will have a set of photos with QR codes in them nor will they have a need to scan them in bulk, but the concepts of this blog (showing how to use an open source image library alongside Spark and Scala) should still apply- just substitute QR code detection libraries with the image processing libraries of your choice.

Initial Oracle Big Data Lite VM Setup:

This demonstration uses the Oracle Big Data Lite VM version, which is available here: . Version of the VM includes CDH 5.4.7, Spark 1.3.0, and Scala 2.10.4.

Once you have the Big Data Lite VM downloaded, imported, and running, then click on the ìRefresh Samplesî icon on the VM desktop to refresh the samples. At this point, you should fine the files needed for this blog are available under your /home/oracle/src/Blogs/SparkBarcode directory.

[Note: If you want to get access to the files referenced in this blog outside of the Oracle Big Data LiteVM, you can find them here: ]

Now run the script located at /home/oracle/src/Blogs/SparkBarcode. The script will download the necessary files for the open source libraries ZXing and SBT as well as copy some sample image files into hdfs.

ZXing (Zebra Crossing) for Barcode Detection:

ZXing (pronounced ìZebra Crossingî) is an open source library that does one-dimensional (barcode) and two-dimensional (QR code) image detection. It is written in java. You can find out more at the website:

As a quick example of ZXing, we can use the ZXing projectís hosted web application to demonstrate its functionality. In the Big Data Lite VM, open up the firefox browser and go to . On the web page, click on the ìBrowseî button and use the file browser to open the /home/oracle/src/Blogs/SparkBarcode/images/test.jpg file. Then click the ìSubmit Queryî button on the web page.

The test.jpg file is a photo containing a 2-dimensional QR code. When you submit the image, the web application should return

The ZXing project has kindly made available the source code for the web application. To understand how to interact with the ZXing API, we can look at how the web application worked and focus on the source of the DecodeServlet class, which is available here:

In particular, navigate to the processImage() method which illustrates how the ZXing APIs (such as MultipleBarcodeReader, MultiFormatReader, etc) are used:

  private static void processImage(BufferedImage image,
                                   HttpServletRequest request,
                                   HttpServletResponse response) throws IOException, ServletException {

    LuminanceSource source = new BufferedImageLuminanceSource(image);
    BinaryBitmap bitmap = new BinaryBitmap(new GlobalHistogramBinarizer(source));
    Collection results = new ArrayList<>(1);

    try {

      Reader reader = new MultiFormatReader();
      ReaderException savedException = null;
      try {
        // Look for multiple barcodes
        MultipleBarcodeReader multiReader = new GenericMultipleBarcodeReader(reader);
        Result[] theResults = multiReader.decodeMultiple(bitmap, HINTS);
        if (theResults != null) {
      } catch (ReaderException re) {
        savedException = re;
      if (results.isEmpty()) {
        try {
          // Look for pure barcode
          Result theResult = reader.decode(bitmap, HINTS_PURE);
          if (theResult != null) {
        } catch (ReaderException re) {
          savedException = re;
      if (results.isEmpty()) {
        try {
          // Look for normal barcode in photo
          Result theResult = reader.decode(bitmap, HINTS);
          if (theResult != null) {
        } catch (ReaderException re) {
          savedException = re;
      if (results.isEmpty()) {
        try {
          // Try again with other binarizer
          BinaryBitmap hybridBitmap = new BinaryBitmap(new HybridBinarizer(source));
          Result theResult = reader.decode(hybridBitmap, HINTS);
          if (theResult != null) {
        } catch (ReaderException re) {
          savedException = re;
      if (results.isEmpty()) {
        try {
          throw savedException == null ? NotFoundException.getNotFoundInstance() : savedException;
        } catch (FormatException | ChecksumException e) {
          errorResponse(request, response, "format");
        } catch (ReaderException e) { // Including NotFoundException
          errorResponse(request, response, "notfound");

    } catch (RuntimeException re) {
      // Call out unexpected errors in the log clearly
      log.log(Level.WARNING, "Unexpected exception from library", re);
      throw new ServletException(re);

    String fullParameter = request.getParameter("full");
    boolean minimalOutput = fullParameter != null && !Boolean.parseBoolean(fullParameter);
    if (minimalOutput) {
      try (Writer out = new OutputStreamWriter(response.getOutputStream(), StandardCharsets.UTF_8)) {
        for (Result result : results) {
    } else {
      request.setAttribute("results", results);
      request.getRequestDispatcher("decoderesult.jspx").forward(request, response);

We can observe that the code makes a series of attempts to identify barcodes/QR codes. If one attempt does not find a barcode, it attempts again using a different method or parameter/hint. This can help the code to better tolerate image detection challenges like variations in lighting, reflection, angle, resolution, image quality, etc. (If you are interested in other ways to use the ZXing APIs, an alternative example is the decode() method in the DecodeWorker class, which can be found here: )

Our next step is to get started coding with ZXing by building a simple standalone Java application. Thanks to the script you ran earlier, we have already downloaded the necessary ZXing java jar libraries, as described in the ZXing Getting Started guide at

Our simple java standalone application will be based off the processImage() method from the web application. The source of the sample application can be found in the file, located in the /home/oracle/src/Blogs/SparkBarcode/SimpleJavaApp/barcodedemo subdirectory.

Essentially, we copied the web applicationís processImage() method and removed the dependencies on the http request and response objects. Explore the source code to see what we mean.

Now run the script ìrun_simple_java.shî to both compile and run our sample test program against the test image.

Our simple standalone java application using ZXing libraries is a success!

Spark Development: A First Scala Application

Our ultimate goal is to combine the open source ZXing library with Spark and run it using the resources of our CDH cluster. Specifically, we want to build a Scala Application that calls the java ZXing libraries using the Spark on Yarn Framework to run our barcode detection on a set of images in parallel. But before we attempt that final application, we will first start with a simple Scala application that uses only Javaís built-in libraries.

If you are new to Spark, you should check out . If you are new to the Scala language, you can find some quick language tips at . This blog assumes you have some experience with working with basic Spark/Scala examples such as word count using the interactive spark-shell.

To help us with our Scala application development, we will want to add the ìsbtî utility to the Big Data Lite VM. SBT is frequently used to manage the build process for Scala applications, much like maven is used with Java applications. The ìsetup.shî script you ran earlier downloaded sbt for the Big Data Lite VM. If you want more information about SBT, you can navigate to here:

Another requirement is to prepare a directory structure for our Spark/Scala application. We will follow the template directory structure described in .

The directories and files for the simple application have been created for you and are located at the /home/oracle/src/Blogs/SparkBarcode/SimpleScalaApp subdirectory. They look likeÖ

There are 2 key files. The simple.sbt file is the build file that sbt uses and contains information such as dependencies on other libraries. The SimpleApp.scala file is the application source. The source looks like:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import java.awt.image.BufferedImage
import javax.imageio.ImageIO

object SimpleApp {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Simple Scala Image App")
    val sc = new SparkContext(conf)
    val files = sc.binaryFiles("barcode/images/test.jpg")
    val imageResults =
         result => println("\nFile:"+result._1+" Width:"+result._2+" Height:"+result._3)
                          ); //this println gets printed to the main stdout


  def processSparkImage (
                    file: (String, org.apache.spark.input.PortableDataStream)
                   ) : (String, Int, Int)  =
    println("In processSparkImage for "+file._1) //this println goes to the executor stdout
    val img: BufferedImage =
    val height = img.getHeight()
    println("Height is "+height+" for "+file._1)

    return (file._1, img.getWidth(), img.getHeight)
The simple application is based on the Spark quick startís example Scala standalone application- the big changes are that we have modified it to use Sparkís sc.binaryFiles() instead of sc.textFile(), and we have created a processSparkImage() function. Our processSparkImage() function uses standard Java Image APIs to extract the width and height of an image.

At the time of this writing, there does not seem to be much published about sc.binaryFiles(), so it is worth a little bit of extra explanation. The output of sc.binaryFiles is a set of tuples. The first element is the filename and the second element is a Spark PortableDataStream of the contents. In Scala notation, the body of processSparkImage() uses file._2 to point to the PortableDataStream and file._1 to point to the filename. The PortableDataStream can be used where you would use an InputStream.

The rest of the code is pretty standard stuff for initializing the SparkContext, etc.

Run the ìbuild_simple_scala.shî script. If this is your first time running the build script, be patient- it will take a dozen minutes or so as the sbt tool will do a one-time download of supporting libraries to prepare the scala environment for Spark application compilation/development.

Once the build is done, run the ìrun_simple_scala.shî script. Your output should look like below:

Notice that the simple application has printed out the Width and Height of the test image, as expected. Also notice that the run script has given you the URLs of the YARN, Hue, and Spark web UIs. In the terminal window, you can right click on those URLs and choose ìOpen LinkÖî to easily view them. Below are screenshots of the Hue, YARN Resource Manager, and Spark History web UIs:

Building our Scala+ZXing Spark Application:

Now, we can move onto integrating the java ZXing library into our Scala application code. We have done so in the application located at /home/oracle/src/Blogs/SparkBarcode/ScalaBarcodeApp.

Letís look at the code source directory.

Notice that there are both java and scala subdirectories. Under the java subdirectory, we have a barcodedemo.BarcodeProcessor class with our version of the processImage() function from our simple java application above.

Our scala code is BarcodeApp.scala. It looks like this:

/* BarcodeApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import barcodedemo.BarcodeProcessor

import java.awt.image.BufferedImage
import javax.imageio.ImageIO

object BarcodeApp {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Scala ZXing Barcode App")
    val sc = new SparkContext(conf)
    val files = sc.binaryFiles(args(0))

    val imageResults =

               result => println("\nFile:" + result._1 + "\n Results:" + result._2)
                          //this println gets written to the main stdout


  def processSparkImage (
                    file: (String, org.apache.spark.input.PortableDataStream)
                   ) : (String, String)  =
    println("In processSparkImage for "+file._1) //this println goes to the executor stdout

    val img: BufferedImage =

    return (file._1, BarcodeProcessor.processImage(img))


Notice that the scala code imports our java BarcodeProcessor class. The Scala processSparkImage() method calls the java BarcodeProcessor.processImage() method, which returns the barcode information in a string.

You can also look at the main method for the Scala application. It defines an RDD using sc.binaryFiles() based on the path defined by the first command-line argument args(0). This will allow us to test our application with 1 or many images by changing the command-line arguments of our run command. Then the application calls the map() transformation for the processSparkImage() method. This will cause the Spark executors to run the processSparkImage() method on each binaryFile. Finally, the Scala code collects the results and prints the output for each file.

We can also look at the barcode.sbt file for this application and notice that weíve included a dependency for the necessary ZXing libraries from the central maven repository, telling SBT to go ahead and download them as needed for our build.

name := "Scala ZXing Barcode Application"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

libraryDependencies += "" % "core" % "3.2.1"
libraryDependencies += "" % "javase" % "3.2.1"

Build our application by running ìbuild_barcode.shî.

Then run our application. If you want to run for the single test.jpg image, use ìrun_barcode.shî. If you want to test for a set of approx 25 images, then use ìrun_barcode_many.shî.

Here is the start, middle, and finish of ìrun_barcode_many.shî:

You will notice that a barcode was not detected in every image. In some cases, there simply wasnít a barcode or QR code in the image. In other cases, there was a barcode but the image might have been fuzzy or too much glare or something else.

Running on a real cluster:

To get a feel for the power of scalability, we can copy our Barcode application and deploy it onto real-world hardware. In our case, we will use an older 6-node Oracle Big Data Appliance X3-2. While the BDA X3-2 lacks the huge horsepower of the latest generation BDA X5-2, it can still give us a place to demonstrate scalability. In our example, we will run the ìrun_barcode_many.shî for 50 images against the Big Data Lite VM as well as against a 6-node X3-2 BDA.

The screenshot above shows details from running the Barcode app on the Big Data Lite VM on my laptop. It used 2 executors and ran in 1.3 minutes of clock time. The screenshot below shows details from running on the 6-node BDA X3-2. Notice the different number of Executors (in the Aggregated Metrics by Executor section). On the BDA, we used 50 executors and it took 5 seconds of clock time. Both scenarios used the same code and same set of images.

Notice how the Spark application was able to take advantage of the multiple machines and multiple CPUs of the BDA X3-2 to run in parallel and complete much faster (5seconds versus 1.3 minutes).

Moving Beyond:

Hopefully, this was a good start for your journey into Spark and Image Processing. Here are some possible future paths you could take:
  • You could learn more about Spark. To do so, I liked the book ìLearning Sparkî.
  • You could learn more about image detection and computer vision. To do so, I found the book ìSimpleCVî to explain the concepts well (SimpleCV is python-focused, but its explanation of concepts are useful to any language).
  • You could experiment with other libraries like OpenCV or Tesseract. Cloudera has a blog example using Spark with the open source Tesseract OCR library:
  • You could experiment with tweaking the sample java processImage() code to work as a custom FrameProcessor for the Oracle Big Data Spatial and Graph Multimedia Analytics feature. This could be used, for instance, to scan videos for barcodes. To do so, you can use processImage() as part of an implementation of the oracle.ord.hadoop.mapreduce.OrdFrameProcessor class. See or stay tuned for an upcoming blog/example.

NOTE: If you want to play around with the source files and make modifications, you should probably copy the SparkBarcode directory tree into a new directory outside of /home/oracle/src. This is because the ìRefresh Samplesî utility will wipe-out the /home/oracle/src directory every time it runs.


This blog has shown you how to do Barcode detection using the ZXing libraries. First, we illustrated ZXing in action by using the hosted web application at Next, we built and ran a simple standalone java application using the ZXing APIs. Then, we built a simple Spark(Scala) application that used the built-in java Image APIs (but not yet the ZXing APIs). We ran this Spark(Scala) application against the YARN cluster on our Big Data Lite VM. Then we built our final Spark(Scala) application which included our custom Java code that called the ZXing APIs. We ran this on our Big Data Lite VM YARN cluster for a set of sample images. Finally, we took the same code and images and ran them on a physical Big Data Appliance to show the benefit of scale-out parallelism across nodes and cpus.

Hopefully, this has made you more comfortable with working with tools like the Oracle Big Data Lite VM, Spark, Scala, sbt, and ZXing. Enjoy.

About the Author:

David Bayard is a member of the Big Data Pursuit team for Oracle North America Sales.

Tuesday Jan 19, 2016

Big Data SQL Quick Start. Introduction - Part1.

Today I am going to explain steps that required to start working with Big Data SQL. It’s really easy!  I hope after this article you all will agree with me. First, if you want to get caught up on what Big Data SQL is, I recommend that you read these blogs: Oracle Big Data SQL: One Fast Query, Big Data SQL 2.0 - Now Available.

The above blogs cover design goals of Big Data SQL. One of the goals of Big Data SQL is transparency. You just define table that links to some directory in HDFS or some table in HCatalog and continue working with it like with general Oracle Database table.It’s also useful to read the product documentation.

Your first query with Big Data SQL

Let’s start with simplest one example and query data that is actually stored in HDFS via Oracle Database using Big Data SQL. I’m going to begin this example by checking of the data that actually lies into HDFS. To accomplish this, I run the hive console and check hive table DDL:

hive> show create table web_sales;



  ws_sold_date_sk int,

 ws_sold_time_sk int,


  ws_net_paid_inc_ship float,

  ws_net_paid_inc_ship_tax float,

  ws_net_profit float)









From the DDL statement, we can see the data is text files (CSV), stored on HDFS in the directory:


From the DDL statement we can conclude that fields terminated by “|”. Trust, but verify – let’s check:

# hdfs dfs -ls /user/hive/warehouse/tpc_ds_3T/web_sales|tail -2

... hive 33400655 2015-05-11 13:00 /user/hive/warehouse/tpc_ds_3T/web_sales/part-01923

... hive 32787672 2015-05-11 13:00 /user/hive/warehouse/tpc_ds_3T/web_sales/part-01924

# hdfs dfs -cat /user/hive/warehouse/tpc_ds_3T/web_sales/part-01923|tail -2



Indeed, we have CSV files on HDFS. Let’s fetch it from the database.

New type of External table, new events and new item in the query plan

With Big Data SQL we introduce new types of External Tables (ORACLE_HIVE and ORACLE_HDFS), a new wait event (cell external table smart scan), and a new plan statement (External Table Access Storage Full). Over this HDFS directory, I’ve defined and Oracle External table, like this:













After table creation, I’m able to query data from the database. To begin, I run a very simple query that calculates the minimum value of some column, and has a filter on it.  Then, I can Oracle Enterprise Manager to determine how my query was processed:

SELECT min(w.ws_sold_time_sk) 


WHERE w.ws_sold_date_sk = 2451047

We can see the new type of the wait event “cell external table smart scan”:

and new item in plan statement - “external table access storage full”:

To make sure that your table now exists in Oracle dictionary you can run follow queries:


FROM user_objects t 





----------- -------------


Big Data SQL also adds a new member to Oracle’s metadata - ALL_HIVE_TABLES:

SQL> SELECT table_name,LOCATION,table_type





TABLE_NAME LOCATION                                 TABLE_TYPE

----------- -------------------------------------- -----------

web_returns hdfs://democluster-ns/.../web_returns EXTERNAL_TABLE

See, querying Hadoop with Oracle is easy! In my next blog posts, we’ll look at more complicated queries!

Thursday Jan 07, 2016

Data loading into HDFS - Part1

Today I’m going to start first article that will be devoted by very important topic in Hadoop world – data loading into HDFS. Before all, let me explain different approaches of loading and processing data in different IT systems.

Schema on Read vs Schema on Write

So, when we talking about data loading, usually we do this into system that could belong on one of two types.  One of this is schema on write. With this approach we have to define columns, data formats and so on. During the reading  every user will observe the same data set. As soon as we performed ETL (transform data in format that mostly convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid penalty for this when were loading data. Like example of schema on write system you could consider Relational data base, for example, like Oracle or MySQL.

Schema on Write

Another approach is schema on read. In this case we load data as-is without any changing and transformations.  With this approach we skip ETL (don’t transform data) step and we don’t have any headaches with data format and data structure. Just load file on file system, like coping photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise you will get set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations.

Schema on Read

Hadoop Distributed File System is classical example of schema on read system.More details about Schema on Read and Schema on Write approach you could findhere. Now we are going to talk about data loading data into HDFS. I hope after explanation above, you understand that data loading into Hadoop is not equal of ETL (data doesn’t transform).

[Read More]

Thursday Dec 24, 2015

Oracle Big Data Lite 4.3.0 is Now Available on OTN

Big Data Lite 4.3.0 is now available on OTN

This latest release is packed with new features - here's the inventory of what's included:

  • Oracle Enterprise Linux 6.7
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.7)
  • Cloudera Manager (5.4.7)
  • Oracle Big Data Spatial and Graph 1.1
  • Oracle Big Data Discovery 1.1.1
  • Oracle Big Data Connectors 4.3
    • Oracle SQL Connector for HDFS 3.4.0
    • Oracle Loader for Hadoop 3.5.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.1
    • Oracle XQuery for Hadoop 4.2.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.4.7)
  • Oracle Table Access for Hadoop and Spark 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1.2 with Oracle REST Data Services 3.0
  • Oracle Data Integrator 12cR1 (12.2.1)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.2.0
  • Oracle Perfect Balance 2.5.0
Also, this release is using github as the repository for all of our sample code (  This gives us a great mechanism for updating the samples/demos between releases.  Users simply double click the "Refresh Samples" icon on the desktop to download the latest collateral.

Friday Oct 23, 2015

Performance Study: Big Data Appliance compared with DIY Hadoop

Over the past couple of months a team of Intel engineers have been working with our engineers on Oracle Big Data Appliance and performance, especially in ensuring a BDA outperforms DIY Hadoop out of the box. The good news is that your BDA, as you know it today is already 1.2x faster. We are now working to include a lot of the findings in BDA 4.3 and subsequent versions, so we are steadily expanding that 1.2x into a 2x out of box performance advantage. And that is all above and beyond the faster time to value a BDA delivers, as well as on top of the low cost you can get it for. 

Read the full paper here.

But, we thought we should add some color to all of this, and if you are at Openworld this year, come listen to Eric explain all of this in detail on Monday October 26th at 2:45 in Moscone West room 3000.

If you can't make it, here is a short little dialog we had over the results and both Eric and Lucy's take on the work they did and what they are up to next.

Q: What was the most surprising finding in tuning the system?

A: We were surprised how well the BDA performed right after its installation. Having worked for over 5 years on Hadoop, we understand it is a long iterative process to extract the best possible performance out of your hardware. BDA was a well-tuned machine and we were a little concerned we might not have much value to add... 

Q: Anything that you thought was exciting but turned out to be not such a big thing?

A: We were hoping for 5x gains from our work, but only got 2x... But, in all seriousness, we were hoping for better results from some of our memory and Java garbage collection tuning. Unfortunately they only resulted in marginal single digits gains. 

Q: What is next?

A: There is a long list of exciting new products coming from Intel in the coming year; such as hardware accelerated compression, 3d-Xpoint, almost zero latency PCIE SSDs and not to forget new processors. We are excited at the idea of tightly integrating them all with Big Data technologies! What is a better test bed that the BDA? A full software/hardware solution!

Looks like we have a lot of fun things to go work on and with, as well as of course looking into performance improvements for BDA in light of Apache Spark.

See you all at Openworld, or once again, read the paper here

Tuesday Oct 13, 2015

Big Data SQL 2.0 - Now Available

With the release of Big Data SQL 2.0 it is probably time to do a quick recap and introduce the marquee features in 2.0. The key goals of Big Data SQL are to expose data in its original format, and stored within Hadoop and NoSQL Databases through high-performance Oracle SQL being offloaded to Storage resident cells or agents. The architecture of Big Data SQL closely follows the architecture of Oracle Exadata Storage Server Software and is built on the same proven technology.

Retrieving Data With data in HDFS stored in an undetermined format (schema on read), SQL queries require some constructs to parse and interpret data for it to be processed in rows and columns. For this Big Data SQL leverages all the Hadoop constructs, notably InputFormat and SerDe Java classes optionally through Hive metadata definitions. Big Data SQL then layers the Oracle Big Data SQL Agent on top of this generic Hadoop infrastructure as can be seen below.

Accessing HDFS data through Big Data SQL

Because Big Data SQL is based on Exadata Storage Server Software, a number of benefits are instantly available. Big Data SQL not only can retrieve data, but can also score Data Mining models at the individual agent, mapping model scoring to an individual HDFS node. Likewise querying JSON documents stored in HDFS can be done with SQL directly and is executed on the agent itself.

Smart Scan

Within the Big Data SQL Agent, similar functionality exists as is available in Exadata Storage Server Software. Smart Scans apply the filter and row projections from a given SQL query on the data streaming from the HDFS Data Nodes, reducing the data that is flowing to the Database to fulfill the data request of that given query. The benefits of Smart Scan for Hadoop data are even more pronounced than for Oracle Database as tables are often very wide and very large. Because of the elimination of data at the individual HDFS node, queries across large tables are now possible within reasonable time limits enabling data warehouse style queries to be spread across data stored in both HDFS and Oracle Database.

Storage Indexes

Storage Indexes - new in Big Data SQL 2.0 - provide the same benefits of IO elimination to Big Data SQL as they provide to SQL on Exadata. The big difference is that in Big Data SQL the Storage Index works on an HDFS block (on BDA – 256MB of data) and span 32 columns instead of the usual 8. Storage Index is fully transparent to both Oracle Database and to the underlying HDFS environment. As with Exadata, the Storage Index is a memory construct managed by the Big Data SQL software and invalidated automatically when the underlying files change.

Concepts for Storage Indexes

Storage Indexes work on data exposed via Oracle External tables using both the ORACLE_HIVE and ORACLE_HDFS types. Fields are mapped to these External Tables and the Storage Index is attached to the Oracle (not the Hive) columns, so that when a query references the column(s), the Storage Index - when appropriate - kicks in. In the current version, Storage Index does not support tables defined with Storage Handlers (ex: HBase or Oracle NoSQL Database).

Compound Benefits

The Smart Scan and Storage Index features deliver compound benefits. Where Storage Indexes reduces the IO done, Smart Scan then enacts the same row filtering and column projection. This latter step remains important as it reduces the data transferred between systems.

To learn more about Big Data SQL, join us at Open World in San Francisco at the end of the month.

Thursday Sep 03, 2015

Oracle Big Data Lite 4.2.1 - Includes Big Data Discovery

We just released Oracle Big Data Lite 4.2.1 VM.  This VM provides many of the key big data technologies that are part of Oracle's big data platform.  Along with all the great features of the previous version, Big Data Lite now adds Oracle Big Data Discovery 1.1:

The list of big data capabilities provided by the virtual machine continues to grow.  Here's a list of all the products that are pre-configured:

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Discovery 1.1
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0 
Take it for a spin - and check out the tutorials and demos that are available from the Big Data Lite download page.

Tuesday Jul 07, 2015

Update to BDA X5-2 provides more flexibility and capacity with no price changes

As more people pick up big data technologies, we see the workloads run on these big data system evolve and diversify. The initial workloads were all Map Reduce and fit a specific, (micro) batch workload pattern. Over the past couple of years that has changed and that change is reflected in the Hadoop tools - specifically with YARN. While there is still quite a bit of batch work being done, typically using Map Reduce (think Hive, Pig etc) we are seeing our customers move to more mixed workloads where the batch work is augmented with both more online SQL as well as more streaming workloads.

More Horsepower - More Capacity 

The change towards a more mixed workload leads us to change the shape of the underlying hardware. Systems now shift away from the once sacred "1-core to 1-disk ratio" and also from the small memory footprints for the worker nodes.

With the BDA X5-2 update in December 2014, BDA doubled the base memory configuration and added 2.25x more CPU resources in every node as well as upgrading to Intel's fastest Xeon E5 CPU. BDA X5-2 now has 2 * 18 Xeon cores to enable CPU intense workloads like analytics SQL queries using Oracle Big Data SQL, machine learning, graph applications etc.

With the processing covered for these more mixed workloads, we looked at other emerging trends or workloads and their impact on the BDA X5-2 hardware. The most prominent trend we see in big data are the large data volumes we expect to see from the Internet of Things (IoT) explosion and the potential cost associated with storing that data.

To address this issue (and storage cost in general) we are now adding 2x more disk space onto each and every BDA disk doubling the total available space on the system while keeping the list price constant. That is correct, 2x capacity but no price change! 

More Flexibility

And if that isn't enough, we are also changing the way our customers can grow their systems by introducing BDA Elastic Configurations.

As we see customers build out production in large increments, we also see a need to be more flexible in expanding the non-production environments (test, qa and performance environments). BDA X5-2 Elastic Configurations enables expansion of a system in 1-node increments by adding a BDA X5-2 High Capacity (HC) plus InfiniBand Infrastructure into a 6-node Starter Rack.

The increased flexibility enables our customers to start with a production scale cluster of 6 nodes (X5-2 or older) and then increment within the base rack up to 18 nodes, then expand across racks without any additional switching (no top of rack required, all on the same InfiniBand network) to build large(r) clusters. The expansion is of course all supported from the Oracle Mammoth configuration utility and its CLI, greatly simplifying expansion of clusters.

Major Improvement, No Additional Cost

Over the past generations BDA has been quickly adopted to changing usage and workload patterns enabling the adoption of Hadoop into the data ecosystem with minimal infrastructure disruption but with maximum business benefits. The latest update to BDA X5-2 enables flexibility, delivers more storage capacity and runs more workloads then ever before. 

For more information see the BDA X5-2 Data Sheet on OTN

Saturday Jun 20, 2015

Oracle Big Data Spatial and Graph - Installing the Image Processing Framework

Oracle Big Data Lite 4.2 was just released - and one of the cool new features is Oracle Spatial and Graph.  In order to use this new feature, there is one more configuration step required.  Normally, we include everything you need in the VM - but this is a component that we couldn't distribute.

For the Big Data Spatial Image Processing Framework, you will need to install and configure Proj.4 - Cartographic Projections Library.  Simply follow these steps: 

  • Start the Big Data Lite VM and log in as user "oracle"
  • Launch firefox and download this tarball (​ to ~/Downloads
  • Run the following commands at the linux prompt:
    • cd ~/Downloads
    • tar -xvf proj-4.9.1.tar.gz
    • cd proj-4.9.1
    • ./configure
    • make
    • sudo make install

This will create the file in directory /usr/local/lib/.  Now that the file has been created, create links to it in the appropriate directories.  At the linux prompt:

  • sudo ln -s /usr/local/lib/ /u02/oracle-spatial-graph/shareddir/spatial/demo/imageserver/native/
  • sudo ln -s /usr/local/lib/ /usr/lib/hadoop/lib/native/

That's all there is to it.  Big Data Lite is now ready for Orace Big Data Spatial and Graph!

Oracle Big Data Lite 4.2 Now Available!

Oracle Big Data Lite Virtual Machine 4.2 is now available on OTN.  For those of you that are new to the VM - it is a great way to get started with Oracle's big data platform.  It has a ton of products installed and configured - including: 

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0

Check out our new product - Oracle Big Data Spatial and Graph (and don't forget to read the blog post on a small config update you'll need to make to use it).  It's a great way to find relationships in data and query and visualize geographic data.  Speaking of analysis... Oracle R Advanced Analytics for Hadoop now leverages Spark for many of its algorithms for (way) faster processing.

 But, that's just a couple of features... download the VM and check it out for yourself :). 

Friday May 15, 2015

Big Data Spatial and Graph is now released!

Cross-posting this from the announcement of the new spatial and graph capabilities. You can get more detail on OTN.

The product objective is to provide spatial and graph capabilities that are best suited to the use cases, data sets, and workloads found in big data environments.  Oracle Big Data Spatial and Graph can be deployed on Oracle Big Data Appliance, as well as other supported Hadoop and NoSQL systems on commodity hardware. 

Here are some feature highlights.   

Oracle Big Data Spatial and Graph includes two main components:

  • A distributed property graph database with 35 built-in graph analytics to
    • discover graph patterns in big data, such as communities and influencers within a social graph
    • generate recommendations based on interests, profiles, and past behaviors
  • A wide range of spatial analysis functions and services to
    • evaluate data based on how near or far something is to one another, or whether something falls within a boundary or region
    • process and visualize geospatial map data and imagery

Property Graph Data Management and Analysis

The property graph feature of Oracle Big Data Spatial and Graph facilitates big data discovery and dynamic schema evolution with real-world modeling and proven in-memory parallel analytics. Property graphs are commonly used to model and analyze relationships, such as communities, influencers and recommendations, and other patterns found in social networks, cyber security, utilities and telecommunications, life sciences and clinical data, and knowledge networks.  

Property graphs model the real-world as networks of linked data comprising vertices (entities), edges (relationships), and properties (attributes) for both. Property graphs are flexible and easy to evolve; metadata is stored as part of the graph and new relationships are added by simply adding a edge. Graphs support sparse data; properties can be added to a vertex or edge but need not be applied to all similar vertices and edges.  Standard property graph analysis enables discovery with analytics that include ranking, centrality, recommender, community detection, and path finding.

Oracle Big Data Spatial and Graph provides an industry leading property graph capability on Apache HBase and Oracle NoSQL Database with a Groovy-based console; parallel bulk load from common graph file formats; text indexing and search; querying graphs in database and in memory; ease of development with open source Java APIs and popular scripting languages; and an in-memory, parallel, multi-user, graph analytics engine with 35 standard graph analytics.

Spatial Analysis and Services Enrich and Categorize Your Big Data with Location

With the spatial capabilities, users can take data with any location information, enrich it, and use it to harmonize their data.  For example, Big Data Spatial and Graph can look at datasets like Twitter feeds that include a zip code or street address, and add or update city, state, and country information.  It can also filter or group results based on spatial relationships:  for example, filtering customer data from logfiles based on how near one customer is to another, or finding how many customers are in each sales territory.  These results can be visualized on a map with the included HTML5-based web mapping tool.  Location can be used as a universal key across disparate data commonly found in Hadoop-based analytic solutions. 

Also, users can perform large-scale operations for data cleansing, preparation, and processing of imagery, sensor data, and raw input data with the raster services.  Users can load raster data on HDFS using dozens of supported file formats, perform analysis such as mosaic and subset, write and carry out other analysis operations, visualize data, and manage workflows.  Hadoop environments are ideally suited to storing and processing these high data volumes quickly, in parallel across MapReduce nodes.  

Learn more about Oracle Big Data Spatial and Graph at the OTN product website:

Read the Data Sheet

Read the Spatial Feature Overview

Tuesday Apr 14, 2015

Statement of Direction -- Big Data Management System

Click here to start reading the Full Statement of Direction. 

Introduction: Oracle Big Data Management System Today 

As today's enterprises embrace big data, their information architectures must evolve. Every enterprise has data warehouses today, but the best-practices information architecture embraces emerging technologies such as Hadoop and NoSQL. Today’s information architecture recognizes that data not only is stored in increasingly disparate data platforms, but also in increasingly disparate locations: on-premises and potentially multiple cloud platforms. The ideal of a single monolithic ‘enterprise data warehouse’ has faded as a new more flexible architecture has emerged. Oracle calls this new architecture the Oracle Big Data Management System, and today it consists of three key components

  • The data warehouse, running on Oracle Database and Oracle Exadata Database Machine, is the primary analytic database for storing much of a company’s core transactional data: financial records, customer data, point- of-sale data and so forth. Despite now being part of a broader architecture, the requirements on the RDBMS for performance, scalability, concurrency and workload management are in more demand than ever; Oracle Database 12c introduced Oracle Database In-Memory (with columnar tables, SIMD processing, and advanced compression schemes) as latest in a long succession of warehouse-focused innovations. The market-leading Oracle Database is the ideal starting point for customers to extend their architecture to the Big Data Management System.
  • The ‘data reservoir’, hosted on Oracle Big Data Appliance, will augment the data warehouse as a repository for the new sources of large volumes of data: machine-generated log files, social-media data, and videos and images -- as well as a repository for more granular transactional data or older transactional data which is not stored in the data warehouse. Oracle’s Big Data Management System embraces complementary technologies and platforms, including open-source technologies: Oracle Big Data Appliance includes Cloudera’s Distribution of Hadoop and Oracle NoSQL Database for data management.
  • A ‘franchised query engine,’ Oracle Big Data SQL, enables scalable, integrated access in situ to the entire Big Data Management System. SQL is the accepted language for day-to-day data access and analytic queries, and thus SQL is the primary language of the Big Data Management System.  Big Data SQL enables users to combine data from Oracle Database, Hadoop and NoSQL sources within a single SQL statement.  Leveraging the architecture of Exadata Storage Software and the SQL engine of the Oracle Database, Big Data SQL delivers high-performance access to all data in the Big Data Management System.

Using this architecture, the Oracle Big Data Management System combines the performance of Oracle’s market-leading relational database, the power of Oracle’s SQL engine, and the cost-effective, flexible storage of Hadoop and NoSQL. The result is an integrated architecture for managing Big Data, providing all of the benefits of Oracle Database, Exadata, and Hadoop, without the drawbacks of independently-accessed data repositories.  

Note that the scope of this statement of direction is the data platform for Big Data. An enterprise Big Data solution would also be comprised of big data tools and big data applications built upon this data platform. 

Read the full Statement of Direction -- Big Data Management System here.

Wednesday Apr 08, 2015

Noteworthy event for big data and data warehousing

I know there is a choice of conferences and events, but this one strikes me as very interesting. The thing that (amongst a whole bunch of other interesting sessions) strikes me as extremely interesting is the extra day with a master class on Information Architecture in the "big data world". Officially this is called Oracle Information Management and Big Data Reference architecture, and I think we - as a community - can all use some more foundation on information management. Or maybe, a nice refresher for those of us who were in the weeds of our daily jobs?

On top of the master class, there are a whole load of interesting sessions. Some are really on the consumer side, some are focused on the back-end, but I think both perspectives are very useful.

Some of the sessions that look very promising to me (in Brighton) are "A journey into big data and analytics - Liberty Global" and in either location anything on Big Data Discovery - really cool stuff, and something I think will change how we derive value from big data.

So, where do you go? Brighton and Atlanta of course...

Read more about the event here.

PS. I'm hearing our own Reiner Zimmermann is doing a keynote in Brighton! 

Wednesday Mar 18, 2015

Production workloads blend Cloud and On-Premise Capabilities

Prediction #7 - blending production workloadsacross cloud and on-premise in Oracle's Enterprise Big Data Predictions 2015 is a tough nut to crack. Yet, we at Oracle think this is really the direction we all will go. Sure we can debate the timing, and whether or not this happens in 2015, but it is something that will come to all of us who are looking towards that big data future. So let’s discuss what we think is really going to happen over the coming years in the big data and cloud world.

Reality #1 – Data will live both in the cloud and on-premise

We see this today. Organizations run Human Capital Management systems in the cloud, integrate these with data from outside cloud based systems (think for example LinkedIn, staffing agencies etc.) while their general data warehouses and new big data systems are all deployed as on-premise systems. We also see the example in the prediction where various auto dealer systems uplink into the cloud to enable the manufacturer to consolidate all of their disparate systems. This data may be translated into data warehouse entries and possibly live in two worlds – both in the cloud and on-premise for deep diving analytics or in aggregated form.

Reality #2 – Hybrid deployments are difficult to query and optimize

We also see this today and it is one of the major issues of living in the hybrid world of cloud and on-premise. A lot of the issues are driven by low level technical limitations, specifically in network bandwidth and upload / download capacity into and out of the cloud environment. The other challenges are really (big) data management challenges in that they go into the art of running queries across two ecosystems with very different characteristics. We see a trend to use engineered systems on-premise, which delivers optimized performance for the applications, but in the cloud we often see virtualization pushing the trade-off towards ease of deployment and ease of management. These completely different ecosystems make optimization of queries across them very difficult.

Solution – Equality brings optimizations to mixed environments

As larger systems like big data and data warehouse systems move to the cloud, better performance becomes a key success criterion. Oracle is uniquely positioned to drive both standardization and performance optimizations into the cloud by deploying on engineered systems like Oracle Exadata and Oracle Big Data Appliance. Deploying engineered systems enables customers to run large systems in the cloud delivering performance as they see today in on-premise deployments. This then means that we do not live in a world divided in slow and fast, but in a world of fast and fast.
This equivalence also means that we have the same functionality in both worlds, and here we can sprinkle in some – future – Oracle magic, where we start optimizing queries to take into account where the data lives, how fast we can move it around (the dreaded networking bandwidth issue) and where we need to execute code. Now, how are we going to do this? That is a piece magic, and you will just need to wait a bit… suffice it to say we are hard at work at solving this challenging topic.


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« February 2016