Thursday Feb 04, 2016

Using Spark(Scala) and Oracle Big Data Lite VM for Barcode & QR Detection

Big Data and Scalable Image Processing and Analytics

Guest post by Dave Bayard - Oracle's Big Data Pursuit Team 

One of the promises of Big Data is its flexibility to work with large volumes of unstructured types of data such as images and photos. In todayís world, there are many sources of images including social media photos, security cameras, satellite images, and more. There are many kinds of image processing and analytics that are possible from optical character recognition (OCR), license plate detection, bar code detection, face recognition, geological analysis and more. And there are many open source libraries such as OpenCV, Tesseract, ZXing, and others that are available to leverage.

This blog and demonstration was built to show that scalable image analytics does not need to be difficult. Our primary goal in this blog is to use the Oracle Big Data Lite VM environment to demonstrate how to take an open source library and combine it into a Spark/Scala application. In particular, we will use Spark alongside the ZXing (Zebra Crossing) library to detect barcodes and QR Codes from a set of image files.

It should be also noted that instead of writing our own Spark application that we could instead have leveraged Oracleís Multimedia Analytics framework that comes as a feature of Oracle Big Data Spatial and Graph. For instance, the Multimedia Analytics framework could let us easily support videos as well as still images. In a later blog post, we will extend this example and show how to make it work with the Multimedia Analytics framework. For now you can learn more about the Multimedia Analytics framework here:

This blog has the following objectives:

  • 1.Learn about the ZXing library for barcode and QR code detection.
  • 2.Learn how to build and run a simple Spark(Scala) application using the Oracle Big Data Lite VM.
  • 3.Learn how to build and run a Spark(Scala+Java) application using the ZXing libraries.

Above is an example of a photo containing a QR code. One potential application could include going through a large set of photos and identifying which photos have QR codes and the content of those QR codes. Not everyone will have a set of photos with QR codes in them nor will they have a need to scan them in bulk, but the concepts of this blog (showing how to use an open source image library alongside Spark and Scala) should still apply- just substitute QR code detection libraries with the image processing libraries of your choice.

Initial Oracle Big Data Lite VM Setup:

This demonstration uses the Oracle Big Data Lite VM version, which is available here: . Version of the VM includes CDH 5.4.7, Spark 1.3.0, and Scala 2.10.4.

Once you have the Big Data Lite VM downloaded, imported, and running, then click on the ìRefresh Samplesî icon on the VM desktop to refresh the samples. At this point, you should fine the files needed for this blog are available under your /home/oracle/src/Blogs/SparkBarcode directory.

[Note: If you want to get access to the files referenced in this blog outside of the Oracle Big Data LiteVM, you can find them here: ]

Now run the script located at /home/oracle/src/Blogs/SparkBarcode. The script will download the necessary files for the open source libraries ZXing and SBT as well as copy some sample image files into hdfs.

ZXing (Zebra Crossing) for Barcode Detection:

ZXing (pronounced ìZebra Crossingî) is an open source library that does one-dimensional (barcode) and two-dimensional (QR code) image detection. It is written in java. You can find out more at the website:

As a quick example of ZXing, we can use the ZXing projectís hosted web application to demonstrate its functionality. In the Big Data Lite VM, open up the firefox browser and go to . On the web page, click on the ìBrowseî button and use the file browser to open the /home/oracle/src/Blogs/SparkBarcode/images/test.jpg file. Then click the ìSubmit Queryî button on the web page.

The test.jpg file is a photo containing a 2-dimensional QR code. When you submit the image, the web application should return

The ZXing project has kindly made available the source code for the web application. To understand how to interact with the ZXing API, we can look at how the web application worked and focus on the source of the DecodeServlet class, which is available here:

In particular, navigate to the processImage() method which illustrates how the ZXing APIs (such as MultipleBarcodeReader, MultiFormatReader, etc) are used:

  private static void processImage(BufferedImage image,
                                   HttpServletRequest request,
                                   HttpServletResponse response) throws IOException, ServletException {

    LuminanceSource source = new BufferedImageLuminanceSource(image);
    BinaryBitmap bitmap = new BinaryBitmap(new GlobalHistogramBinarizer(source));
    Collection results = new ArrayList<>(1);

    try {

      Reader reader = new MultiFormatReader();
      ReaderException savedException = null;
      try {
        // Look for multiple barcodes
        MultipleBarcodeReader multiReader = new GenericMultipleBarcodeReader(reader);
        Result[] theResults = multiReader.decodeMultiple(bitmap, HINTS);
        if (theResults != null) {
      } catch (ReaderException re) {
        savedException = re;
      if (results.isEmpty()) {
        try {
          // Look for pure barcode
          Result theResult = reader.decode(bitmap, HINTS_PURE);
          if (theResult != null) {
        } catch (ReaderException re) {
          savedException = re;
      if (results.isEmpty()) {
        try {
          // Look for normal barcode in photo
          Result theResult = reader.decode(bitmap, HINTS);
          if (theResult != null) {
        } catch (ReaderException re) {
          savedException = re;
      if (results.isEmpty()) {
        try {
          // Try again with other binarizer
          BinaryBitmap hybridBitmap = new BinaryBitmap(new HybridBinarizer(source));
          Result theResult = reader.decode(hybridBitmap, HINTS);
          if (theResult != null) {
        } catch (ReaderException re) {
          savedException = re;
      if (results.isEmpty()) {
        try {
          throw savedException == null ? NotFoundException.getNotFoundInstance() : savedException;
        } catch (FormatException | ChecksumException e) {
          errorResponse(request, response, "format");
        } catch (ReaderException e) { // Including NotFoundException
          errorResponse(request, response, "notfound");

    } catch (RuntimeException re) {
      // Call out unexpected errors in the log clearly
      log.log(Level.WARNING, "Unexpected exception from library", re);
      throw new ServletException(re);

    String fullParameter = request.getParameter("full");
    boolean minimalOutput = fullParameter != null && !Boolean.parseBoolean(fullParameter);
    if (minimalOutput) {
      try (Writer out = new OutputStreamWriter(response.getOutputStream(), StandardCharsets.UTF_8)) {
        for (Result result : results) {
    } else {
      request.setAttribute("results", results);
      request.getRequestDispatcher("decoderesult.jspx").forward(request, response);

We can observe that the code makes a series of attempts to identify barcodes/QR codes. If one attempt does not find a barcode, it attempts again using a different method or parameter/hint. This can help the code to better tolerate image detection challenges like variations in lighting, reflection, angle, resolution, image quality, etc. (If you are interested in other ways to use the ZXing APIs, an alternative example is the decode() method in the DecodeWorker class, which can be found here: )

Our next step is to get started coding with ZXing by building a simple standalone Java application. Thanks to the script you ran earlier, we have already downloaded the necessary ZXing java jar libraries, as described in the ZXing Getting Started guide at

Our simple java standalone application will be based off the processImage() method from the web application. The source of the sample application can be found in the file, located in the /home/oracle/src/Blogs/SparkBarcode/SimpleJavaApp/barcodedemo subdirectory.

Essentially, we copied the web applicationís processImage() method and removed the dependencies on the http request and response objects. Explore the source code to see what we mean.

Now run the script ìrun_simple_java.shî to both compile and run our sample test program against the test image.

Our simple standalone java application using ZXing libraries is a success!

Spark Development: A First Scala Application

Our ultimate goal is to combine the open source ZXing library with Spark and run it using the resources of our CDH cluster. Specifically, we want to build a Scala Application that calls the java ZXing libraries using the Spark on Yarn Framework to run our barcode detection on a set of images in parallel. But before we attempt that final application, we will first start with a simple Scala application that uses only Javaís built-in libraries.

If you are new to Spark, you should check out . If you are new to the Scala language, you can find some quick language tips at . This blog assumes you have some experience with working with basic Spark/Scala examples such as word count using the interactive spark-shell.

To help us with our Scala application development, we will want to add the ìsbtî utility to the Big Data Lite VM. SBT is frequently used to manage the build process for Scala applications, much like maven is used with Java applications. The ìsetup.shî script you ran earlier downloaded sbt for the Big Data Lite VM. If you want more information about SBT, you can navigate to here:

Another requirement is to prepare a directory structure for our Spark/Scala application. We will follow the template directory structure described in .

The directories and files for the simple application have been created for you and are located at the /home/oracle/src/Blogs/SparkBarcode/SimpleScalaApp subdirectory. They look likeÖ

There are 2 key files. The simple.sbt file is the build file that sbt uses and contains information such as dependencies on other libraries. The SimpleApp.scala file is the application source. The source looks like:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import java.awt.image.BufferedImage
import javax.imageio.ImageIO

object SimpleApp {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Simple Scala Image App")
    val sc = new SparkContext(conf)
    val files = sc.binaryFiles("barcode/images/test.jpg")
    val imageResults =
         result => println("\nFile:"+result._1+" Width:"+result._2+" Height:"+result._3)
                          ); //this println gets printed to the main stdout


  def processSparkImage (
                    file: (String, org.apache.spark.input.PortableDataStream)
                   ) : (String, Int, Int)  =
    println("In processSparkImage for "+file._1) //this println goes to the executor stdout
    val img: BufferedImage =
    val height = img.getHeight()
    println("Height is "+height+" for "+file._1)

    return (file._1, img.getWidth(), img.getHeight)
The simple application is based on the Spark quick startís example Scala standalone application- the big changes are that we have modified it to use Sparkís sc.binaryFiles() instead of sc.textFile(), and we have created a processSparkImage() function. Our processSparkImage() function uses standard Java Image APIs to extract the width and height of an image.

At the time of this writing, there does not seem to be much published about sc.binaryFiles(), so it is worth a little bit of extra explanation. The output of sc.binaryFiles is a set of tuples. The first element is the filename and the second element is a Spark PortableDataStream of the contents. In Scala notation, the body of processSparkImage() uses file._2 to point to the PortableDataStream and file._1 to point to the filename. The PortableDataStream can be used where you would use an InputStream.

The rest of the code is pretty standard stuff for initializing the SparkContext, etc.

Run the ìbuild_simple_scala.shî script. If this is your first time running the build script, be patient- it will take a dozen minutes or so as the sbt tool will do a one-time download of supporting libraries to prepare the scala environment for Spark application compilation/development.

Once the build is done, run the ìrun_simple_scala.shî script. Your output should look like below:

Notice that the simple application has printed out the Width and Height of the test image, as expected. Also notice that the run script has given you the URLs of the YARN, Hue, and Spark web UIs. In the terminal window, you can right click on those URLs and choose ìOpen LinkÖî to easily view them. Below are screenshots of the Hue, YARN Resource Manager, and Spark History web UIs:

Building our Scala+ZXing Spark Application:

Now, we can move onto integrating the java ZXing library into our Scala application code. We have done so in the application located at /home/oracle/src/Blogs/SparkBarcode/ScalaBarcodeApp.

Letís look at the code source directory.

Notice that there are both java and scala subdirectories. Under the java subdirectory, we have a barcodedemo.BarcodeProcessor class with our version of the processImage() function from our simple java application above.

Our scala code is BarcodeApp.scala. It looks like this:

/* BarcodeApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import barcodedemo.BarcodeProcessor

import java.awt.image.BufferedImage
import javax.imageio.ImageIO

object BarcodeApp {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Scala ZXing Barcode App")
    val sc = new SparkContext(conf)
    val files = sc.binaryFiles(args(0))

    val imageResults =

               result => println("\nFile:" + result._1 + "\n Results:" + result._2)
                          //this println gets written to the main stdout


  def processSparkImage (
                    file: (String, org.apache.spark.input.PortableDataStream)
                   ) : (String, String)  =
    println("In processSparkImage for "+file._1) //this println goes to the executor stdout

    val img: BufferedImage =

    return (file._1, BarcodeProcessor.processImage(img))


Notice that the scala code imports our java BarcodeProcessor class. The Scala processSparkImage() method calls the java BarcodeProcessor.processImage() method, which returns the barcode information in a string.

You can also look at the main method for the Scala application. It defines an RDD using sc.binaryFiles() based on the path defined by the first command-line argument args(0). This will allow us to test our application with 1 or many images by changing the command-line arguments of our run command. Then the application calls the map() transformation for the processSparkImage() method. This will cause the Spark executors to run the processSparkImage() method on each binaryFile. Finally, the Scala code collects the results and prints the output for each file.

We can also look at the barcode.sbt file for this application and notice that weíve included a dependency for the necessary ZXing libraries from the central maven repository, telling SBT to go ahead and download them as needed for our build.

name := "Scala ZXing Barcode Application"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

libraryDependencies += "" % "core" % "3.2.1"
libraryDependencies += "" % "javase" % "3.2.1"

Build our application by running ìbuild_barcode.shî.

Then run our application. If you want to run for the single test.jpg image, use ìrun_barcode.shî. If you want to test for a set of approx 25 images, then use ìrun_barcode_many.shî.

Here is the start, middle, and finish of ìrun_barcode_many.shî:

You will notice that a barcode was not detected in every image. In some cases, there simply wasnít a barcode or QR code in the image. In other cases, there was a barcode but the image might have been fuzzy or too much glare or something else.

Running on a real cluster:

To get a feel for the power of scalability, we can copy our Barcode application and deploy it onto real-world hardware. In our case, we will use an older 6-node Oracle Big Data Appliance X3-2. While the BDA X3-2 lacks the huge horsepower of the latest generation BDA X5-2, it can still give us a place to demonstrate scalability. In our example, we will run the ìrun_barcode_many.shî for 50 images against the Big Data Lite VM as well as against a 6-node X3-2 BDA.

The screenshot above shows details from running the Barcode app on the Big Data Lite VM on my laptop. It used 2 executors and ran in 1.3 minutes of clock time. The screenshot below shows details from running on the 6-node BDA X3-2. Notice the different number of Executors (in the Aggregated Metrics by Executor section). On the BDA, we used 50 executors and it took 5 seconds of clock time. Both scenarios used the same code and same set of images.

Notice how the Spark application was able to take advantage of the multiple machines and multiple CPUs of the BDA X3-2 to run in parallel and complete much faster (5seconds versus 1.3 minutes).

Moving Beyond:

Hopefully, this was a good start for your journey into Spark and Image Processing. Here are some possible future paths you could take:
  • You could learn more about Spark. To do so, I liked the book ìLearning Sparkî.
  • You could learn more about image detection and computer vision. To do so, I found the book ìSimpleCVî to explain the concepts well (SimpleCV is python-focused, but its explanation of concepts are useful to any language).
  • You could experiment with other libraries like OpenCV or Tesseract. Cloudera has a blog example using Spark with the open source Tesseract OCR library:
  • You could experiment with tweaking the sample java processImage() code to work as a custom FrameProcessor for the Oracle Big Data Spatial and Graph Multimedia Analytics feature. This could be used, for instance, to scan videos for barcodes. To do so, you can use processImage() as part of an implementation of the oracle.ord.hadoop.mapreduce.OrdFrameProcessor class. See or stay tuned for an upcoming blog/example.

NOTE: If you want to play around with the source files and make modifications, you should probably copy the SparkBarcode directory tree into a new directory outside of /home/oracle/src. This is because the ìRefresh Samplesî utility will wipe-out the /home/oracle/src directory every time it runs.


This blog has shown you how to do Barcode detection using the ZXing libraries. First, we illustrated ZXing in action by using the hosted web application at Next, we built and ran a simple standalone java application using the ZXing APIs. Then, we built a simple Spark(Scala) application that used the built-in java Image APIs (but not yet the ZXing APIs). We ran this Spark(Scala) application against the YARN cluster on our Big Data Lite VM. Then we built our final Spark(Scala) application which included our custom Java code that called the ZXing APIs. We ran this on our Big Data Lite VM YARN cluster for a set of sample images. Finally, we took the same code and images and ran them on a physical Big Data Appliance to show the benefit of scale-out parallelism across nodes and cpus.

Hopefully, this has made you more comfortable with working with tools like the Oracle Big Data Lite VM, Spark, Scala, sbt, and ZXing. Enjoy.

About the Author:

David Bayard is a member of the Big Data Pursuit team for Oracle North America Sales.

Tuesday Jan 19, 2016

Big Data SQL Quick Start. Introduction - Part1.

Today I am going to explain steps that required to start working with Big Data SQL. It’s really easy!  I hope after this article you all will agree with me. First, if you want to get caught up on what Big Data SQL is, I recommend that you read these blogs: Oracle Big Data SQL: One Fast Query, Big Data SQL 2.0 - Now Available.

The above blogs cover design goals of Big Data SQL. One of the goals of Big Data SQL is transparency. You just define table that links to some directory in HDFS or some table in HCatalog and continue working with it like with general Oracle Database table.It’s also useful to read the product documentation.

Your first query with Big Data SQL

Let’s start with simplest one example and query data that is actually stored in HDFS via Oracle Database using Big Data SQL. I’m going to begin this example by checking of the data that actually lies into HDFS. To accomplish this, I run the hive console and check hive table DDL:

hive> show create table web_sales;



  ws_sold_date_sk int,

 ws_sold_time_sk int,


  ws_net_paid_inc_ship float,

  ws_net_paid_inc_ship_tax float,

  ws_net_profit float)









From the DDL statement, we can see the data is text files (CSV), stored on HDFS in the directory:


From the DDL statement we can conclude that fields terminated by “|”. Trust, but verify – let’s check:

# hdfs dfs -ls /user/hive/warehouse/tpc_ds_3T/web_sales|tail -2

... hive 33400655 2015-05-11 13:00 /user/hive/warehouse/tpc_ds_3T/web_sales/part-01923

... hive 32787672 2015-05-11 13:00 /user/hive/warehouse/tpc_ds_3T/web_sales/part-01924

# hdfs dfs -cat /user/hive/warehouse/tpc_ds_3T/web_sales/part-01923|tail -2



Indeed, we have CSV files on HDFS. Let’s fetch it from the database.

New type of External table, new events and new item in the query plan

With Big Data SQL we introduce new types of External Tables (ORACLE_HIVE and ORACLE_HDFS), a new wait event (cell external table smart scan), and a new plan statement (External Table Access Storage Full). Over this HDFS directory, I’ve defined and Oracle External table, like this:













After table creation, I’m able to query data from the database. To begin, I run a very simple query that calculates the minimum value of some column, and has a filter on it.  Then, I can Oracle Enterprise Manager to determine how my query was processed:

SELECT min(w.ws_sold_time_sk) 


WHERE w.ws_sold_date_sk = 2451047

We can see the new type of the wait event “cell external table smart scan”:

and new item in plan statement - “external table access storage full”:

To make sure that your table now exists in Oracle dictionary you can run follow queries:


FROM user_objects t 





----------- -------------


Big Data SQL also adds a new member to Oracle’s metadata - ALL_HIVE_TABLES:

SQL> SELECT table_name,LOCATION,table_type





TABLE_NAME LOCATION                                 TABLE_TYPE

----------- -------------------------------------- -----------

web_returns hdfs://democluster-ns/.../web_returns EXTERNAL_TABLE

See, querying Hadoop with Oracle is easy! In my next blog posts, we’ll look at more complicated queries!

Thursday Jan 07, 2016

Data loading into HDFS - Part1

Today I’m going to start first article that will be devoted by very important topic in Hadoop world – data loading into HDFS. Before all, let me explain different approaches of loading and processing data in different IT systems.

Schema on Read vs Schema on Write

So, when we talking about data loading, usually we do this into system that could belong on one of two types.  One of this is schema on write. With this approach we have to define columns, data formats and so on. During the reading  every user will observe the same data set. As soon as we performed ETL (transform data in format that mostly convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid penalty for this when were loading data. Like example of schema on write system you could consider Relational data base, for example, like Oracle or MySQL.

Schema on Write

Another approach is schema on read. In this case we load data as-is without any changing and transformations.  With this approach we skip ETL (don’t transform data) step and we don’t have any headaches with data format and data structure. Just load file on file system, like coping photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise you will get set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations.

Schema on Read

Hadoop Distributed File System is classical example of schema on read system.More details about Schema on Read and Schema on Write approach you could findhere. Now we are going to talk about data loading data into HDFS. I hope after explanation above, you understand that data loading into Hadoop is not equal of ETL (data doesn’t transform).

[Read More]

Thursday Dec 24, 2015

Oracle Big Data Lite 4.3.0 is Now Available on OTN

Big Data Lite 4.3.0 is now available on OTN

This latest release is packed with new features - here's the inventory of what's included:

  • Oracle Enterprise Linux 6.7
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.7)
  • Cloudera Manager (5.4.7)
  • Oracle Big Data Spatial and Graph 1.1
  • Oracle Big Data Discovery 1.1.1
  • Oracle Big Data Connectors 4.3
    • Oracle SQL Connector for HDFS 3.4.0
    • Oracle Loader for Hadoop 3.5.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.1
    • Oracle XQuery for Hadoop 4.2.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.4.7)
  • Oracle Table Access for Hadoop and Spark 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1.2 with Oracle REST Data Services 3.0
  • Oracle Data Integrator 12cR1 (12.2.1)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.2.0
  • Oracle Perfect Balance 2.5.0
Also, this release is using github as the repository for all of our sample code (  This gives us a great mechanism for updating the samples/demos between releases.  Users simply double click the "Refresh Samples" icon on the desktop to download the latest collateral.

Friday Nov 20, 2015

Review of Data Warehousing and Big Data at #oow15

DW BigData Review Of OOW15

This year OpenWorld was bigger, more exciting and packed with sessions about the very latest technology and product features. Most importantly, both data warehousing and Big Data were at the heart of this year’s conference across a number of keynotes and a huge number of general sessions. Our hands-on labs were all completely full as people got valuable hands-on time with our most important new features. The key focus areas at this year’s conference were:
  • Database 12c for Data Warehousing
  • Big Data and the Internet of Things 
  • Cloud from on-premise to on the Cloud to running hybrid Cloud systems
  • Analytics and SQL continues to evolve to enable more and more sophisticated analysis. 
All these topics appeared across the main keynote sessions including live on-stage demonstrations of how many of our news features can be used to increase the performance and analytical capability of your data warehouse and big data management system - checkout the on-demand videos for the keynotes and executive interviews....
[Read More]

Thursday Sep 03, 2015

Oracle Big Data Lite 4.2.1 - Includes Big Data Discovery

We just released Oracle Big Data Lite 4.2.1 VM.  This VM provides many of the key big data technologies that are part of Oracle's big data platform.  Along with all the great features of the previous version, Big Data Lite now adds Oracle Big Data Discovery 1.1:

The list of big data capabilities provided by the virtual machine continues to grow.  Here's a list of all the products that are pre-configured:

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Discovery 1.1
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0 
Take it for a spin - and check out the tutorials and demos that are available from the Big Data Lite download page.

Thursday Jul 02, 2015

Using Oracle Big Data Spatial and Graph

Wondering how to get started with graph analyses?  The latest Oracle Big Data Lite VM includes Oracle's new spatial and graph toolkit for big data.  Check out these two blog posts that describe how to find interesting relationships in data:

 Pretty cool :)


Saturday Jun 20, 2015

Oracle Big Data Spatial and Graph - Installing the Image Processing Framework

Oracle Big Data Lite 4.2 was just released - and one of the cool new features is Oracle Spatial and Graph.  In order to use this new feature, there is one more configuration step required.  Normally, we include everything you need in the VM - but this is a component that we couldn't distribute.

For the Big Data Spatial Image Processing Framework, you will need to install and configure Proj.4 - Cartographic Projections Library.  Simply follow these steps: 

  • Start the Big Data Lite VM and log in as user "oracle"
  • Launch firefox and download this tarball (​ to ~/Downloads
  • Run the following commands at the linux prompt:
    • cd ~/Downloads
    • tar -xvf proj-4.9.1.tar.gz
    • cd proj-4.9.1
    • ./configure
    • make
    • sudo make install

This will create the file in directory /usr/local/lib/.  Now that the file has been created, create links to it in the appropriate directories.  At the linux prompt:

  • sudo ln -s /usr/local/lib/ /u02/oracle-spatial-graph/shareddir/spatial/demo/imageserver/native/
  • sudo ln -s /usr/local/lib/ /usr/lib/hadoop/lib/native/

That's all there is to it.  Big Data Lite is now ready for Orace Big Data Spatial and Graph!

Oracle Big Data Lite 4.2 Now Available!

Oracle Big Data Lite Virtual Machine 4.2 is now available on OTN.  For those of you that are new to the VM - it is a great way to get started with Oracle's big data platform.  It has a ton of products installed and configured - including: 

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0

Check out our new product - Oracle Big Data Spatial and Graph (and don't forget to read the blog post on a small config update you'll need to make to use it).  It's a great way to find relationships in data and query and visualize geographic data.  Speaking of analysis... Oracle R Advanced Analytics for Hadoop now leverages Spark for many of its algorithms for (way) faster processing.

 But, that's just a couple of features... download the VM and check it out for yourself :). 

Wednesday Mar 11, 2015

Oracle Big Data Lite 4.1 VM is available on OTN

Oracle Big Data Lite 4.1 VM is now available for download on OTN.  Big Data Lite includes many of the key capabilities of Oracle's big data platform.  Each of the components have been configure to work together - and there are many hands-on labs and demonstrations to help you get started using the system.  Below is a listing of what's included:

  • Oracle Enterprise Linux 6.5
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.3.0)
  • Cloudera Manager (5.3.0)
  • Oracle Big Data Connectors 4.1
    • Oracle SQL Connector for HDFS 3.2.0
    • Oracle Loader for Hadoop 3.3.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.1.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.2.5)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.3.0
  • Oracle CopyToBDA 1.1 



Tuesday Mar 03, 2015

Are you leveraging Oracle's database innovations for Cloud and Big data?

If you are interested in big data, Hadoop, SQL and data warehousing then mark your calendars because on March 18th at 10:00AM PST/1:00PM EST, you will be able to hear Tom Kyte (Oracle Database Architect) talk about how you can use Oracle Big Data SQL to seamlessly integrate all your Hadoop big data datasets with your relational schemas stored in Oracle Database 12c. As part of this discussion Tom will outline how you can build the perfect foundation for your enterprise big data management system using Oracle's innovative technology.

If you are working on a data warehousing project and/or a big data project then this is one webcast you will not want to miss so register today (click here) to hear the latest about Oracle Database innovations and best practices. The full list of speakers is:

Tom Kyte
Oracle Database Architect
Keith Wilcox
VP, Database Administration
Bill Callahan
Director, Principal Engineer,
CCC Information Services, Inc.

Why SQL is the natural language for data analysis

Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it. 

The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users. 

[Read More]

Tuesday Nov 11, 2014

Optimizing Table Scans with Zone Maps

Most of you will be familiar with partition pruning, where the Oracle Database will avoid the need to scan table and index partitions based on query predicates. This optimization is transparent to your application, but for it to work, the database has to find a way of mapping a query filter predicate to the partitioning key column (or columns). Partition pruning can only occur if the query has predicates that match the predetermined shape of a partitioned object. For example, a query on a SALES table partitioned by ORDER_DATE will need to include ORDER_DATE in a join or WHERE clause for it to be optimized by partition pruning.

What if you could do better than this? What if you could prune partitions using a variety of column predicates and dimension hierarchies, irrespective of their appearance in the partitioning key? How about pruning at a much finer level of granularity than a partition? Perhaps we want to optimize queries that filter SALES by SHIP_DATE, STATE and COUNTY, as well as ORDER_DATE. The new Oracle zone map feature is designed to achieve this, and just like partitioning, zone maps are transparent to your queries; you don’t have to change your applications to make use of them.

Zone maps are available in Oracle Database 12c for Oracle Engineered Systems. Conceptually, they divide a table up into contiguous regions of blocks called zones (the default zone size being 1024 blocks). For each zone, the Oracle database records the minimum and maximum values for specified columns using a new database object called a zone map. Queries that filter on zone map columns have the potential to be optimized; it’s possible to prune zones that contain ranges of column values outside the match specified in the query predicate.

Consider a query that filters a sales table by (North American) state; in this case “CA”. A zone map on the STATE column will record the minimum and maximum values for this column for each zone in the table. This makes it possible to skip the zones that we can be certain won’t contain rows for “CA”.

Zone Map Representation

You are probably aware that Exadata storage cells and the Oracle database In-Memory Column Store uses similar storage index techniques, so what benefits do zone maps add? Besides the fact that you can control zone maps explicitly, the most significant difference between zone maps and storage indexes is that zone maps can be used to prune zones using column predicates from multiple (joined) tables. Consider a more realistic scenario, in which the SALES table doesn’t have a STATE column, but instead has a LOCATION_ID referencing a dimension table called LOCATIONS. This is our query for summing the sales figures in California:

SELECT SUM(amount)
FROM   sales     s,
       locations l
WHERE  s.location_id = l.location_id
AND    l.state = 'CA';

It would be great if we could avoid scanning zones in SALES that don’t contain rows associated with “CA”. Before we look at how we can do this, we’ll make the scenario even more realistic by assuming that LOCATIONS is a dimensional hierarchy of State and County, like this:

Zone Map Locations Table Example

 Each State is made up of multiple Counties, so “CA” will be associated with multiple LOCATION_ID values. If we want the “CA” rows in SALES, we’ll need to match the ones marked below in bold/red:

Zone Map Sales Table Example

If we want to optimize a scan for “CA” rows, we will have to address a few issues:

  • The SALES table does not have a STATE column, so no storage index structure on SALESs data will allow us to directly prune disk regions based on “CA”.
  • Table rows associated with “CA” are likely to be physically scattered throughout the SALES table, so it’s unlikely that these rows will be confined to a relatively small number of zones or disk regions. We might not be able to make efficient use of an Exadata storage index on SALES.LOCATION_ID, if any (note that I am consciously ignoring the push down of BLOOM FILTERs to Exadata here, which still suffers from the physical scattering).
  • A SALES storage index based on min/max Location ID is likely to be less efficient than using zones based on min/max State values, simply because each State is made up of multiple Location IDs. This inefficiency is more pronounced if Location IDs for “CA” are not numerically close to one another - it will reduce the chances that the Location IDs we’re searching for will be found within the same min/max Location ID regions.

Of course, zone maps are designed to address these issues - with a little bit of help from another Oracle Database 12c feature called attribute clustering.  I introduced attribute clustering in an earlier post, but don’t worry if you haven’t read that yet; I’ll cover the basics here anyway. You’ve probably deduced that we can reduce the number of zones that contain “CA” rows if we cluster or sort the rows in SALES, keeping these rows close to one another, like this:

Table with Zone Map

Attribute clustering is the feature that’s used to cluster the rows together. Zone maps are used to record the min/max values for specified columns for each zone (and this can include column values derived from joins; LOCATIONS.STATE and LOCATIONS.COUNTY in our case).

The following DDL will create a zone map on our SALES fact table using the dimension table columns LOCATIONS.STATE and LOCATIONS.COUNTY. It will also enable attribute clustering, using the same columns to cluster the table’s rows:

JOIN locations ON (sales_ac.location_id = locations.location_id) 
BY LINEAR ORDER (locations.state, locations.county)

The LINEAR ORDER clause specifies a linear clustering algorithm, which is ideal for this example. Another algorithm is available; it is specified with "INTERLEAVED" and is optimized for more complex combinations of query predicates and dimension tables. Note that the definition of attribute clustering by itself does not change any data stored on disk; instead, it provides a directive for direct path operations; INSERT APPEND and MOVE that will physically perform the clustering operation for us. If there are pre-existing rows in SALES, we can MOVE the table (or its partitions) to re-order them.

Joins between SALES and the dimension table are now candidates for optimization when the query includes predicates on the dimension hierarchy “state” and “state, county”. For example:

SELECT SUM(amount)
FROM   sales
JOIN   locations  ON (sales.location_id = locations.location_id) 
WHERE  locations.state  = 'NM';
SELECT SUM(amount)
FROM   sales
JOIN   locations  ON (sales.location_id = locations.location_id) 
WHERE  locations.state  = 'CA'
AND    locations.county = 'Kern';

By clustering the rows and recording appropriate min/max column values for our zones, we have addressed all of the issues I identified above. What’s more, we can still get benefit from Exadata storage indexes because zone maps and storage indexes complement one another, and they work together transparently.

Zone maps are explicitly created and controlled by the database administrator on a table-by-table basis. They are an inherent part of the physical database design and can be thought of as a coarse anti-index structure (unlike an index, a zone map tells you what zones not to access). Zone maps are very compact, and in some cases it is possible to use them where you would otherwise use an index. This is most relevant in data warehousing environments where scanning is often more appropriate than indexed row retrieval, and where indexes can use a considerable amount of storage space. Zone maps must be refreshed to be synchronized with the underlying table data, so you will need to give some consideration to how you want them to be kept up-to-date if you decide to use them as an alternative to indexes.

In summary, take a look at zone maps if you want to:

  • Optimize scanning queries, particular when joining with one or more tables.
  • Reduce your dependency on indexes, particularly in data warehousing environments.
  • Improve performance in your data warehouse; particularly for star or snowflake schemas.

Here’s an example of using zone maps to optimize a table scan. To compare before and after, start by creating a table that has no zone map or attribute clustering:

CREATE TABLE sales_zm (sale_id NUMBER(10), customer_id NUMBER(10));

Insert 8 million rows with the following PL/SQL code. Why that many? With our example, we'll read one or two zones rather than the entire table, so I'm aiming to make the difference pretty obvious when you look at the block read statistics:

  i NUMBER(10);
  FOR i IN 1..80
    INSERT INTO sales_zm
    FROM   dual
    CONNECT BY LEVEL <= 100000;
EXEC dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'SALES_ZM');

Run the following query a few times to see what value “consistent gets” settles at:

FROM   sales_zm
WHERE  customer_id = 50;

On my machine, I read 7,545 blocks from the buffer cache, but since the value depends on some storage defaults don’t be surprised if your value is different:

Before Zone Map

The following DDL will create a zone map, but since attribute clustering is a property of the table (like compression), any existing rows will not be re-ordered:


The zone map will not be efficient until we cluster the rows together, so we’ll MOVE the table to achieve this. This will refresh the zone map too:


Run the same query a few times to see what value “consistent gets” settles at:

FROM   sales_zm
WHERE  customer_id = 50;

On my database, I read around 1,051 database blocks instead of 7,545: a considerable improvement:

After Zone Map

You'll find more examples covering zone maps and attribute clustering in the Oracle Learning Library and inside the Oracle Github repository. Full details on zone maps and attribute clustering can be found in the Oracle documentation library; particularly the Oracle 12c Database Data Warehousing Guide.

There's an earlier post on attribute clustering if you haven't read it already.

If there's anything to need to ask, or if you can't find what you need regarding zone maps or attribute clustering, please let me know by leaving a comment below. Thanks!

Friday Sep 26, 2014

Oracle Big Data Lite 4.0 Virtual Machine Now Available

Big Data Lite 4.0 is now available for download from OTN.  There are lots of new capabilities in this latest version:
  • Oracle Database 12c (, including new JSON support and Oracle Big Data SQL-enabled external tables.  Check out this hands-on lab to learn how to securely analyze all your data - across both Hadoop and Oracle Database 12c - using Big Data SQL.
  • New versions of SQL Developer and Data Modeler that support Hive access and automatic generation of Big Data SQL external tables
  • GoldenGate and the latest ODI versions are now included - with some great new hands-on labs.
  • Cloudera Manager is back - you can now optionally use CM to manage your Hadoop environment (requires 10GB memory devoted to the VM).  If you don't want to use CM, you can use the manual CDH configuration with the Big Data Lite services application
  • New versions of the entire stack... Big Data Connectors, NoSQL Database, CDH, JDeveloper and more.

Here's the inventory of all the features and version:

  • Oracle Enterprise Linux 6.4
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Advanced Analytics, OLAP, Spatial and more
  • Cloudera Distribution including Apache Hadoop (CDH5.1.2)
  • Cloudera Manager (5.1.2)
  • Oracle Big Data Connectors 4.0
    • Oracle SQL Connector for HDFS 3.1.0
    • Oracle Loader for Hadoop 3.2.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.4.1
    • Oracle XQuery for Hadoop 4.0.1
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.0.14)
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.0.3
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1

Monday Sep 15, 2014

Oracle SQL Developer & Data Modeler Support for Oracle Big Data SQL

Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle Big Data SQL.  The tools allow you to connect to Hive, use the SQL Worksheet to query, create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle external tables that dynamically access data sources defined in the Hive metastore.  

Let's take a look at what it takes to get started and then preview this new capability.

Setting up Connections to Hive

The first thing you need to do is set up a JDBC connection to Hive.  Follow these steps to set up the connection:

Download and Unzip JDBC Drivers

Cloudera provides high performance JDBC drivers that are required for connectivity:

  • Download the Hive Drivers from the Cloudera Downloads page to a local directory
  • Unzip the archive
    • unzip
  • Three zip files are contained within the archive.  Unzip the JDBC4 archive to a target directory that is accessible to SQL Developer (e.g. /home/oracle/jdbc below): 
    • unzip -d /home/oracle/jdbc/
    • Note: you will get an error when attempting to open a Hive connection in SQL Developer if you use a different JDBC version. Ensure you use JDBC4 and not JDBC41.

Now that the JDBC drivers have been extracted, update SQL Developer to use the new drivers.

Update SQL Developer to use the Cloudera Hive JDBC Drivers

Update the preferences in SQL Developer to leverage the new drivers:

  • Start SQL Developer
  • Go to Tools -> Preferences
  • Navigate to Database -> Third Party JDBC Drivers
  • Add all of the jar files contained in the zip to the Third-party JDBC Driver Path.  It should look like the picture below:
    sql developer preferences

  • Restart SQL Developer

Create a Connection

Now that SQL Developer is configured to access Hive, let's create a connection to Hiveserver2.  Click the New Connection button in the SQL Developer toolbar.  You'll need to have an ID, password and the port where Hiveserver2 is running:

connect to hiveserver2

The example above is creating a connection called hive which connects to Hiveserver2 on localhost running on port 10000.  The Database field is optional; here we are specifying the default database.

Using the Hive Connection

The Hive connection is now treated like any other connection in SQL Developer.  The tables are organized into Hive databases; you can review the tables' data, properties, partitions, indexes, details and DDL:

sqldeveloper - view data in hive

And, you can use the SQL Worksheet to run custom queries, perform DDL operations - whatever is supported in Hive:


Here, we've altered the definition of a hive table and then queried that table in the worksheet.

Create Big Data SQL-enabled Tables Using Oracle Data Modeler

Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables.  Let's create a few tables using the metadata from the Hive Metastore.  Invoke the import wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item.  You will see the same connections found in the SQL Developer connection navigator:

pick a connection

After selecting the hive connection and a database, select the tables to import:

pick tables to import

There could be any number of tables here - in our case we will select three tables to import.  After completing the import, the logical table definitions appear in our palette:

imported tables

You can update the logical table definitions - and in our case we will want to do so.  For example, the recommended column in Hive is defined as a string (i.e. there is no precision) - which the Data Modeler casts as a varchar2(4000).  We have domain knowledge and understand that this field is really much smaller - so we'll update it to the appropriate size:

update prop

Now that we're comfortable with the table definitions, let's generate the DDL and create the tables in Oracle Database 12c.  Use the Data Modeler DDL Preview to generate the DDL for those tables - and then apply the definitions in the Oracle Database SQL Worksheet:

preview ddl

Edit the Table Definitions

The SQL Developer table editor has been updated so that it now understands all of the properties that control Big Data SQL external table processing.  For example, edit table movieapp_log_json:

edit table props

You can update the source cluster for the data, how invalid records should be processed, how to map hive table columns to the corresponding Oracle table columns (if they don't match), and much more.

Query All Your Data

You now have full Oracle SQL access to data across the platform.  In our example, we can combine data from Hadoop with data in our Oracle Database.  The data in Hadoop can be in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big Data SQL can access it!  Below, we're combining click data from the JSON-based movie application log with data in our Oracle Database tables to determine how the company's customers rate blockbuster movies:

compare to blockbuster movies

Looks like they don't think too highly of them! Of course - the ratings data is fictitious ;)


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« February 2016