Tuesday Jul 07, 2015

Update to BDA X5-2 provides more flexibility and capacity with no price changes

As more people pick up big data technologies, we see the workloads run on these big data system evolve and diversify. The initial workloads were all Map Reduce and fit a specific, (micro) batch workload pattern. Over the past couple of years that has changed and that change is reflected in the Hadoop tools - specifically with YARN. While there is still quite a bit of batch work being done, typically using Map Reduce (think Hive, Pig etc) we are seeing our customers move to more mixed workloads where the batch work is augmented with both more online SQL as well as more streaming workloads.

More Horsepower - More Capacity 

The change towards a more mixed workload leads us to change the shape of the underlying hardware. Systems now shift away from the once sacred "1-core to 1-disk ratio" and also from the small memory footprints for the worker nodes.

With the BDA X5-2 update in December 2014, BDA doubled the base memory configuration and added 2.25x more CPU resources in every node as well as upgrading to Intel's fastest Xeon E5 CPU. BDA X5-2 now has 2 * 18 Xeon cores to enable CPU intense workloads like analytics SQL queries using Oracle Big Data SQL, machine learning, graph applications etc.

With the processing covered for these more mixed workloads, we looked at other emerging trends or workloads and their impact on the BDA X5-2 hardware. The most prominent trend we see in big data are the large data volumes we expect to see from the Internet of Things (IoT) explosion and the potential cost associated with storing that data.

To address this issue (and storage cost in general) we are now adding 2x more disk space onto each and every BDA disk doubling the total available space on the system while keeping the list price constant. That is correct, 2x capacity but no price change! 

More Flexibility

And if that isn't enough, we are also changing the way our customers can grow their systems by introducing BDA Elastic Configurations.

As we see customers build out production in large increments, we also see a need to be more flexible in expanding the non-production environments (test, qa and performance environments). BDA X5-2 Elastic Configurations enables expansion of a system in 1-node increments by adding a BDA X5-2 High Capacity (HC) plus InfiniBand Infrastructure into a 6-node Starter Rack.

The increased flexibility enables our customers to start with a production scale cluster of 6 nodes (X5-2 or older) and then increment within the base rack up to 18 nodes, then expand across racks without any additional switching (no top of rack required, all on the same InfiniBand network) to build large(r) clusters. The expansion is of course all supported from the Oracle Mammoth configuration utility and its CLI, greatly simplifying expansion of clusters.

Major Improvement, No Additional Cost

Over the past generations BDA has been quickly adopted to changing usage and workload patterns enabling the adoption of Hadoop into the data ecosystem with minimal infrastructure disruption but with maximum business benefits. The latest update to BDA X5-2 enables flexibility, delivers more storage capacity and runs more workloads then ever before. 

For more information see the BDA X5-2 Data Sheet on OTN

Saturday Jun 20, 2015

Oracle Big Data Spatial and Graph - Installing the Image Processing Framework

Oracle Big Data Lite 4.2 was just released - and one of the cool new features is Oracle Spatial and Graph.  In order to use this new feature, there is one more configuration step required.  Normally, we include everything you need in the VM - but this is a component that we couldn't distribute.

For the Big Data Spatial Image Processing Framework, you will need to install and configure Proj.4 - Cartographic Projections Library.  Simply follow these steps: 

  • Start the Big Data Lite VM and log in as user "oracle"
  • Launch firefox and download this tarball (​http://download.osgeo.org/proj/proj-4.9.1.tar.gz) to ~/Downloads
  • Run the following commands at the linux prompt:
    • cd ~/Downloads
    • tar -xvf proj-4.9.1.tar.gz
    • cd proj-4.9.1
    • ./configure
    • make
    • sudo make install

This will create the libproj.so file in directory /usr/local/lib/.  Now that the file has been created, create links to it in the appropriate directories.  At the linux prompt:

  • sudo ln -s /usr/local/lib/libproj.so /u02/oracle-spatial-graph/shareddir/spatial/demo/imageserver/native/libproj.so
  • sudo ln -s /usr/local/lib/libproj.so /usr/lib/hadoop/lib/native/libproj.so

That's all there is to it.  Big Data Lite is now ready for Orace Big Data Spatial and Graph!

Oracle Big Data Lite 4.2 Now Available!

Oracle Big Data Lite Virtual Machine 4.2 is now available on OTN.  For those of you that are new to the VM - it is a great way to get started with Oracle's big data platform.  It has a ton of products installed and configured - including: 

  • Oracle Enterprise Linux 6.6
  • Oracle Database 12c Release 1 Enterprise Edition ( - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more.
  • Cloudera Distribution including Apache Hadoop (CDH5.4.0)
  • Cloudera Manager (5.4.0)
  • Oracle Big Data Connectors 4.2
    • Oracle SQL Connector for HDFS 3.3.0
    • Oracle Loader for Hadoop 3.4.0
    • Oracle Data Integrator 12c
    • Oracle R Advanced Analytics for Hadoop 2.5.0
    • Oracle XQuery for Hadoop 4.2.0
  • Oracle NoSQL Database Enterprise Edition 12cR1 (3.3.4)
  • Oracle Big Data Spatial and Graph 1.0
  • Oracle JDeveloper 12c (12.1.3)
  • Oracle SQL Developer and Data Modeler 4.1
  • Oracle Data Integrator 12cR1 (12.1.3)
  • Oracle GoldenGate 12c
  • Oracle R Distribution 3.1.1
  • Oracle Perfect Balance 2.4.0
  • Oracle CopyToBDA 2.0

Check out our new product - Oracle Big Data Spatial and Graph (and don't forget to read the blog post on a small config update you'll need to make to use it).  It's a great way to find relationships in data and query and visualize geographic data.  Speaking of analysis... Oracle R Advanced Analytics for Hadoop now leverages Spark for many of its algorithms for (way) faster processing.

 But, that's just a couple of features... download the VM and check it out for yourself :). 

Friday May 15, 2015

Big Data Spatial and Graph is now released!

Cross-posting this from the announcement of the new spatial and graph capabilities. You can get more detail on OTN.

The product objective is to provide spatial and graph capabilities that are best suited to the use cases, data sets, and workloads found in big data environments.  Oracle Big Data Spatial and Graph can be deployed on Oracle Big Data Appliance, as well as other supported Hadoop and NoSQL systems on commodity hardware. 

Here are some feature highlights.   

Oracle Big Data Spatial and Graph includes two main components:

  • A distributed property graph database with 35 built-in graph analytics to
    • discover graph patterns in big data, such as communities and influencers within a social graph
    • generate recommendations based on interests, profiles, and past behaviors
  • A wide range of spatial analysis functions and services to
    • evaluate data based on how near or far something is to one another, or whether something falls within a boundary or region
    • process and visualize geospatial map data and imagery

Property Graph Data Management and Analysis

The property graph feature of Oracle Big Data Spatial and Graph facilitates big data discovery and dynamic schema evolution with real-world modeling and proven in-memory parallel analytics. Property graphs are commonly used to model and analyze relationships, such as communities, influencers and recommendations, and other patterns found in social networks, cyber security, utilities and telecommunications, life sciences and clinical data, and knowledge networks.  

Property graphs model the real-world as networks of linked data comprising vertices (entities), edges (relationships), and properties (attributes) for both. Property graphs are flexible and easy to evolve; metadata is stored as part of the graph and new relationships are added by simply adding a edge. Graphs support sparse data; properties can be added to a vertex or edge but need not be applied to all similar vertices and edges.  Standard property graph analysis enables discovery with analytics that include ranking, centrality, recommender, community detection, and path finding.

Oracle Big Data Spatial and Graph provides an industry leading property graph capability on Apache HBase and Oracle NoSQL Database with a Groovy-based console; parallel bulk load from common graph file formats; text indexing and search; querying graphs in database and in memory; ease of development with open source Java APIs and popular scripting languages; and an in-memory, parallel, multi-user, graph analytics engine with 35 standard graph analytics.

Spatial Analysis and Services Enrich and Categorize Your Big Data with Location

With the spatial capabilities, users can take data with any location information, enrich it, and use it to harmonize their data.  For example, Big Data Spatial and Graph can look at datasets like Twitter feeds that include a zip code or street address, and add or update city, state, and country information.  It can also filter or group results based on spatial relationships:  for example, filtering customer data from logfiles based on how near one customer is to another, or finding how many customers are in each sales territory.  These results can be visualized on a map with the included HTML5-based web mapping tool.  Location can be used as a universal key across disparate data commonly found in Hadoop-based analytic solutions. 

Also, users can perform large-scale operations for data cleansing, preparation, and processing of imagery, sensor data, and raw input data with the raster services.  Users can load raster data on HDFS using dozens of supported file formats, perform analysis such as mosaic and subset, write and carry out other analysis operations, visualize data, and manage workflows.  Hadoop environments are ideally suited to storing and processing these high data volumes quickly, in parallel across MapReduce nodes.  

Learn more about Oracle Big Data Spatial and Graph at the OTN product website:

Read the Data Sheet

Read the Spatial Feature Overview

Tuesday Apr 14, 2015

Statement of Direction -- Big Data Management System

Click here to start reading the Full Statement of Direction. 

Introduction: Oracle Big Data Management System Today 

As today's enterprises embrace big data, their information architectures must evolve. Every enterprise has data warehouses today, but the best-practices information architecture embraces emerging technologies such as Hadoop and NoSQL. Today’s information architecture recognizes that data not only is stored in increasingly disparate data platforms, but also in increasingly disparate locations: on-premises and potentially multiple cloud platforms. The ideal of a single monolithic ‘enterprise data warehouse’ has faded as a new more flexible architecture has emerged. Oracle calls this new architecture the Oracle Big Data Management System, and today it consists of three key components

  • The data warehouse, running on Oracle Database and Oracle Exadata Database Machine, is the primary analytic database for storing much of a company’s core transactional data: financial records, customer data, point- of-sale data and so forth. Despite now being part of a broader architecture, the requirements on the RDBMS for performance, scalability, concurrency and workload management are in more demand than ever; Oracle Database 12c introduced Oracle Database In-Memory (with columnar tables, SIMD processing, and advanced compression schemes) as latest in a long succession of warehouse-focused innovations. The market-leading Oracle Database is the ideal starting point for customers to extend their architecture to the Big Data Management System.
  • The ‘data reservoir’, hosted on Oracle Big Data Appliance, will augment the data warehouse as a repository for the new sources of large volumes of data: machine-generated log files, social-media data, and videos and images -- as well as a repository for more granular transactional data or older transactional data which is not stored in the data warehouse. Oracle’s Big Data Management System embraces complementary technologies and platforms, including open-source technologies: Oracle Big Data Appliance includes Cloudera’s Distribution of Hadoop and Oracle NoSQL Database for data management.
  • A ‘franchised query engine,’ Oracle Big Data SQL, enables scalable, integrated access in situ to the entire Big Data Management System. SQL is the accepted language for day-to-day data access and analytic queries, and thus SQL is the primary language of the Big Data Management System.  Big Data SQL enables users to combine data from Oracle Database, Hadoop and NoSQL sources within a single SQL statement.  Leveraging the architecture of Exadata Storage Software and the SQL engine of the Oracle Database, Big Data SQL delivers high-performance access to all data in the Big Data Management System.

Using this architecture, the Oracle Big Data Management System combines the performance of Oracle’s market-leading relational database, the power of Oracle’s SQL engine, and the cost-effective, flexible storage of Hadoop and NoSQL. The result is an integrated architecture for managing Big Data, providing all of the benefits of Oracle Database, Exadata, and Hadoop, without the drawbacks of independently-accessed data repositories.  

Note that the scope of this statement of direction is the data platform for Big Data. An enterprise Big Data solution would also be comprised of big data tools and big data applications built upon this data platform. 

Read the full Statement of Direction -- Big Data Management System here.

Tuesday Apr 07, 2015

Oracle Academy: Data Science Bootcamp for 2015

I'm pleased to announce that Oracle Academy has released our Data Science Bootcamp for 2015.  As I've spent a great deal of time over the past few months helping Oracle Academy develop the content, I wanted to briefly explain what the Bootcamp series is and why it's worth a look.

 What is This Thing?

The Data Science Bootcamp is an attempt at providing asynchronous training for data science fundamentals.  There are videos for each of the 16 lessons, example code, tight integration with our Big Data Lite VM, and even an online textbook.  Between these elements, we think students can learn in the way that best fits their schedule and level of interest.  Watch, try, or read about each problem in whatever helps you learn best.

[Read More]

Friday Feb 06, 2015

Unified Query: SQL for All Seasons

In a recent interview, the topic of "a SQL for All Seasons" came up.  Initially, the phrasing made me think we were going to about a database that staunchly refused to answer queries about divorce.  Instead, the conversation centered around the pain enterprises feel when dealing with polyglot persistence.  As much as we, as developers, may choose to avoid (or embrace) polyglot persistence, in large enterprises it's becoming unavoidable.

What we focus on with Oracle Big Data SQL is unified query, and it's designed to be the complement to polyglot persistence.  Store data in the places the business deems correct, resulting in the "polyglot problem," but query it all simultaneously using a single SQL statement.  We think it's a pretty valuable concept, and it makes storing data in Hadoop or NoSQL stores for business or performance requirements easier to manage.  To explain the concept more fully, we've released a new whitepaper which considers why unified query is important, and what pitfalls can exist in some implementations.

[Read More]

Friday Jan 16, 2015

Deploying SAS High Performance Analytics on Big Data Appliance

Oracle and SAS have an ongoing commitment to our joint customers to deliver value-added technology integrations through engineered systems such as Exadata, Big Data Appliance, SuperCluster,  Exalogic and ZFS Storage Appliance.  Dedicated resources manage and execute on joint SAS/Oracle Database, Fusion Middleware, and Oracle Solaris integration projects; providing customer support, including sizing and IT infrastructure optimization and consolidation.  Oracle support teams are onsite at SAS Headquarters in Cary, NC (USA); and in the field on a global basis.

The latest in this effort is to enable our joint customers to deploy SAS High Performance Analytics on Big Data Appliance. This effort enables SAS users to leverage the lower cost infrastructure Hadoop offers in a production ready deployment on Oracle Big Data Appliance. Here from Paul Kent (VP Big Data, SAS) on some of the details.

Read more on deploying SAS High Performance Analytics on www.oracle.com/SAS. Don't miss the deployment guide and best practices here.

Thursday Oct 09, 2014

One of the ways Oracle is using Big Data

Today, Oracle is using big data technology and concepts to significantly improve the effectiveness of its support operations, starting with its hardware support group. While the company is just beginning this journey, the initiative is already delivering valuable benefits.

In 2013, Oracle’s hardware support group began to look at how it could use automation to improve support quality and accelerate service request (SR) resolution. Its goal is to use predictive analytics to automate SR resolution within 80% to 95% accuracy.

Oracle’s support group gathers a tremendous amount of data. Each month, for example, it logs 35,000 new SRs and receives nearly 6 TB of telemetry data via automated service requests (ASRs)—which represent approximately 18% of all SRs. Like many organizations, Oracle had a siloed view of this data, which hindered analysis. For example, it could look at SRs but could not analyze the associated text, and it could review SRs and ASRs separately, but not together.

Oracle was conducting manual root-cause analysis to identify which types of SRs were the best candidates for automation. This was a time-consuming, difficult, and costly process, and the company looked to introduce big data and predictive analytics to automate insight.

The team knew that it had to walk before it could run. It started by taking information from approximately 10 silos, such as feeds from SRs and ASRs, parts of databases, and customer experience systems, and migrating the information to an Oracle Endeca Information Discovery environment. Using the powerful Oracle Endeca solution, Oracle could look at SRs, ASRs, and associated notes in a single environment, which immediately yielded several additional opportunities for automation. On the first day of going live with the solution, Oracle identified 4% more automation opportunities.

Next, Oracle focused its efforts on gaining insight in near real time, leveraging the parallel processing of Hadoop to automatically feed Oracle Endeca Information Discovery—dramatically improving data velocity. Oracle’s first initiative with this new environment looked at Oracle Solaris SRs. In the first few weeks of that project, Oracle identified automation opportunities that will increase automated SR resolution from less than 1% to approximately 5%—simply by aggregating all of the data in near real-time. 

Once Oracle proved via these early proofs of concept that it could process data more efficiently and effectively to feed analytical projects, it began to deploy Oracle Big Data Appliance and Oracle Exalytics In-Memory Machine.

Read the entire profile here.

Tuesday Sep 23, 2014

Big Data IM Reference Architecture

Just in time for Oracle Openworld, the new Big Data Information Management Reference Architecture is posted on our OTN pages. The reference architecture attempts to create order in the wild west of new technologies, the flurry of new ideas and most importantly tries to go from marketing hype to a real, implementable architecture.

To get all the details, read the paper here. Thanks to the EMEA architecture team , the folks at Rittman Mead Consulting and all others involved.

Monday Sep 15, 2014

Oracle SQL Developer & Data Modeler Support for Oracle Big Data SQL

Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle Big Data SQL.  The tools allow you to connect to Hive, use the SQL Worksheet to query, create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle external tables that dynamically access data sources defined in the Hive metastore.  

Let's take a look at what it takes to get started and then preview this new capability.

Setting up Connections to Hive

The first thing you need to do is set up a JDBC connection to Hive.  Follow these steps to set up the connection:

Download and Unzip JDBC Drivers

Cloudera provides high performance JDBC drivers that are required for connectivity:

  • Download the Hive Drivers from the Cloudera Downloads page to a local directory
  • Unzip the archive
    • unzip hive_jdbc_2.5.15.1040.zip
  • Three zip files are contained within the archive.  Unzip the JDBC4 archive to a target directory that is accessible to SQL Developer (e.g. /home/oracle/jdbc below): 
    • unzip Cloudera_HiveJDBC4_2.5.15.1040.zip -d /home/oracle/jdbc/
    • Note: you will get an error when attempting to open a Hive connection in SQL Developer if you use a different JDBC version. Ensure you use JDBC4 and not JDBC41.

Now that the JDBC drivers have been extracted, update SQL Developer to use the new drivers.

Update SQL Developer to use the Cloudera Hive JDBC Drivers

Update the preferences in SQL Developer to leverage the new drivers:

  • Start SQL Developer
  • Go to Tools -> Preferences
  • Navigate to Database -> Third Party JDBC Drivers
  • Add all of the jar files contained in the zip to the Third-party JDBC Driver Path.  It should look like the picture below:
    sql developer preferences

  • Restart SQL Developer

Create a Connection

Now that SQL Developer is configured to access Hive, let's create a connection to Hiveserver2.  Click the New Connection button in the SQL Developer toolbar.  You'll need to have an ID, password and the port where Hiveserver2 is running:

connect to hiveserver2

The example above is creating a connection called hive which connects to Hiveserver2 on localhost running on port 10000.  The Database field is optional; here we are specifying the default database.

Using the Hive Connection

The Hive connection is now treated like any other connection in SQL Developer.  The tables are organized into Hive databases; you can review the tables' data, properties, partitions, indexes, details and DDL:

sqldeveloper - view data in hive

And, you can use the SQL Worksheet to run custom queries, perform DDL operations - whatever is supported in Hive:


Here, we've altered the definition of a hive table and then queried that table in the worksheet.

Create Big Data SQL-enabled Tables Using Oracle Data Modeler

Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables.  Let's create a few tables using the metadata from the Hive Metastore.  Invoke the import wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item.  You will see the same connections found in the SQL Developer connection navigator:

pick a connection

After selecting the hive connection and a database, select the tables to import:

pick tables to import

There could be any number of tables here - in our case we will select three tables to import.  After completing the import, the logical table definitions appear in our palette:

imported tables

You can update the logical table definitions - and in our case we will want to do so.  For example, the recommended column in Hive is defined as a string (i.e. there is no precision) - which the Data Modeler casts as a varchar2(4000).  We have domain knowledge and understand that this field is really much smaller - so we'll update it to the appropriate size:

update prop

Now that we're comfortable with the table definitions, let's generate the DDL and create the tables in Oracle Database 12c.  Use the Data Modeler DDL Preview to generate the DDL for those tables - and then apply the definitions in the Oracle Database SQL Worksheet:

preview ddl

Edit the Table Definitions

The SQL Developer table editor has been updated so that it now understands all of the properties that control Big Data SQL external table processing.  For example, edit table movieapp_log_json:

edit table props

You can update the source cluster for the data, how invalid records should be processed, how to map hive table columns to the corresponding Oracle table columns (if they don't match), and much more.

Query All Your Data

You now have full Oracle SQL access to data across the platform.  In our example, we can combine data from Hadoop with data in our Oracle Database.  The data in Hadoop can be in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big Data SQL can access it!  Below, we're combining click data from the JSON-based movie application log with data in our Oracle Database tables to determine how the company's customers rate blockbuster movies:

compare to blockbuster movies

Looks like they don't think too highly of them! Of course - the ratings data is fictitious ;)

Tuesday Jul 22, 2014

StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

The benefits of Big Data at Stubhub:

  • Stubhub enabled data scientists to work directly with customer-related data—such as ticket-purchasing history—inside the database, and to use database options to explore the data graphically, build and evaluate multiple data-mining models, and deploy predictions and insights throughout the enterprise—drastically improving StubHub’s agility and responsiveness
  • Developed highly targeted ticket promotional campaigns and offers by having the ability to calculate 180 million customers’ lifetime value (or propensity) instead of just 20,000 values at a time
  • Used Oracle R Enterprise component of Oracle Advanced Analytics—an Oracle Database option—to reduce a fraud issue by up to 90%

Read more or watch the video:

Tuesday Jul 15, 2014

Oracle Big Data SQL: One Fast Query, All Your Data


Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

SQL on Hadoop

As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

SQL On More Than Hadoop

For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

This is why we think Oracle Big Data SQL is obvious too.

It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

  • Maintain the valuable characteristics of the system storing the data
  • Unify metadata to understand how to execute queries
  • Optimize execution to take advantage of the systems storing the data

For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.


So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

It's SQL that plays data where it lies, letting you place data where you think it belongs.

[Read More]

Thursday Jun 26, 2014

Big Data Breakthrough; Watch the Webcast on July 15th

Thursday Jun 05, 2014

Globacom and mCentric Deploy BDA and NoSQL Database to analyze network traffic 40x faster

In a fast evolving market, speed is of the essence. mCentric and Globacom leveraged Big Data Appliance, Oracle NoSQL Database to save over 35,000 Call-Processing minutes daily and analyze network traffic 40x faster. 

Here are some highlights from the profile:

Why Oracle

“Oracle Big Data Appliance works well for very large amounts of structured and unstructured data. It is the most agile events-storage system for our collect-it-now and analyze-it-later set of business requirements. Moreover, choosing a prebuilt solution drastically reduced implementation time. We got the big data benefits without needing to assemble and tune a custom-built system, and without the hidden costs required to maintain a large number of servers in our data center. A single support license covers both the hardware and the integrated software, and we have one central point of contact for support,” said Sanjib Roy, CTO, Globacom.

Implementation Process

It took only five days for Oracle partner mCentric to deploy Oracle Big Data Appliance, perform the software install and configuration, certification, and resiliency testing. The entire process—from site planning to phase-I, go-live—was executed in just over ten weeks, well ahead of the four months allocated to complete the project.

Oracle partner mCentric leveraged Oracle Advanced Customer Support Services’ implementation methodology to ensure configurations are tailored for peak performance, all patches are applied, and software and communications are consistently tested using proven methodologies and best practices.

Read the entire profile here.


The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« October 2015