Tuesday Jul 15, 2014

Oracle Big Data SQL: One Fast Query, All Your Data


Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

SQL on Hadoop

As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

SQL On More Than Hadoop

For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

This is why we think Oracle Big Data SQL is obvious too.

It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

  • Maintain the valuable characteristics of the system storing the data
  • Unify metadata to understand how to execute queries
  • Optimize execution to take advantage of the systems storing the data

For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.


So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

It's SQL that plays data where it lies, letting you place data where you think it belongs.

[Read More]

Thursday Jul 18, 2013

Practical HDFS Permissions


Documentation and most discussions are quick to point out that HDFS provides OS-level permissions on files and directories.  However, there is less readily-available information about what the effects of OS-level permissions are on accessing data in HDFS via higher-level abstractions such as Hive or Pig.  To provide a bit of clarity, I decided to run through the effects of permissions on different interactions with HDFS.

The Setup

In this scenario, we have three users: oracle, dan, and not_dan.  The oracle user has captured some data in an HDFS directory.  The directory has 750 permissions: read/write/execute for oracle, read/execute for dan, and no access for not_dan.  One of the files in the directory has 700 permissions, meaning that only the oracle user can read it.  Each user will tries to do the following tasks:

  • List the contents of the directory
  • Count the lines in a subset of files including the file with 700 permissions
  • Run a simple Hive query over the directory

Listing Files

Each user issues the command

hadoop fs -ls /user/shared/moving_average|more

And what do they see:

[oracle@localhost ~]$ hadoop fs -ls /user/shared/moving_average|more

Found 564 items

Obviously, the oracle user can see all the files in its own directory.

[dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more
Found 564 items

Similarly, since dan has group read access, that user can also list all the files. The user without group read permissions, however, receives an error.

[not_dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more

ls: Permission denied: user=not_dan, access=READ_EXECUTE,


Counting Rows in the Shell

In this test, each user pipes a set of HDFS files into a unix command and counts rows.  Recall, one of the files has 700 permissions.

The oracle user, again, can see all the available data:

[oracle@localhost ~]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l

The user with partial permissions receives an error on the console, but can access the data they have permissions on.  Naturally, the user without permissions only receives the error.

[dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l
cat: Permission denied: user=dan, access=READ, inode="/user/shared/moving_average/FlumeData.1374082184056":oracle:shared_hdfs:-rw-------
[not_dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l
cat: Permission denied: user=not_dan, access=READ_EXECUTE, inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x---

Permissions on Hive

In this final test, the oracle user defines an external Hive table over the shared directory.  Each user issues a simple COUNT(*) query against the directory.  Interestingly, the results are not the same as piping the datastream to the shell.

The oracle user's query runs correctly, while both dan and not_dan's queries fail:

As dan

Job Submission failed with exception 'java.io.FileNotFoundException(File /user/shared/moving_average/FlumeData.1374082184056 does not exist)'

As not_dan

Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException
(Permission denied: user=not_dan, access=READ_EXECUTE,

So, what's going on here? In each case, the query fails, but for different reasons. In the case of not_dan, the query fails because the user has no permissions on the directory. However, the query issued by dan fails because of a FileNotFound exception. Because dan does not have read permissions on the file, Hive cannot find all the files necessary to build the underlying MapReduce job. Thus, the query fails before being submitted to the JobTracker.  The rule then, becomes simple: to issue a Hive query, a user must have read permissions on all files read by the query. If a user has permissions on one set of partition directories,  but not another, they can issue queries against the readable partitions, but not against the entire table.


In a nutshell, the OS-level permissions of HDFS behave just as we would expect in the shell. However, problems can arise when tools like Hive or Pig try to construct MapReduce jobs. As a best practice, permissions structures should be tested against the tools which will access the data. This ensures that users can read
what they are allowed to, in the manner that they need to. 

[Read More]

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.


« November 2015