By dan.mcclary on Jul 15, 2014
Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores. Big Data SQL has been in development for quite a while now, and will be generally available in a few months. With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.
SQL on Hadoop
As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press. From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it. It does, however, prompt the question: Why SQL?
There's an argument to be made for SQL simply being a form of skill reuse. If people and tools already speak SQL, then give the people what they know. In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves. If there were a better language for querying Big Data, the community would have turned it up by now.
I think the reality is that the SQL language endures because it is uniquely suited to querying datasets. Consider, SQL is a declarative language for operating on relations in data. It's a domain-specific language where the domain is datasets. In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler. It's set theory set into a programming language.
It goes beyond just the language itself. SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on. SQL lets us think about answers, which lets us get more done.
SQL on Hadoop, then, is somewhat obvious. As data gets bigger, we would prefer to only have to reason about answers.
SQL On More Than Hadoop
For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback. Specifically, data rarely lives in a single place. Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before. If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.
If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores. Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.
This is why we think Oracle Big Data SQL is obvious too.
It's just a little more complicated than SQL on any one thing. To pull it off, we have to do a few things:
- Maintain the valuable characteristics of the system storing the data
- Unify metadata to understand how to execute queries
- Optimize execution to take advantage of the systems storing the data
For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.
For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read. Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data. Schema-on-read is at least equally important if not more. As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format. So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.
Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.
To unify metadata for planning and executing SQL queries, we require a catalog of some sort. What tables do I have? What are their column names and types? Are there special options defined on the tables? Who can see which data in these tables?
Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables. Tables in Hadoop or NoSQL databases are defined as external tables in Oracle. This makes sense, given that the data is external to the DBMS.
Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?
Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop. The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers. That causes both poor parallelism and removes the value of schema-on-read.
The external tables Big Data SQL presents are different. They leverage the Hive metastore or user definitions to determine both parallelism and read semantics. That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel. If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read. Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.
Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.
Being able to join data from Hadoop with Oracle Database is a feat in and of itself. However, given the size of data in Hadoop, it ends up being a lot of data to shift around. In order to optimize performance, we must take advantage of what each system can do.
In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance. Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself. The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.
The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.
Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement. It does this via something we call Smart Scan for Hadoop.
Moving the work to the data is straightforward. Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers. Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local. Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.
Smart Scan for Hadoop
Consider this: most queries don't select all columns, and most queries have some kind of predicate on them. Moving unneeded columns and rows is, by definition, excess data movement and impeding performance. Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.
For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.
Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading. It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute. Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data. Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.
The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.
So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures. Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics. Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.
It's SQL that plays data where it lies, letting you place data where you think it belongs.