With the release of Big Data SQL 2.0 it is probably time to do a quick recap and introduce the marquee features in 2.0. The key goals of Big Data SQL are to expose data in its original format, and stored within Hadoop and NoSQL Databases through high-performance Oracle SQL being offloaded to Storage resident cells or agents. The architecture of Big Data SQL closely follows the architecture of Oracle Exadata Storage Server Software and is built on the same proven technology.
With data in HDFS stored in an undetermined format (schema on read), SQL queries require some constructs to parse and interpret data for it to be processed in rows and columns. For this Big Data SQL leverages all the Hadoop constructs, notably InputFormat and SerDe Java classes optionally through Hive metadata definitions. Big Data SQL then layers the Oracle Big Data SQL Agent on top of this generic Hadoop infrastructure as can be seen below.
Because Big Data SQL is based on Exadata Storage Server Software, a number of benefits are instantly available. Big Data SQL not only can retrieve data, but can also score Data Mining models at the individual agent, mapping model scoring to an individual HDFS node. Likewise querying JSON documents stored in HDFS can be done with SQL directly and is executed on the agent itself.
Within the Big Data SQL Agent, similar functionality exists as is available in Exadata Storage Server Software. Smart Scans apply the filter and row projections from a given SQL query on the data streaming from the HDFS Data Nodes, reducing the data that is flowing to the Database to fulfill the data request of that given query. The benefits of Smart Scan for Hadoop data are even more pronounced than for Oracle Database as tables are often very wide and very large. Because of the elimination of data at the individual HDFS node, queries across large tables are now possible within reasonable time limits enabling data warehouse style queries to be spread across data stored in both HDFS and Oracle Database.
Storage Indexes - new in Big Data SQL 2.0 - provide the same benefits of IO elimination to Big Data SQL as they provide to SQL on Exadata. The big difference is that in Big Data SQL the Storage Index works on an HDFS block (on BDA – 256MB of data) and span 32 columns instead of the usual 8. Storage Index is fully transparent to both Oracle Database and to the underlying HDFS environment. As with Exadata, the Storage Index is a memory construct managed by the Big Data SQL software and invalidated automatically when the underlying files change.
Storage Indexes work on data exposed via Oracle External tables using both the ORACLE_HIVE and ORACLE_HDFS types. Fields are mapped to these External Tables and the Storage Index is attached to the Oracle (not the Hive) columns, so that when a query references the column(s), the Storage Index - when appropriate - kicks in. In the current version, Storage Index does not support tables defined with Storage Handlers (ex: HBase or Oracle NoSQL Database).
The Smart Scan and Storage Index features deliver compound benefits. Where Storage Indexes reduces the IO done, Smart Scan then enacts the same row filtering and column projection. This latter step remains important as it reduces the data transferred between systems.
To learn more about Big Data SQL, join us at Open World in San Francisco at the end of the month.