The explosive growth of data and the opportunity to discover insights from that data have never been greater, but the performance challenges of these massive calculations can be daunting. Apache Spark SQL provides a powerful way for data scientist to easily process lots of data quickly. Apache Spark is rapidly evolving. Spark SQL and the new abstractions of Datasets/DataFrames provides a more expressive and powerful way to write code than previously possible with Spark's lower-level RDDs. In addition Apache Spark's Catalyst engine automatically optimizes Apache Spark SQL code to improve execution speed.
The community has been doing a lot of work within the auspices of Project Tungsten to improve the execution of Apache Spark SQL in version 2.1.0 with what is called Whole-stage CodeGen. Dramatic improvements of 10x have been demonstrated by Databrick's engineers. Please see "Apache Spark 2.0 presented by Databricks' Spark Chief Architect Reynold Xin"
Even with all of these 10x improvements, processing on 10's of Terabytes or more of data can still take a long time. Oracle engineers have prototyped accelerating Spark SQL using some of Oracle's new hardware and software innovations. These innovations have the ability to further increase performance by 10-20x over Whole-stage Codegen (which is 100x to 200x over Apache Spark's Volcano interpreter in earlier versions).
Oracle's hardware innovations were created and inspired by the challenges of optimizing SQL processing for in-memory analytics in Oracle Database. The hardware that Oracle designed into current SPARC processors is called the Data Analytics Accelerator (DAX). It is an offload co-processor that uses the same memory as the processor's cores. In addition, Oracle has created an Open API (libdax) so we can easily use the DAX hardware on Apache Spark and other applications. See Open DAX APIs for more.To show the advantages of libdax in Apache Spark SQL, Oracle engineers have created a proof-of-concept implementation to show what is possible with libdax integrated into Apache Spark. We have demonstrated that Oracle's SPARC S7 processor with DAX can provide a 15.6x per core improvement over an x86 (E5 v4) processor in SQL performance on a 2-predicate "Between" query. Both the SPARC S7 and SPARC M7 processors contain the DAX offload co-processor. The DAX shares the same memory controller as the normal cores so the bottleneck of GPU-like data transfers are completely avoided.
This work was debuted at the Apache Spark New York City meetup in December of 2016. A video of this meetup can be seen on youtube.
Even without the DAX, the SPARC processors can show better performance than x86 processors. This is because Apache Spark is written in Scala which runs on the JVM. The SPARC M7 and SPARC S7 processors have demonstrated 1.5x performance per core advantage over x86 cores on JVM performance as shown by these benchmark results "SPECjbb2015: SPARC T7-1 World Record for 1 Chip Result" and "SPECjbb2015: SPARC S7-2 Multi-JVM and Distributed Results". Oracle achieved these results by focusing on Java (JVM) performance as a goal during processor design.
With the SPARC M7 processor, Oracle added a number of Software in Silicon (SWiS) features by building in higher-level software functions into the processor. One of the new features introduced in the SPARC M7 processor is called the Data Analytics Accelerator, or DAX, and the multiple DAX on the processor deliver unprecedented analytics efficiency. DAX is an integrated co-processor which provides a specialized set of instructions that can run very selective functionality – Scan, Extract, Select, Filter, and Translate – at fast speeds. The multiple DAX share the same memory interface with the processors cores so the DAX can take full advantage of the 140-160 GB/sec memory bandwidth of the SPARC M7 processor. The DAX co-processor were originally intended to speed up Oracle Database and have been used in production to run an SQL query in parallel at query rates of 170 billion rows/second.
Other programming methods have also been shown to be able to take advantage of DAX. As an example, consider how the SQL statement that is accelerated in the Oracle Database can be written in other programming styles. These additional programming styles can also be accelerated using DAX.
SQL (Oracle Database):
SELECT count(*) from citizen WHERE citizen.age > 18
Apache Spark SQL (DSL - Domain Specific Langauge, or Language Integrated Query):
val nvoters : Int = citizen.filter($”citizen.age” > 18).count()
Java Streams API:
int nvoters = array.parallelStream().filter(citizen->citizenage>18).count();
Eclipse Collections (formerly Goldman Sachs Collections):
int nvoters = fastList.count(citizen-> citizen.olderThan(18));
The Stream API was introduced in Java 8. See "Processing Data with Java SE 8 Streams, Part 1" and "Part 2: Processing Data with Java SE 8 Streams" for more on what can be done using Java Streams. Java Streams can condense verbose collection processing code into simple and readable stream programs. These programs provide the ability to process data that can be accelerated on modern processors. You can write abstract, query-like code leveraging the Stream API without going into the details of iterating over the collection entities. Eclipse Collections offers an API that can also take advantage of accelerations due to Java Stream performance.
The DAX has also been used to accelerate other software applications such as ActivePivot from ActiveViam, and their results of 6x to 8x faster with DAX were reported at the 2016 Oracle Open World. More information on ActivePivot performance can be found at
"Accelerating In-Memory Computing with ActivePivot on Oracle".
Copyright 2017, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
The previous information is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle's products remains at the sole discretion of Oracle.