X

The Oracle NoSQL Database Blog covers all things Oracle NoSQL Database. On-Prem, Cloud and more.

  • October 21, 2014

Using Nosql and Spark (How to Start)

Spark is an open-source data analytics cluster computing framework,  built outside of Hadoop’s two-stage MapReduce paradigm but runs on top
of HDFS. Because of its successful approach, Spark has quickly been adopted and is as an attractive choice for the future
of data processing in Hadoop. The question about how to link Nosql and Spark often concerns Big Data architects and developers.

Let's take a quit look to this question.

Spark revolves around the concept of a resilient distributed dataset
(RDD), which is a fault-tolerant collection of elements that can be
operated on in parallel. RDDs can be created by referencing a dataset in an external storage system, such as a
shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Good news,  Nosql could be integrated with Spark-Java using KVInputFormat, KVAvroInputFormat, which are two Nosql java classes which extend the Hadoop abstract class InputFormat.

How to proceed

  1. Get Spark resources
  2. Define the configuration parameters for Nosql (connection, store name,  key prefix)
  3. Define the Spark resilient distributed dataset to get data from Nosql
  4. Dispatch Nosql key-value subset into a dataset for key and a dataset for values
  5. Do some computations
  6. Release Spark resources

Get Spark resources

SparkConf sparkConf = new SparkConf().setAppName("SparkKVInput"); //AppName allows                                                                       //the tracking of job

JavaSparkContext sc = new JavaSparkContext(sparkConf);

Define the configuration parameters

hconf.set("oracle.kv.kvstore", "kvstore");

hconf.set("oracle.kv.parentKey", "/H/10"); // just a major key prefix

hconf.set("oracle.kv.hosts", String[]{"bigdatalite.localdomain:5000"}); 

Define the Spark resilient distributed dataset to get data from Nosql

JavaPairRDD<Text,Text> jrdd = sc.newAPIHadoopRDD(hconf,KVInputFormat.class, Text.class, Text.class);

The dataset parameters are the configuration, the InputFormat extension, the java class for keys and the java class for values. The class of keys and values is stated on the javadoc of KVInputFormat

public class KVInputFormat 

extends KVInputFormatBase<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text>

Dispatch Nosql key-value subset into a dataset for key and a dataset for values

Setting the datasets for keys and values is easy: 

JavaRDD<Text> rddkeys = jrdd.keys();

JavaRDD<Text> rddvalues = jrdd.values();

Their manipulation is not possible, Spark does not know how to serialize Text. A mapping is needed to transform the Text dataset into a String dataset. The following code does the trick: 

The code for the key values is very similar

Do some computations

Print key and values:

List <String> keys = strkeys.collect();

List <String> values = strvalues.collect();

for(int idx=0;idx<values.size();idx++){

       String val = values.get(idx);

       String key = keys.get(idx);

       System.out.println(key.toString()+":"+val.toString());

}

Release Spark resources

sc.stop(); //release JavaSparkContext 

How to test-it

Put some data on the kvstore

  • put kv -key /T/1/1/-/1 -value V11_1
  • put kv -key /T/1/1/-/2 -value V11_2
  • put kv -key /T/1/1/-/3 -value V11_3
  • put kv -key /T/1/2/-/3 -value V12_3
  • put kv -key /T/2/2/-/1 -value V22_1
  • put kv -key /T/2/2/-/2 -value V22_2
  • Generate a jar containing kvclient.jar, avro.jar and the class with the spark code SparkKVInput. This class has three parameters which are the configuration file parameters: oracle.kv.store, oracle.kv.host  and oracle.kv.parentkey.

    Generate a jar (spark.jar) containing the nosql kvclient.jar and kvavro.jar

    An example of calling command on a bigdatalite 4.1 is:
    spark-submit --class nosql.examples.oracle.com.SparkKVInput <path  to spark.jar location>/spark.jar kvstore bigdatalite.localdomain:5000 /T/1

    results are:

    14/10/21 11:26:54 INFO SparkContext: Job finished: collect at SparkKVInput.java:62, took 0.341590587 s

    /T/1/1/-/1:V11_1

    /T/1/1/-/2:V11_2

    /T/1/1/-/3:V11_3

    /T/1/2/-/3:V12_3

    14/10/21 11:26:54 INFO SparkUI: Stopped Spark web UI at http://bigdatalite.localdomain:4040

    Hope this first introduction to the use of Nosql key/values on Spark will help you to go deeper with Spark as an engine to manipulate Nosql Data. 

    Be the first to comment

    Comments ( 0 )
    Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
    Oracle

    Integrated Cloud Applications & Platform Services