Move data between Oracle Database and Apache Hadoop using high speed connectors.

  • December 1, 2017

See How Easily You Can Move Data Between Apache Hadoop and Oracle Database in the Cloud - Part 2

In Part 1 we saw how easily we can move data between a Big Data Cloud Service and Database Cloud Service.

In this post we look at how data copied from Oracle Database Cloud Service to Big Data Cloud Service can be used in Machine Learning applications that use Spark MLLib.

Using steps from this earlier blog post on using Spark and data copied by Copy to Hadoop, we see that a Data Frame can be created to point to the Hive external table created in Part 1.  

scala> val movie_ratings_oracle_df = sqlContext.table("moviedemo.movie_ratings_oracle")

This data has a lot of NULLs in the ratings column.   Let us remove them:

scala> val movie_ratings_oracle_df_notnull = movie_ratings_oracle.filter("rating is not null")

Let us transfer the data frame movie_ratings_oracle_df_notnull to an RDD of ratings, by extracting the columns cust_id(0), movie_id(1), rating(6) from movie_ratings_oracle_df_notnull.

scala> val ratings = movie_ratings_oracle_df_notnull.map{row => Rating(row.getDecimal(0).intValue(),row.getDecimal(1).intValue(),row.getDecimal(6).doubleValue())}

On screen:

ratings: org.apache.spark.rdd.RDD[org.apache.spark.mlib.recommendation.Rating]=MapPartitionsRDD[26] at map at <console>:32

As described in Copy to Hadoop + Spark blog post, we can use this RDD in analysis after importing some Spark MLLib classes.

scala> import org.apache.spark.mllib.recommendation.ALS

scala> import org.apache.spark.mllib.recommendation.MatrixFactorizationModel

scala> import org.apache.spark.mllib.recommendation.Rating

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.