rw

This blog introduces Oracle Big Data Service (BDS) and describes how you can connect to a BDS cluster using Hive and Spark connections from Oracle Analytics Cloud (OAC).

Oracle Big Data Service clusters contain a Hadoop Distributed File System (HDFS) and Hive database which ingests and transforms data from diverse sources and formats (structured, semi-structured, and unstructured). You can access and use the refined data from BDS in Oracle Analytics Cloud to gain deeper insights. Architecture

BDS_2_OAC_Connect

Create a     Big Data Service Cluster on Oracle Cloud Infrastructure Create Oracle Cloud Infrastructure (OCI) IAM Policies Required for Big Data Service

For information about OCI Identity and Access Management (IAM) policies required for BDS, see Understanding Big Data Service resources and permissions in IAM policies.

Create a Big Data Service Cluster on OCI

  1. Sign in to your OCI console.
  2. Navigate to Analytics & AI.
  3. Under Data Lake, click Big Data Service.
  4. Select the Compartment where you want to create the BDS cluster.

    Analytics_AI_BDS

  5. Click Create cluster.

    Create_BDS

  6. Select the public or private subnets of the Virtual Cloud Network (VCN) where you want to create the BDS cluster.

For information on how to create VCNs and subnets, refer to the OAC disaster recovery technical paper Disaster Recovery Configuration for Oracle Analytics Cloud.

You can create the BDS cluster either as a secure and highly-available (HA) cluster or as a non-HA cluster.

You can create the BDS cluster in the VCN’s public or private subnet.

  • If the BDS cluster is created in a public subnet, the public IP address isn’t created by default. You must explicitly create the public IP address.
  • If the BDS cluster is created in a private subnet, you can’t access it from the Internet. You must create a private access channel (PAC) in Oracle Analytics Cloud to connect to the BDS cluster on the private endpoint.

Network Security for Big Data Service

In OCI, you use Security Lists and Network Security Groups to manage network security.

You must open certain ports on Big Data Service clusters to allow access to services such as Apache Ambari, Cloudera Manager, and Hue.

Use these Ingress Rules Destination Ports for Hive and Spark connections:

  • Hive: 10010
  • Spark: 10000

Create an OCI Object Storage Bucket

The steps to create OCI Object Storage buckets are described in the document Disaster Recovery Configuration for Oracle Analytics Cloud.  After creating a bucket, record the namespace, bucket name, and object storage URL.

Download the Sample Data File

Download the sample date file used in the blog: NYC Taxi Trip Records for Jan19

Upload the Data File to the Object Storage Bucket

DataFile_in_Bucket

Load the Data File to the Hadoop Distributed File System (HDFS)

# Access the first master node (mn0) of the BDS cluster using ssh as the opc user.

# Switch to the Linux user created on the nodes of the cluster (for example, odhadmin).

sudo su odhadmin

# If the BDS cluster is created as a Secure & High Available Cluster, get a Kerberos token for the odhadmin user.
kinit odhadmin

# Create a local download folder
mkdir -p /home/odhadmin/downloads

cd /home/odhadmin/downloads

# Download the data file from the object storage bucket to the local file system.
wget https://{namespace}.objectstorage.{oci-region}.oci.customer-oci.com/n/{namespace}/b/{bucet-name}/o/yellow_tripdata_2019-01.csv

# Remove the header row from the data file.
sed -i 1d yellow_tripdata_2019-01.csv

# Upload the data file to HDFS.
HDFS_ROOT=/data

hadoop fs -mkdir -p /data/YellowTaxiTrip
hadoop fs -chmod 777 /data/YellowTaxiTrip
hadoop fs -put -f /home/odhadmin/downloads/yellow_tripdata_2019-01.csv /data/YellowTaxiTrip/

Create a Hive Database and Tables and Load the Data

Run the following code on the first master node (mn0) of the BDS cluster.

hive -e "
create database if not exists yellowtaxi;
DROP TABLE IF EXISTS yellowtaxi.trips_ext;
DROP TABLE IF EXISTS yellowtaxi.trips;

CREATE EXTERNAL TABLE yellowtaxi.trips_ext (
  VendorID int,
  tpep_pickup_datetime timestamp,
  tpep_dropoff_datetime timestamp,
  passenger_count int,
  trip_distance double,
  RatecodeID int,
  store_and_fwd_flag string,
  PULocationID int,
  DOLocationID int,
  payment_type int,
  fare_amount double,
  extra double,
  mta_tax double,
  tip_amount double,
  tolls_amount double,
  improvement_surcharge double,
  total_amount double,
  congestion_surcharge double
 )
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
  STORED AS TEXTFILE
  location '/data/YellowTaxiTrip/'
;

CREATE TABLE yellowtaxi.trips (
  VendorID int,
  tpep_pickup_datetime timestamp,
  tpep_dropoff_datetime timestamp,
  passenger_count int,
  trip_distance double,
  RatecodeID int,
  store_and_fwd_flag string,
  PULocationID int,
  DOLocationID int,
  payment_type int,
  fare_amount double,
  extra double,
  mta_tax double,
  tip_amount double,
  tolls_amount double,
  improvement_surcharge double,
  total_amount double,
  congestion_surcharge double
 )
PARTITIONED BY (p_VendorID int)
STORED AS PARQUET
;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM yellowtaxi.trips_ext t
INSERT OVERWRITE TABLE yellowtaxi.trips PARTITION(p_VendorID)
SELECT
  VendorID,
  tpep_pickup_datetime,
  tpep_dropoff_datetime,
  passenger_count,
  trip_distance,
  RatecodeID,
  store_and_fwd_flag,
  PULocationID,
  DOLocationID,
  payment_type,
  fare_amount,
  extra,
  mta_tax,
  tip_amount,
  tolls_amount,
  improvement_surcharge,
  total_amount,
  congestion_surcharge,
  VendorID
;"

View the Data Loaded in Hue

bds_hue

SSL for Certificate Authority Signed Certificates

In Big Data Service clusters version 3.0.7 and later, you can run the Big Data Service certificate utility tool to use Certificate Authority (CA) SSL certificates for your Oracle Distribution including Apache Hadoop (ODH) clusters.

Connect Oracle Analytics Cloud to Big Data Service Clusters

Prerequisites:

  • Create an Oracle Analytics Cloud instance in either the same public or private subnet of the same VCN as the BDS cluster, or in a different public or private subnet of the same VCN as the BDS cluster.
    • Note: If you create the OAC instance in a different VCN, ensure that the OAC instance can reach the BDS cluster on the other VCN.
  • If you created the BDS cluster in a private subnet, create a private access channel (PAC) in OAC.

You can use Apache Hive or Apache Spark to connect Oracle Analytics Cloud to both secure and unsecure BDS clusters.

  • Connect to unsecure BDS clusters with no Kerberos or SSL.
  • Connect to secure BDS clusters with Kerberos enabled.
  • Connect to secure BDS clusters with Kerberos and SSL enabled.

Connect to Unsecure BDS Clusters with no Kerberos or SSL using HiveServer2

In OAC, create an Apache Hive connection to the OCI BDS cluster’s Hive database.

Use Basic authentication with a username and password to connect.

bds_unsecure

Connect to Secure BDS Clusters with Kerberos Enabled using HiveServer2

Kerberos Authentication

1. Collect the required files for Kerberos authentication.

Before running the cmd to get a Ticket Granting Ticket (TGT), ensure the Kerberos admin_server and kdc ports 749 and 88 are open for ingress access from the Internet or from the OAC private subnet.

0.0.0.0/0 port 749 TCP

0.0.0.0/0 port 88 TCP

0.0.0.0/0 port 88 UDP

2. Sign in to the un0 node (bdsclusun0.ds.dsvcn.oraclevcn.com) of the BDS cluster.

3. Create the folder kerberos under /tmp.

mkdir -p /tmp/kerberos

4. Copy the hive.service.keytab file from /etc/security/keytabs to /tmp/kerberos as oac.keytab.

cp /etc/security/keytabs/hive.service.keytab /tmp/kerberos/oac.keytab

5. Copy the /etc/krb5.conf file to /tmp/kerberos/krb5conf.

cp /etc/krb5.conf /tmp/kerberos/krb5conf

bds_krb5conf

6. If BDS is on a public subnet, update the admin_server and kdc information in the krb5conf file with the public IP address of the cluster’s mn0 node instead of the hostname.

If BDS is on a private subnet, use the same FQDN hostname of the mn0 node of the BDS cluster and configure the PAC in OAC to resolve and reach the BDS cluster mn0 node components.

7. Create a file named service_details.json and save it inside the /tmp/kerberos folder.

{
 "Host" : "bdsclusun0.ds.dsvcn.oraclevcn.com",
 "Port" : "10010",
 "ServicePrincipalName" : "hive/bdsclusun0.ds.dsvcn.oraclevcn.com@BDSCLOUDSERVICE.ORACLE.COM"
}

If you’re using the BDS public IP address, you must to use the public IP address instead of the hostname in the above json file. 

8. Create a ZIP file from the kerberos folder.

cd /tmp

ls -l kerberos

zip -r SSLKerberos.zip kerberos/*

9. Copy the SSLKerberos.zip file to the client machine where you access the OAC URL.

Configure a Private Access Channel for Your Oracle Analytics Cloud Instance

oac_bds_pac

bds_oac_pac2

Create a Hive Connection

1. Sign in to OAC.

2. Create a connection.

oac_create_conn

3. Select the connection type Apache Hive.

Apache_hive_oac

4. Select the authentication type Kerberos and select the Kerberos Credentials ZIP file.

Apache_hive_conn1

Connect to Secure BDS Clusters with Kerberos and SSL Enabled using HiveServer2

1. If the BDS cluster is SSL-enabled and has a well-known CA-signed SSL Certificate, check the Enable TLS checkbox in the connection dialog.

Note: OAC can trust only well-known CA-signed SSL certificates of the data sources.

oac_hive_conn3

2. Similarly, create a Spark connection.

Connecting to Secure BDS Clusters with Kerberos Enabled using Spark3 Thrift Server

Kerberos Authentication

1. Collect the required files for Kerberos authentication.

Before running the cmd to get a Ticket Granting Ticket (TGT), ensure the Kerberos admin_server and kdc ports 749 and 88 are open for ingress access from the internet or from the OAC private subnet.

0.0.0.0/0 port 749 TCP

0.0.0.0/0 port 88 TCP

0.0.0.0/0 port 88 UDP

2. Sign in to the mn0 node (bdsclusmn0.ds.dsvcn.oraclevcn.com) of the BDS cluster.

3. Create a folder kerberos under /tmp.

mkdir -p /tmp/kerberos

4. Copy the spark.service.keytab file from /etc/security/keytabs to /tmp/kerberos as oac.keytab.

cp /etc/security/keytabs/spark.service.keytab /tmp/kerberos/oac.keytab

5. Copy the /etc/krb5.conf file to /tmp/kerberos/krb5conf.

cp /etc/krb5.conf /tmp/kerberos/krb5conf

bds_krb5conf1

6. If BDS is on a public subnet, update the admin_server and kdc information in the krb5conf file with the public IP address of the cluster’s mn0 node instead of the hostname.

If BDS  is on a private subnet, use the same FQDN hostname of the mn0 node of the BDS cluster and configure the PAC in OAC to resolve and reach the BDS Cluster mn0 node components.

7. Create a file named service_details.json and save it inside the /tmp/kerberos folder.

{
 "Host" : "bdsclusmn0.ds.dsvcn.oraclevcn.com",
 "Port" : "10000",
 "ServicePrincipalName" : "spark/bdsclusmn0.ds.dsvcn.oraclevcn.com@BDSCLOUDSERVICE.ORACLE.COM"
}

If you’re using a BDS public IP address, provide the public IP address instead of the hostname in the above JSON file. 

8. Create a ZIP file from the kerberos folder.

cd /tmp

ls -l kerberos

zip -r spark3tskerb.zip kerberos/*

9. Copy the spark3tskerb.zip file to the client machine where you access the OAC URL.

Create a Spark Connection

1. Sign in to OAC.

2. Create a Spark connection.

bds_spark_oac

oac_bds_spark2

Create a Dataset from the BDS Connection

bds_ds_1

oac_ds_bds1

Create a New Dataset

OAC_DS123

Transform the columns while creating the dataset. For example, you can convert the pickup and drop-off times to Linux timestamp format, and calculate trip duration from the difference between the trip drop-off and pickup time.

Create Workbooks using the Respective Datasets

After creating the dataset, create a workbook from the Hive connection.

OAC_bds_wb1

Similarly, create a workbook using the dataset you created from the Spark connection.

oac_bds_wb

Call to Action

Now you’ve learnt how to connect to Oracle Big Data clusters from Oracle Analytics Cloud, try it yourself with your own Oracle Big Data Service clusters. Load your data files (typically extremely large and complex datasets that are difficult or impossible to process using traditional data processing methods) into Oracle Big Data Service clusters and transform the data. Connect Oracle Analytics Cloud to your Oracle BDS clusters and build enhanced data insights in Oracle Analytics Cloud.

If you have questions, post them in the Oracle Analytics Community and we’ll follow up with answers.

REDWOOD

var coll = document.getElementsByClassName(“collapsible”); var i; for (i = 0; i < coll.length; i++) { coll[i].addEventListener("click", function() { this.classList.toggle("active"); var content = this.nextElementSibling; if (content.style.display === "block") { content.style.display = "none"; } else { content.style.display = "block"; } }); }