X

Break New Ground

OSS data to ADW in almost real-time

Igor Aragao de Souza
Principal Big Data Consultant

Hey you.

Did you know that it is possible to get the Oracle Streaming Service (OSS) data and put in Autonomous Data Warehouse (ADW) in almost real-time just using Object Store? And more, you can do it without code.
 

Requirements:

  • OSS instance;

  • ADW instance;

  • Object Store Bucket;

 

How does this work?

The idea is that we can use Apache Flume to write the data in files and store in Object store than we can use ADW to create an External Table from these files and with this we have the data in ADW in almost real-time.

You can play around with the number of files and how often they will be written to achieve real-time.

01.JPG

So let's divide into 2 steps;

  1. Read data from OSS and write in Object Store
  2. Create the external table using the Object Store file path

 

1-) Read data from OSS and write in Object Store

Assuming that you already have the OSS with some content.
 

Option 1:
 

I don't want to re-invent the wheel, so you can have a look at Publishing To Object Storage From Oracle Streaming Service.

 

if you followed the tutorial you will see that you need a "connector-config.json" where you add the connector configurations.

You can check the connector properties here and change how often the file will be written.


Example: Streaming the content of a log file1.json;

This connector configuration will generate 3 files and each file is encoded as

<topic>+<kafkaPartition>+<startOffset>.<format>.

You should see:

and if you check file 1 oci os object get -ns mynamespace -bn mybucket --name myfile.txt --file - you will see

the properties partitioner.class can be io.confluent.connect.storage.partitioner.TimeBasedPartitioner and here you can play around and configure to create one file a day or one file every 1 minute and with this change how real-time your solution will be.

You can have a look at the Amazon S3 Sink Connector for Confluent Platform docs here.


Option 2: using Apache Flume

 

Download Apache Flume here.
 

Untar the downloaded in any place of your preference.

tar xzf apache-flume-1.9.0-bin.tar.gz

you need setup the JAVA_HOME

 

we are going to use Flume Kafka source and HDFS sink.

HDFS sink doc

Kafka source doc

 

because we are using HDFS sink you need to add a Hadoop jar in the Flume lib folder.

You can download here

The idea here is to use the Oracle Object Store Amazon S3 Compatibility.

For this, we will use Fuse and you can check this tutorial to see a step by step Fuse configuration

 

Create a jaas file in /home/user/path/to/flume/conf/flume_kafka_jaas.conf

Create a Flume conf file in home/user/path/to/flume/conf/conf.properties

now you can run Flume with the command:

change the file names and path for whatever you used before.

 

as a example you can use a1.sinks.oci1.hdfs.path = /data/kafka/%{topic}/%y-%m-%d and this will generate subfolder with one file a day, moth and year.

%{topic} >> Kafka topic name

%y-%m-%d >> year-month-day

 

2-) Create the external table using the Object Store file path

Assuming that you already have the ADW ready, you just need to create the external table.

check here for more info.

you can add several files in file_uri_list and you can play with Table Partition as well.

file_uri_list Doc

format Doc

you can change the column_list and create some specif columns that you want.
 

Note: This idea is perfect for the Big Data scenario using HDFS and Hive. Oracle is about to release the new Big Data PaaS solution (BDS - Big Data Service) and when this happens I will generate a second version of this tutorial using HDFS and Hive.

 

Conclusion


The frequency of files that you write in the Object Store is how real-time your solution will be.

Example:

if you write one file a day your real-time is 1 day;

If you write one file every minute your real-time is 1 minute;

 

You can create one table partition for each file or you can use several files for one table partition. It's a trade-off between a number of files and table partitions that will give you a better performance on the SQL command.

Example:

you can create the table using one file and the SQL performance will be good until the file became a big one.

if you create more files you can create some table partition using different files and than you can query by partition when the volume of data increase.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.