DataScience and Oracle Streaming Service

March 2, 2020 | 2 minute read
Igor Aragao de Souza
Principal Big Data Consultant
Text Size 100%:

How’s it going there?

Today I want to show how we can get the Oracle Streaming Service (OSS) data using Data Science.

  • Data Science is a platform for data scientists to build, train, and manage models on Oracle Cloud Infrastructure using Python and open source machine learning libraries.
  • Oracle Streaming service is a Kafka compatible, secure, no lock-in, pay as you use, scalable, cheap streaming solution that allows the purpose mentioned with very low effort and ease to develop and deploy.
  • The idea here is to get Donald Trump tweets in almost real-time and do some kind of analysis.

 

Requirements:

  • OSS instance;

  • Data Science instance;

How does this works?

Oracle OCI Streaming service is mostly API Compatible with Apache Kafka, hence you can use Kafka APIs to produce/consume from/to OCI streaming service and based on that we are going to use the Kafka-Python API.

To connect with Twitter we are going to use the Tweepy API.

Architecture

Note: Because OSS is Kafka compatible we can use a Kafka API and this means that you can in an easy way change to use Kafka broker.

 

Data Science Environment

First, you need some configurations in your tenancy for Data Science.
here

You need to allow Data Science environment to access the internet and for this, you need to configure a NAT Gateway.
here

 

Note: I suggest you follow the "getting-started.ipynb" first to validate your environment. If you get an "HTTPsConnectionPool" error means that you have an error in your NAT Gateway configuration.

open the Data Science terminal and install the two lib's.

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
<strong><em><span class="n">pip install tweepy</span></em></strong>
<strong><em><span class="n">pip install kafka-python</span></em></strong>

OSS Environment

For the OSS you just need to check the authentication part, and for this, you need:

Show me the code

Producer - Application that sends the messages.

Consumer - Application that receives the messages.

For connecting with Twitter you will need credentials.
here.

Great, at this point the tweets are being captured. The execution speed will depend on how “hot/trending” the keywords you defined currently are on Twitter. I defined keywords to collect data about Donald Trump, which is always a hot topic.

Now let's do some kind of analysis: we can create a class that gets the tweets and saves them than in a file, we can stop streaming when it reaches the limit.

 

here we can convert the tweets data to pandas

This is a very simple DataFrame, but we can still check some interesting stuff;

Copied to Clipboard
Error: Could not Copy
Copied to Clipboard
Error: Could not Copy
<strong><em><span class="n">df.lang.value_counts()</span></em></strong>
<strong><em><span class="n">df.source.value_counts()</span></em></strong>

Now you can play with and so some nice analyses and visualizations as well.

 

Links

Overview of Data Science

Overview of Streaming Service

Igor Aragao de Souza

Principal Big Data Consultant

Principal Big Data Consultant, Developer Advocate, Streaming Evangelist, Brazilian Geek, Coffee lover, Sepultura Fan, Maker and Hockey Player. Based in Dublin.

My hobbies are playing with IoT, electronics, Twitter sentiment analysis, electric guitar and In-line Skate/Hockey.

I'm current have a project to Transforming Donald Trump’s Tweets in cash.

I also run a monthly meet-up for Brazilian IT professionals in Dublin with an average of 80 attendees. Volunteer mentor and assisting Coder Dojo in Rathmines.

Show more

Previous Post

Oracle Database client libraries for Java now on Maven Central

Gerald Venzl | 2 min read

Next Post


Podcast #377: Oracle Autonomous Database -- An Interview with Maria Colgan

Bob Rhubart | 3 min read