By Denny Wong on Dec 15, 2014
Sentiment analysis has been a hot topic recently; sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials. Social media websites are good source of people sentiments. Companies have been using social networking sites to make new product announcements, promote their products, collect product reviews and user feedback, interact with their customers, etc. It is important for companies to sense customer sentiments toward their products, so they can react accordingly to benefit from customers’ opinion.
In this blog, we will show you how to use Data Miner to perform some basic sentiment analysis (based on text analytics) using Twitter data. The demo data was downloaded from the developer API console page of the Twitter website. The data itself originated from the Oracle Twitter page, and it contains about a thousand tweets posted in the past six months (May to Oct 2014). We will determine the sentiments (highly favored, moderately favored, and less favored) of tweets based on their favorite counts, and assign the sentiment to each tweet. We then build classification models using these tweets along with their assigned sentiments. The goal is to predict how well a new tweet will be received by customers. This may help marketing department to better craft a tweet before it is posted.
The demo (click here to download demo twitter data and workflow) will use the newly added JSON Query node in the Data Miner 4.1 to import the twitter data; please review the “How to import JSON data to Data Miner for Mining” blog entry in previous post.
Workflow for Sentiment Analysis
The following workflow shows the process we use to prepare the twitter data, determine the sentiments of tweets, and build classification models on the data.
The following describes the nodes used in the above workflow:
- Data Source (TWITTER_LARGE)
- Select the demo Twitter data source. The sample Twitter data is attached with this blog.
- JSON Query (JSON Query)
- Select the required JSON attributes used for analysis; we only use the “id”, “text”, and “favorite_count” attributes. The “text” attribute contains the tweet, and the “favorite_count” attribute indicates how many times the tweet has been favorited.
- SQL Query (Cleanse Tweets)
- Remove shorten URLs and punctuations within tweets because these data contain no predictive information.
- Filter Rows (Filter Rows)
- Remove retweeted tweets because these are duplicate tweets.
- Transform (Transform)
- Perform quantile bin of the “favorite_count” data into three quantiles; each quantile represent a sentiment. The top quantile represents “highly favored” sentiment, the middle quantile represents “moderately favored” sentiment, and the bottom quantile represents “less favored” sentiment.
- SQL Query (Recode Sentiment)
- Assign quantiles as determined sentiments to tweets.
- Create Table (OUTPUT_4_29)
- Persist the data to a table for classification model build (optional).
- Classification (Class Build)
- Build classification models to predict customer sentiment toward a new tweet (how much will customer like this new tweet?).
Data Source Node (TWITTER_LARGE)
Select the JSON_DATA in the TWITTER_LARGE table. The JSON_DATA contains about a thousand tweets to be used for sentiment analysis.
JSON Query Node (JSON Query)
Use the new JSON Query node to select the following JSON attributes. This node projects the JSON data to relational data format, so that it can be consumed within the workflow process.
SQL Query Node (Cleanse Tweets)
Use the REGEXP_REPLACE function to remove numbers, punctuations, and shorten URLs inside tweets because these data are considered noises and do not provide any predictive information. Notice we do not treat hash tags inside tweets specially; these tags are treated as regular words.
We specify the number, punctuation, and URL patterns in regular expression syntax and use the database function REGEXP_REPLACE to replace these patterns inside all tweets with empty spaces.
REGEXP_REPLACE("JSON Query_N$10055"."TWEET", '([[:digit:]*]|[[:punct:]*]|(http[s]?://(.*?)(\s|$)))', '', 1, 0) "TWEETS",
Filter Rows Node (Filter Rows)
Remove retweeted tweets because these are duplicate tweets. Usually, retweeted tweets start with a “RT” abbreviate, so we specify the following row filter condition to filter out those tweets.
Transform Node (Transform)
Use the Transform node to perform quantile bin of the “favorite_count” data into three quantiles; each quantile represent a sentiment. For simplicity, we just bin the count into three quantiles without applying any special treatment first.
SQL Query Node (Recode Sentiment)
Assign quantiles as determined sentiments to tweets; top quantile represents “highly favored” sentiment, the middle quantile represents “moderately favored” sentiment, and the bottom quantile represents “less favored”. These sentiments become target classes for the classification model build.
Classification Node (Class Build)
Build Classification models using the sentiment as target and tweet id as case id.
Since the TWEETS column contains the textual tweets, so we change the mining type to Text Custom.
Enable the Stemming option for text processing.
Compare Test Results
After the model build completes successfully, open the test viewer to compare model test results, the SVM model seems to produce the best prediction for the “highly favored” sentiment (57% correct prediction).
Moreover, the SVM model has better lift result than other models, so we will use this model for scoring.
Sentiment Prediction (Scoring)
Let’s score this tweet “this is a boring tweet!” using the SVM model.
As expected, this tweet receives a “less favored” prediction.
How about this tweet “larry is doing a data mining demo now!” ?
Not surprisingly, this tweet receives a “highly favored” prediction.
Last but not least, let’s see the sentiment prediction for the title of this blog
Not bad it gets a “highly favored” prediction, so it seems this title will be well received by audience.
The best SVM model only produces 57% accuracy for the “highly favored” sentiment prediction, but it is reasonably better than random guess. For a larger sample of tweet data, the model accuracy could be improved. With the new JSON Query node, it enables us to perform data mining on JSON data which is the most popular data format produced by prominent social networking sites.