Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

Using Social Media Data to Build an Audience Attribution Platform

Linear TV publishers and advertisers have historically been unable to adequately measure audience engagement and preferences for their specific television shows and movies. This has led to an inability to build advertising-based products across content platforms. In contrast, digital media provides significantly enhanced audience metrics for tracking advertising and media spend, although there has been considerable controversy over the fidelity of some of these measurements.

Nielsen ratings only provide age and gender specifications and many premium shows have no data as some high-end audiences do not use people meters. Consequently, ad buyers and brand managers have to work with limited data to determine the effectiveness of their ads in meeting advertising goals across the stages of the advertising and communications funnel. There are some fundamental limitations of this ecosystem that may be never solved since it will be impossible to correctly attribute 100% of your ad spending to all stages of the advertising funnel from awareness to the product trial and purchase decision.

In this article, I will present a conceptual and product framework to use social media data to develop an audience attribution and research platform for major entertainment and media companies. The target market for this product are publishers, agencies, and advertisers. It is based on work that was designed and implemented with real social media data but for reasons of confidentiality the focus is on the conceptual framework and the data science challenges.

A high-level outline of the project, data sources, analytical approaches, outputs, and deliverables is given in the figure below:

Social Media Project Outline

Figure 1: Overview of Project

Converting Raw Social Media Data to Structured Data for Analysis

The core of converting raw social media data of any kind is to build domain specific dictionaries that convert the unstructured data into structured numeric data that can then be used for quantitative analysis, segmentation, and predictive modeling. The preliminary domain was restricted to the client’s and competitors’ top 50 TV shows, 200 top movies, and about 100 leading brands. Initially, the process was purely manual with the goal of detecting and validating patterns, keywords, and phrases with the help of the client’s domain experts. Ultimately, a series of language models were built that automated some of this work and allowed for manual intervention. Over time, a custom scripted discovery process was built that allowed for accelerated dictionary creation. The output from the application of the dictionaries was placed in predefined spreadsheet templates. These spreadsheet templates were merged and loaded into SQL dictionary schemas. The output from all these dictionaries were manually checked and used to form the basis of a ground truth. We also created a classification system that separated fake tweets and bad data from good data based initially on manual intervention and partial automation.

There was also a parallel effort to establish ground truth and create classification models to determine gender and age groupings. An ideal attribution platform should be able to identify key demographics for its audience. These key demographics were identified as:

  • Age

  • Gender

  • Income

  • Education

  • Family size

  • Life stage

  • Location (Approximate)

We were able to establish sufficient ground truth and develop classifiers for age and gender that had reasonably good accuracy. Classification models for income, education, family size, and life stage were difficult to build. To build reliable and accurate classification models for these variables and establish a foundation of ground truth would require significant investments.

Overview of Data Flow

Figure 2: From Raw Social Media Data to Multiple Databases

Psychographic Insights Model Based on Computational Psycholinguistics

A key differentiating component for an attribution platform is the ability to analyze an individual’s digital communication and infer her psychographics characteristics using psycholinguistic modeling. It is well known in psycholinguistic literature that the accumulated spoken or written words of an individual, produced over time, reflects the individual’s enduring psychographic characteristics.

The LIWC (Linguistic Inquiry and Word Count)1 is a transparent text analysis program that counts words in psychologically meaningful categories. Empirical results using LIWC demonstrates its ability to detect meaning in a wide variety of experimental settings, including attentional focus, emotionality, social relationships, thinking styles, and individual differences. In its complete form, this approach is able to classify an individual along 52 individual personality traits starting with OCEAN or the “big 5.”

The “big 5” are the five personality traits referred to in the acronym OCEAN which stands for the following personality characteristics that generally describe how a person engages with the world.

  1. Openness

  2. Conscientiousness

  3. Extraversion

  4. Agreeableness

  5. Emotional Range (Neuroticism)

To generate meaningful attributes with high confidence requires each individual to produce greater than 2500 words in a conversational setting. This would mean that if there are genuine and sufficiently good quality tweets from verified sources, it is then possible to infer the individual’s psychographic attributes. This work required extensive SQL work and usage of an open source personality insights API.

Example of Extracting “Agreeableness” as an Individual Psycholinguistic Trait

Figure 3: Example of Extracting “Agreeableness” as an Individual Psycholinguistic Trait

As described earlier, it is important to build accurate dictionaries from the ground up to classify tweets or other social media data conversations and determine whether the specific tweet is relevant for the business question. Does the tweet classify as a comment on one of the main characters of the TV Show “Big Bang Theory”? This can initially be answered by manually building dictionaries of words, phrases, or expressions that are unique to the show of interest and also manually doing the match to be able to build a set of rules that can then be programmatically enabled for automation. Some manual random checks were still necessary to ensure the fidelity of the entire process. 

Personality Attribution and Segment Creation from Computational Psycholinguistic Analysis

Figure 4: Personality Attribution and Segment Creation from Computational Psycholinguistic Analysis

Once we have generated a set of these 52 psychographic traits for each relevant and individual twitter user, we can use that data along with demographics and brand preference data to create segments which will then enable the design of customized and targeted advertising campaigns. How can we ensure that the psychographic traits generated genuinely describe the population of interest and have a high confidence? What is the face validity, construct validity, content validity, and predictive validity of these approaches? We developed a process of face validation by confirming with our clients that the audience traits identified for a specific show using social media data was consistent with the intent of the show’s creators. The industry does a good job in designing TV shows and entertainment targeted towards different segments. There is however no confirmation process of whether that audience is reached, if they engaged, what their brand preferences are, and other audience measures.

Building Customer Segments

Latent Class Analysis was used to build customer segments. These segments were built using the psychographic attributes and select client show preferences. Competitor shows, brands, and demographic attributes such as age and gender were used as profiling variables. An illustrative five-segment solution is given below:

Social Media-Driven Five Segment Solution

Figure 5: Social Media-Driven Five Segment Solution

Application Overview

The segmentation outputs at the individual social media/consumer level were then used to create a prototype application in Tableau. This application’s screen and functionality is given below:



Audience Insights and Analytics

Overview of ratings score; SVOD data and associated demographics – by the various shows and program

Audience Composition

Demographics; Big 5 personality attributes by program

Audience Engagement

Audience engagement indices for each program / show and by gender and age

Program Insights

Key audience demographics, audience duplication levels, top show preferences

Audience Targeting

Conceptualize, design and create media and content plans for targeted audiences. It also gives insights on consumption size and potential based on demographic segments, product category and brand affinities

Targeted Bundling screen

Create targeted bundles towards audiences with higher brand affinities and achieve significantly higher TRP advantages than a non-targeted plan

This is the first time that it’s possible to link brand affinity to an audience show’s preferences in the context of linear TV on a continuous time basis without using survey data or other one-off approaches. Not all brands were represented since the brand affinity score were dependent on availability of their mentions in sufficient volumes in social media data.

Lessons Learned

  • Creating a customer data panel based on user generated content is feasible and will improve over time.

  • Meaningful and well differentiated segments can be created using social media data and machine learning approaches.

  • Focus should be on creating the customer panel with two broad areas:

    • Data that changes slowly across time or not change at all – psychographics, age, gender, socio-economic status, and family with children
    • Data that changes across time – interaction with TV shows, movies, consumer brands, and sentiment

  • For customer valuation, you need to populate difficult-to-fill fields using surveys and establishing ground truth. Specifically, these are socio-economic status, income groupings, marital status, and family size.

  • Third-party data from various vendors may be used to validate demographics, purchasing, and other behaviors, but since the level of aggregation may be too high, the joins may produce useful data.

  • As social media data in most cases is not currently available to external third-party vendors, these frameworks can only be created by the companies that manage and have ownership of their users’ social media data.


Yarkoni, Tal (2010) and Pennebaker, James W., Martha E. Francis, and Roger J. Booth. (2001). 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.