Learn data science best practices

A Data Science Framework for Forecasting Opening Box Office Revenue

Today’s entertainment marketplace has become a highly competitive playground full of diverse channels of consumption and sources of content. Traditional moviegoing has changed dramatically as streaming services like Netflix, with their exponentially growing library of affordable and convenient content, are significantly impacting consumer behavior. 


With the ever-changing landscape, forecasting opening box office revenue has become increasingly difficult and traditional ways of forecasting have seen higher error rates in more recent years, especially with the recent surge of record-breaking films in the superhero category despite a decline of total box office revenue in the industry. National Research Group, the primary tracking service for all major film studios, has implemented a more comprehensive data science framework to combine survey data with other sources of information that has mitigated these obstacles to forecasting. This article will explain this framework which can also be used in a variety of applications and forecasting objectives.

Leveraging Historical Data

National Research Group has collected core measures using a targeted survey approach for all films dating back to the 80s. However, we have found that the most relevant data based on correlations can be found in the last 10 years. Not only does NRG want to be able to accurately forecast opening box office for its clients, we also want to be able to do so much further out than the traditional 3.5 weeks before release—as early as a film’s first trailer release into the market. With that in mind, NRG launched its Campaign Management product in 2016 which not only asks a film’s core measures (awareness and interest) as early as its first trailer release but also a breadth of questions about a film’s story, characters, theatricality, originality, etc. that would help studios understand how to best optimize and position their product in the marketplace. NRG uses a robust sampling methodology that targets the moviegoing population with specific frequency, size, and demographic representation. 

Aside from survey data, NRG collects social media information about each film, film demographic attributes, screen count, and Rotten Tomato scores to comprehensively forecast film performance:


The Curse of Dimensionality and Missing Values

With only hundreds of films available for training and validating a model, we can immediately see that there are too many variables relative to the number of films that have been historically released since 2016.  Not to mention the additional variables of seasonality and categorical variables for genre, rating, franchise, sequel, etc. that need to be controlled for. We applied a factor analysis study on the survey measures to understand which variables naturally grouped together and created composite measures to ensure that we captured the most information while limiting the number of variables in the training set.  Through this method, we were able to condense 15 variables into 4:


Of course it is up to the modeler on whether to use actual principal components or factor scores within their models, but for our purposes we decided to do a weighted average across the attributes to produce composite measures for each group to simplify interpretation later in our forecasts.

Another obstacle we faced was the fact that some of our titles had franchise information, and some of our titles had family information. The availability of Rotten Tomato scores were also limited to a few weeks before a title’s release, at the earliest. Basically, we had a very significant missing value problem and there was no way we could impute franchise information on films that were not in fact franchises and family information on films that were not in fact for families. To mitigate this issue, we utilized Bayesian updating on top of the model that utilized the entire dataset. We first built a ‘base’ model that uses all the historical database of films with all their survey measures and attributes. Then for franchise or family films with additional information, we used Bayesian updating to combine the base model (prior) with the new data to calculate a new forecast (posterior). We implemented Bayesian updates for franchise, family, and films with available Rotten Tomato and social media data. This methodology allowed for us to have a single model framework without having to impute a significant number of missing values. 

Building a Dynamic Simulator on Top of the Model

Because some of our forecasts can be up to a year out, some of the information about the films could be estimates, i.e. screen count, Rotten Tomato score, etc.  Also, it is extremely helpful to understand how forecasts move up or down if various measures reach certain levels. For example, what if my awareness was at 20% at this point in time instead of 15%?. This information helps studios optimize their media spend and update their marketing materials. We used Power BI to build a dynamic simulator as it has a pretty comprehensive R module in which we were able to script in our entire model code. 


As far as performance, we have multiple forecasting windows for Campaign Management and errors decrease as they get closer to release, which is expected. When the forecasts are so far out, accuracy becomes less important because we know that perceptions of films are still changing and we want that to be true since the whole idea is to help films realize their full potential. Therefore, we want a balance between having good accuracy and insightful simulations. 


NRG’s forecasting framework has many uses across the area of forecasting and predictive analytics. The framework has enough flexibility to utilize any ‘base’ model and is also open to any type of variable selection methodology.  Of course, there are many details that I have skipped over but I hope the information provided here may prove useful to traditional forecasters and machine learning practitioners alike.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.