At the core of DataFox is a focus on creating a pristine dataset of company data. We want this data to be as clean, accurate, and all-encompassing as possible.
In practical terms, this means we want to gather as much relevant information as we can on every company in our database. In one workflow, we contribute toward this goal through data partnerships; we combine, prioritize, and resolve a huge variety of data sources.
In another workflow, we focus on the company data that partnerships cannot provide. We make a meticulous effort to find relevant and unique company data from unstructured text through natural language processing algorithms. It's an enormous undertaking, but the creation of a proprietary, competitively advantageous set of company data is well worth it.
Although we also classify unstructured text coming from public websites, company blogs, and tweets, this post will focus solely on news generated from news organizations.
Every 30 minutes each day, we collect articles from nearly 7,000 of these sources (through RSS feeds) and pass them through our classifier. These sources range from highly specific news sources (e.g. Healthcare IT News) to more widely digested websites (e.g. TechCrunch).
What they all have in common is a high degree of content curation (thank you editors) and a focus on being very up-to-date (thank you capitalism). These features help prevent garbage from entering our data pipeline and helps us keep our classification of the news highly targeted and highly relevant.
Given the constant pipeline of articles being created by these sources, we want to make sense of them and extract important company data from them. For as many companies as possible, we want to automatically tag articles with their appropriate company signals using natural language processing algorithms.
In our domain, we currently classify 68 such signal types that broadly fall into one of the following eight buckets:
Categorization of these signal types can be difficult for many reasons:
We could have approached this as a multi-label classification problem at the article level. In this situation, an article could contain information about a company that both received private funding and expanded to open new offices.
We would then tag the article as containing two signal types, though the specific context of which sentences relate to which signals would be tough to disambiguate.
Instead, we decided to focus on sentences or groups of sentences within the article. We refer to these smaller units as "snippets." By narrowing the lens of the classifier, we focus the problem into a multi-label classification problem per snippet.
Though in many cases there will be only one signal per snippet (leading to more of a multiclass classification approach), there are many interesting snippets that can contain several signals.
Here is one such motivating example from PR Newswire that contains 5 signals:
We decided to tackle this multi-label problem through a natural language processing (NLP) and supervised machine learning (ML) approach. Our customers expect high precision out of these signals, so, over the span of many years, we created a golden set of tagged snippet data.
The precursor to all of our work on classification comes from an essential dataset that is continuously being expanded by our highly trained auditors. DataFox's team of auditing specialists have collectively spent tens of thousands of hours combing through our news pipeline to tag articles with relevant signals.
This huge amount of work has created a useful, clean, and structured set of company data with hundreds of thousands of relevant signal tag examples. Using our set of internal tools, an auditor will read through an article, highlight a relevant snippet, tag it with any relevant signals, and associate each signal with the relevant company.This human classification step creates structure for our training dataset and creates a significant competitive advantage for our machine learning approach.
Returning to the crux of the multi-label problem, we knew that in order to make sense of the tagged training data, we still needed to extract relevant features. Specifically, we needed a way to turn the tagged snippet text into features that could be incorporated into a model.
In order to do this, we turned to the python library spaCy to perform a significant portion of data cleaning and pre-processing. Using spaCy helped us with the following tasks:
Ultimately, we tested these different methods of preparing the data on different classes of models to gauge the relative effects on the classifier's precision and recall per signal.
After these data preparation steps, it was time to test out different models and gauge how well they classified the snippets in question. In a typical test/train setup, we set aside 20% of the tagged data as a test set.
We then evaluated individual models using cross-validation on the remaining 80% of data, with discrete outputs of the the mean and variance for precision and recall across all cross-validated runs.
Here were a few of the model classes we tested employing scikit-learn's classes and its stochastic gradient descent classifier:
Support Vector Machines
In order to compare the results from these models, we used precision, recall, and f1-scores per signal as guiding metrics. Additionally, to choose amongst similarly performing models, we considered the interpretability of the model.
In the end, a tuned logistic regression performed best, lending itself well to future re-training and giving us the ability to inspect the most important features contributing to each signal tag. Here are some high-level results from the model's performance on the test set, where the overall precision (78.1%) and the overall recall (73.2%) combined for a healthy f1-score of 74.6%.
|. . .|
In the end, this structured classification of the news brings in fascinating examples, from the international marketing activity of interestingly named French companies:
To the can't miss details of massive funding rounds in Silicon Valley:
As always, at the end of such a project, there are exciting opportunities to improve the current methodology. Here are a few items that are top of mind heading forward:
At the tail end of our signal taxonomy are a few signal tags that haven't received enough tags in our training set to perform well in a classification setting. We have room to improve those signal classifications either by artificially boosting our training set, or manually tagging more examples to get a better breadth of examples.
We can consider using neural networks like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs). In this situation we would use word embeddings as features rather than sentence vector space models. An advantage of using a CNN could be to help make the classifier more context aware than N-grams preprocessing can provide. And an advantage of using an RNN could be making the classifier more temporally aware, processing one word at a time, as opposed to all at once.