In a previous post, I provided an overview of the key differences between supervised and unsupervised machine learning. For the sake of simplicity, I suggested these two buckets could neatly encompass all the different types of machine learning algorithms data scientists use to discover patterns in big data, but that just isn’t the case. Many popular methods leverage a blend of both, an approach called semi-supervised learning. In today’s blog, I’ll dive deeper into what semi-supervised learning is and which use cases are best suited to this hybrid approach.
The biggest difference between supervised and unsupervised machine learning is this: Supervised machine learning algorithms are trained on datasets that include labels added by a machine learning engineer or data scientist that guide the algorithm to understand which features are important to the problem at hand. Unsupervised machine learning algorithms, on the other hand, are trained on unlabeled data and must determine feature importance on their own based on inherent patterns in the data. (If the ideas of training algorithms or quantifying feature importance seem completely foreign, be sure to check out our executive’s guide to predictive modeling!)
As you may have guessed, semi-supervised learning algorithms are trained on a combination of labeled and unlabeled data. This is useful for a few reasons. First, the process of labeling massive amounts of data for supervised learning is often prohibitively time-consuming and expensive. What’s more, too much labeling can impose human biases on the model. That means including lots of unlabeled data during the training process actually tends to improve the accuracy of the final model while reducing the time and cost spent building it.
For that reason, semi-supervised learning is a win-win for use cases like webpage classification, speech recognition, or even for genetic sequencing. In all of these cases, data scientists can access large volumes of unlabeled data, but the process of actually assigning supervision information to all of it would be an insurmountable task.
Using classification as an example, let’s compare how these three approaches work in practice:
Supervised classification: The algorithm learns to assign labels to types of webpages based on the labels that were inputted by a human during the training process.
Unsupervised clustering: The algorithm looks at inherent similarities between webpages to place them into groups.
Semi-supervised classification: Labeled data is used to help identify that there are specific groups of webpage types present in the data and what they might be. The algorithm is then trained on unlabeled data to define the boundaries of those webpage types and may even identify new types of webpages that were unspecified in the existing human-inputted labels.
The structure and volume of the data at hand should always inform the data modeling approach you take, no matter the use case. That’s why critically evaluating your data and resources is an integral part of an efficient data science workflow.