Clustering text documents using the natural language processing (NLP) conda pack in OCI Data Science

June 2, 2021 | 9 minute read
Wendy Yip
Data Scientist
Text Size 100%:

Oracle Cloud Infrastructure (OCI) Data Science recently released two conda packs designed for natural language processing (NLP) workloads: the natural language processing conda pack for CPU and the natural language processing conda pack for GPU.  Both conda packs are available to customers when they log in to OCI Data Science. 

Natural language processing (NLP) refers to the area of artificial intelligence of how machines work with human language. NLP tasks include sentiment analysis, language detection, key phrase extraction, and clustering of similar documents. Our conda packs come pre-installed with many packages for NLP workloads.

Over the last few years, NLP has undergone great advances with the introduction of Transformer models such as BERT (Bidirectional Encoder Representations from Transformers). Our conda packs include Hugging Face’s state-of-the-art transformers library, which contains pre-trained models for different NLP tasks.  In addition, it includes the versatile libraries NLTK and Scikit-learn for pre-processing text data and model building, various wrappers around BERT like key-bert, and deep learning accelerating frameworks PyTorch Lightning. 

We are going to show an example of how to use the NLP conda pack to process and group a set of documents. We are going to use the 20 Newsgroups dataset, which contains ~20,000 forum posts from 20 different topics. 

 

Load the dataset

First, we import the necessary libraries. We are going to use Scikit-learn for loading in the dataset and pre-processing the data.  Also, we are going to use UMAP, which is a dimensionality reduction library, along with Matplotlib and Bokeh for data visualization.

Graphical user interface, text, application

Description automatically generated

Let’s load the dataset. Scikit-learn has a built-in function for loading the 20 Newsgroups dataset.

These are the 20 Newsgroups categories that a post can belong to.

Graphical user interface, application, Word

Description automatically generated

Let’s look at one post.

Text, letter

Description automatically generated

We notice metadata such as title and name. These are parts of the data that a classifier may overfit on. We can use techniques such as TF-IDF to minimize the undesirable data from polluting a classifier (We will discuss TF-IDF later in the post).

But first, let’s create the data labels from the Newsgroups categories and a pandas DataFrame of the labels to assist UMAP with plotting.

 

Create vectors from the text data

The first step to working with text data is to turn it into features. This is a process known as vectorization, as it is mapping text data into a vector form.

There are different ways of vectorization. We are going to start with the simplest, the word-count vectorizer. It uses a "bag-of-words" approach to handle the text data. This is a word-order independent approach that simply counts how many times a particular word appears in a document. We are going to add an additional requirement that a word must be seen at least 5 times to be part of the generated encoded representation of a post. We are going to use Scikit-learn’s built-in CountVectorizer to do this.

We utilize UMAP configured with the cosine distance measurement. UMAP is a general-purpose dimensionality reduction algorithm. We run it in unsupervised mode, and it functions to create an easily visualized 2D representation of our documents. Cosine distance measurement gives a measure of how similar two documents are to each other.  There are different metrics for measuring similarity, and cosine distance is empirically more effective in most NLP tasks than other metrics such as Euclidean distance.  We see how our documents, labeled by the different topics, cluster together.