Data science has entered a renaissance period in recent years. Ever-increasing computation power in commodity hardware and powerful open source libraries have enabled us to repurpose algorithms that have languished in textbooks for decades. In particular, parallel computation libraries like Nvidia’s CUDA and recent research by leading minds like Geoffrey Hinton and Yann LeCun have led to a resurgence of the neural network, a once-forgotten algorithm whose origins can be traced to the infancy of computers. These “deep” learning algorithms, so-named because of their computational architectures consisting of many intermediate, “hidden” layers, have been used with astounding success in the fields of visual and audio processing, and most recently with natural language processing (NLP).
There are many types of deep learning algorithms, from both convolutional and recurrent neural networks, to Hidden Markov models and conditional random fields that also feature many “hidden” intermediate computations that can be executed in parallel by the same frameworks. In NLP, we use these algorithms to detect important words and phrases, determine the semantic subjects and objects of sentences, and to classify vast amounts of text content.
Natural Language Processing in Production
At Oracle Data Cloud, the Applied Data Science group in Reston, Virginia is one of many data engineering teams focusing on the problem of “data materialization.” Put simply, our task is turning unstructured data into structured data. Today, one of our largest tasks involves producing a useful classification for billions of web pages per month. Historically, this sort of text classification has largely been relegated to the bag-of-words model that simply makes a guess based on words on the page, or traditional linear algorithms like the logistic regression and linear-kernel support vector machine (SVM). The typical workflow from an algorithm like these involves heuristic pre-processing algorithms for stemming and tokenization, calculating word counts, figuring out which words are most important to establish a feature set for each class, creating a dictionary from that feature set, and then training an algorithm (or, in the case of bag-of-words, doing nothing).
These approaches can sometimes perform adequately on simple sentiment analysis and classification tasks, but they very quickly become unwieldy or fall apart completely for complicated problems having hundreds of classes. The requirement for a linearly separable feature set—a set of words specific to one topic and not any others (at least, hopefully)—can prove to be an impossible task on its own when, for example, trying to distinguish between two topics like “politics” and “government,” let alone “state politics” and “local politics.”
The next major problem is our need to assign multiple labels. There are ways to accomplish this with traditional classifiers, typically involving ensembles of algorithms, but these require extensive time for research and development, which we simply don’t have when dealing with the ever-changing landscape of the entire internet. Furthermore, these solutions become exponentially more difficult to implement and maintain with each additional classifier, especially as our taxonomies evolve and our production systems age.
Deep Learning: The Natural Choice for NLP
Deep learning—specifically the convolutional neural network (CNN)—solves these aforementioned issues as a design feature. These algorithms use convolution kernels, an old technique in computer vision, to address problems traditionally dealing with images, essentially by providing a consistent transformation of input data to increase the amount of information, or entropy, from which to derive data representations. Surprisingly, given the right architecture, this approach can be used with unstructured text data embedded in a higher dimensional space. We can perhaps imagine that the solution is somewhat analogous to looking at “pictures” of the words we’re dealing with; in truth, the reasons why they work comprise somewhat of a mathematical rabbit hole that invites a reference to Arthur C. Clarke’s famous quote, “any sufficiently advanced technology is indistinguishable from magic.” (The difficulty in explainability and interpretability in neural networks actually has impeded their adoption in some fields.)
To use convolutional networks for text data, we typically start with word embeddings like Google’s popular Word2Vec or Stanford’s Global Vectors for Word Representation (“GLOVE”). These differ from traditional “count vectorization” type approaches in the sense that when we look up a word in our dictionary, we don’t just get a numerical index that’s been used for training, but an entire vector that describes a spatial relationship of this word with others. Intuitively, it makes sense that this approach can outperform traditional methods, simply by virtue of using a representational format for the input that encodes more information than we were previously able to.
Now, our two biggest problems are solved: by virtue of representation learning, we’ve bypassed the need to construct and maintain a linearly separable feature set, and our taxonomies can now become arbitrarily complex and be dealt with by a single algorithm, as we train with categorical or binary cross-entropy loss and produce an entire vector of probabilities, rather than just a single prediction.
Character-Based Solutions for Real-World Problems
At the time of this writing, word-based CNN approaches are the state of the art in text classification tasks. However, we now must address some of the hard truths of applying academic research to the real world.
Production code is an entirely different beast, where problems of scale and data cleanliness arise almost instantly. For example, academic research usually does not address:
Scale: Billions of web pages per month is hundreds or thousands classified per second. After downloading the HTML and isolating the interesting bits, there just isn’t a lot of time for an algorithm to run, even distributed across hundreds of compute nodes.
Languages: Besides English, Spanish, and French, enormous portions of open web data are in languages like Chinese, Hindi, Indonesian, and Japanese. Most word-based approaches require that a dictionary be loaded into memory for each language supported (we currently support almost 30). Of course, all this requires that the language on the page be detected ahead of time.
Cleanliness: In addition to common “dirty” features like typos and slang, the content we classify may be in multiple languages, may come to us broken or incomplete due to formatting or artifacts of content extraction algorithms, and may consist entirely of abbreviations that make little sense at all in any language.
Right now, a hybrid solution may look the best—something like a word-based CNN for the bulk of English data, falling back to simpler, less effective symptoms for the long tail. This sort of system appears frequently in production applications; machine learning applications tend to be intricate clockworks of machinery, and production tends to be a world best suited to sledgehammers.
Holding out hope for our “data science sledgehammer,” we turn to character-based algorithms in the hopes of trading a small amount of performance for a great deal of resilience and maintainability. Character-based approaches have been making significant gains in NLP for the past few years, especially in named entity recognition and text generation tasks where they can be used with recurrent architectures and conditional random fields. Research from Yann LeCun’s labs at NYU and Facebook AI Research have yielded convolutional architectures for classification with these character-based approaches, and our approach is based on this research, published in 2015 and 2016 by Xiang Zhang and Alexis Conneau et. al., respectively.
Using a character-based network immediately frees us from enormous dictionaries of vectors that must be loaded into memory on every node in a shared compute infrastructure, as well as the task of language detection. Remember, language detection may seem comparatively easy, but there may be multiple languages on a website and there may not be enough text to distinguish certain groups of languages (English in particular can be difficult, as there’s such a breadth in its etymology). We’re also completely freed from error-prone preprocessing steps like stemming and tokenization, and even content extraction can be drastically simplified. While these algorithms may currently lag behind word-based approaches in performance, their resilience when dealing with messy data for a diverse range of purposes has made them into the sledgehammer we need. As an added benefit, we can train on languages like Hindi, Russian, Korean, and Indonesian by simply changing the alphabet used and obtaining a truth set, with no need for language-specific embeddings.
The crux of any convolutional network architecture—and, to a certain extent, all classification machine learning—can be thought of as manipulating data in a uniform, predictable way to extract additional information. The network itself then uses this information in a process I liken to a child picking up puzzle pieces and examining them to see how they may later fit together. By using a much longer series of progressively wider convolution filters than what are typically used in word-based approaches, we apply a set of progressive, uniform, transformative numerical operations from which our loss function can derive a set of representations to classify data. In other words, the network is spending much more time flipping over much smaller puzzle pieces to learn as much as it can about how they may fit together.
The character-based convolutional network begins by embedding characters, rather than words, into a 16-dimensional space of random uniform vectors. Essentially, this step serves simply to transform letters into numbers. As we only have a small handful of characters in the Latin alphabet, we end up with a much smaller space to span than an entire dictionary of tens of thousands of words. Therefore, we can get away with embedding our input in a comparatively smaller space of 16 dimensions, compared to the word-based vectors that are usually embedded in hundreds of dimensions to better preserve the unique spatial relationships between words.
The magic comes from the temporal convolutions, which continue in a series considerably longer than in these networks’ word-based counterparts. This is presumably where Conneau’s “Very Deep” descriptor comes from, as these networks have between 10 and 50 convolutional layers, in contrast to word-based approaches which typically just have one (Kim 2015ish). It still is far less than the hundreds used in massive image networks, as they contain far more entropy from which to extract representations—to be expected, if one is to believe that a picture is worth a thousand words.
The most surprising result from this research is that performance apparently continues increasing with additional network depth. In practice, we’ve found that 9 layers are sufficient for research purposes and the 29 layer architecture can reliably produce a noticeable performance boost for the production implementation. Beyond this, training takes over 24 hours on dual Tesla V100 GPU machines and the cost of the compute time can’t be justified by a demonstrable increase in revenue from additional performance.
Other Things to Consider
Deep learning presents a much steeper barrier to entry than traditional machine learning methods, which can be used by simply calling a “fit” function on a pre-made classifier. Furthermore, much like the research itself, the freely available code for these approaches tends to lack basic production considerations—for example, the ability to serialize a model and use it for simple inference after training.
Deep learning frameworks are still relatively new, and much more barebones than traditional statistical packages. These frameworks themselves are not without issues; whether you use Tensorflow, PyTorch, Caffe, or any other deep learning framework, it’s important to understand that these libraries primarily serve the purpose of hardware optimization, not basic linear algebra. That means that what’s being abstracted away is enormously complex when compared to most open source libraries, and penetrates down to L2 caching and hardware instruction optimizations. In return for this abstraction layer, you are essentially tasked with learning an entire domain-specific language rather than a simple library purpose-built for convenience, and you will usually have to contend with statically precompiled computation graphs and a deferred execution model that can make debugging very difficult. Frequently, a failure will happen many, many steps removed from your code and outside the environment where it’s running (e.g. the Python interpreter), so breakpoints are useless and the error messages tend not to make sense.
We’ve found that Tensorflow is the best choice for our restrictions, namely widespread support, multiGPU capabilities, and the ability to deploy to a Java ecosystem. But even Tensorflow is plagued by sparsely available and sometimes incorrect documentation, and we rely heavily on so-called “contrib” functions that are explicitly exempted from API stability guarantees and, consequently, frequently break between minor releases. Careful consideration should be taken prior to investing time and resources to these methods to evaluate the necessity for the increase in performance, as well as the realistic monetary benefit. Frequently, the best solutions are the simplest. But for text classification issues, when “simple” fails, we now have a sledgehammer that does the trick.