Natural Language Processing (NLP) is a set of techniques for identifying valuable information and patterns from text or language data. The open source community has developed a wide variety of tools that can be used to implement NLP more effectively, but choosing the right ones for the enterprise level can be a challenge. We’ve used the DataScience Trends tool to evaluate three of the most popular open source libraries for NLP: the Natural Language Toolkit (NLTK), Gensim, and spaCy.
Want to master spaCy for natural language processing? Check out this three-part video learning path from O'Reilly Media featuring our own data scientist Aaron Kramer.
How NLP Works
NLP can be roughly divided into three phases: preprocessing, representation, and application. Preprocessing involved getting the raw data into a format that’s cleaned and potentially annotated with named entities, parts of speech, and syntactic dependencies. In the representation phase, this data is transformed into a more useful form for a desired use-case. Finally, it’s time for application, in which these representations are used for any of a variety of tasks like classification, document retrieval, answering questions, or more. Most open source packages are useful for only one or two of these three phases, so it’s important to note that there is often overlap between their use.
Evaluating Open Source Libraries for NLP
We’ll be using DataScience Trends to measure the number of new daily stars each package receives on GitHub, averaged over intervals of one week. Users choose to star tools that they’re interested in, but not necessarily contributing to, so this metric is a good proxy for popularity. Released in 2001, NLTK is the longest running open source library that we’ll explore. It’s often used as a teaching tool and is favored in the academic community, but is only very useful for the preprocessing phase of NLP. The graph below shows NLTK has typically received an average of 3 or 4 new stars (GitHub’s form of bookmarking a project) per day since 2015. Every exploration you make with Trends has its own unique, shareable, url, so you can directly interact with the graph belowand add your own filters.
For the most part, this is comparable to activity for the package Gensim, represented as “piskvorky/gensim” prior to June 15, 2016 and “rare-technologies/gensim” afterwards. Gensim excels at the representation and application aspects of NLP, and is specifically designed to handle large text collections. This makes Gensim a much more suitable choice for enterprise-level NLP, but the lack of a fast and scalable way to conduct preprocessing may explain why the package did not begin to gain popularity early on.
In fact, the graph illustrates that the increase in Gensim’s popularity appears to coincide with the rise of spaCy, represented as “spacy-io/spacy” until October 1, 2016 and “explosion/spacy” afterwards. That’s because spaCy integrates with Gensim and excels in preprocessing and representation. Since spaCy was designed for production users, it’s much faster than NLTK and has even been named the fastest syntactic parser in the world. In addition to Gensim, spaCy integrates with other popular Python packages including TensorFlow, Keras, and scikit-learn. This versatility explains why its popularity has eclipsed both NLTK and Gensim in a relatively short time. Not convinced? Go directly to the data visualization and explore for yourself.
For more examples of how DataScience Trends can help you identify noteworthy activity in GitHub’s most popular repositories, download our white paper on open source tools for enterprise data science. Or, check out the tool yourself and see how your favorite libraries measure up.