You might have noticed over the years how search engines and social media recommender systems have gotten smarter and more accurate. Search engines are now capable of parsing ambiguous Natural Language queries, such as “Where do the Warriors play?” (i.e. Oracle Arena, San Francisco), and social media platforms are able to recommend posts which are related to comments you have written, or pages you have visited.
The accuracy of these predictions looks fairly magical at times, but there's a lot under the hood to make it happen.
One of the most interesting pieces of this puzzle is called Named Entity Disambiguation (NED), or Entity Linking (EL).
Named Entity Disambiguation is the task of mapping words of interest, such as names of persons, locations and companies, from an input text document to corresponding unique entities in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The target KB depends on the application, but for generic Named Entity Disambiguation systems a common choice is Wikipedia. Usually Named Entity Disambiguation does not employ Wikipedia directly, but they exploit databases which contain structured versions of it, such as DBpedia or Wikidata.
For example, if we have a sentence like "The Indiana Pacers and Miami Heat meet at Miami’s American Airlines Arena”, we can link each Named Entity to its Wikipedia page (e.g. en.wikipedia.org/wiki/Indiana_Pacers).
In this article, explained over two posts, we will present a new Named Entity Disambiguation algorithm that we have developed, which is able to perform real-time predictions with state-of-the-art accuracy by leveraging graph embeddings. In this first post, we’ll present a brief overview of Named Entity Disambiguation, and how we can construct a powerful Knowledge Graph from Wikipedia. In the next post, we’ll show how to create embeddings from this graph, and how to use them to perform Named Entity Disambiguation.
Named Entity Disambiguation is critical in many fields of application, such as text analysis, recommender systems, semantic search and chatbots. All of these fields benefit from high-level representations of the text, in which concepts relevant to the application are separated from text and other non-meaningful data. NED can also be used to quickly extract information from social media posts or news articles, to understand which topics are trending. To get a clearer understanding of the benefits given by Named Entity Disambiguation, consider again the initial example query, “Where do the Warriors play?”. One common task that search engines perform is to find documents that are similar to the one given as input, or to find additional information about the entities that are mentioned in it.
Without Named Entity Disambiguation, the search engine only looks for documents and sources of information that mention “Warriors”. This implies that the search engine wouldn't retrieve documents which, for example, mention “Golden State”, or “The basketball team of San Francisco” (i.e. the Golden State Warriors), leading to so-called False Negative (FN). Even worse, the search engine would produce spurious matches (or False Positive (FP)), such as retrieving documents about the 1979 movie “The Warriors”.
Obviously, NED alone is not enough to build a powerful search engine or recommender system, but it still plays a very important role in any modern Natural Language Processing pipeline.
Performing NED on real text data is often not trivial, and a robust NED algorithm must deal with a number of different challenges. Let’s consider another complex example, taken from a tweet: “Floyd revolutionized rock with the Wall”. Here, we have 3 mentions (Floyd, Rock, and Wall), but none of them has a simple unique link to a Wikipedia page. “Rock” could be referring to physical rocks, to Rock music, or even to the wrestler Dwayne Johnson, known as “The Rock”. Similarly, “Wall” could be referring to different things, from actual walls (which could be made of rocks) to Pink Floyd’s album The Wall. To correctly perform NED, we need to consider all 3 mentions at the same time, and thanks to their context infer that the sentence is about Pink Floyd’s The Wall, and not about generic rocks and walls.
This tweet gives a concrete example of many common challenges of NED:
As seen in the previous section, context plays a big role in performing correct NED. Thankfully, Wikipedia pages are linked to each other, and pages belonging to the same “context” are often linked to the same pages. We can very easily imagine Wikipedia as a large graph, where each page is a vertex, and each link between pages is an edge between vertices. Edges/links can express different concepts, such as “birth date”, or “grandfather of”. There are also “special” links, called “redirect” and “disambiguation” links. Redirect links are used to link entities which might appear with different names, such as New York and NYC. In Wikipedia, opening the page of NYC automatically redirects to New York, and we can take advantage of this to create a more robust NED algorithm. Similarly, disambiguation links are used to link pages with very generic or ambiguous names: looking for “Rock” brings you to a page with different options, such as geological rocks, Rock music, and Dwayne Johnson. Once again, using these links will benefit the accuracy of our algorithm.
Instead of using Wikipedia directly, our NED uses a Knowledge Graph (KG) created from DBpedia. DBpedia represents entities with a schema, and divides them into different types. For example, Paris is an entity of type PopulatedPlace, and has properties such as area, country and mayor.
From a Knowledge Base represented with Resource Description Framework (RDF) such as DBpedia, it comes natural to create a Knowledge Graph, so that relations between entities can be leveraged for NED. Note that there is a 1-to-1 mapping between entities in DBpedia and Wikipedia, so it’s easy to move back and forth between the two.
Even though DBpedia links are directed, we consider our graph as undirected, as the information contained in DBpedia links is inherently bidirectional (with the exception of redirection and disambiguation links). We employ most of the files that compose the English DBpedia, and create a graph with around 12 million vertices and 170 million (bidirectional) edges.
DBpedia links that perform redirections (e.g. NY) or are outlinks of disambiguation pages (e.g. Paris) define a Directed Acyclic Graph (DAG) that is used to enrich the candidate set of each mention, by adding redirection and disambiguation targets to the set.
In the next blog post, we’ll see how to leverage Graph Embeddings created from our Knowledge Graph to perform real-time Named Entity Disambiguation with state-of-the-art accuracy.