X

Oracle Artificial Intelligence Blog

Named Entity Disambiguation with Knowledge Graphs

Introduction to Named Entity Disambiguation

You might have noticed over the years how search engines and social media recommender systems have gotten smarter and more accurate. Search engines are now capable of parsing ambiguous Natural Language queries, such as “Where do the Warriors play?” (i.e. Oracle Arena, San Francisco), and social media platforms are able to recommend posts which are related to comments you have written, or pages you have visited.

The accuracy of these predictions looks fairly magical at times, but there's a lot under the hood to make it happen.

One of the most interesting pieces of this puzzle is called Named Entity Disambiguation (NED), or Entity Linking (EL).

Named Entity Disambiguation is the task of mapping words of interest, such as names of persons, locations and companies, from an input text document to corresponding unique entities in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The target KB depends on the application, but for generic Named Entity Disambiguation systems a common choice is Wikipedia. Usually Named Entity Disambiguation does not employ Wikipedia directly, but they exploit databases which contain structured versions of it, such as DBpedia or Wikidata.

For example, if we have a sentence like "The Indiana Pacers and Miami Heat meet at Miami’s American Airlines Arena”, we can link each Named Entity to its Wikipedia page (e.g. en.wikipedia.org/wiki/Indiana_Pacers).

https://lh5.googleusercontent.com/fzAtIal2qljvCLpwL2G_6Fp6M6ne05MvJjKrfe-Nlc1WOz_rvJ1dFubXD0atWYH7SJzDVFoqg-sWj-jtbgPXJJEIKXQaFzwVIoaF7uoqJb_iXTefYvygh9sAL0eiWzEIXsTAaXYK

In this article, explained over two posts, we will present a new Named Entity Disambiguation algorithm that we have developed, which is able to perform real-time predictions with state-of-the-art accuracy by leveraging graph embeddings. In this first post, we’ll present a brief overview of Named Entity Disambiguation, and how we can construct a powerful Knowledge Graph from Wikipedia. In the next post, we’ll show how to create embeddings from this graph, and how to use them to perform Named Entity Disambiguation.

Extracting Knowledge From Text

Named Entity Disambiguation is critical in many fields of application, such as text analysis, recommender systems, semantic search and chatbots. All of these fields benefit from high-level representations of the text, in which concepts relevant to the application are separated from text and other non-meaningful data. NED can also be used to quickly extract information from social media posts or news articles, to understand which topics are trending. To get a clearer understanding of the benefits given by Named Entity Disambiguation, consider again the initial example query, “Where do the Warriors play?”. One common task that search engines perform is to find documents that are similar to the one given as input, or to find additional information about the entities that are mentioned in it.

Without Named Entity Disambiguation, the search engine only looks for documents and sources of information that mention “Warriors”. This implies that the search engine wouldn't retrieve documents which, for example, mention “Golden State”, or “The basketball team of San Francisco” (i.e. the Golden State Warriors), leading to so-called False Negative (FN). Even worse, the search engine would produce spurious matches (or False Positive (FP)), such as retrieving documents about the 1979 movie “The Warriors”.

Obviously, NED alone is not enough to build a powerful search engine or recommender system, but it still plays a very important role in any modern Natural Language Processing pipeline.

Challenges in Named Entity Disambiguation

Performing NED on real text data is often not trivial, and a robust NED algorithm must deal with a number of different challenges. Let’s consider another complex example, taken from a tweet: “Floyd revolutionized rock with the Wall”. Here, we have 3 mentions (FloydRock, and Wall), but none of them has a simple unique link to a Wikipedia page. “Rock” could be referring to physical rocks, to Rock music, or even to the wrestler Dwayne Johnson, known as “The Rock”. Similarly, “Wall” could be referring to different things, from actual walls (which could be made of rocks) to Pink Floyd’s album The Wall. To correctly perform NED, we need to consider all 3 mentions at the same time, and thanks to their context infer that the sentence is about Pink Floyd’s The Wall, and not about generic rocks and walls.

Named Entity Disambiguation With Knowledge Graphs Example

This tweet gives a concrete example of many common challenges of NED:

  • Name Variations: the same entity might appear with different forms. Name variations might be caused by abbreviations (New York, NY), aliases (New York, Big Apple), or spelling variations and errors (New yokr).
  • Ambiguity: the same mention often refers to many different entities, depending on the context, as many entity names tend to be polysemous (i.e. have multiple meanings). In the previous example, all 3 mentions show a certain degree of ambiguity (e.g. The Wall)
  • Incomplete Information: a robust NED system should cope with limited amount of information and context, for example when processing short documents that contain only a few mentions.
  • Scalability and Speed: an industrial NED system must provide results in a reasonable time, and often in real-time. This requirement is critical for search engines, chat-bots and for NED systems offered by data-analytics platforms. Ensuring low execution time can be challenging when using large KBs or when processing large documents. For example, the English Wikipedia contains nearly 9 million entities and more than 170 million relationships among them.
  • Training: in this field, it is often difficult to train algorithm in a supervised fashion. Available datasets contain at most a few hundred sentences each, and building new data is expensive and time-consuming. Hence, unsupervised NED systems can provide greater flexibility and often better accuracy.

Leveraging Knowledge Graphs

The DBpedia Knowledge Base

As seen in the previous section, context plays a big role in performing correct NED. Thankfully, Wikipedia pages are linked to each other, and pages belonging to the same “context” are often linked to the same pages. We can very easily imagine Wikipedia as a large graph, where each page is a vertex, and each link between pages is an edge between vertices. Edges/links can express different concepts, such as “birth date”, or “grandfather of”. There are also “special” links, called “redirect” and “disambiguation” links. Redirect links are used to link entities which might appear with different names, such as New York and NYC. In Wikipedia, opening the page of NYC automatically redirects to New York, and we can take advantage of this to create a more robust NED algorithm. Similarly, disambiguation links are used to link pages with very generic or ambiguous names: looking for “Rock” brings you to a page with different options, such as geological rocksRock music, and Dwayne Johnson. Once again, using these links will benefit the accuracy of our algorithm.

https://lh3.googleusercontent.com/ZwygGnzXP8jvk9NLXL0YagnaGqzfQHG23QWZddOvxayn6d_Qlj43G8zYhMCLk2hxtO4JIm1msh8Q4B3sz2uugsJ29jH799SQb4eALvzwqmxbB-X7a6M8pfwE3fOGHVSfTK33VuQI

Creating our Knowledge Graph

Instead of using Wikipedia directly, our NED uses a Knowledge Graph (KG) created from DBpedia. DBpedia represents entities with a schema, and divides them into different types. For example, Paris is an entity of type PopulatedPlace, and has properties such as area, country and mayor.

From a Knowledge Base represented with Resource Description Framework (RDF) such as DBpedia, it comes natural to create a Knowledge Graph, so that relations between entities can be leveraged for NED. Note that there is a 1-to-1 mapping between entities in DBpedia and Wikipedia, so it’s easy to move back and forth between the two.

Even though DBpedia links are directed, we consider our graph as undirected, as the information contained in DBpedia links is inherently bidirectional (with the exception of redirection and disambiguation links). We employ most of the files that compose the English DBpedia, and create a graph with around 12 million vertices and 170 million (bidirectional) edges.

DBpedia links that perform redirections (e.g. NY) or are outlinks of disambiguation pages (e.g. Paris) define a Directed Acyclic Graph (DAG) that is used to enrich the candidate set of each mention, by adding redirection and disambiguation targets to the set.

Learn more

About / Contacts

In the next blog post, we’ll see how to leverage Graph Embeddings created from our Knowledge Graph to perform real-time Named Entity Disambiguation with state-of-the-art accuracy. 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.