In the previous post, we have seen what Named Entity Disambiguation is, and how we can build a Knowledge Graph from Wikipedia (or DBpedia, in our case) to link textual Named Entities to Wikipedia entities. In this follow-up post, we’ll see how our approach works in practice, and what results it is able to achieve.
The core of our algorithm is the use of vertex embeddings created from our Knowledge Graph. Representing vertices with embeddings allows encoding in a low-dimensional space the complex topological relationships that exist among entities in the graph. Moreover, it is possible to encode in a unified way heterogeneous properties, such as textual labels, statistical measures, and other handcrafted features. While text-based embeddings have shown good results [1], the use of vertex embeddings offers more flexibility with respect to the field of application, as no large text corpus is required to create the embeddings. Hence, vertex embeddings are content-agnostic and thereby more practical for large-scale deployments. The content-agnostic feature of our vertex embeddings enables its use on any kind of domain-specific KG, even if no suitable text corpus is available.
To the best of our knowledge, we are the first to use vertex embeddings in an unsupervised Named Entity Disambiguation algorithm, and thanks to our approach, we can compute real-time predictions with state-of-the-art accuracy.
Our algorithm is composed of 5 building blocks, which are grouped in 2 main phases: Graph Learning and NLP Execution.
Graph Learning:
1.Knowledge Graph Creation
2.Vertex Embedding Creation
NLP Execution:
1.Named Entity Recognition
2.Candidate Finder
3.Collective Disambiguation
We have already seen this step in the first post of this article. Starting from DBpedia raw data, we create a large Knowledge Graph with nearly 9 million entities and more than 170 million relationships among them.
After creating our Knowledge Graph, we compute the embeddings of each vertex in it. A vertex embedding is an n-dimensional vector associated with the vertex of a graph, created to encode in a compact way the information of the vertex. This information could be related to the topology of the graph (e.g. information about the connectivity of a vertex or about its neighborhood) or encode additional features of the vertex (e.g. the category of a Wikipedia page). The key idea behind embedding algorithms is that vertices that are similar or related in some way should also have similar embeddings. Vertices which are more strongly related have more similar embeddings and give better NER.
In our algorithm, we make use of a custom implementation of DeepWalk [2] applied to our KG. As the random walks are vertex sequences, we can transform them into embeddings using word2vec [3], and use negative sampling to approximate probabilities of encountering a given vertex. We use an embedding size of 160, a random walk length of 8, 12 random walks for each vertex, and 6 epochs of training.
An NER system extracts mentions to be linked, e.g. names of persons, locations, or companies. There exists plenty of accurate NER systems, and we can simply use an existing tool for this step, like the Stanford NER [4]. As commonly done in literature, we assume that mentions have already been computed, to avoid the propagation of errors made by the NER algorithm to the evaluation of the Named Entity Disambiguation step.
Using again our returning example “Floyd revolutionized rock with the Wall,” a good NER system will find the following 3 Named Entities: Floyd, Rock, Wall.
These Named Entities are highly ambiguous, and we need to perform Named Entity Disambiguation before they can be used in further analyses.
The goal of the candidate finder is to reduce the dimensionality of the problem by obtaining, for each mention found by NER, a small set of entities. These entities are represented by vertices of the graph, and the correct entity should be present among them. Each candidate vertex is assigned a score in [0, 1] that represents the similarity of its identifier (i.e. the DBpedia entity name) to the surface form of the textual mention. To avoid a linear scan of all the vertices in the KG, which has complexity |V| and is unsuitable for real-time usage, we use an index-based string search system based on a straightforward combination of 2-grams and 3-grams similarity, which is extremely fast to compute thanks to our indexing approach.
Often, entities might appear with plenty of different surface forms: for example, "New York" could also be mentioned as "NY" or "Big Apple." By leveraging redirection links, we can match a surface form such as "NY" to a vertex whose name is "NY," and then follow the redirection link from vertex "NY" to vertex "New_York."
Another ambiguous scenario is related to very generic surface forms, mapped to many different vertices. "Paris" could be a French city, but also the name of many different movies. We employ disambiguation pages and links. If a surface form is matched with a disambiguation page, we add all the destinations of the disambiguation page to the list of candidates. As redirection and disambiguation links might follow each other, we propagate scores using a Breadth-First Visit, starting from the initial set of candidates found with string similarity.
In this final step, we pick the best candidate vertex for each mention and provide the final linked entities for the given document. The idea behind this step is to look for candidates referring to related concepts and belonging to the same context. In a good solution, candidates are related to each other, like “Pink Floyd,” “Rock Music,” and “The Wall.”
If two vertices are similar, their embeddings should also be similar. We can use vertex embeddings to find, for each mention, which is the best candidate.
Given a tuple of candidates (i.e. we pick a candidate for each mention), we can score how good this tuple is by measuring the sum of the (cosine) similarities of the candidates’ embeddings from the mean vector of the tuple’s embeddings. To avoid local minima (i.e. tuples of candidates which are very strongly related, but have no similarity with our input mentions), we combine the score obtained from embedding similarity (called Global Score) with the original string similarity measured by the candidate finder (the Local Score).
Evaluating all combinations of candidates is not feasible for documents with more than a handful of mentions due the exponential number of possibilities (e.g. 10 mentions with 10 candidates each would result in 10^{10} different tuples). To overcome this issue, we devised a heuristic optimization algorithm based on state space search exploration that is able to converge to good solutions (without guarantee of convergence to the global optimal) in a very small amount of iterations.
The basic idea of the optimization algorithm is to iteratively improve an existing candidate tuple. Starting from an initial tuple (which can be created at random, or with a smarter heuristic), at each step, the algorithm will create a new tuple which will have a better score. The algorithm terminates after a given amount of steps, or when an early-stop termination criterion is met (e.g. the score didn't improve for the last N iterations).
To evaluate the performance of our algorithm, we tested it against multiple state-of-the-art Named Entity Disambiguation algorithms on a variety of datasets. We measured the Micro-averaged Bag-of-Concept F1 Score of each algorithm, a metric which combines Precision and Recall and provides a good estimate of how the algorithm will fare in real-life applications. Note that most of the other algorithms, including the best performing ones, are trained in a supervised fashion. Our algorithm, despite being unsupervised, is able to match or exceed the performance of pretty much any other approach.
The focus of our NED algorithm is not just on the accuracy of its predictions, but also on providing good results in a very short, and possibly real-time, time-frame (here, we consider the algorithm to be real-time if the end-to-end analysis takes less than one second). Moreover, users should be able to tune the algorithm according to their needs, and to whether they require higher accuracy, higher throughput, or lower latency.
The execution time of our algorithm can be influenced using different settings, and a minimal drop in accuracy can translate to a significant reduction in the document analysis latency. For example, in the low-latency configuration, we lower the number of candidates per mention, and the number of optimization steps: the execution time improves by almost a factor of 6x compared to the default configuration, with a loss of F1 score of only 2% absolute points.
In conclusion, we presented a novel unsupervised Named Entity Disambiguation algorithm that leverages vertex neural embeddings to provide high accuracy and performance, along with an unmatched degree of modularity and flexibility. Our approach can rival and exceed multiple supervised state-of-the-art algorithms, and provide real-time analyses without significant sacrifices in terms of accuracy.
[1] Stefan Zwicklbauer, Christin Seifert, and Michael Granitzer. Robust and collective entity disambiguation through semantic embeddings. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM. 2016, pp. 425434.
[2] Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.
[3] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
[4] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf