X

Learn data science best practices

Introduction to Embedding in Natural Language Processing

This post introduces word embeddings, discusses the challenges associated with word embeddings, and embeddings for other artifacts such as n-grams, sentences, paragraphs, documents, and knowledge graphs.

Introducing Word Embeddings

A word is a basic unit of language that conveys meaning of its own. With the help of words and language rules, an infinite set of concepts can be expressed. Machine learning approaches towards NLP require words to be expressed in vector form. Word embeddings, proposed in 1986 [4], is a feature engineering technique in which words are represented as a vector.

Embeddings are designed for specific tasks. Let's take a simple way to represent a word in vector space: each word is uniquely mapped onto a series of zeros and a one, with the location of the one corresponding to the index of the word in the vocabulary. This technique is referred to as one-hot encoding. As an example, take the sentence "Go Intelligent Bot Service Artificial Intelligence, Oracle": 

word go intelligent bot service artificial intelligence oracle
Go 1 0 0 0 0 0 0
Intelligent 0 1 0 0 0 0 0
... ... ... ... ... ... ... ...

Sentence vector representations can be built using these vector representations of the word. 

If the downstream task from word vectors is to capture synonyms as a low-dimensional representation of a dictionary, the embeddings technique can be designed to encode this information. However, the representation above can be improved upon by capturing word co-occurrence statistics. 

Word Embeddings based on Co-occurrence Statistics of Words in Corpus

As a first step towards generating a co-occurrence matrix, consider adjacent word pairs (word and next word). In our example above, a few valid pairs would be: "go" and "intelligent," "intelligent" and "bot," and "bot" and "service." Keeping this vocabulary constant, a co-occurrence matrix is created by incrementing (word's row index, next word's column index). Following through with this algorithm, Table 1 is generated. Adding further utterances such as "go bot," "go oracle," "go bot artificial intelligence," "intelligence service," and "bot go" gives rise to Table 2.

Table 1: Co-occurrence matrix for "go intelligent bot service artificial intelligence oracle"

Co-occurrence Matrix

go

intelligent

bot

service

artificial

intelligence

oracle

go

0

1

0

0

0

0

0

intelligent

0

0

1

0

0

0

0

bot

0

0

0

1

0

0

0

service

0

0

0

0

1

0

0

artificial

0

0

0

0

0

1

0

intelligence

0

0

0

0

0

0

1

oracle

0

0

0

0

0

0

0

Table 2: Co-occurrence matrix after adding further sentences

Co-occurrence matrix

go

intelligent

bot

service

artificial

intelligence

oracle

go

0

1

2

0

0

0

1

intelligent

0

0

1

0

0

0

0

bot

1

0

0

1

1

0

0

service

0

0

0

0

1

0

0

artificial

0

0

0

0

0

2

0

intelligence

0

0

0

1

0

0

1

oracle

0

0

0

0

0

0

0

Using the co-occurrence statistics defined in Table 2, there’s a 50% probability that the next word after "go" is "bot" (number of times "bot" occurs in row for "go")/(sum total of the row "go"). Similarly, there’s a 100% probability that the next word after "artificial" is "intelligence." Though this is simple and straightforward, co-occurrence stores additional information beyond one-hot encoding—it also captures the semantic meanings of words and syntactic structure.

While it is possible to generate word vectors for a small corpus, word vectors based on a large corpus capture syntactic and semantic relationships between words more effectively. When represented as sparse vectors (where most values are zero), a co-occurrence matrix built for a large vocabulary becomes computationally unwieldy. It necessitates the use of dense representation in lower dimensions with fewer non-zero elements. With lower dimensions, the models inherently capture a distributed representation where the relationships are distributed throughout the vector dimension. Dimensionality reduction [16] or neural networks [18] are popular mechanisms to generate word vectors with reduced dimensions.

Application of Word Embeddings

The following figures have been generated using Google's Pre-trained word and phrase vectors [7]. The pre-trained model is generated using Google News Dataset with 100 billion words with a vocabulary of 3 million. The following graphs depict the semantic similarity between the animal, food, and vehicles categories.

vectors-1

Figure 1: Distinct notion of domestic animals captured.

vectors-2

Figure 2: Distinct notion of farm animals captured; rabbit seems closer to farm and domestic animal, but farther from wild animals.

vectors-3

Figure 3: Additional food and vehicle category. While distinctions between wild, domestic, and farm animals are made within the animal category, additional categories indicate that wild animals are closer to animals rather than vehicles or food.

vectors-4

Figure 4: Additional category headers for Animal (point marked as A), Food (point marked as F), and Vehicle (point marked as V) categories demonstrate the distinct notions. 

In the following example, a notion of plurality can be noted with the artifacts showing singular, plural and difference vector of the two.

vectors-5

Figure 5: Concept of plurality for the word "cat."

vectors-6

Figure 6: Concept of plurality for the word "dog."

vectors-7

Figure 7: Concept of plurality for the word "computer."

vectors-8

Figure 8: Concept of plurality for the word "dance."

 

Challenges in Word Vectors

Multi-sense words and Semantic Change of Word

Words have multiple meanings, "lion" can refer to a) an animal and b) movies named lion, c) surname, or others. Further, a word's meaning changes over time. These effect relationships among words in the corpus and thereby, impact the word vectors built with them.

Out of Vocabulary (OOV) words

New words are frequently introduced into a language's vocabulary, more so with social media. Combined with eliminating infrequent words to eliminate noise during training, there is a small chance that words are not represented as vectors. Techniques for handling OOV words need to be considered, along with downstream processes to identify, capture and handle them effectively.

"Multi-word" Terms

A phrase refers to a new distinct entity separate than the words it is composed of. For example, mountain lion is a different species than a lion, but might not be represented. By incorporating "mountain lion" as a single phrase at training time, word embeddings can capture specific context  aptly. But, taking an average of the two words (mountain, lion) is closer to animals than vehicles.

vectors-9

Figure 9: Notion of animals, food, and vehicles along with its categories.

vectors-10

Figure 10: "Mountain lion" does not exist in the vector space, but "mountain" and "lion" do. Using the average vector generated from the vector of two words, "mountain lion" (point marked as M) is closer to the animal category than "vehicle" or "food." Upon closer inspection, "mountain lion" is closer to "dog" than "tiger." vectors-11

Figure 11: "Cougar" (another name for "mountain lion") brings out the nuance that a cougar is closer to a tiger than a dog. While using average vectors allows for broad categorical differences to be captured (i.e., between vehicle and animal), having "multi-word" vectors in the embedding space captures this better.

vectors-12

Figure 12: More examples of average vectors for "fighter plane," "jack rabbit," and "red tailed hawk."

Word vectors for Resource Constrained Environments

Embedding word vectors into web applications requires non-trivial resources, especially in compute, disk, and memory-constrained environments. Approaches with smaller vocabulary by restricting the domain and/or reduced vector dimensions help.

Post training rounding off of vectors for inference stage is one way to reduce the requirements, but leads to quantization error. Low Precision Training [3, 10], attempt to generate lower precision vectors during training time, leading to quantization error being captured and minimized. In Product Quantization (PQ) [9] each vector is expressed as a combination of multiple subvectors to save vector memory foot print. Using quantization and other techniques [2] models with 10% size with little impact in-terms of accuracy, on standard bench marking tasks have been created.

Multi-lingual and Cross-lingual Word Embeddings

Building mono-lingual word embeddings is similar for all languages, given a tokenization schemes, especially for non-space separated languages. Due to random initialization during training time, the vector for a word and its counterpart in another language would not be the same. To embed them in the same vector space, having a large corpus containing rich interactions in desired languages is straight forward. In case, such corpus is not available, other popular techniques that have been proposed. 

Given word vectors generated in two languages, relationship between words are expected to share similar relationship baring variations, owing to 1. training corpus sizes, 2. relative word frequency, 3. general relationships between entities; or 4. errors in tokenization. For example, while a word in first language has a single semantic meaning, the word in second language could have multiple semantic meanings. This does happen, however, this can be optimized by considering a large set of translated combination. Following table shows some examples.

vectors-13

Figure 13: Concepts of animal, vehicle, food, and knowledge in English.

vectors-14

Figure 14: Addition of music and dance categories show that they are closer to animal and food in English

vectors-15

Figure 15: Concepts of animal, vehicle, food, and knowledge in Spanish (labels are in English). Specific Spanish words used: "vehículo," "animal," "conocimiento," and "comida." The categories seem to show a similar structure as English.

vectors-16

Figure 16: Addition of music and dance categories show that they are closer to animal and food (labels are in English). Specific Spanish words used: "vehículo," "animal," "conocimiento," "música," "danza," and "comida." 

vectors-17

Figure 17: Concepts of animal, vehicle, food, and knowledge in German (labels are in English). Specific German words used: "fahrzeug," "tier," "wissen," and "lebensmittel." Note the difference between German and other languages.

vectors-18

Figure 18: Addition of music and dance categories in German show that they are closer to knowledge and food (labels are in English). Specific German words used: "fahrzeug," "tier," "wissen," "musik," "tanzen," and "lebensmittel." Note the difference between German and other languages.

Since relationships are maintained, it is possible to translate the vector space of one language into another. By using a few thousand pairs of translated words between the languages A and B, a translation matrix T, can be identified such that when multiplied when a vector from B, it will be in same space as A. Oracle Labs [11] refers to an approach that augments the training data by replacing translated word in A with its synonym in B. After joint training the word vector, the vector space for these languages is in the same space.

Beyond Word Vectors 

In many cases, vector representation of complex entities beyond words is required for certain tasks. Vectors for Sentence, Paragraph and Documents are discussed with focus on text classification tasks.

Sentence, Paragraph and Document Vectors

Consider the following corpus containing classes and example utterances: 

Class: Balances Class: Transactions
Whats the current value of my savings account?
Whats the current balance on my cc?
Do I have enough money?
What do I owe?
How much money do I have in all of my accounts?
How much money do I have in checking?
tell me my balance
How much do I owe on all my credit cards?
Whats my balance?
Whats the current balance
How much money did I save last year?
How much do I have available?
Whats my available credit on my Visa?
Show my transactions for last weekend.
Show me all the checks I sent out last month
Show my last 5 checks
Show me my Checking account transactions from January to May
Show me my last 10 Saving account transactions
Show my largest transaction in May
How much did I pay Andy last week?
How much did I deposit in April?

Average word vector is a simple design for sentence vector as it contains the general semantic meaning of the sentence. Plotting the average word vectors of a sentence vector for the above data set.

vectors-19

Figure 19: Visualizing sentence vector generated by average word vector for the above dataset containing the two classes.

vectors-20

Figure 20: Visualizing the sentence vector generated by removing one sentence (“what do I owe”). The two classes are clearly from distinct groups.

Using word vector average as sentence vector, is advantageous, since its computationally efficient; and combined with linear models, lends itself to fast classification. While this works well, there are edge cases where average vectors break down, for example, consider: "please order no pizza" and "no, please order pizza." Though these mean two different things, an average vector will lead to the same sentence vector. 

Pragraph2Vec or Doc2Vec is another mechanism to represent a document or paragraph[15]. This technique extends CBOW and Skip-gram models, by seeding a unique id for each paragraph/document id as the initial state and word averaging to create new vectors. This way of representing the sentence vectors leads to lower error rates and leads to better text classification. 

Character n-gram vectors

As mentioned in the challenges section, word vectors while being state-of-the-art, do suffer owing to out-of-vocabulary words. Character n-gram vectors [14] attempt to solve this using subwords with Skip-gram. For example, while "chocolateess" (misspelling for chocolate) might not available in word embeddings, a combination of 3-gram vectors for "cho" + "col" + "ate" + "ess" captures a some context of chocolates; and have been demonstrated to be better at handling out-of-vocabulary words.

Convolution based Sentence vector

In computer vision, a convolution layer is a set of filters each containing a weight. The filters enhance certain features in input vector, while a pooling layer reduce the dimensions. These techniques have been applied to NLP as well. Given a sentence "order the pizza" and a simple filter [0.6, 0.1, 0.4] that places higher importance on first and last words, importance is placed on the words "order" and "pizza." By defining multiple filters of varying sizes and weights, followed by pooling layers, top features from the input utterance emerge. As of 2014 [22], these architectures proved to be state-of-art on NLP tasks.

Knowledge Graph Embeddings

A knowledge graph is a representation of a knowledge base as a graph. These contain entities and relationships between them. For example, DBpedia [17] is a large scale open source knowledge graph contains million of entities[5]. A quick look at the DBpedia entry for Oracle Corporation entity [6] shows one property called "foundedby" which links to a list of "person" entity that have founded Oracle Corporation. Knowledge graph embeddings [1, 23] embed entities and their relationship into low dimensional spaces. Consider two entities (entity_1, entity_2) and a edge (edge). Intuitively, (entity_1 vector + edge) should be approximately equal to entity_2 vector. Knowledge graph embeddings can be augmented with word embeddings to improve text classification in large open domain problems.

Conclusion

This blog post provides an introduction to techniques that embed higher dimensional artifacts into low dimensional vectors. Links to research papers have been mentioned.

Acknowledgements

We would like to thank the help of Sri Gadde and Edison Zhao, for their invaluable suggestions.

References

[1] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), Vol. 2. Curran Associates Inc., USA, 2787-2795.

[2] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, and Tomas Mikolov, Fasttext.zip: ´ Compressing text classification models. arXiv preprint arXiv:1612.03651

[3] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger, Kunle Olukotun, and Christopher Ré. 2018. High-Accuracy Low-Precision Training. arXiv preprint arXiv:1803.03383 (2018)

[4] David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986.

[5] DBpedia Facts and Figures, https://wiki.dbpedia.org/about/facts-figures

[6] DBpedia, http://dbpedia.org/page/Oracle_Corporation

[7] Google News Word Vectors https://code.google.com/archive/p/word2vec/

[8] Harold Hotelling. 1936. Relations Between Two Sets of Variates. Biometrika. 28 (3–4): 321–377. doi:10.1093/biomet/28.3-4.321. JSTOR 2333955.

[9] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell.33, 1 (January 2011), 117-128. DOI: https://doi.org/10.1109/TPAMI.2010.57

[10] Matthieu Courbariaux,  Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications.  arXiv preprint arXiv:1412.7024, 2014

[11] Michael Wick, Pallika Kanani, and Adam Pocock. Minimally-constrained multilingual embeddings via artificial code-switching. In Thirtieth AAAI conference on Artificial Intelligence (AAAI).

[12] Omer Levy, and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization (PDF). NIPS.

[13] Paramveer Dhillon, Jordan Rodu, Dean P. Foster, and Lyle H. Ungar. Two Step CCA: A new spectral method for estimating vector models of words. In Proceedings of ICML.

[14] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5:135–146

[15] Quoc Le, and Tomas Mikolov. Distributed Represenations of Sentences and Documents. In Proceedings of ICML 2014.

[16] Rémi Lebret, and Ronan Collobert. 2013. Word Emdeddings through Hellinger PCA. Conference of the European Chapter of the Association for Computational Linguistics (EACL). 2014. arXiv:1312.5542. Bibcode:2013arXiv1312.5542L.

[17] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: a nucleus for a web of open data. Lecture Notes in Computer Science. 6. 722-735. 10.1007/978-3-540-76298-0_52.

[18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546 [cs.CL].

[19] Van der Maaten, L. J. P. & Hinton, G. E.Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

[20] Wolfgang Härdle, and Léopold Simar. 2007. Canonical Correlation Analysis. Applied Multivariate Statistical Analysis. pp. 321–330. doi:10.1007/978-3-540-72244-1_14. ISBN 978-3-540-72243-4.

[21] Yitan Li, and Linli Xu. 2015. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective (PDF). Int'l J. Conf. on Artificial Intelligence (IJCAI).

[22] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

[23] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI'14). AAAI Press 1112-1119.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.