Here in the IRML group at Oracle Labs we do a lot of research on Natural Language Processing (NLP). Extracting information from unstructured text is one of the core problems with big data, and the NLP community is making definite progress. However much of this progress is only occurring in English because the Machine Learning approaches that are the workhorses in modern NLP require lots of training data and lexical resources, and these are frequently only available in English. We often find that our collaborators in product groups want to use the NLP tools we develop in multiple languages, which causes a problem because we don't have labelled training data in those languages.
One thing we do have is lots of unlabelled text, in multiple languages, from sources like Wikipedia. Like the rest of the NLP community we're trying to find ways to leverage this unlabelled data to improve our supervised task. One common approach has been to train a word embedding model such as word2vec or GloVe on the unlabelled data, and use these word embeddings with our supervised task. This lets our supervised task use the semantic and syntactic knowledge encoded in the word vectors which should improve our performance.
The reason we want to use an embedding approach in multiple languages is that once we've mapped every word into a vector, our predictor doesn't need to know anything about languages, only vectors. This means we can train it on word vectors from our English training data, and test it on word vectors from French test data, without the predictor ever having seen any French words at training time. Of course this only works if we managed to put French words "close" to their English equivalents in the vector space (i.e. we would put "bon" close to "good").
Many researchers are working on extending these embedding techniques to multiple languages. This research usually focuses on using "parallel corpora" where each word in a document is linked to the equivalent word in a different language version of the same document. Unfortunately this defeats the point of using unlabelled data, as now you need a large labelled parallel corpus. In our AAAI 2016 paper we present a way of generating a multilingual embedding which only requires a few thousand single word translations (as supplied by a bilingual dictionary), in addition to unlabelled text for each language.
Our main approach to produce these embeddings we call "Artificial Code Switching" (ACS), by analogy to the "code switching" done by multilingual people as they change languages within in a single sentence. We found that by probabilistically replacing words with their translations in another language as we feed a sentence to an embedding algorithm such as CBOW, we can create the right kind of multilingual alignment. This approach moves the word "bon" closer to "good", but it also moves French words with unknown translations which are close to "bon" closer to "good". We also tested an approach which forces two translated words to be close, but this constraint gradient is hard to balance with the normal CBOW gradient. You can see the different kinds of update in the figure below:
We found the ACS update to work surprisingly well for such a simple procedure. It scales across multiple languages, and our embeddings with 5 languages in were just as good as the ones with 2 languages. It even allows the use of multilingual analogies, where half the analogy is in one language, and half in another. For example, the standard word2vec analogy vec(king) - vec(man) + vec(woman) ~ vec(queen), can be expressed as vec(king) - vec(man) + vec(femme) ~ vec(reine). To show this we translated the single word analogies supplied with word2vec into 4 different languages (French, Spanish, German and Danish), and generated multiple language analogies going from English to each of the other languages. The multilingual embedding isn't quite as good at the purely English analogies, but it improves on the multilingual analogies, and even improves performance on the purely French analogies (we think that's due to more training data and a small amount of vocabulary overlap between English and French). We also showed that ACS helps with our original task, of generating French & Spanish prediction systems (in this case Sentiment Analysis), even when we only train them on English training data. More details on the algorithm and our experiments are available in the paper.
Michael Wick from the IRML group will be presenting this paper at AAAI 2016 on Sunday 14th February, so if you're at the conference say hello. The paper is available here, and we're working on a place to host the translated analogies so other groups can test out their approaches. I'll edit this post with the link when we have it.
Paper: Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching, Michael Wick, Pallika Kanani, Adam Pocock, AAAI 2016. PDF
Adam (adam dot pocock at oracle dot com)