Papers, research and other insights from Oracle Labs' Machine Learning Research Group

Recent Posts

Machine Learning

MLRG is hiring: Machine Learning Researcher

The Mission of Oracle Labs is straightforward: Identify, explore, and transfer new technologies that have the potential to substantially improve Oracle's business. The Machine Learning Research Group (MLRG, formerly the Information Retrieval and Machine Learning group) in Oracle Labs is working to develop, scale and deploy Machine Learning throughout all of Oracle's Products and Services. Our team conducts research related to the development and application of techniques and technologies for Information Retrieval, Machine Learning and Statistical Natural Language Processing (NLP) in a variety of topic areas of interest to Oracle and its customers. Along with developing new techniques, researchers in the IRML team are afforded the opportunity to work with Oracle product groups to transfer them into commercial products. Oracle Labs encourages the publication of results in academic conferences and journals. At Oracle Labs, you won't get lost in the crowd and you'll have a chance to drive your own research agenda. The IRML team has Research Scientist positions available in Burlington, Massachusetts. Research areas of interest to the group include: Statistical NLP including named entity recognition, entity linking, and relationship extraction in a variety of domains (general text, medical, e-commerce) Classification techniques, especially text classification. We deal with many problems where it is difficult or expensive to get training data and where there can be large class imbalances or large numbers of classes Feature selection and structure learning Deep learning, especially for image classification in commercial domains and for statistical NLP Model interpretation and visualization. How do we help researchers, data scientists, and product developers understand the decisions that a model is making? Privacy-preserving machine learning Scalable machine learning Candidates must have a Ph.D. degree in Computer Science, Machine Learning, or related technical field. Education and/or 3-5 years experience in the following areas is required: Graduate-level research experience in one of the areas described above Using standard machine learning and/or NLP toolkits Proficiency in modern programming and scripting languages, such as Java, Scala, C++, or Python Proficiency in relevant statistical mathematics Ability to develop novel ML and/or NLP techniques and to apply them to real-world problems faced by Oracle's product groups and customers In addition, the following area of education or experience are preferred: Applying existing machine learning and natural language processing techniques and technologies to real-world problems Database technologies such as Oracle and MySQL Big Data platforms such as Hadoop, Spark, or MPI Familiarity with special-purpose computing architectures as applied to machine learning Researchers in Oracle Labs are expected to work closely with Oracle product groups to transfer new approaches and technologies into Oracle's products. Researchers are expected to innovate and create novel technologies, patent new IP, and publish research results in academic conferences and journals. You can learn more about the Machine Learning Research Group at our project page. Applications for the position are available through the Oracle careers site, search for requisition number 180002UE. Or click here.

The Mission of Oracle Labs is straightforward: Identify, explore, and transfer new technologies that have the potential to substantially improve Oracle's business. The Machine Learning Research Group...

MLRG is hiring: Research Software Engineer

The Mission of Oracle Labs is straightforward: Identify, explore, and transfer new technologies that have the potential to substantially improve Oracle's business. The Machine Learning Research Group (MLRG, formerly the Information Retrieval and Machine Learning group) in Oracle Labs is working to develop, scale and deploy Machine Learning throughout all of Oracle's Products and Services. As our group grows and tackles new problems, we have more and more need for custom research and data wrangling platforms. We are looking for engineers with experience developing tools in the big data space to join our team in Burlington, Massachusetts. She or he will enjoy defining and developing creative solutions to meet the varying needs of researchers and data scientists as they work with and develop prototypes for product groups inside the company. This position may touch on tools for all aspects of large data systems, ranging from storage to analytics to artifact tracking and data science notebooks. As an example, one of the systems the group currently maintains is a distributed data store, called AURA, that scales to billions of objects, can run custom code right next to the data, and handles full-text queries in milliseconds. It has been used to support tasks such as candidate generation and random sampling. Candidates must have a BS or MS in Computer Science and at least 5 years of experience developing solid, maintainable, object-oriented code. Required Skills and Experience: Define and develop creative solutions to meet the varying needs of researchers and data scientists as they work with and develop prototypes for product groups inside the company Strong Java skills, or similar language with willingness to work in Java Solid grasp of data structures and algorithms, finding the right tool for the job Feels at home working in a Unix/Unix-like environment and with Unix tools Working knowledge of full stack development from back end to interfaces Desire to find the right solution to a problem, not just the most expedient one Willingness to solve the hard problems so that your users don't have to Excellent written and oral communication skills Understanding of machine learning and data science workflows Preferred Skills and Experience: Experience building scalable systems that run on distributed platforms Familiarity with distributed computation frameworks such as Hadoop, Spark, or MPI Knowledge of Python and scikit-learn You can learn more about the Machine Learning Research Group at our project page. Applications for the position are available through the Oracle careers site, search for requisition number 17001966. Or click here.

The Mission of Oracle Labs is straightforward: Identify, explore, and transfer new technologies that have the potential to substantially improve Oracle's business. The Machine Learning Research Group...

Summer 2017 Internship positions at IRML

The Information Retrieval and Machine Learning (IRML) group at Oracle Labs is looking for several highly motivated interns for Summer 2017. The candidates should be proficient in one or more areas of Machine Learning, including Classification, Statistical NLP, and Computer Vision. Our current focus is in the areas of information extraction, structured prediction, topic models, causal inference, and deep learning. Research experience in any of these areas is strongly preferred. The candidates should also have good implementation skills in Java, Scala, Python or other common language of choice. We are open to both Masters and PhD candidates, but PhD candidates are preferred. The Information Retrieval and Machine Learning Group is located at Oracle Labs in Burlington, MA. The candidate would need to relocate to this area. Please send your CV to pallika.kanani@oracle.com for applications or questions about the positions. Oracle is an equal opportunity employer. Oracle Labs will have a booth at NIPS 2016, where you can drop by and talk to Pallika or other members of the Labs. About IRML: The group is tasked with developing core Information Retrieval, statistical Natural Language Processing and Machine Learning technologies in order to help solve complex and challenging business problems. We collaborate with a number of Oracle product groups, working on projects like classification, search relevance, feature selection, Bayesian inference, sentiment analysis, named entity recognition, entity linking, and product attribute extraction. We also publish our research at top conferences. We're also looking for people to join our team for full-time positions. You can learn more about our research on our project page or by reading other entries on this blog. About Oracle Labs: The Mission of Oracle Labs is straightforward: Identify, explore, and transfer new technologies that have the potential to substantially improve Oracle's business. Oracle Labs grew out of Sun Microsystems Labs after Oracle's acquisition of Sun Microsystems. While many product development organizations within Oracle develop leading edge technologies, Oracle Labs is devoted exclusively to research. The Labs has a wide range of projects, from research into fast VMs for dynamic languages like R and Javascript, to developing state of the art inference algorithms for clusters of GPUs.

The Information Retrieval and Machine Learning (IRML) group at Oracle Labs is looking for several highly motivated interns for Summer 2017. The candidates should be proficient in one or more areas...

Machine Learning

Exponential stochastic cellular automata: How we achieved more than 1 billion tokens per second for LDA

This blog post is about our recent 2016 AISTATS paper on scaling inference for LDA and their ilk which you can download here. Manzil Zaheer, who joined us from CMU for the summer of 2015, contributed to this work during his internship. As computers become more parallel rather than faster, Bayesian inference is doomed unless it can exploit new computational resources like multicores and GPUs. While stochastic gradient descent (SGD) has been crucial in enabling other areas of machine learning to scale (parameter estimation in Markov random fields, deep learning, training classifiers, etc), unfortunately it is difficult to apply to Bayesian inference. Stochastic variational inference (SVI) and Langevin dynamics have shown some promise, but they do not always work in high dimensions. Further, the latter completely breaks down in the context of batch-based SGD as the noise dominates the gradient. Might maximum a posteriori (MAP) inference in a Bayesian model suffice instead? Sometimes, this might be appropriate, especially for large datasets for which we can appeal to the concentration of measure. So if we're willing to accept MAP in lieu of the full posterior then we have some options for scalable inference. Suppose, for example, that we want to scale a model such as latent Dirichlet allocation (LDA) to a large dataset. There are several commonly used tools available including Spark's MLLib and YahooLDA which aim to scale up LDA. MLLib has two algorithms available: online variational inference and expectation maximization (EM), while YahooLDA employs an approximate collapsed Gibbs sampler (CGS). All of these algorithms are perfectly reasonable, but they each have their own unique set of computational properties. In CGS we first integrate out some of the parameters (hence "collapsed") and then perform Gibbs sampling on the remaining latent variables. That is, the sampler visits each latent variable in turn, sampling a new assignment to the variable and then updating the appropriate sufficient statistics before moving on to the next variable. Note that during this process we only need to keep track of a single assignment for each latent variable. Thus, CGS is appealing because it enables the use of sparse data structures which reduce memory and communication bottlenecks in distributed environments. However, because the process of collapsing usually causes dependencies between all of the latent variables, CGS is inherently sequential and thus difficult to parallelize. EM alternates between computing the expectation of the latent variables and updating the sufficient statistics for the parameters. Since the latent variables only depend on each other through the parameters, they are conditionally independent during the E-step. Thus, we can compute all of these variables in parallel. For latent Dirichlet allocation (LDA), this means that we could parallelize at the granularity of words! Unfortunately EM requires estimating expectations for each latent variable which requires dense data structures that put tremendous pressure on memory and communication bandwidth. So while CGS is easy to distribute and hard to parallelize, EM is easy to parallelize, but hard to distribute. This got us thinking, is there a way to combine the parallel nature of EM with the sparsity of a collapsed Gibbs sampler? Motivated by this purely computational goal, we temporarily threw everything we knew about statistics out the window, and settled-upon an algorithm that would satisfy both the sparsity and massively parallel desiderata. In essence, we have a stochastic EM (SEM). The idea is to insert an extra stochastic step (S-step) after the E-step in EM. That is, once we have computed the expectations for the latent variables, we estimate that expectation with a single sample from its distribution. Thus, like collapsed Gibbs sampling it employs sparse data structures, yet like EM, all the variables can be processed in parallel. Here is the resulting algorithm that we recently presented at AISTATS 2016: Alternate the following steps: S-step: in parallel, compute the expectations of the latent variables and estimate the expectation with a single draw from the variable's conditional probability distribution. M-step: update the sufficient statistics An implementation can have two copies of the data structures that store the sufficient statistics. During one round it reads the sufficient statistics from copy A and increments the counts in copy B (as it imputes values in the S-step). Then, in the next round, it reads the fresh sufficient statistics from copy B and increments the counts in copy A. This process is similar to double-buffering in computer graphics. Note that for a model such as LDA, the algorithm has a number of desirable computational properties: The memory footprint is minimal since we only store the data and sufficient statistics. In contrast to MCMC methods, we do not store the assignments to the per-word latent variables z. In contrast to variational methods, we do not store the variational parameters. Further, variational methods require K memory accesses (one for each topic) per word. In contrast, the S-step ensures we only have a single access (for the sampled topic) per word. Such reduced pressure on the memory bandwidth can improve performance significantly for highly parallel applications. We can further reduce the memory footprint by compressing the sufficient statistics with approximate counters. This is possible because updating the sufficient statistics only requires increments as in Mean-for-Mode (an earlier approach to running LDA on GPUs developed last year). In contrast, CGS decrements counts, preventing the use of approximate counters. An implementation can be lock-free (in that it does not use locks, but assumes atomic increments) because the double buffering ensures we always read from one data structure and write to another. Finally, the algorithm is able to fully benefit from Vose's alias method for constant time random sampling because the cost of constructing the table can be amortized over all the parallel computation in a single S-step. So yes indeed the algorithm has nice computational properties, but will it work from a statistical stand-point? On the one hand the algorithm might work because it resembles EM and statistical dependencies could be capture through the time lag. On the other hand, the algorithm resembles a parallel Gibbs sampler or stochastic cellular automaton (SCA), which are known in the general case to converge to the wrong stationary distribution. With this in mind, we were eager to see if the algorithm would actually work, so we implemented it for latent Dirichlet allocation (LDA), and compared it to collapsed Gibbs sampling. Of course, there were two possible outcomes to our little experiment. Either (a) the experiment would be successful because the computational efficiency would dominate the statistical accuracy of the CGS, and so although the final model log-likelihood might be slightly worse, it would converge to that slightly worse log-likelihood much quicker or (b) the experiment would be a complete failure because the statistical error is so large that no amount of computational efficiency could salvage the model. But in fact there was a third possibility that we had not considered: ESCA and CGS actually converged to the same likelihood! Even when we did a massive sweep over the hyper parameters, each instance of the algorithm converged to a similar likelihood as the corresponding CGS. Delighted, we were eager to understand why it worked so well. Is it actually converging to the correct stationary distribution after all? If so, what properties of LDA allow this to happen? Do these properties apply to other models as well? To what class of models can we generalize? We spent the next several months investigating this line of questioning. Our initial results on this problem form the bulk of the AISTATS paper (authored by Manzil Zaheer, Michael Wick, Jean-Baptiste Tristan, Alex Smola, and Guy Steele). For example, there are certain conditions under which SEM is known to converge and we were able to show that under reasonable assumptions, LDA (and a broader class of Bayesian models, those with an exponential family likelihood) satisfies these conditions. We also found an interesting connection between stochastic EM and stochastic gradient descent with a Frank-Wolfe update. However, more work needs to be done to further establish the statistical validity of the algorithm. For this, we seek the collective wisdom of the machine learning community. Michael

This blog post is about our recent 2016 AISTATS paper on scaling inference for LDA and their ilk which you can download here. Manzil Zaheer, who joined us from CMU for the summer of 2015, contributed...

Natural Language Processing

Nerd2Vec: Jointly embedding Star Trek, Star Wars and Doctor Who Wikias

As a result of our group's work on multilingual word embeddings, we built infrastructure for processing MediaWiki dumps and a Java implementation of SkipGram and CBOW. Developing the embedding code required a lot of testing and so I chose a smaller corpus which would train more quickly. As I'm a massive nerd the logical choice was the Star Wars Wikia Wookieepedia. Our end goal was to build a system that could take in multiple MediaWiki dumps, and perform our Artificial Code Switching algorithm on the data as we fed it into CBOW or SkipGram, and you can see the results of that in our AAAI 2016 paper. However there is nothing in the system that requires the MediaWiki dumps to be in different languages, and I can turn ACS off, leaving a standard implementation of SkipGram and CBOW. So I looked around for fun things to do with all this infrastructure. In the end I decided to use more Wikias to make a bigger, nerdier embedding. I chose Wookieepedia, Memory Alpha - the Star Trek Wikia, and TARDIS - the Doctor Who Wikia. The training parameters, other experimental details and a download link for the embedding can be found at the end of this post. First I'll run through a brief explanation of what I mean when I say a "word embedding", and then on with the nerdery. What are word embeddings? Words are tricky things for machine learning systems to deal with. There are a huge number of them, and character level similarity doesn't mean a lot (e.g. "through" and "though" have very different meanings). This means there is a lot of research figuring out a representation for words which is useful to a machine learning system, which usually means replacing each word with a vector of a few hundred floating point numbers. There are many approaches to this, from Latent Semantic Analysis, through to modern neural network based approaches, like word2vec. We refer to a system that converts a word into a vector as an embedding, as it embeds the words in a lower dimensional vector space. Much of the recent hype around word embeddings comes from two algorithms developed by Tomas Mikolov and a team at Google, called CBOW and SkipGram. For more information on this topic Google have a good writeup in their TensorFlow tutorial. These modern embedding algorithms are trained on large corpora of unlabelled text, like all of English Wikipedia, or millions of sentences extracted from Google News. They are very popular because embeddings created using CBOW and SkipGram have an extremely cool property, which is they can learn analogies. Teaching a computer that "man is to king as woman is to queen" is pretty hard, but CBOW and SkipGram learn that the vector for "queen" is close to the vector generated by subtracting "man" from "king" and adding "woman", and they do this without being told anything about analogies. When these results came out in a paper at NIPS 2013, there was an explosion of research in the field, finding applications for these embeddings and developing new ways to create them which improve their analogical reasoning. And now I'm going to take all that research and use it to settle an extremely important argument once and for all. Who in Star Trek: The Next Generation is more like Captain Kirk? Is it Riker or Picard? Examining the nerdy embedding After downloading and processing the three corpora, I trained the system to create the nerdiest set of word embeddings possible. This gave me a mapping for each of 42,000 words to a 300 dimensional vector. Let's do a few quick sanity checks to see if everything is working. First thing to check is that it has the usual English analogy relationships. Running it through the standard analogy dataset from word2vec it gets 56% accuracy for the top 5, but two thirds of the test set is out of vocabulary (turns out there isn't a page on Paris in Wookieepedia), and so I'm not counting those. Let's try out a specific analogy, say "look : looking :: move : ?" embedding sh% thisplusthat looking - look + move Similarity to looking - look + move Similarity: (looking - look + move : moving) = 0.5875 Similarity: (looking - look + move : moved) = 0.4681 Similarity: (looking - look + move : waiting) = 0.4650 and we get the expected answer "moving". Of course we should check the analogy that appears in all the papers "man : king :: woman : ?" embedding sh% thisplusthat king - man + woman Similarity to king - man + woman Similarity: (king - man + woman : queen) = 0.5390 Similarity: (king - man + woman : monarch) = 0.4281 Similarity: (king - man + woman : princess) = 0.4102 and we see the top result ranked by cosine similarity is "queen". These results are from a little browser I put together, "thisplusthat" is the command that takes an algebraic expression for vectors and finds the closest vector in terms of cosine similarity. It's like the word-analogy program supplied with word2vec, just wrapped in a real shell. Later we'll also use "query" which takes a single word and returns the list of the closest vectors to the vector for that word measured by cosine similarity. I'll continue to show the top 3 results, but there are other interesting results further down these lists that I encourage you to find for yourself. Let's check a few more involved properties of the embedding, does it know how to make plurals? embedding sh% thisplusthat daleks - dalek + cyberman Similarity to daleks - dalek + cyberman: Similarity: (daleks - dalek + cyberman : cybermen) = 0.7073 Similarity: (daleks - dalek + cyberman : autons) = 0.5488 Similarity: (daleks - dalek + cyberman : sontarans) = 0.4915 embedding sh% thisplusthat vulcans - vulcan + klingon Similarity to (vulcans - (vulcan) + (klingon) Similarity: (vulcans - vulcan + klingon : klingons) = 0.7117 Similarity: (vulcans - vulcan + klingon : romulans) = 0.6225 Similarity: (vulcans - vulcan + klingon : cardassians) = 0.5662 It looks like the similarities are still based on the source material, Cybermen are similar to Autons and Sontarans as they are all baddies from Doctor Who. Ditto for Klingons, Romulans and Cardassians for Star Trek. It's definitely figured out plurals within a Wikia, but is "plural" the same direction in Doctor Who as it is in Star Trek? Or Star Wars? Does it know that klingons - klingon + wookiee = wookiees? embedding sh% thisplusthat klingons - klingon + wookiee Similarity to klingons - klingon + wookiee Similarity: (klingons - klingon + wookiee : wookiees) = 0.6586 Similarity: (klingons - klingon + wookiee : cardassians) = 0.5208 Similarity: (klingons - klingon + wookiee : ewoks) = 0.5191 So plurals are the same throughout the embedding, though it does add a little flavour of the source material. I think Ewoks are a little more similar to a Wookiee than Cardassians, but CBOW disagrees. Structure within a Wikia Lets take a look at some of the structure CBOW learned within a single Wikia, to see how it works when the entities overlap in the text. Wookieepedia (Star Wars) First let's find out what it knows about the main characters in Star Wars: Luke, Leia and Han. embedding sh% query luke Similarity to luke Similarity: (luke : anakin) = 0.7211 Similarity: (luke : leia) = 0.6500 Similarity: (luke : cade) = 0.6418 embedding sh% query leia Similarity to leia Similarity: (leia : luke) = 0.6500 Similarity: (leia : solo) = 0.6452 Similarity: (leia : han) = 0.6180 embedding sh% query han Similarity to han Similarity: (han : chewbacca) = 0.6565 Similarity: (han : organa) = 0.6392 Similarity: (han : leia) = 0.6180 So it knows that Luke and Anakin Skywalker closely related (in a plot sense), that Luke & Leia are related, and that Han Solo spends a lot of time with a walking carpet. Looking at the Dark side, we can see that the embedding has uncovered the big mystery in the prequels, namely that Senator Palpatine is actually a Sith Lord. embedding sh% query palpatine Similarity to palpatine Similarity: (palpatine : sidious) = 0.7394 Similarity: (palpatine : vader) = 0.6536 Similarity: (palpatine : dooku) = 0.6372 Now we've established that the queries look sensible, lets see how well we can move around the space with linear operations. If the Empire is run by the Sith, what organisation has the same relationship to the Jedi? That is, "sith : empire :: jedi : ?" embeddingsh% thisplusthat empire - sith + jedi Similarity to empire - sith + jedi Similarity: (empire - sith + jedi : alliance) = 0.5323 Similarity: (empire - sith + jedi : rebellion) = 0.5182 Similarity: (empire - sith + jedi : republic) = 0.5000 All three of these answers work fine, either the Rebel or Galactic Alliances, the Rebellion or the Old/New Republics all look to the Jedi the way the Empire looked to the Sith. We can ask who is the most like Luke Skywalker in the Republic. embedding sh% thisplusthat luke - rebellion + republic Similarity to luke - rebellion + republic Similarity: (luke - rebellion + republic : kenobi) = 0.4510 Similarity: (luke - rebellion + republic : anakin) = 0.4452 Similarity: (luke - rebellion + republic : jacen) = 0.4397 Obi-Wan is a good choice for this as he's the most notable good Jedi in the prequels. Anakin Skywalker fits as the main character in the prequels, even if he does fall to the dark side. Anakin could also refer to Anakin Solo, who with Jacen comes from the New Republic in the now de-canonised Expanded Universe, they are the male children of Han & Leia, and they are similarly heroic for most of the New Jedi Order. Finally did we learn who are the masters of the Dark side of the Force? embedding sh% thisplusthat jedi - light + dark Similarity to jedi - light + dark Similarity: (jedi - light + dark : sith) = 0.6827 Similarity: (jedi - light + dark : masters) = 0.4429 Similarity: (jedi - light + dark : apprentice) = 0.4206 Well that seems to have worked, Sith is far and away the most similar answer to the query "jedi - light + dark". Memory Alpha (Star Trek) Lets start off with a few queries to see what we've learned about Star Trek. embedding sh% query kirk Similarity to kirk Similarity: (kirk : spock) = 0.7800 Similarity: (kirk : mccoy) = 0.6923 Similarity: (kirk : riker) = 0.6893 Similarity: (kirk : picard) = 0.6823 embedding sh% query picard Similarity to picard Similarity: (picard : riker) = 0.7958 Similarity: (picard : worf) = 0.7508 Similarity: (picard : troi) = 0.7214 We can see it's figured out who appears in which series, though oddly Riker is closer to Kirk than Picard is (we'll see this again in a moment). It has Spock and McCoy pegged as closely related to Kirk, and Riker, Worf and Troi related to Picard. If you expand the number of matches, most of the bridge crew for the respective series turn up when you query for their captain. Star Trek has a lot more structure in it than the other Wikias as there are 5 fairly self contained series, with few crossovers between them. In Memory Alpha each series is referred to by a three letter acronym, so Star Trek -> TOS, Star Trek: The Next Generation -> TNG, Deep Space 9 -> DS9 etc. So lets see if we can find out who the captain is in Deep Space 9, given we know Kirk is the captain in TOS. embedding sh% thisplusthat kirk - tos + ds9 Similarity to kirk - tos + ds9 Similarity: (kirk - tos + ds9 : sisko) = 0.6546 Similarity: (kirk - tos + ds9 : picard) = 0.6200 Similarity: (kirk - tos + ds9 : riker) = 0.6081 That worked pretty well, even though Sisko is a Commander for half the series. Let's try again with TNG. embedding sh% thisplusthat kirk - tos + tng Similarity to kirk - tos + tng Similarity: (kirk - tos + tng : riker) = 0.6801 Similarity: (kirk - tos + tng : picard) = 0.6776 Similarity: (kirk - tos + tng : spock) = 0.6003 Controversial choice here by CBOW, is it Riker's tendency to sleep with all alien females that make him the Kirk of the Next Generation? Or that he gets into all the fights? Or even the way he sits in chairs? Also, in Star Trek each person has a job and a rank so Spock is the science officer, McCoy is the doctor etc. This is in contrast to Star Wars for example where Luke is a farmer, then a pilot, then someone who camps in a swamp, then a Jedi knight, then the only Jedi master, then someone  who keeps turning students to the Dark side. Anyway, we should see if the rank and job structure made it into the embedding. For example, does it know the right answer to "doctor : McCoy :: captain : ?" embedding sh% thisplusthat mccoy - doctor + captain Similarity to mccoy - doctor + captain Similarity: (mccoy - doctor + captain : kirk) = 0.4339 Similarity: (mccoy - doctor + captain : spock) = 0.4078 Similarity: (mccoy - doctor + captain : pike) = 0.4006 That seemed to work OK, but Kirk has a very strong signal in Memory Alpha, he turns up everywhere, so lets try and go from Kirk to his science officer. embedding sh% thisplusthat kirk - captain + science Similarity to kirk - captain + science Similarity: (kirk - captain + science : spock) = 0.4196 Similarity: (kirk - captain + science : pol) = 0.3840 Similarity: (kirk - captain + science : mccoy) = 0.3820 We got Spock back, which is the right answer, and clearly science has a strong signal as it brought up T'Pol, the science officer in Star Trek: Enterprise. TARDIS (Doctor Who) First lets find out what the embedding thinks of the Doctor. Unfortunately there is a lot of noise here as we lowercased all the words, and "doctor" appears in both the other Wikias in completely different contexts. embedding sh% query doctor Similarity to doctor Similarity: (doctor : doctors) = 0.6063 Similarity: (doctor : brigadier) = 0.5916 Similarity: (doctor : tardis) = 0.5432 Similarity: (doctor : man) = 0.5416 Similarity: (doctor : monk) = 0.5045 Similarity: (doctor : rani) = 0.4993 I left a few extra rows in here to show something interesting (to me at least). The similarity between "doctor" and "monk" presumably comes from the two stories in the 60s where the Doctor faced off against the Meddling Monk, trying to prevent him altering history. From this relatively small overlap, where they both have TARDISes and are Time Lords from Gallifrey (though both Time Lords and Gallifrey aren't named for at least another 3 years), it picked him out as a similar word. Immediately below "monk" in the ranking is "rani", yet another renegade Time Lord who turns up in a few stories to frustrate the Doctor. It's strange that the Master doesn't appear in this list, as he's the most similar to the Doctor throughout the series, but presumably his name is too overloaded what with all the Jedi masters in the rest of the corpus. Let's try one of the recent companions, Rory: embedding sh% query rory Similarity to rory Similarity: (rory : amy) = 0.8353 Similarity: (rory : clara) = 0.6021 Similarity: (rory : donna) = 0.5629 It's found Rory's wife, Amy, and the companions before and after them. Looking further into the past, how well does it know old-school Who? embedding sh% query romana Similarity to romana Similarity: (romana : k9) = 0.5310 Similarity: (romana : leela) = 0.5107 Similarity: (romana : adric) = 0.4576 Romana travelled with both K9 and Adric, and was the Doctor's companion after Leela, so that seems pretty good. Now what does it know about Gallifrey, home of the Time Lords? embedding sh% query gallifrey Similarity to gallifrey Similarity: (gallifrey : earth) = 0.4910 Similarity: (gallifrey : skaro) = 0.4761 Similarity: (gallifrey : leela) = 0.4251 Similarity: (gallifrey : korriban) = 0.4191 Along with Earth and Skaro these are the most frequently mentioned planets in Doctor Who. Leela is an odd choice, as she's one of the Fourth Doctor's companions, and not a Time Lord at that, though there is some link between the two. She was inexplicably married off to a Time Lord guard, and left on Gallifrey, in the time honoured Doctor Who tradition of dumping companions wherever the actor's last story was set. Korriban is also a strange connection as it's the home of the Sith Empire in Star Wars, and Gallifrey is not known for it's links with evil galaxy conquering empires of space wizards. Though occasionally the Doctor does seem to have the force. We should be able to move around the semantic space in the same way as the other Wikias, for example, do we know where the Thals come from? embedding sh% thisplusthat gallifrey - lord + thal Similarity to gallifrey - lord + thal Similarity: (gallifrey - lord + thal : skaro) = 0.3549 Similarity: (gallifrey - lord + thal : earth) = 0.3400 Similarity: (gallifrey - lord + thal : kembel) = 0.3313 Lord is slightly overloaded in this corpus as there are Sith Lords and Time Lords. Here it's kept it in the realm of Doctor Who, as all three are Doctor Who references (Kembel is where the First Doctor encountered the Daleks' Master Plan, so it's at least tangentially related to the Thals). Oddly it thinks "lord : gallifrey :: dalek : ?" is Earth, when it really should be Skaro, though "lord : gallifrey :: kaled : ?" gives the right answer. Structure across Wikias, or "who is the Han Solo of Doctor Who?" Now we can answer the big questions, like which captain of the Enterprise is most like Han Solo: Kirk or Picard? embedding sh% thisplusthat han - falcon + enterprise Similarity to han - falcon + enterprise) Similarity: (han - falcon + enterprise : picard) = 0.5163 Similarity: (han - falcon + enterprise : archer) = 0.4585 Similarity: (han - falcon + enterprise : kirk) = 0.4535 It would appear Picard is the winner here, though unfortunately we can't ask the embedding why. Next, we can ask who is the Han Solo of Doctor Who? embedding sh% thisplusthat han - wars + doctor Similarity to han - wars + doctor) Similarity: (han - wars + doctor : rory) = 0.4402 Similarity: (han - wars + doctor : amy) = 0.4252 Similarity: (han - wars + doctor : jamie) = 0.4120 In a completely unexpected answer it seems Rory is the Han Solo of Doctor Who. I'm not really sure I buy this, I think Captain Jack is a much better fit from the new series, and the old series doesn't really have any people who shoot first and ask questions later. Still who am I to question our AI overlords? One bit of cross-pollination I found amusing when writing this post was when querying the Borg. embedding sh% query borg Similarity to borg Similarity: (borg : romulans) = 0.5430 Similarity: (borg : cube) = 0.5424 Similarity: (borg : drone) = 0.5327 Similarity: (borg : klingons) = 0.5320 Similarity: (borg : collective) = 0.5185 Similarity: (borg : dominion) = 0.4996 Similarity: (borg : federation) = 0.4936 Similarity: (borg : drones) = 0.4918 Similarity: (borg : cardassians) = 0.4910 Similarity: (borg : cybermen) = 0.4888 There down in 10th place is the Cybermen, also known as when Doctor Who did this idea 23 years before Star Trek. I like to think this link occurs because CBOW learned something really fundamental about the relationship. Unfortunately it's probably due to a comic book called Assimilation, which is much less interesting. These Wikias also contain lots of information about the making of the different shows, so we can ask questions like "Who is the George Lucas of Star Trek?" embedding sh% thisplusthat lucas - wars + trek Similarity to lucas - wars + trek) Similarity: (lucas - wars + trek : roddenberry) = 0.4987 Similarity: (lucas - wars + trek : okuda) = 0.4207 Similarity: (lucas - wars + trek : berman) = 0.4157 Those answers seem to work, Gene Roddenberry created TOS and TNG, and Rick Berman took over for DS9, VOY and ENT. Though the embedding is pretty fond of Roddenberry as an answer for all questions like someone - wars + trek = ? embedding sh% thisplusthat han - wars + trek Similarity to han - wars + trek) Similarity: (han - wars + trek : roddenberry) = 0.3599 Similarity: (han - wars + trek : shatner) = 0.3577 Similarity: (han - wars + trek : sulu) = 0.3451 The difference between Han and Lucas introduces some confusion, with Lucas it's clearly a list of people who worked on Star Trek. Whereas with Han it's a mixture of the real world and the Star Trek world. Let's end with a larger question, who is the Federation of Star Wars? embedding sh% thisplusthat federation - trek + wars Similarity to federation - trek + wars) Similarity: (federation - trek + wars : republic) = 0.5472 Similarity: (federation - trek + wars : empire) = 0.4930 Similarity: (federation - trek + wars : alliance) = 0.4747 We get a strong preference for the Republic, rather than the evil Empire, or the much smaller and not really a government Rebel Alliance. I still find it amazing that this link appears. There is no overlap in the source material, and the Federation is run by a President and a Council, whereas the Republic was governed by a Chancellor and a Senate, so it's not even that there are mentions like "Republic President" & "Federation Chancellor" to link the two. Summary Neural word embeddings have an awful lot of unexpected power. They help improve the performance of many NLP systems, and have lots of interesting properties. But as we've seen here, they also are pretty good for starting arguments on the internet (the pro-Rory section of Doctor  Who fandom needs all the help it can get). I find it surprising how well the different source materials lined up, and the links it created between different things are surprising. The phrase "the Republic is the Federation of Star Wars" doesn't appear in the Wikias at all, but it's definitely learned that. Even when there is no text overlap between the entities. I hope you enjoyed our little look through Wikia word embeddings, and you should definitely download the embedding and have a poke around. Let me know in the comments if you find other interesting relationships, or use it in a cool project. Adam (adam dot pocock at oracle dot com) Experimental Setup I downloaded the MediaWiki dumps in late November 2015, using the current dump at that time (thus Wookieepedia doesn't contain anything from The Force Awakens). Each MediaWiki dump was split into documents, one per page, which were passed through a lowercasing Lucene tokenizer. I made a Java 8 stream out of each list of documents, and we mixed the streams so that the final document stream had a uniformly distributed mixture of the different MediaWiki dumps (so it doesn't run out of one kind of document before the end of an iteration). All tokens that occur less than 10 times in a single dump were removed, and it only trained on unigrams (i.e. single words not phrases), leaving 41,933 unique words in the embedding. I used CBOW with the following parameters: window size = 5, negative samples = 5, dimension = 300, learning rate = 0.05, no subsampling of frequent terms, iterations = 50. I also tried a model with 10 iterations but the embedding wasn't as good. It was trained Hogwild using 6 threads (stream.parallel() FTW), which took a couple of hours on my 2012 retina MacBook Pro. Download You can download a zipped file of the final embedding in word2vec text format here. This can be read by word2vec, and presumably other packages like GenSim or DeepLearning4J (though I have only tested word2vec).

As a result of our group's work on multilingual word embeddings, we built infrastructure for processing MediaWiki dumps and a Java implementation of SkipGram and CBOW. Developing the embedding code...

Natural Language Processing

Minimally Constrained Word Embeddings via Artificial Code Switching (AAAI 2016)

Here in the IRML group at Oracle Labs we do a lot of research on Natural Language Processing (NLP). Extracting information from unstructured text is one of the core problems with big data, and the NLP community is making definite progress. However much of this progress is only occurring in English because the Machine Learning approaches that are the workhorses in modern NLP require lots of training data and lexical resources, and these are frequently only available in English. We often find that our collaborators in product groups want to use the NLP tools we develop in multiple languages, which causes a problem because we don't have labelled training data in those languages. One thing we do have is lots of unlabelled text, in multiple languages, from sources like Wikipedia. Like the rest of the NLP community we're trying to find ways to leverage this unlabelled data to improve our supervised task. One common approach has been to train a word embedding model such as word2vec or GloVe on the unlabelled data, and use these word embeddings with our supervised task. This lets our supervised task use the semantic and syntactic knowledge encoded in the word vectors which should improve our performance. The reason we want to use an embedding approach in multiple languages is that once we've mapped every word into a vector, our predictor doesn't need to know anything about languages, only vectors. This means we can train it on word vectors from our English training data, and test it on word vectors from French test data, without the predictor ever having seen any French words at training time. Of course this only works if we managed to put French words "close" to their English equivalents in the vector space (i.e. we would put "bon" close to "good"). Many researchers are working on extending these embedding techniques to multiple languages. This research usually focuses on using "parallel corpora" where each word in a document is linked to the equivalent word in a different language version of the same document. Unfortunately this defeats the point of using unlabelled data, as now you need a large labelled parallel corpus. In our AAAI 2016 paper we present a way of generating a multilingual embedding which only requires a few thousand single word translations (as supplied by a bilingual dictionary), in addition to unlabelled text for each language. Our main approach to produce these embeddings we call "Artificial Code Switching" (ACS), by analogy to the "code switching" done by multilingual people as they change languages within in a single sentence. We found that by probabilistically replacing words with their translations in another language as we feed a sentence to an embedding algorithm such as CBOW, we can create the right kind of multilingual alignment. This approach moves the word "bon" closer to "good", but it also moves French words with unknown translations which are close to "bon" closer to "good". We also tested an approach which forces two translated words to be close, but this constraint gradient is hard to balance with the normal CBOW gradient. You can see the different kinds of update in the figure below: The black arrows are the standard CBOW updates, moving a word closer to it's context and the context closer to the word. The red arrows are the ACS updates, moving a word closer to the context of a translation, and the translated context closer to the word. The blue arrows are the constraint updates, moving a word closer to it's translation. We found the ACS update to work surprisingly well for such a simple procedure. It scales across multiple languages, and our embeddings with 5 languages in were just as good as the ones with 2 languages. It even allows the use of multilingual analogies, where half the analogy is in one language, and half in another. For example, the standard word2vec analogy vec(king) - vec(man) + vec(woman) ~ vec(queen), can be expressed as vec(king) - vec(man) + vec(femme) ~ vec(reine). To show this we translated the single word analogies supplied with word2vec into 4 different languages (French, Spanish, German and Danish), and generated multiple language analogies going from English to each of the other languages. The multilingual embedding isn't quite as good at the purely English analogies, but it improves on the multilingual analogies, and even improves performance on the purely French analogies (we think that's due to more training data and a small amount of vocabulary overlap between English and French). We also showed that ACS helps with our original task, of generating French & Spanish prediction systems (in this case Sentiment Analysis), even when we only train them on English training data. More details on the algorithm and our experiments are available in the paper. Michael Wick from the IRML group will be presenting this paper at AAAI 2016 on Sunday 14th February, so if you're at the conference say hello. The paper is available here, and we're working on a place to host the translated analogies so other groups can test out their approaches. I'll edit this post with the link when we have it. Paper: Minimally Constrained Multilingual Word Embeddings via Artificial Code Switching, Michael Wick, Pallika Kanani, Adam Pocock, AAAI 2016. PDF Adam (adam dot pocock at oracle dot com)

Here in the IRML group at Oracle Labs we do a lot of research on Natural Language Processing (NLP). Extracting information from unstructured text is one of the core problems with big data, and the NLP...

Hello World

This blog is a place for members of the Information Retrieval and Machine Learning (IRML) group in Oracle Labs to write about what we're up to. We work on the research and development of information retrieval, natural language processing, and machine learning systems to help solve difficult business problems. We're going to talk about our papers, what conferences we're attending, and cool things we're doing with our research. People Stephen Green (Principal Investigator) Steve is a researcher in Information Retrieval, like his father before him. He's been developing search systems for more than 20 years, both in research and as shipping products. He's been running the IRML group in one form or another for more than 10 years, doing work on passage retrieval, document classification, recommendation (for music in particular), and statistical NLP. Jeffrey Alexander Jeff has been working on Information Retrieval systems for more than 10 years. He has worked both in the inner reaches of an industrial strength research search engine and in creating abstract frameworks for pushing data in and out of many different engines. When possible, he combines his interest in IR with his interests around scalable and distributed systems, building highly scalable distributed systems for search-related tasks. Pallika Kanani Pallika works at the intersection of NLP and Machine Learning. She is interested in information extraction, semi-supervised learning, transfer learning, and active information acquisition. She works closely with various product groups at Oracle on real world applications of Machine Learning, and has worked extensively with social data. She did her PhD at UMass, Amherst, under Prof. Andrew McCallum. Along the way, she interned for the Watson Jeopardy! project at IBM, worked on Bing at Microsoft Research, played with algorithmic trading at Deutche Bank, analyzed babies' growth in a psychology lab, tutored students for GRE, and worked in the family chemical manufacturing business. She also serves as a senior board member for Women in Machine Learning (WiML). Philip Ogren Philip has nearly 20 years of software engineering experience which includes four years on an Oracle product team and working on a variety of NLP open source projects during his PhD at the University of Colorado. He enjoys working on a variety of software engineering problems related to NLP including frameworks, data structures, and algorithms. His current interests include string similarity search and named entity recognition and linking. Adam Pocock Adam is a Machine Learning researcher, who finished his PhD in Information Theory and feature selection in 2012. His thesis won the British Computer Society Distinguished Dissertation award in 2013. He's interested in distributed machine learning, Bayesian inference, and structure learning. And writing code that requires as many GPUs as possible, because he enjoys building shiny computers. Jean-Baptiste Tristan John has been tempted over to Machine Learning from his first calling in Programming Languages research. His recent work is on scaling Bayesian inference across clusters of GPUs or CPUs, while maintaining guarantees on the statistical quality of the result. During his PhD he helped develop the Compcert compiler, the first provably correct optimising C compiler. This work won him the 2011 La Recherche award with his advisor and the CompCert research group. Michael Wick Michael works at the intersection of Machine Learning and NLP. He received his PhD in Computer Science from the University of Massachusetts, Amherst advised by Prof. Andrew McCallum. He has co-organized machine learning workshops and has dozens of machine learning and NLP papers in top conferences. In 2009 he received the Yahoo! Award for Excellence in Search and Mining and 2010 he received the Yahoo! Key Scientific Challenges award. Recently, a hierarchical coreference algorithm he created won an international competition held by the U.S. Patent Office (USPTO). The algorithm is both the fastest and most accurate at disambiguating inventors, and will soon drive patent search for the USPTO. His research interests are learning and inference in graphical models, and structured prediction in NLP (e.g. coreference). Research Our recent published research has focused on a couple of different areas within ML and NLP. Speeding up Bayesian inference by using clusters of GPUs or CPUs, whilst maintaining statistical guarantees about the stationary distribution. We have several ML papers on this topic, published in NIPS 2014, ICML2015, AISTATS 2016 and a paper on approximate counters in PPoPP 2016. Extracting information from noisy, poorly written text. We had a paper in a NIPS 2015 workshop, and several are in preparation. Extending NLP systems into multiple languages. Our first paper in this area appears at AAAI 2016. There are a few other areas we are interested in: Scalable coreference and entity linking. Improving search results with learning to rank. Applying Recurrent Neural Networks to grammatical inference and program induction. Deep learning, just like everyone else. Opportunities Oracle Labs has a summer internship program for talented graduate students who want to work in industry during their studies. In IRML we take a few interns each year to work on our research goals in IR, NLP and ML. We're based near Boston, MA. Note: our lunchtime sessions of Mario Kart are entirely optional, but any trash talking must be backed up by results. The IRML team

This blog is a place for members of the Information Retrieval and Machine Learning (IRML) group in Oracle Labs to write about what we're up to. We work on the research and development of information...