X

Papers, research and other insights from Oracle Labs' Machine Learning Research Group

Nerd2Vec: Jointly embedding Star Trek, Star Wars and Doctor Who Wikias

Adam Pocock
Principal Member of Technical Staff
As a result of our group's work on multilingual word embeddings, we built infrastructure for processing MediaWiki dumps and a Java implementation of SkipGram and CBOW. Developing the embedding code required a lot of testing and so I chose a smaller corpus which would train more quickly. As I'm a massive nerd the logical choice was the Star Wars Wikia Wookieepedia. Our end goal was to build a system that could take in multiple MediaWiki dumps, and perform our Artificial Code Switching algorithm on the data as we fed it into CBOW or SkipGram, and you can see the results of that in our AAAI 2016 paper. However there is nothing in the system that requires the MediaWiki dumps to be in different languages, and I can turn ACS off, leaving a standard implementation of SkipGram and CBOW. So I looked around for fun things to do with all this infrastructure. In the end I decided to use more Wikias to make a bigger, nerdier embedding. I chose Wookieepedia, Memory Alpha - the Star Trek Wikia, and TARDIS - the Doctor Who Wikia. The training parameters, other experimental details and a download link for the embedding can be found at the end of this post. First I'll run through a brief explanation of what I mean when I say a "word embedding", and then on with the nerdery.

What are word embeddings?

Words are tricky things for machine learning systems to deal with. There are a huge number of them, and character level similarity doesn't mean a lot (e.g. "through" and "though" have very different meanings). This means there is a lot of research figuring out a representation for words which is useful to a machine learning system, which usually means replacing each word with a vector of a few hundred floating point numbers. There are many approaches to this, from Latent Semantic Analysis, through to modern neural network based approaches, like word2vec. We refer to a system that converts a word into a vector as an embedding, as it embeds the words in a lower dimensional vector space. Much of the recent hype around word embeddings comes from two algorithms developed by Tomas Mikolov and a team at Google, called CBOW and SkipGram. For more information on this topic Google have a good writeup in their TensorFlow tutorial.

These modern embedding algorithms are trained on large corpora of unlabelled text, like all of English Wikipedia, or millions of sentences extracted from Google News. They are very popular because embeddings created using CBOW and SkipGram have an extremely cool property, which is they can learn analogies. Teaching a computer that "man is to king as woman is to queen" is pretty hard, but CBOW and SkipGram learn that the vector for "queen" is close to the vector generated by subtracting "man" from "king" and adding "woman", and they do this without being told anything about analogies. When these results came out in a paper at NIPS 2013, there was an explosion of research in the field, finding applications for these embeddings and developing new ways to create them which improve their analogical reasoning. And now I'm going to take all that research and use it to settle an extremely important argument once and for all. Who in Star Trek: The Next Generation is more like Captain Kirk? Is it Riker or Picard?

Examining the nerdy embedding

After downloading and processing the three corpora, I trained the system to create the nerdiest set of word embeddings possible. This gave me a mapping for each of 42,000 words to a 300 dimensional vector. Let's do a few quick sanity checks to see if everything is working. First thing to check is that it has the usual English analogy relationships. Running it through the standard analogy dataset from word2vec it gets 56% accuracy for the top 5, but two thirds of the test set is out of vocabulary (turns out there isn't a page on Paris in Wookieepedia), and so I'm not counting those. Let's try out a specific analogy, say "look : looking :: move : ?"

embedding sh% thisplusthat looking - look + move
Similarity to looking - look + move
Similarity: (looking - look + move : moving) = 0.5875
Similarity: (looking - look + move : moved) = 0.4681
Similarity: (looking - look + move : waiting) = 0.4650

and we get the expected answer "moving". Of course we should check the analogy that appears in all the papers "man : king :: woman : ?"

embedding sh% thisplusthat king - man + woman
Similarity to king - man + woman
Similarity: (king - man + woman : queen) = 0.5390
Similarity: (king - man + woman : monarch) = 0.4281
Similarity: (king - man + woman : princess) = 0.4102

and we see the top result ranked by cosine similarity is "queen". These results are from a little browser I put together, "thisplusthat" is the command that takes an algebraic expression for vectors and finds the closest vector in terms of cosine similarity. It's like the word-analogy program supplied with word2vec, just wrapped in a real shell. Later we'll also use "query" which takes a single word and returns the list of the closest vectors to the vector for that word measured by cosine similarity. I'll continue to show the top 3 results, but there are other interesting results further down these lists that I encourage you to find for yourself.

Let's check a few more involved properties of the embedding, does it know how to make plurals?

embedding sh% thisplusthat daleks - dalek + cyberman
Similarity to daleks - dalek + cyberman:
Similarity: (daleks - dalek + cyberman : cybermen) = 0.7073
Similarity: (daleks - dalek + cyberman : autons) = 0.5488
Similarity: (daleks - dalek + cyberman : sontarans) = 0.4915
embedding sh% thisplusthat vulcans - vulcan + klingon
Similarity to (vulcans - (vulcan) + (klingon)
Similarity: (vulcans - vulcan + klingon : klingons) = 0.7117
Similarity: (vulcans - vulcan + klingon : romulans) = 0.6225
Similarity: (vulcans - vulcan + klingon : cardassians) = 0.5662

It looks like the similarities are still based on the source material, Cybermen are similar to Autons and Sontarans as they are all baddies from Doctor Who. Ditto for Klingons, Romulans and Cardassians for Star Trek. It's definitely figured out plurals within a Wikia, but is "plural" the same direction in Doctor Who as it is in Star Trek? Or Star Wars? Does it know that klingons - klingon + wookiee = wookiees?

embedding sh% thisplusthat klingons - klingon + wookiee
Similarity to klingons - klingon + wookiee
Similarity: (klingons - klingon + wookiee : wookiees) = 0.6586
Similarity: (klingons - klingon + wookiee : cardassians) = 0.5208
Similarity: (klingons - klingon + wookiee : ewoks) = 0.5191

So plurals are the same throughout the embedding, though it does add a little flavour of the source material. I think Ewoks are a little more similar to a Wookiee than Cardassians, but CBOW disagrees.

Structure within a Wikia

Lets take a look at some of the structure CBOW learned within a single Wikia, to see how it works when the entities overlap in the text.

Wookieepedia (Star Wars)

First let's find out what it knows about the main characters in Star Wars: Luke, Leia and Han.

embedding sh% query luke
Similarity to luke
Similarity: (luke : anakin) = 0.7211
Similarity: (luke : leia) = 0.6500
Similarity: (luke : cade) = 0.6418
embedding sh% query leia
Similarity to leia
Similarity: (leia : luke) = 0.6500
Similarity: (leia : solo) = 0.6452
Similarity: (leia : han) = 0.6180
embedding sh% query han
Similarity to han
Similarity: (han : chewbacca) = 0.6565
Similarity: (han : organa) = 0.6392
Similarity: (han : leia) = 0.6180

So it knows that Luke and Anakin Skywalker closely related (in a plot sense), that Luke & Leia are related, and that Han Solo spends a lot of time with a walking carpet. Looking at the Dark side, we can see that the embedding has uncovered the big mystery in the prequels, namely that Senator Palpatine is actually a Sith Lord.

embedding sh% query palpatine
Similarity to palpatine
Similarity: (palpatine : sidious) = 0.7394
Similarity: (palpatine : vader) = 0.6536
Similarity: (palpatine : dooku) = 0.6372

Now we've established that the queries look sensible, lets see how well we can move around the space with linear operations. If the Empire is run by the Sith, what organisation has the same relationship to the Jedi? That is, "sith : empire :: jedi : ?"

embeddingsh% thisplusthat empire - sith + jedi
Similarity to empire - sith + jedi
Similarity: (empire - sith + jedi : alliance) = 0.5323
Similarity: (empire - sith + jedi : rebellion) = 0.5182
Similarity: (empire - sith + jedi : republic) = 0.5000

All three of these answers work fine, either the Rebel or Galactic Alliances, the Rebellion or the Old/New Republics all look to the Jedi the way the Empire looked to the Sith.

We can ask who is the most like Luke Skywalker in the Republic.

embedding sh% thisplusthat luke - rebellion + republic
Similarity to luke - rebellion + republic
Similarity: (luke - rebellion + republic : kenobi) = 0.4510
Similarity: (luke - rebellion + republic : anakin) = 0.4452
Similarity: (luke - rebellion + republic : jacen) = 0.4397

Obi-Wan is a good choice for this as he's the most notable good Jedi in the prequels. Anakin Skywalker fits as the main character in the prequels, even if he does fall to the dark side. Anakin could also refer to Anakin Solo, who with Jacen comes from the New Republic in the now de-canonised Expanded Universe, they are the male children of Han & Leia, and they are similarly heroic for most of the New Jedi Order.

Finally did we learn who are the masters of the Dark side of the Force?

embedding sh% thisplusthat jedi - light + dark
Similarity to jedi - light + dark
Similarity: (jedi - light + dark : sith) = 0.6827
Similarity: (jedi - light + dark : masters) = 0.4429
Similarity: (jedi - light + dark : apprentice) = 0.4206

Well that seems to have worked, Sith is far and away the most similar answer to the query "jedi - light + dark".

Memory Alpha (Star Trek)

Lets start off with a few queries to see what we've learned about Star Trek.

embedding sh% query kirk
Similarity to kirk
Similarity: (kirk : spock) = 0.7800
Similarity: (kirk : mccoy) = 0.6923
Similarity: (kirk : riker) = 0.6893
Similarity: (kirk : picard) = 0.6823
embedding sh% query picard
Similarity to picard
Similarity: (picard : riker) = 0.7958
Similarity: (picard : worf) = 0.7508
Similarity: (picard : troi) = 0.7214

We can see it's figured out who appears in which series, though oddly Riker is closer to Kirk than Picard is (we'll see this again in a moment). It has Spock and McCoy pegged as closely related to Kirk, and Riker, Worf and Troi related to Picard. If you expand the number of matches, most of the bridge crew for the respective series turn up when you query for their captain.

Star Trek has a lot more structure in it than the other Wikias as there are 5 fairly self contained series, with few crossovers between them. In Memory Alpha each series is referred to by a three letter acronym, so Star Trek -> TOS, Star Trek: The Next Generation -> TNG, Deep Space 9 -> DS9 etc. So lets see if we can find out who the captain is in Deep Space 9, given we know Kirk is the captain in TOS.

embedding sh% thisplusthat kirk - tos + ds9
Similarity to kirk - tos + ds9
Similarity: (kirk - tos + ds9 : sisko) = 0.6546
Similarity: (kirk - tos + ds9 : picard) = 0.6200
Similarity: (kirk - tos + ds9 : riker) = 0.6081

That worked pretty well, even though Sisko is a Commander for half the series. Let's try again with TNG.

embedding sh% thisplusthat kirk - tos + tng
Similarity to kirk - tos + tng
Similarity: (kirk - tos + tng : riker) = 0.6801
Similarity: (kirk - tos + tng : picard) = 0.6776
Similarity: (kirk - tos + tng : spock) = 0.6003

Controversial choice here by CBOW, is it Riker's tendency to sleep with all alien females that make him the Kirk of the Next Generation? Or that he gets into all the fights? Or even the way he sits in chairs?

Also, in Star Trek each person has a job and a rank so Spock is the science officer, McCoy is the doctor etc. This is in contrast to Star Wars for example where Luke is a farmer, then a pilot, then someone who camps in a swamp, then a Jedi knight, then the only Jedi master, then someone  who keeps turning students to the Dark side. Anyway, we should see if the rank and job structure made it into the embedding. For example, does it know the right answer to "doctor : McCoy :: captain : ?"

embedding sh% thisplusthat mccoy - doctor + captain
Similarity to mccoy - doctor + captain
Similarity: (mccoy - doctor + captain : kirk) = 0.4339
Similarity: (mccoy - doctor + captain : spock) = 0.4078
Similarity: (mccoy - doctor + captain : pike) = 0.4006

That seemed to work OK, but Kirk has a very strong signal in Memory Alpha, he turns up everywhere, so lets try and go from Kirk to his science officer.

embedding sh% thisplusthat kirk - captain + science
Similarity to kirk - captain + science
Similarity: (kirk - captain + science : spock) = 0.4196
Similarity: (kirk - captain + science : pol) = 0.3840
Similarity: (kirk - captain + science : mccoy) = 0.3820

We got Spock back, which is the right answer, and clearly science has a strong signal as it brought up T'Pol, the science officer in Star Trek: Enterprise.

TARDIS (Doctor Who)

First lets find out what the embedding thinks of the Doctor. Unfortunately there is a lot of noise here as we lowercased all the words, and "doctor" appears in both the other Wikias in completely different contexts.

embedding sh% query doctor
Similarity to doctor
Similarity: (doctor : doctors) = 0.6063
Similarity: (doctor : brigadier) = 0.5916
Similarity: (doctor : tardis) = 0.5432
Similarity: (doctor : man) = 0.5416
Similarity: (doctor : monk) = 0.5045
Similarity: (doctor : rani) = 0.4993

I left a few extra rows in here to show something interesting (to me at least). The similarity between "doctor" and "monk" presumably comes from the two stories in the 60s where the Doctor faced off against the Meddling Monk, trying to prevent him altering history. From this relatively small overlap, where they both have TARDISes and are Time Lords from Gallifrey (though both Time Lords and Gallifrey aren't named for at least another 3 years), it picked him out as a similar word. Immediately below "monk" in the ranking is "rani", yet another renegade Time Lord who turns up in a few stories to frustrate the Doctor. It's strange that the Master doesn't appear in this list, as he's the most similar to the Doctor throughout the series, but presumably his name is too overloaded what with all the Jedi masters in the rest of the corpus.

Let's try one of the recent companions, Rory:

embedding sh% query rory
Similarity to rory
Similarity: (rory : amy) = 0.8353
Similarity: (rory : clara) = 0.6021
Similarity: (rory : donna) = 0.5629

It's found Rory's wife, Amy, and the companions before and after them. Looking further into the past, how well does it know old-school Who?

embedding sh% query romana
Similarity to romana
Similarity: (romana : k9) = 0.5310
Similarity: (romana : leela) = 0.5107
Similarity: (romana : adric) = 0.4576

Romana travelled with both K9 and Adric, and was the Doctor's companion after Leela, so that seems pretty good. Now what does it know about Gallifrey, home of the Time Lords?

embedding sh% query gallifrey
Similarity to gallifrey
Similarity: (gallifrey : earth) = 0.4910
Similarity: (gallifrey : skaro) = 0.4761
Similarity: (gallifrey : leela) = 0.4251
Similarity: (gallifrey : korriban) = 0.4191

Along with Earth and Skaro these are the most frequently mentioned planets in Doctor Who. Leela is an odd choice, as she's one of the Fourth Doctor's companions, and not a Time Lord at that, though there is some link between the two. She was inexplicably married off to a Time Lord guard, and left on Gallifrey, in the time honoured Doctor Who tradition of dumping companions wherever the actor's last story was set. Korriban is also a strange connection as it's the home of the Sith Empire in Star Wars, and Gallifrey is not known for it's links with evil galaxy conquering empires of space wizards. Though occasionally the Doctor does seem to have the force.

We should be able to move around the semantic space in the same way as the other Wikias, for example, do we know where the Thals come from?

embedding sh% thisplusthat gallifrey - lord + thal
Similarity to gallifrey - lord + thal
Similarity: (gallifrey - lord + thal : skaro) = 0.3549
Similarity: (gallifrey - lord + thal : earth) = 0.3400
Similarity: (gallifrey - lord + thal : kembel) = 0.3313

Lord is slightly overloaded in this corpus as there are Sith Lords and Time Lords. Here it's kept it in the realm of Doctor Who, as all three are Doctor Who references (Kembel is where the First Doctor encountered the Daleks' Master
Plan, so it's at least tangentially related to the Thals). Oddly it thinks "lord : gallifrey :: dalek : ?" is Earth, when it really should be Skaro, though "lord : gallifrey :: kaled : ?" gives the right answer.

Structure across Wikias, or "who is the Han Solo of Doctor Who?"

Now we can answer the big questions, like which captain of the Enterprise is most like Han Solo: Kirk or Picard?

embedding sh% thisplusthat han - falcon + enterprise
Similarity to han - falcon + enterprise)
Similarity: (han - falcon + enterprise : picard) = 0.5163
Similarity: (han - falcon + enterprise : archer) = 0.4585
Similarity: (han - falcon + enterprise : kirk) = 0.4535

It would appear Picard is the winner here, though unfortunately we can't ask the embedding why. Next, we can ask who is the Han Solo of Doctor Who?

embedding sh% thisplusthat han - wars + doctor
Similarity to han - wars + doctor)
Similarity: (han - wars + doctor : rory) = 0.4402
Similarity: (han - wars + doctor : amy) = 0.4252
Similarity: (han - wars + doctor : jamie) = 0.4120

In a completely unexpected answer it seems Rory is the Han Solo of Doctor Who. I'm not really sure I buy this, I think Captain Jack is a much better fit from the new series, and the old series doesn't really have any people who shoot first and ask questions later. Still who am I to question our AI overlords?

One bit of cross-pollination I found amusing when writing this post was when querying the Borg.

embedding sh% query borg
Similarity to borg
Similarity: (borg : romulans) = 0.5430
Similarity: (borg : cube) = 0.5424
Similarity: (borg : drone) = 0.5327
Similarity: (borg : klingons) = 0.5320
Similarity: (borg : collective) = 0.5185
Similarity: (borg : dominion) = 0.4996
Similarity: (borg : federation) = 0.4936
Similarity: (borg : drones) = 0.4918
Similarity: (borg : cardassians) = 0.4910
Similarity: (borg : cybermen) = 0.4888

There down in 10th place is the Cybermen, also known as when Doctor Who did this idea 23 years before Star Trek. I like to think this link occurs because CBOW learned something really fundamental about the relationship. Unfortunately it's probably due to a comic book called Assimilation, which is much less interesting.

These Wikias also contain lots of information about the making of the different shows, so we can ask questions like "Who is the George Lucas of Star Trek?"

embedding sh% thisplusthat lucas - wars + trek
Similarity to lucas - wars + trek)
Similarity: (lucas - wars + trek : roddenberry) = 0.4987
Similarity: (lucas - wars + trek : okuda) = 0.4207
Similarity: (lucas - wars + trek : berman) = 0.4157

Those answers seem to work, Gene Roddenberry created TOS and TNG, and Rick Berman took over for DS9, VOY and ENT. Though the embedding is pretty fond of Roddenberry as an answer for all questions like someone - wars + trek = ?

embedding sh% thisplusthat han - wars + trek
Similarity to han - wars + trek)
Similarity: (han - wars + trek : roddenberry) = 0.3599
Similarity: (han - wars + trek : shatner) = 0.3577
Similarity: (han - wars + trek : sulu) = 0.3451

The difference between Han and Lucas introduces some confusion, with Lucas it's clearly a list of people who worked on Star Trek. Whereas with Han it's a mixture of the real world and the Star Trek world.

Let's end with a larger question, who is the Federation of Star Wars?

embedding sh% thisplusthat federation - trek + wars
Similarity to federation - trek + wars)
Similarity: (federation - trek + wars : republic) = 0.5472
Similarity: (federation - trek + wars : empire) = 0.4930
Similarity: (federation - trek + wars : alliance) = 0.4747

We get a strong preference for the Republic, rather than the evil Empire, or the much smaller and not really a government Rebel Alliance. I still find it amazing that this link appears. There is no overlap in the source material, and the Federation is run by a President and a Council, whereas the Republic was governed by a Chancellor and a Senate, so it's not even that there are mentions like "Republic President" & "Federation Chancellor" to link the two.

Summary

Neural word embeddings have an awful lot of unexpected power. They help improve the performance of many NLP systems, and have lots of interesting properties. But as we've seen here, they also are pretty good for starting arguments on the internet (the pro-Rory section of Doctor  Who fandom needs all the help it can get). I find it surprising how well the different source materials lined up, and the links it created between different things are surprising. The phrase "the Republic is the Federation of Star Wars" doesn't appear in the Wikias at all, but it's definitely learned that. Even when there is no text overlap between the entities.

I hope you enjoyed our little look through Wikia word embeddings, and you should definitely download the embedding and have a poke around. Let me know in the comments if you find other interesting relationships, or use it in a cool project.

Adam (adam dot pocock at oracle dot com)

Experimental Setup

I downloaded the MediaWiki dumps in late November 2015, using the current dump at that time (thus Wookieepedia doesn't contain anything from The Force Awakens). Each MediaWiki dump was split into documents, one per page, which were passed through a lowercasing Lucene tokenizer. I made a Java 8 stream out of each list of documents, and we mixed the streams so that the final document stream had a uniformly distributed mixture of the different MediaWiki dumps (so it doesn't run out of one kind of document before the end of an iteration). All tokens that occur less than 10 times in a single dump were removed, and it only trained on unigrams (i.e. single words not phrases), leaving 41,933 unique words in the embedding. I used CBOW with the following parameters: window size = 5, negative samples = 5, dimension = 300, learning rate = 0.05, no subsampling of frequent terms, iterations = 50. I also tried a model with 10 iterations but the embedding wasn't as good. It was trained Hogwild using 6 threads (stream.parallel() FTW), which took a couple of hours on my 2012 retina MacBook Pro.

Download

You can download a zipped file of the final embedding in word2vec text format here. This can be read by word2vec, and presumably other packages like GenSim or DeepLearning4J (though I have only tested word2vec).

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services