Words are tricky things for machine learning systems to deal with. There are a huge number of them, and character level similarity doesn't mean a lot (e.g. "through" and "though" have very different meanings). This means there is a lot of research figuring out a representation for words which is useful to a machine learning system, which usually means replacing each word with a vector of a few hundred floating point numbers. There are many approaches to this, from Latent Semantic Analysis, through to modern neural network based approaches, like word2vec. We refer to a system that converts a word into a vector as an embedding, as it embeds the words in a lower dimensional vector space. Much of the recent hype around word embeddings comes from two algorithms developed by Tomas Mikolov and a team at Google, called CBOW and SkipGram. For more information on this topic Google have a good writeup in their TensorFlow tutorial.
These modern embedding algorithms are trained on large corpora of unlabelled text, like all of English Wikipedia, or millions of sentences extracted from Google News. They are very popular because embeddings created using CBOW and SkipGram have an extremely cool property, which is they can learn analogies. Teaching a computer that "man is to king as woman is to queen" is pretty hard, but CBOW and SkipGram learn that the vector for "queen" is close to the vector generated by subtracting "man" from "king" and adding "woman", and they do this without being told anything about analogies. When these results came out in a paper at NIPS 2013, there was an explosion of research in the field, finding applications for these embeddings and developing new ways to create them which improve their analogical reasoning. And now I'm going to take all that research and use it to settle an extremely important argument once and for all. Who in Star Trek: The Next Generation is more like Captain Kirk? Is it Riker or Picard?
After downloading and processing the three corpora, I trained the system to create the nerdiest set of word embeddings possible. This gave me a mapping for each of 42,000 words to a 300 dimensional vector. Let's do a few quick sanity checks to see if everything is working. First thing to check is that it has the usual English analogy relationships. Running it through the standard analogy dataset from word2vec it gets 56% accuracy for the top 5, but two thirds of the test set is out of vocabulary (turns out there isn't a page on Paris in Wookieepedia), and so I'm not counting those. Let's try out a specific analogy, say "look : looking :: move : ?"
embedding sh% thisplusthat looking - look + move Similarity to looking - look + move Similarity: (looking - look + move : moving) = 0.5875 Similarity: (looking - look + move : moved) = 0.4681 Similarity: (looking - look + move : waiting) = 0.4650
and we get the expected answer "moving". Of course we should check the analogy that appears in all the papers "man : king :: woman : ?"
embedding sh% thisplusthat king - man + woman Similarity to king - man + woman Similarity: (king - man + woman : queen) = 0.5390 Similarity: (king - man + woman : monarch) = 0.4281 Similarity: (king - man + woman : princess) = 0.4102
and we see the top result ranked by cosine similarity is "queen". These results are from a little browser I put together, "thisplusthat" is the command that takes an algebraic expression for vectors and finds the closest vector in terms of cosine similarity. It's like the word-analogy program supplied with word2vec, just wrapped in a real shell. Later we'll also use "query" which takes a single word and returns the list of the closest vectors to the vector for that word measured by cosine similarity. I'll continue to show the top 3 results, but there are other interesting results further down these lists that I encourage you to find for yourself.
Let's check a few more involved properties of the embedding, does it know how to make plurals?
embedding sh% thisplusthat daleks - dalek + cyberman Similarity to daleks - dalek + cyberman: Similarity: (daleks - dalek + cyberman : cybermen) = 0.7073 Similarity: (daleks - dalek + cyberman : autons) = 0.5488 Similarity: (daleks - dalek + cyberman : sontarans) = 0.4915 embedding sh% thisplusthat vulcans - vulcan + klingon Similarity to (vulcans - (vulcan) + (klingon) Similarity: (vulcans - vulcan + klingon : klingons) = 0.7117 Similarity: (vulcans - vulcan + klingon : romulans) = 0.6225 Similarity: (vulcans - vulcan + klingon : cardassians) = 0.5662
It looks like the similarities are still based on the source material, Cybermen are similar to Autons and Sontarans as they are all baddies from Doctor Who. Ditto for Klingons, Romulans and Cardassians for Star Trek. It's definitely figured out plurals within a Wikia, but is "plural" the same direction in Doctor Who as it is in Star Trek? Or Star Wars? Does it know that klingons - klingon + wookiee = wookiees?
embedding sh% thisplusthat klingons - klingon + wookiee Similarity to klingons - klingon + wookiee Similarity: (klingons - klingon + wookiee : wookiees) = 0.6586 Similarity: (klingons - klingon + wookiee : cardassians) = 0.5208 Similarity: (klingons - klingon + wookiee : ewoks) = 0.5191
So plurals are the same throughout the embedding, though it does add a little flavour of the source material. I think Ewoks are a little more similar to a Wookiee than Cardassians, but CBOW disagrees.
Lets take a look at some of the structure CBOW learned within a single Wikia, to see how it works when the entities overlap in the text.
First let's find out what it knows about the main characters in Star Wars: Luke, Leia and Han.
embedding sh% query luke Similarity to luke Similarity: (luke : anakin) = 0.7211 Similarity: (luke : leia) = 0.6500 Similarity: (luke : cade) = 0.6418 embedding sh% query leia Similarity to leia Similarity: (leia : luke) = 0.6500 Similarity: (leia : solo) = 0.6452 Similarity: (leia : han) = 0.6180 embedding sh% query han Similarity to han Similarity: (han : chewbacca) = 0.6565 Similarity: (han : organa) = 0.6392 Similarity: (han : leia) = 0.6180
So it knows that Luke and Anakin Skywalker closely related (in a plot sense), that Luke & Leia are related, and that Han Solo spends a lot of time with a walking carpet. Looking at the Dark side, we can see that the embedding has uncovered the big mystery in the prequels, namely that Senator Palpatine is actually a Sith Lord.
embedding sh% query palpatine Similarity to palpatine Similarity: (palpatine : sidious) = 0.7394 Similarity: (palpatine : vader) = 0.6536 Similarity: (palpatine : dooku) = 0.6372
Now we've established that the queries look sensible, lets see how well we can move around the space with linear operations. If the Empire is run by the Sith, what organisation has the same relationship to the Jedi? That is, "sith : empire :: jedi : ?"
embeddingsh% thisplusthat empire - sith + jedi Similarity to empire - sith + jedi Similarity: (empire - sith + jedi : alliance) = 0.5323 Similarity: (empire - sith + jedi : rebellion) = 0.5182 Similarity: (empire - sith + jedi : republic) = 0.5000
All three of these answers work fine, either the Rebel or Galactic Alliances, the Rebellion or the Old/New Republics all look to the Jedi the way the Empire looked to the Sith.
We can ask who is the most like Luke Skywalker in the Republic.
embedding sh% thisplusthat luke - rebellion + republic Similarity to luke - rebellion + republic Similarity: (luke - rebellion + republic : kenobi) = 0.4510 Similarity: (luke - rebellion + republic : anakin) = 0.4452 Similarity: (luke - rebellion + republic : jacen) = 0.4397
Obi-Wan is a good choice for this as he's the most notable good Jedi in the prequels. Anakin Skywalker fits as the main character in the prequels, even if he does fall to the dark side. Anakin could also refer to Anakin Solo, who with Jacen comes from the New Republic in the now de-canonised Expanded Universe, they are the male children of Han & Leia, and they are similarly heroic for most of the New Jedi Order.
Finally did we learn who are the masters of the Dark side of the Force?
embedding sh% thisplusthat jedi - light + dark Similarity to jedi - light + dark Similarity: (jedi - light + dark : sith) = 0.6827 Similarity: (jedi - light + dark : masters) = 0.4429 Similarity: (jedi - light + dark : apprentice) = 0.4206
Well that seems to have worked, Sith is far and away the most similar answer to the query "jedi - light + dark".
Lets start off with a few queries to see what we've learned about Star Trek.
embedding sh% query kirk Similarity to kirk Similarity: (kirk : spock) = 0.7800 Similarity: (kirk : mccoy) = 0.6923 Similarity: (kirk : riker) = 0.6893 Similarity: (kirk : picard) = 0.6823 embedding sh% query picard Similarity to picard Similarity: (picard : riker) = 0.7958 Similarity: (picard : worf) = 0.7508 Similarity: (picard : troi) = 0.7214
We can see it's figured out who appears in which series, though oddly Riker is closer to Kirk than Picard is (we'll see this again in a moment). It has Spock and McCoy pegged as closely related to Kirk, and Riker, Worf and Troi related to Picard. If you expand the number of matches, most of the bridge crew for the respective series turn up when you query for their captain.
Star Trek has a lot more structure in it than the other Wikias as there are 5 fairly self contained series, with few crossovers between them. In Memory Alpha each series is referred to by a three letter acronym, so Star Trek -> TOS, Star Trek: The Next Generation -> TNG, Deep Space 9 -> DS9 etc. So lets see if we can find out who the captain is in Deep Space 9, given we know Kirk is the captain in TOS.
embedding sh% thisplusthat kirk - tos + ds9 Similarity to kirk - tos + ds9 Similarity: (kirk - tos + ds9 : sisko) = 0.6546 Similarity: (kirk - tos + ds9 : picard) = 0.6200 Similarity: (kirk - tos + ds9 : riker) = 0.6081
That worked pretty well, even though Sisko is a Commander for half the series. Let's try again with TNG.
embedding sh% thisplusthat kirk - tos + tng Similarity to kirk - tos + tng Similarity: (kirk - tos + tng : riker) = 0.6801 Similarity: (kirk - tos + tng : picard) = 0.6776 Similarity: (kirk - tos + tng : spock) = 0.6003
Controversial choice here by CBOW, is it Riker's tendency to sleep with all alien females that make him the Kirk of the Next Generation? Or that he gets into all the fights? Or even the way he sits in chairs?
Also, in Star Trek each person has a job and a rank so Spock is the science officer, McCoy is the doctor etc. This is in contrast to Star Wars for example where Luke is a farmer, then a pilot, then someone who camps in a swamp, then a Jedi knight, then the only Jedi master, then someone who keeps turning students to the Dark side. Anyway, we should see if the rank and job structure made it into the embedding. For example, does it know the right answer to "doctor : McCoy :: captain : ?"
embedding sh% thisplusthat mccoy - doctor + captain Similarity to mccoy - doctor + captain Similarity: (mccoy - doctor + captain : kirk) = 0.4339 Similarity: (mccoy - doctor + captain : spock) = 0.4078 Similarity: (mccoy - doctor + captain : pike) = 0.4006
That seemed to work OK, but Kirk has a very strong signal in Memory Alpha, he turns up everywhere, so lets try and go from Kirk to his science officer.
embedding sh% thisplusthat kirk - captain + science Similarity to kirk - captain + science Similarity: (kirk - captain + science : spock) = 0.4196 Similarity: (kirk - captain + science : pol) = 0.3840 Similarity: (kirk - captain + science : mccoy) = 0.3820
We got Spock back, which is the right answer, and clearly science has a strong signal as it brought up T'Pol, the science officer in Star Trek: Enterprise.
First lets find out what the embedding thinks of the Doctor. Unfortunately there is a lot of noise here as we lowercased all the words, and "doctor" appears in both the other Wikias in completely different contexts.
embedding sh% query doctor Similarity to doctor Similarity: (doctor : doctors) = 0.6063 Similarity: (doctor : brigadier) = 0.5916 Similarity: (doctor : tardis) = 0.5432 Similarity: (doctor : man) = 0.5416 Similarity: (doctor : monk) = 0.5045 Similarity: (doctor : rani) = 0.4993
I left a few extra rows in here to show something interesting (to me at least). The similarity between "doctor" and "monk" presumably comes from the two stories in the 60s where the Doctor faced off against the Meddling Monk, trying to prevent him altering history. From this relatively small overlap, where they both have TARDISes and are Time Lords from Gallifrey (though both Time Lords and Gallifrey aren't named for at least another 3 years), it picked him out as a similar word. Immediately below "monk" in the ranking is "rani", yet another renegade Time Lord who turns up in a few stories to frustrate the Doctor. It's strange that the Master doesn't appear in this list, as he's the most similar to the Doctor throughout the series, but presumably his name is too overloaded what with all the Jedi masters in the rest of the corpus.
Let's try one of the recent companions, Rory:
embedding sh% query rory Similarity to rory Similarity: (rory : amy) = 0.8353 Similarity: (rory : clara) = 0.6021 Similarity: (rory : donna) = 0.5629
It's found Rory's wife, Amy, and the companions before and after them. Looking further into the past, how well does it know old-school Who?
embedding sh% query romana Similarity to romana Similarity: (romana : k9) = 0.5310 Similarity: (romana : leela) = 0.5107 Similarity: (romana : adric) = 0.4576
Romana travelled with both K9 and Adric, and was the Doctor's companion after Leela, so that seems pretty good. Now what does it know about Gallifrey, home of the Time Lords?
embedding sh% query gallifrey Similarity to gallifrey Similarity: (gallifrey : earth) = 0.4910 Similarity: (gallifrey : skaro) = 0.4761 Similarity: (gallifrey : leela) = 0.4251 Similarity: (gallifrey : korriban) = 0.4191
Along with Earth and Skaro these are the most frequently mentioned planets in Doctor Who. Leela is an odd choice, as she's one of the Fourth Doctor's companions, and not a Time Lord at that, though there is some link between the two. She was inexplicably married off to a Time Lord guard, and left on Gallifrey, in the time honoured Doctor Who tradition of dumping companions wherever the actor's last story was set. Korriban is also a strange connection as it's the home of the Sith Empire in Star Wars, and Gallifrey is not known for it's links with evil galaxy conquering empires of space wizards. Though occasionally the Doctor does seem to have the force.
We should be able to move around the semantic space in the same way as the other Wikias, for example, do we know where the Thals come from?
embedding sh% thisplusthat gallifrey - lord + thal Similarity to gallifrey - lord + thal Similarity: (gallifrey - lord + thal : skaro) = 0.3549 Similarity: (gallifrey - lord + thal : earth) = 0.3400 Similarity: (gallifrey - lord + thal : kembel) = 0.3313
Lord is slightly overloaded in this corpus as there are Sith Lords and Time Lords. Here it's kept it in the realm of Doctor Who, as all three are Doctor Who references (Kembel is where the First Doctor encountered the Daleks' Master
Plan, so it's at least tangentially related to the Thals). Oddly it thinks "lord : gallifrey :: dalek : ?" is Earth, when it really should be Skaro, though "lord : gallifrey :: kaled : ?" gives the right answer.
Now we can answer the big questions, like which captain of the Enterprise is most like Han Solo: Kirk or Picard?
embedding sh% thisplusthat han - falcon + enterprise Similarity to han - falcon + enterprise) Similarity: (han - falcon + enterprise : picard) = 0.5163 Similarity: (han - falcon + enterprise : archer) = 0.4585 Similarity: (han - falcon + enterprise : kirk) = 0.4535
It would appear Picard is the winner here, though unfortunately we can't ask the embedding why. Next, we can ask who is the Han Solo of Doctor Who?
embedding sh% thisplusthat han - wars + doctor Similarity to han - wars + doctor) Similarity: (han - wars + doctor : rory) = 0.4402 Similarity: (han - wars + doctor : amy) = 0.4252 Similarity: (han - wars + doctor : jamie) = 0.4120
In a completely unexpected answer it seems Rory is the Han Solo of Doctor Who. I'm not really sure I buy this, I think Captain Jack is a much better fit from the new series, and the old series doesn't really have any people who shoot first and ask questions later. Still who am I to question our AI overlords?
One bit of cross-pollination I found amusing when writing this post was when querying the Borg.
embedding sh% query borg Similarity to borg Similarity: (borg : romulans) = 0.5430 Similarity: (borg : cube) = 0.5424 Similarity: (borg : drone) = 0.5327 Similarity: (borg : klingons) = 0.5320 Similarity: (borg : collective) = 0.5185 Similarity: (borg : dominion) = 0.4996 Similarity: (borg : federation) = 0.4936 Similarity: (borg : drones) = 0.4918 Similarity: (borg : cardassians) = 0.4910 Similarity: (borg : cybermen) = 0.4888
There down in 10th place is the Cybermen, also known as when Doctor Who did this idea 23 years before Star Trek. I like to think this link occurs because CBOW learned something really fundamental about the relationship. Unfortunately it's probably due to a comic book called Assimilation, which is much less interesting.
These Wikias also contain lots of information about the making of the different shows, so we can ask questions like "Who is the George Lucas of Star Trek?"
embedding sh% thisplusthat lucas - wars + trek Similarity to lucas - wars + trek) Similarity: (lucas - wars + trek : roddenberry) = 0.4987 Similarity: (lucas - wars + trek : okuda) = 0.4207 Similarity: (lucas - wars + trek : berman) = 0.4157
Those answers seem to work, Gene Roddenberry created TOS and TNG, and Rick Berman took over for DS9, VOY and ENT. Though the embedding is pretty fond of Roddenberry as an answer for all questions like someone - wars + trek = ?
embedding sh% thisplusthat han - wars + trek Similarity to han - wars + trek) Similarity: (han - wars + trek : roddenberry) = 0.3599 Similarity: (han - wars + trek : shatner) = 0.3577 Similarity: (han - wars + trek : sulu) = 0.3451
The difference between Han and Lucas introduces some confusion, with Lucas it's clearly a list of people who worked on Star Trek. Whereas with Han it's a mixture of the real world and the Star Trek world.
Let's end with a larger question, who is the Federation of Star Wars?
embedding sh% thisplusthat federation - trek + wars Similarity to federation - trek + wars) Similarity: (federation - trek + wars : republic) = 0.5472 Similarity: (federation - trek + wars : empire) = 0.4930 Similarity: (federation - trek + wars : alliance) = 0.4747
We get a strong preference for the Republic, rather than the evil Empire, or the much smaller and not really a government Rebel Alliance. I still find it amazing that this link appears. There is no overlap in the source material, and the Federation is run by a President and a Council, whereas the Republic was governed by a Chancellor and a Senate, so it's not even that there are mentions like "Republic President" & "Federation Chancellor" to link the two.
Neural word embeddings have an awful lot of unexpected power. They help improve the performance of many NLP systems, and have lots of interesting properties. But as we've seen here, they also are pretty good for starting arguments on the internet (the pro-Rory section of Doctor Who fandom needs all the help it can get). I find it surprising how well the different source materials lined up, and the links it created between different things are surprising. The phrase "the Republic is the Federation of Star Wars" doesn't appear in the Wikias at all, but it's definitely learned that. Even when there is no text overlap between the entities.
I hope you enjoyed our little look through Wikia word embeddings, and you should definitely download the embedding and have a poke around. Let me know in the comments if you find other interesting relationships, or use it in a cool project.
Adam (adam dot pocock at oracle dot com)
I downloaded the MediaWiki dumps in late November 2015, using the current dump at that time (thus Wookieepedia doesn't contain anything from The Force Awakens). Each MediaWiki dump was split into documents, one per page, which were passed through a lowercasing Lucene tokenizer. I made a Java 8 stream out of each list of documents, and we mixed the streams so that the final document stream had a uniformly distributed mixture of the different MediaWiki dumps (so it doesn't run out of one kind of document before the end of an iteration). All tokens that occur less than 10 times in a single dump were removed, and it only trained on unigrams (i.e. single words not phrases), leaving 41,933 unique words in the embedding. I used CBOW with the following parameters: window size = 5, negative samples = 5, dimension = 300, learning rate = 0.05, no subsampling of frequent terms, iterations = 50. I also tried a model with 10 iterations but the embedding wasn't as good. It was trained Hogwild using 6 threads (stream.parallel() FTW), which took a couple of hours on my 2012 retina MacBook Pro.
You can download a zipped file of the final embedding in word2vec text format here. This can be read by word2vec, and presumably other packages like GenSim or DeepLearning4J (though I have only tested word2vec).