By searchguy on Feb 24, 2009
I was working on computing tag-tag similarities for last.fm tags for Frank for a 50K artist crawl that we did last month using an Aura instance running on EC2.
I wrote a quick program to pull the document vectors for the tags (the tags here are "documents" and the artists to whom the tags have been applied are the "words" in those documents.)
Given the vectors, it was easy to compute the complete similarity table for the tags and output the similarities for each tag in decreasing similarity order.
For 1500 tags pulled from an index of 1.8 million documents, this takes about 55 seconds to run.
I wanted to make sure that it was doing something reasonable, so I had it dump the top 10 similar tags for each tag as it was running, and I found this:
19/1510 artist-tag:8bit computing 1510 similarities
Most similar: ["<artist-tag:8bit, 1.000>", "<artist-tag:chiptune, 0.795>", "<artist-tag:chiptunes, 0.726>", "<artist-tag:bitpop, 0.584>", "<artist-tag:chipmusic, 0.341>", "<artist-tag:blipblop, 0.258>", "<artist-tag:nintendocore, 0.194>", "<artist-tag:nintendo, 0.189>", "<artist-tag:c64, 0.174>", "<artist-tag:vgm, 0.166>"]
Can you tell what caught my eye? Nintendocore? Awesome!