Friday Jan 08, 2010

Leaving Sun

Seems like I've been seeing a lot of these posts lately, and now I guess I get to write my own. Today is my last day at Sun. I'm moving on to a new opportunity. It's not really anything to do with the pending acquisition, it's more that it's a good time for me to make the move.

Before I go, though, I wanted to say thanks to some people who made my ten years (and two months!) at Sun really great. Bill Woods (now at ITA software) and Bob Sproull took a chance on a kid (well, I was 31 — grad school takes a looooong time) who'd never had a real job and I really do appreciate it. I learned a lot working here in Sun Labs. I've had a great time working with Jeff Alexander, who started at Sun Labs to work on the Advanced Search Technologies project and the Minion search engine, and went on to do a great job of designing and implementing the Data Store for the AURA Project. Thanks to Bill, Geoff, Karl, Meg, and Miriam (and then Karl again) for managing (or at least, attempting to manage) me over the years. Thanks to Josh Simons for being my SEED mentor and for lots of great discussions over the years. Thanks, too to Jim Waldo who was always willing to give me the straight dope. This one's for you, Jim!

I don't think any of them are left any more, but I owe the Portal Server team a real debt of gratitude. They were also willing to take a chance: on a search engine written by some weird labs guy. Thanks Mark, Ashok and everyone else.

All in all, it's hard to overstate what a great group of people Sun Labs is, especially the Labs here in Burlington ("Large enough to be interesting, small enough to be fun"). I mean, you give a guy an earworm from 30 Rock and he actually makes Cheesy Blasters:

That's pretty amazing right there. Also, Cheesy Blasters are awesome. I'm going to continue blogging (perhaps less sporadically) over at the new blog, if you would like to continue to read about search stuff.

Tuesday Aug 11, 2009

Homebrew Hybrid Storage Pool

I had a bit of trouble with a slow iscsi connection to my downstairs Solaris box, and so I tried something crazy. I used a RAM disk for a ZIL. This is fine, as long as your machine never, ever (ever!) goes down. I made things slightly less crazy by replacing the RAM disk with a file vdev. This allows me to power down the machine, but the iscsi performance goes back to being pretty terrible.

The answer is to use an SSD as a ZIL vdev, so I talked the wife into letting me get 16GB SATA II drive, which I just put into the machine. There was a bit of a tense moment (note: this is what we in the business call an understatement) when the file systems on my big pool didn't appear right away (me: ls /tank, machine: I got nothing, me: WTF!?), but they appeared eventually and all I had to do was svcadm clear a couple of services that depended on them. Note to self: make those services dependent on the ZFS file systems being available.

The SSD was all the way off on c11d0 for some reason, but ZFS was happy to replace my ZIL with the new vdev, so now I'm sitting here watching this:

root@blue:~# zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h21m, 57.62% done, 0h15m to go

	NAME                     STATE     READ WRITE CKSUM
	tank                     ONLINE       0     0     0
	  raidz1                 ONLINE       0     0     0
	    c4d0                 ONLINE       0     0     0
	    c4d1                 ONLINE       0     0     0
	    c5d0                 ONLINE       0     0     0
	    c5d1                 ONLINE       0     0     0
	  replacing              ONLINE       0     0     0
	    /rpool/slog/current  ONLINE       0     0     0
	    c11d0                ONLINE       0     0     0

errors: No known data errors
Yes, I called my big pool tank. I'm a ZFS nerd, I guess. Tomorrow I'll plug the Mac Book into the gigE and see how Time Machine does over iscsi. I'm hoping for big numbers.

Update: I'm glad that I didn't stay up until it finished:

root@blue:~# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: resilver completed after 2h37m with 0 errors on Wed Aug 12 01:27:18 2009

	tank        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    c4d0    ONLINE       0     0     0
	    c4d1    ONLINE       0     0     0
	    c5d0    ONLINE       0     0     0
	    c5d1    ONLINE       0     0     0
	  c11d0     ONLINE       0     0     0

errors: No known data errors

Saturday Jul 25, 2009

Dear ZFS and Time Slider teams: Will you marry me?

I'm sure my wife and your wives (or husbands) and children will understand.

You see, I was working on my home system this afternoon, writing code instead of enjoying the summer weather, when I hit the following:

stgreen@blue:~/Projects/silv/work$ hg verify
\*\* unknown exception encountered, details follow
\*\* report bug details to
\*\* or
\*\* Mercurial Distributed SCM (version 1.1.2)
\*\* Extensions loaded: 
Traceback (most recent call last):
  File "/usr/bin/hg", line 20, in ?
  File "/usr/lib/python2.4/vendor-packages/mercurial/", line 379, in parseindex
    index, nodemap, cache = parsers.parse_index(data, inline)
ValueError: corrupt index file

This made me, shall we say, unhappy. This made me realize that I hadn't done a push to the "main" hg repository since I started the new project that I had just recently gotten working, so I was looking at losing more than a thousand lines of code.

But there you were, ZFS and Time Slider, ready to pick me up and get me back in the game:

You are the wind beneath my wings:

stgreen@blue:~/Desktop/work$ hg verify
checking changesets
checking manifests
crosschecking files in changesets and manifests
checking files
3209 files, 266 changesets, 3862 total revisions

I only lost 15 minutes of work, and those 15 minutes don't even matter, because for those 15 minutes, I was screwing around in a virtualized Ubuntu getting GWT hosting mode working. I only lost two small changes.

I have no idea what caused this problem. ZFS isn't reporting any errors on the drive, but the hg and virtualbox forums suggest that the vboxsf filesystem might be corrupting files. So, note to self: push to the main hg repository before cloning to the virtualized Ubuntu!

And I'm 100% serious about that marriage thing.

Thursday Apr 09, 2009

Compare and contrast: The Tragically Hip edition

A new album from The Tragically Hip came out this week. I'm listening to it now. I bought it from Amazon, and I just went to Amazon to see if the fine folks there could recommend any related artists for me. Today, after I searched for the band, Amazon offered a link to the "Tragically Hip Store", where "a fan can be a fan." I'm a fan, and I sure do want a place to be a fan, so let's check it out!

Here's The Tragically Hip Store at Amazon:

Now, don't get me wrong, I like Amazon (it's where I bought the new album, after all), but I'm very surprised to see that there's no recommendations on the page. There's links to the albums, but wouldn't a fan who's looking for a place to be a fan already have the albums and possibly be looking for other music?

Over there is The Tragically Hip on the Explaura where the it's all about the recommendations, and a fan can find other things to be a fan of.

Monday Apr 06, 2009

The Aussie Long Tail

As a Canadian, I grew up in a culture that was dominated to a pretty large extent by American culture (e.g., watch any good Canadian movies lately?) The one area where this was not as much of an issue was music, mostly due to the CanCon regulations that required radio stations to air a certain percentage of Canadian music. This could lead to some bad results (hearing enough Celine Dion on the radio, yet? No? Well, here's some more!), but also to some good results.

I did my postdoc work at Macquarie University in Sydney. I was already a fan of a couple of Aussie bands (most notably Midnight Oil) but I was pleasantly surprised to discover a lot of other great Aussie bands (e.g., Regurgitator and Spiderbait — you gotta love a band that can put together a two minute song that leaves you wanting more) mostly by listening to Triple M and Triple J.

A bug report for the Explaura dropped me right back into this Aussie music. I must admit that I had forgotten how much I liked some of this music. To a first approximation you would never hear about these bands in the states, so that's one in the win column for the Explaura.

Friday Apr 03, 2009

The AURA Data Store: An Overview

Paul said some pretty nice things about The Explaura today. He mentioned that I hadn't talked much about the backend system architecture, which we call the AURA Data Store (or just "the data store"). This is something that I will be talking about in future posts, but since he mentioned it, I wanted to give you an idea of what the AURA data store is like.

Here's a snapshot of our data store browsing and visualization tool (click for the full sized version):

You can see that the data store has almost 2 million items in it. These are the actual artists along with items for the photos, videos, albums and so on. Along with that are about 1.2 million attentions. An attention is a triple consisting of a user, an item, and an attention type. So, for example, if we know that a particular user listened to a particular artist then we would add an attention for that user with the item corresponding to that artist, and an attention type of LISTEN.

The layout of the visualization tool gives you a pretty good idea of the layout of the data store. At the top level there are data store heads, which is what clients of the data store will use for adding and fetching items and attention and for doing searches in the data store.

The data store heads talk to the second level of the data store, which is composed of a number of partition clusters. Each of the partition clusters manages a number of replicants (currently that number is 1, but stay tuned throughout the summer), which is meant to provide for scalabilty and redundancy in the data store.

Each replicant is composed of a BDB key-value store and a Minion search index. BDB is responsible for storing and retrieving the item and attention data and Minion is responsible for almost all of the searching and all of the similarity computations.

The addition of the Minion search index means that the AURA data store is a little different from typical key value stores because it makes it possible to ask the kinds of queries that a search engine can handle (e.g., find the artists that contain the word progressive in their Wikipedia bios, find artists whose social tags are similar to a particular artist) as well as the kind of queries that a traditional key-value store can handle (e.g., fetch me the item with this key.)

We refer to this data store as a 16-way store, because there are 16 partition clusters and 16 replicants. Each of the boxes that you see in the visualization represents a JVM running a chunk of the data store code. We use Jini in combination with our configuration toolkit to do service registration and discovery. It's all running on the Project Caroline infrastructure, and we'll be migrating it to the new Sun Cloud offering as soon as we can.

In our load tests, a data store like this one was capable of serving about 14,000 concurrent users doing typical recommendation tasks (looking at items, finding similar items, adding attention data) with sub-500ms response time at the client.

Wednesday Apr 01, 2009

The Music Explaura

Today I'm (finally!) announcing the first offering from the AURA Project: The Music Explaura. The Explaura is a way for you to explore musical artists and find new ones that you might like, based on the words that people have used to describe the artists. We call the set of words used to describe an artist the textual aura for that artist.

You start out by searching for an artist that you know, say one of your favorite bands. The data store contains information for about 30,000 artists. Over on the left, you can see what the Explaura knows about one of my favorite bands, The Tragically Hip.

It's a bit hard to see (embiggened version), but this gives you some idea of the information that the Explaura collects for each band. There's a tag cloud (more on that in a bit), the artist's bio from Wikipedia, videos from YouTube, photos from Flickr, album covers from Amazon and upcoming events from Upcoming. You can click on the play icon to listen to that artist's radio at

On the left of the artist page, you see the list of similar artists generated by the AURA recommenders. This list of artists is generated using a technique that's quite a bit different than you're probably used to. Rather than relying on the wisdom of the crowds via a technique like collaborative filtering, the AURA system computes the similarity between artists by computing the similarity between their textual auras.

The tag cloud that the Explaura displays for an artist is a portion of the textual aura that the system uses to compute the similarity between two artists (in this case, it's social tags collected from This cloud is a little different than the tag clouds that you typically see: here the size of a tag is not proportional to its frequency, but rather to its importance for this artist. Here's a better view of the cloud for The Tragically Hip:

As you can see, The Hip are a Canadian band that plays energetic, indie rock. How do we compute the importance of a particular tag in the cloud? Using our good friend from the information retrieval world, TFIDF. The idea is that a tag is important for an artist if it is applied frequently to that artist and infrequently to other artists (i.e., it does a good job of distinguishing this artist from others.)

Because we're using the textual aura to compute the similarity, it's easy to generate a set of words that explain the similarity betwen two artists. If you click on the "Why?" link next to one of the recommended artists, you'll be shown the overlap tag cloud for these artists. Here's the overlap cloud for The Tragically Hip and Sloan:

In this tag cloud, the size of a tag is related to how much that particular tag contributed to the similarity between the artists. So the fact that both The Hip and Sloan are Canadian played a pretty big part in their similarity, along with the fact that they're both literate indie rock outfits.

One more thing about the artist's tag cloud: if you click on one of the tags in this cloud, you'll be taken to a page for that tag. This page will look a lot like the artist page: it shows information about the tag itself including the artists for whom the tag is important. The tag cloud that is shown on the tag page is built from the tags that are most similar to the tag that you clicked on. Here's the tag page for classic rock:

But what if I want things that are like The Hip, but I don't just want Canadian music? That's where steerability comes into play. Each artist has a little steering wheel icon next to it. When you click on that icon you're taken to the steering interface:

The steering interface starts out with a tag cloud that has the most important tags for the artist. On the left, you see the artists that the AURA system recommended based on their similarity to this steering tag cloud. On the right, you can see a selection of tags from the artist. Clicking on one will add it to the steering cloud. Note that as you add tags, the recommended artists are updated in real time. You're not restricted to the tags that have been applied to that particular artist, either. You can search for tags to add using the handy search box.

The really cool thing here is that the tag cloud is interactive: you can drag a tag to increase or decrease its importance. If you drag upwards on a tag, the tag gets larger and more important. If you drag downwards on a tag, the tag gets smaller and less important. If you drag a tag small enough, it goes negative and is shown with a strike-through. When a tag is negative, no artists with that tag will be recommended.

If we make the canadian tag smaller, then it's less important and we get bands for which canadian is less important. We can add the literate tag (because we like literate music!) and make it bigger, which makes it more important. Again, the recommendations are updated for each change in the cloud, so you get direct feedback as to how your changes are affecting the recommended artists. Here's my new steering page:

And there's a band that I've never seen before: Classic Case. Now I can click on the play button and see if I like their music.

If you don't want a tag in the steering cloud, you can right-click on it and select "Delete" from the menu. If you click on "Sticky" in that menu, then any recommended artists must have that tag in their aura. You can click on "Negative" in this menu to quickly make a tag negative.

It's probably a lot easier to see this in the demo video that Paul made last year:

As you can see, there's been lots of updates since the video was made, but there's still lots more to be done (for example, it's very annoying that canada and canadian are considered to be different tags), but we're pretty proud of how good the recommendations are turning out to be. I've discovered several new bands that I like using the Explaura.

There's a link for email feedback at the bottom of the Explaura interface, so let us know what you think. I'll be posting more about the Explaura and AURA in the future, so stay tuned.

Tuesday Feb 24, 2009

My new favorite kind of music

I was working on computing tag-tag similarities for tags for Frank for a 50K artist crawl that we did last month using an Aura instance running on EC2.

I wrote a quick program to pull the document vectors for the tags (the tags here are "documents" and the artists to whom the tags have been applied are the "words" in those documents.)

Given the vectors, it was easy to compute the complete similarity table for the tags and output the similarities for each tag in decreasing similarity order.

For 1500 tags pulled from an index of 1.8 million documents, this takes about 55 seconds to run.

I wanted to make sure that it was doing something reasonable, so I had it dump the top 10 similar tags for each tag as it was running, and I found this:

19/1510 artist-tag:8bit computing 1510 similarities
Most similar: ["<artist-tag:8bit, 1.000>", "<artist-tag:chiptune, 0.795>", "<artist-tag:chiptunes, 0.726>", "<artist-tag:bitpop, 0.584>", "<artist-tag:chipmusic, 0.341>", "<artist-tag:blipblop, 0.258>", "<artist-tag:nintendocore, 0.194>", "<artist-tag:nintendo, 0.189>", "<artist-tag:c64, 0.174>", "<artist-tag:vgm, 0.166>"]

Can you tell what caught my eye? Nintendocore? Awesome!

Thursday Sep 18, 2008

New Minion documentation

I've just uploaded some new documentation for Minion. It covers basic and advanced Minion configuration.

Configuration in Minion is done using an offshoot of the configuration system that was developed for the Sphinx-4 speech recognition system. We started out with the original version of the configuration package that Paul developed while he was working on Sphinx-4. Along the way, we synced our local changes to that version with the new annotation-based version that the Sphinx-4 folks have been using.

We've added a few basic capabilities to the configuration system: we added the configuration of lists of strings, the ability to configure a property that holds the value of a Java enum and a few other things. We also added one big capability to the config system, which is the ability to take advantage of a Jini lookup service in configurations.

With this addition you can define a configuration that can import and export components to a lookup service, so that it's easy to set up and configure a distributed system. We're making a lot of use of this capability in the AURA Project and I expect that it'll eventually get used for a distributed version of Minion.

You can find the documentation linked from the Minion project home page.

Friday May 09, 2008

Minion and Lucene: Performance

Otis asked about Minion performance and how it compares to Lucene.

We did some performance comparisons a while ago, and it probably deserves a full post. Part of the problem with performance comparisons is that it's hard to get the same conditions for both systems.

I think it would be a good idea for all of the open source engines to get together, find a nice open document collection (the Apache mailing list archives and their associated searches?) and build a nice set of regression tests and some pooled relevance sets so that we can test retrieval performance without having to rely on the TREC data.

At any rate, we used Lucene 2.0 and a build of Minion to build an indexer for email messages. We used these to build indexes of 1.1 million email messages. We then ran a bunch of queries against the index. The messages and queries were taken from an archive of about 5 million messages that we provide internally. We ran the queries serially and in parallel up to 32 threads on a couple of Sun x64 and Niagara boxes.

The practical upshot was that the Minion indexer was substantially faster than the Lucene indexer (at the cost of more memory used) and the queries were comparable in similar conditions. Of course, indexing speed probably isn't that big a deal in most situations.

Our "standard" config Lucene did a lot better than our default Minion config, which is totally understandable, since Minion was doing a lot more work (e.g., doing query term expansion, processing the case sensitive postings and searching all the fields). When we feed Lucene our expanded terms, and both engines are doing straightforward boolean AND, the difference in query times is about 10% in Lucene's favor. This sounds like a lot, but we're talking about 35ms vs 39ms.

I have no doubt that we could go in and massage the query evaluator (or come up with a more efficient postings format, or ...) to get that 10% back and I'm sure that the Lucene folks could go in and get another 10%, lather, rinse, repeat.

With respect to global IDF, we store a separate dictionary containing global term statistics, so when we're doing weighting, we're using the statistics for the collection as a whole. We actually try to be fairly careful about the term statistics, especially for things like document categorization.

As for distributed search, Minion is like Lucene in that it's a simple indexing and retrieval library. We're working on distributing the search as part of Aura, but I can't really comment on the performance there yet as it's still pretty early days. I will say that it was pretty pleasingly fast on a 7 million document index distributed into 16 pieces — fast enough that I thought it was broken when we ran the first queries. I'll let you know how it goes when we have 100 million or a billion docs :-)

Monday Apr 28, 2008

Not exactly a freakomendation

Paul's been posting freakomendations, which are "unusual recommendations" (that's a bit of an understatement given your examples, Paul!)

John Scalzi posted that Amazon's recommender suggested that he might like The Last Colony, a book he wrote!

This is not necessarily a freakomendation, because it seems pretty likely that John would read books that the people who read The Last Colony had read, and by that measure the Amazon recommender worked pretty well. But, as you can see from the comments for that postings, doing things like this calls the quality of all of the recommendations into question. This is probably unfair to Amazon's recommender, but that's what you're (which is to say "we're") up against when building recommender systems.


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« July 2016