Monday Jul 11, 2011

Importance of lively updates

Google Plus is giving us an example of how important it is for your interface to immediately reflect changes made by users. When it doesn't, it doesn't inspire confidence. My first experience with Google+ was adding a new person to a circle. Her posts showed up immediately (yay!), but in the list of people "In your circle", she was nowhere to be found. When I clicked on her name in one of her posts, I got "Add to circles" as an option. Adding her a second time caused her to show up in the "In your circle" list and changed the behavior of clicking her name to bring me to her profile instead of showing me the "Add to circles" list. Seems like a case of different parts of the page pulling data from different parts of Google. The piece that shows me posts knew she was in a circle. The piece that handles who is in my circles hadn't quite noticed that change yet.

In another case, I clicked over to the profile of a friend who had just signed up. From what I can tell, he entered his gender into his profile but not much else. Fair enough. I know this because his profile says "Gender: Male". Here's what I saw on the screen though.

Google+ says "Michael has not filled out their profile yet.". First of all, Michael clearly has filled in at least something in his profile - you're showing it right there, Google. Second, the one piece of his profile he did submit was his gender. How about "Michael has not filled out his profile yet."? Once again, this looks like a disconnect between different parts of the UI, where some of it has gotten an update (gender being filled in in the profile), and some if it hasn't (the text right above saying there is no profile, and not reflecting the one bit of known data).

I'm left feeling like Google doesn't totally have their act together even though they had a cold-open with an invite-only system. On a totally unrelated note, I've notice a somewhat similar behavior in Google Calendar - if you import a bunch of ICS data, you can see it if you subscribe to the ICS calendar, but it doesn't show up in the UI for a while. Not all the time, just sometimes. Perhaps this is just a symptom of running a large distributed system. On the other hand, maybe we can do better?

I guess what I'm trying to say is... just what do I call "people in my circles"?

Monday Mar 14, 2011

Blog moving

This blog will be moving to I don't know what the URL will be yet, but hopefully this one will get redirected. Maybe this move will encourage me to start blogging again? Who can say?

Wednesday Jun 02, 2010

Confidence in Recommendation

Generating confidence in your recommendations is key to getting people to use them. When using the Music Explaura, you're more likely to trust an artist recommendation that we give if it is an artist you've heard of. Then again, if it is an artist you've heard of, it hasn't actually expanded your musical horizons. The trick is to include enough popular, obvious, recommendations in your results that you can see we're on the right track, then expand into artists you may never have heard of that might match your tastes even better than the ones you have.

I heard an interesting talk during lunch today from the Oracle folks who do retail science. They create forecasts about how products will sell given different factors. To oversimplify, "you'll sell N units of this product if you discount it by 25% starting on this date". One of the non-technical problems they face is building confidence in their forecasts. Sometimes a prediction (i.e. a recommendation) makes sense to the user - have a sale on flowers before Mother's Day and you'll sell more flowers. But sometimes the model predicts something that doesn't have an obvious explanation even through the historical data that is collected indicates that it should be true. The user has to make a judgment call. This, I think, is very similar to the kind of trust a user builds in a recommender system. If the first thing the retail system predicts doesn't make any sense to me (even though it is based on solid facts), my first impression will be that the system isn't going to work. What I really want is for the system to start out by demonstrating that it can predict the "obvious" things (like flowers for Mother's Day, or in our music recommender's case, that I might like Coldplay). Once I'm confident that it can do that, then I'll feel more confident in it when it predicts something less obvious.

On a somewhat unrelated note, one of the speakers observed that a product that has only a few items on the shelf isn't going to sell well. Picture a grocery store where there's only a few boxes of couscous left. Maybe you'll pass on those and get something else. You think "maybe those are the rejects that nobody else wanted". I think it's an interesting look at human behavior that when you turn that into an online store, you can motivate people to buy those boxes by saying "Only 3 boxes left".

Tuesday May 18, 2010

Pattern for Defining Fields in Minion

While writing various applications that use Minion, I've settled into the habit of using an Enum to declare (and define) the fields I'm using in the search engine. It is a convenient and concise way to combine the declaration of their attributes with a mechanism for always getting the field name right. It is pretty simple and works well. I create an enum with one value per field, using the syntax that lets you specify arguments to a constructor to set the field's attributes. I then can use an accessor method of the enum to always get the particular value's field name whenever I need to refer to the field. As a bonus, I can throw in a defineFields method that simply iterates over all the fields in the enum, defining each one in the search engine with the attributes specified. See the example below.

/\*\* \* An enumeration of the fields available in the index. These values should \* always be used when referencing fields. \*/ public enum IndexFields { /\*\* Email address is saved and searchable \*/ EMAIL ("email", FieldInfo.Type.STRING, EnumSet.of(FieldInfo.Attribute.INDEXED, FieldInfo.Attribute.TOKENIZED, FieldInfo.Attribute.SAVED)), /\*\* ID is saved only \*/ PERSON_ID ("person-id", FieldInfo.Type.STRING, EnumSet.of(FieldInfo.Attribute.SAVED)), /\*\* Tags are saved and searchable, but not broken into tokens. They are vectored for use in document similarity \*/ TAG ("tag", FieldInfo.Type.STRING, EnumSet.of(FieldInfo.Attribute.INDEXED, FieldInfo.Attribute.VECTORED, FieldInfo.Attribute.SAVED)), /\*\* Bio is the "body" of the document, indexed, tokenized, and vectored \*/ BIO ("bio", FieldInfo.Type.NONE, EnumSet.of(FieldInfo.Attribute.INDEXED, FieldInfo.Attribute.TOKENIZED, FieldInfo.Attribute.VECTORED)); /\* \* Each enumerated value will have these three fields \*/ private final String fieldName; private final EnumSet<FieldInfo.Attribute> attrs; private final FieldInfo.Type type; /\*\* \* The constructor to create the instances defined \* above \*/ IndexFields(String fieldName, FieldInfo.Type type, EnumSet<FieldInfo.Attribute> attrs) { this.fieldName = fieldName; this.attrs = attrs; this.type = type; } /\* \* Public methods to get the field properties: \*/ public String getFieldName() { return fieldName; } public EnumSet<FieldInfo.Attribute> getAttributes() { return attrs; } public FieldInfo.Type getType() { return type; } public String toString() { return fieldName; } /\*\* \* Defines the fields enumerated in this enum in \* the provided search engine \*/ public static void defineFields(SearchEngine engine) throws SearchEngineException { for (IndexFields i : IndexFields.values()) { engine.defineField(new FieldInfo( i.getFieldName(), i.getAttributes(), i.getType())); } } }

Wednesday Jan 27, 2010

Did you mean...?

Just an example of how search engines use statistics to help them "understand" words. Until today, iPad wasn't a word (or at least, it wasn't a name). If you go to Google (for now at least) and search for ipad, you see something like this:

iPad appears very infrequently but is similar to a word that appears very frequently - "iPod". iPad appears infrequently enough and the discrepancy between the two is large enough, that Google assumes I probably made a mistake typing iPad. Google's actual algorithm for spelling corrections is more complicated than this, but this is the basic idea behind spelling correction in search engines. As iPad starts to show up all over the web, Google will stop making that suggestion since iPad will become more plausible (statistically) as a word. Or maybe somebody at Google will just add iPad to an exception list so it stops making the suggestion.

As a side note (no pun intended) the sponsored ads on the side of the page are all for the iPAQ as various vendors have put in bids to get listed for ipad as a misspelling of ipaq. In this case, I'm fairly sure, these are strictly companies that asked their ads to be shown for the word ipad (as well as ipaq).

Friday Jan 22, 2010

Slow Day

Our sysadmin Gary was cleaning out one of our hardware labs and put the old Mac IIfx from Bill Woods' old office on the discard pile. I wasn't in the office, but Seth couldn't bear to see it go and of course rightly assumed that I wouldn't want to see it go either. So Seth put the machine in my office. We couldn't just let it sit there, so we brought it into the game lab, experimented with various cables to get it attached to a monitor, and eventually, 20 years after it was made, booted the machine. Now if only we had a pair of PhoneNet connectors, we could network it to the old Powerbook 520c that has an ethernet tranceiver and get the IIfx online!

Tuesday Nov 24, 2009

Handling Transactions in BDB-JE 4

The Replicants (data storage nodes) in Aura are composed of a search engine and a database. While we are closely tied to using our own search engine, the implementation of the database layer is entirely contained in a wrapper class called BerkeleyDataWrapper. This class does all the handling around repeating failed transactions, making sure transactions commit or abort, closing cursors, etc in case of success or failure. In my first iteration of the code, I had this handling code repeated in every method around every transaction. There were frequently small differences in how things were handled so the code got repeated a lot with minor tweaks here and there. Obviously there was a maintenance problem when I needed to change something in the error handling for all methods.

When the newest version of BDB-JE came out, adding in support for High Availability, this further complicated the set of things that could go wrong and needed to be handled with each transaction. Now we could have a whole slew of errors related to networking and other processes that the previously monolithic database was unaware of. Since the error handling code was about to get even more complicated, this seemed like a good time to refactor the way the BerkeleyDataWrapper was constructed.

In the latest version, I create command objects in each of the database methods and evaluate them all in a single method that handles all the failure scenarios. Now, this pattern isn't exactly earth shattering, but I think it works fairly well here so I thought I'd document it. After looking at the variables in my transaction execution, I constructed the following command object:

public abstract class DBCommand<R> { /\*\* \* Returns a configuration for the transaction that should be \* used when running this command. \* \* @return null by default \*/ public TransactionConfig getTransactionConfig() { return null; } /\*\* \* Runs the command within the given transaction. The transaction is \* committed by the invoker method so commit should not be called here. \* @param txn the transaction within which the code should be run \* @return the result of the command \*/ public abstract R run(Transaction txn) throws AuraException; /\*\* \* Gets a message that should be included upon failure that should \* include the command name and any extra data (item key, etc) \* \* @return the status message \*/ public abstract String getDescription(); }
The DBCommand allows you to specify: the configuration for the transaction in which you'd like to run (e.g. defining what the read/write semantics are); the actual code to run within the configured transaction; and a human readable description that could appear in log messages. When creating a concrete instance of DBCommand, you simply override run and use the transaction passed in to perform your application logic. DBCommand is parameterized, allowing you to specify an optional return value from the run method. Here's a simple example of how this is used to put a set of Attention objects (associations between users and items) into the database. Nothing is returned from this, so a type of Void is given.

/\*\* \* Puts attention into the entry store. Attentions should never be \* overwritten. \* \* @param pas the attention \*/ public void putAttention(final List<PersistentAttention> pas) throws AuraException { DBCommand<Void> cmd = new DBCommand<Void>() { @Override public Void run(Transaction txn) throws AuraException { for (PersistentAttention pa : pas) { allAttn.putNoOverwrite(txn, pa); } return null; } @Override public TransactionConfig getTransactionConfig() { // // We want putting attentions to be fast and not necessarily // durable or consistent across replicas. TransactionConfig txConf = new TransactionConfig(); Durability dur = new Durability( Durability.SyncPolicy.WRITE_NO_SYNC, Durability.SyncPolicy.WRITE_NO_SYNC, Durability.ReplicaAckPolicy.NONE ); txConf.setDurability(dur); return txConf; } @Override public String getDescription() { return "putAttention(" + pas.size() + " attns)"; } }; invokeCommand(cmd); }
In Aura, we may be collecting Attention objects at a good clip (e.g. "user X played song Y", or "user X starred article Z"), so we want to relax the transaction semantics as far as we can to keep the speed up while recording Attentions. If we lose some attention data due to a crash, it doesn't hurt that much, so this is a good trade-off. In order to modify the semantics, you can see that I've overridden getTransactionConfig to return a configuration with very little durability. A transaction with that config is created in the command invoker and passed into the run method. Since error handling is done externally, the run method only needs to have the basic application logic in it.

After creating the command object above, the last line of putAttention invokes the command, with the call going into the generic command invocation code.

/\*\* \* Invoke a database command, passing in a transaction to use for \* operating on the database. The configuration from the transaction \* is obtained from the command that is passed in. \* \* @param cmd the command object to invoke \* @param commit whether we should commit after running this transaction \* @param useCurrentTxn if the thread-local transaction should be used instead \* of a new transaction \* @return the result of the run method \* @throws AuraException in the event of a failure \*/ protected <R> R invokeCommand(DBCommand<R> cmd, boolean commit, boolean useCurrentTxn) throws AuraException { int numRetries = 0; int sleepTime = 0; while(numRetries < MAX_RETRIES) { Transaction txn = null; CurrentTransaction currTxn = null; try { // // If the write lock on quiesce isn't held, we can continue quiesce.readLock().lock(); // // Fetch the transaction config and create a transaction or // get the CurrentTransaction TransactionConfig tconf = cmd.getTransactionConfig(); if (useCurrentTxn) { currTxn = CurrentTransaction.getInstance(dbEnv); txn = currTxn.beginTransaction(tconf); } else { txn = dbEnv.beginTransaction(null, tconf); } // // Now run the command and commit if necessary. R result =; if (commit) { if (useCurrentTxn) { currTxn.commitTransaction(); } else { txn.commit(); } } return result; } catch (InsufficientReplicasException e) { // // In the event of a write operation that couldn't be sent // to a quorum of replicas, wait a bit and try again sleepTime = 2 \* 1000; } catch (InsufficientAcksException e) { // // We didn't get confirmation from other replicas that the // write was accepted. This likely happens when a replica // is going down (and when we are requiring acks). For us, // this is okay. } catch (ReplicaWriteException e) { // // We tried to write to this node, but this node is a replica. throw new AuraReplicantWriteException( "Cannot modify a replica: " + cmd.getDescription()); } catch (ReplicaConsistencyException e) { // // We require a higher level of consistency that is currently // available on this replica. Wait a bit and try again. sleepTime = 1000; } catch (LockConflictException e) { try { if (useCurrentTxn) { currTxn.abortTransaction(); } else { txn.abort(); } log.finest("Deadlock detected in command " + cmd.getDescription() + ": " + e.getMessage()); numRetries++; } catch (DatabaseException ex) { throw new AuraException("Txn abort failed", ex); } } catch (Throwable t) { try { if(txn != null) { if (useCurrentTxn) { currTxn.abortTransaction(); } else { txn.abort(); } } } catch (DatabaseException ex) { // // Not much that can be done at this point } throw new AuraException("Command failed: " + cmd.getDescription(), t); } finally { quiesce.readLock().unlock(); } // // Do we need to sleep before trying again? if (sleepTime > 0) { try { Thread.sleep(sleepTime); } catch (InterruptedException e) { // Nothing we can do about it. } } } throw new AuraException(String.format( "Command failed after %d retries: %s", numRetries, cmd.getDescription())); }
Scrolling through the invokeCommand method, you'll see that we use a structure fairly close to the invocation code that is given as an example in the BDB documentation. It all runs in a loop in case we need to retry due to conflicts, and it handles, in different ways, a number of the exceptions that might be thrown. Near the top of the loop is the logic that gets the transaction configuration from the command and uses it either to create a new transaction, or to start the thread's CurrentTransaction (used for BDB methods that don't otherwise take a transaction as a parameter). If all goes well, the code is run and committed and the result is returned. If not, the error handling kicks in.

One of the nice things about running the transactions this way is that it is very easy to modify the semantics around every transaction that runs. For example, I realized that there may be occasions that we'd like to quiesce the database, temporarily pausing all activity. In the previous iteration of the code, I would have needed to modify every method to check if it was okay to run, but since I had refactored everything, I could add the simple quiesce lock into the invoker. The quiesce instance variable in the code above is a ReentrantReadWriteLock, so if anybody came in and called the quiesce method, they'd request the write lock and no further database commands could run until the lock was released. I could see this being very useful for keeping track of failure rates and logging in general. Finally, if it turned out that I need to add a new kind of command that needs to further parameterize its execution environment, it would be easy to add another method to the DBCommand interface to get at its values. So all told, I think this works well, and maybe it'll be useful for somebody else too.

Wednesday Nov 18, 2009

Replicated Replicants

Oracle has just recently released a new version of Oracle Berkeley DB Java Edition including new High Availability features. These allow you to keep multiple database instances in sync (using a single master). Some time ago I was asked if we'd like to help evaluate a pre-release version of the code, and of course I said yes. We've been waiting for HA features to be available in the database to implement our own replication support so it was a perfect fit for evaluation. Some very insightful people had good things to say about it. Since we had a head start, I already have a working version of the AURA Data Store with replication.

The AURA Data Store has three tiers - the Data Store Head serves as the communication point to the outside world and also distributes queries to each of the many Partition Clusters. Partition Clusters each represent a single partition of the data, replicated across a cluster of Replicants. Until recently we only supported a single Replicant, but thanks to BDB-JE/HA we now have support for multiple replicas. If you're following along in the source code you'll want to update to the "replication" branch to see the work being done. Adding support was fairly straightforward once I got a handle on how each of the nodes in a replication group are connected to each other. We already had infrastructure that expected to work this way, so integration was smooth. When setting up the group, you specify how closely synchronized the replicas need to be, and when committing transactions you can specify just how far into the disk or network you want the commit to go before returning. So we commit to the master and in a few seconds we can expect to see changes in the replicas.

The only catch was that we maintain both a database and a search engine. We haven't put any support in the search engine for HA (although a single index can have multiple readers and writers if we were sharing it). So for the time being I have a thread that picks up any changed item keys from the master and informs the replicas that they should fetch those items from their respective databases and re-index them. What would be nice would be if we could get an event stream from the database in the replicas notifying us that a new or updated item just came in. Another option might be to actually store the search engine's data in the database and let it do the replication, but the nature of the inverted file doesn't really lend itself to this (at least, not with any hope of performing well).

Anyway, the real excitement here was that for the first time, we got to see our data store visualization show us multiple Replicants for each Partition Cluster:

This is a screenshot showing a very small ("two-way") data store. It is running with only one DataStoreHead, and the data is divided across two partitions. Each partitions has three Replicants. While the Replicants are drawn inside the Partition Clusters, it should be noted that they are actually separate processes running on separate machines. The grouping is just a logical view. I opened up the overall Item Stats display to show that we only have a small number of items. To make the screenshot more interesting, I'm running a 2,000 user load test that simulates the actual operations users would perform (basically, a weighted mix of getting item details, searching, and getting live recommendations) when using an application such as the Music Explaura.

As you can see in the image, we're distributing query load fairly evenly across all Replicants in each Partition Cluster. Replicants do most of the heavy lifting in terms of data crunching. In order to benefit from the greater amount of cache afforded us by the greater number of processes/memory, we distribute what queries we can based on the hash code of the item key, thereby asking for the same items from the same Replicants. The Partition Clusters are doing a little work in distributing those queries, and the Data Store Head is doing a little more work in that it needs to combine results from across Partition Clusters.

I plan to do some more posting about how BDB-JE/HA is integrated into the AURA Data Store, so stay tuned!

Monday Nov 02, 2009

AURA Items and Java Compressor Streams

At the replicant level of The AURA Project (the base level that actually holds the data), we store our Items in a Berkeley DB / Java Edition database. Items have a number of fixed fields that we can retrieve them by such as unique key, name, creation time, and type. Each type of item may have its own particular data though that the Data Store itself considers to be opaque. For example: a musical artist will have reviews, bios, links to videos, etc; a user will have a name, email address, gender and so on. To keep things simple in our current implementation, we store all of the type-specific data in a blob. We'll serialize the data to a byte array, then store the byte array in the database.

We run on a very fast network where we weren't initially concerned with the size of these blobs, but as they got bigger we started to hit a bug that seems to be caused by some combination of the network cards we're using, the switch between them, and possibly the ethernet driver as well. This bug was causing us to see some intermittent retransmits with 400ms backoffs on larger objects. Obviously, this isn't good when we're trying to keep our total request time below 500ms (leaving only 100ms to actually do anything). We weren't in a position to track down the delay (the equipment wasn't ours to tamper with), so we tried to mitigate it by decreasing the size of the items. The simplest way to do this (that didn't involve actually storing the blobs somewhere else or breaking them up) was to compress the data. This is all a long winded way of saying that I had cause to evaluate a bunch of the compression stream implementations that I found scattered around. Surprisingly, I couldn't find anything like this already out there on the interwebs.

I wrote a simple test harness that reads a whole bunch of .class files (as an extremely rough approximation of what live instance data would look like) and compresses and decompresses them and records the size and times for each test. Below are the results of reading around 800 class files and compressing them, checking the compressed size, then decompressing them. The first item, "None", is just straight I/O without any compression.

CompressorComp TimeComp SizeDecomp Time

The ZLIB compressors come from the JDK's Deflator zlib implementation with "BEST_SPEED", default, and "BEST_COMPRESSION" options. GZIP and Zip are from the JDK as well. BZip2 is from the Apache Commons Compress project and HadoopGZIP is from the Hadoop project using the Java implementation.

Looking at the results, I was initially excited by the ZLIB-Fast option, but while its compression time is quite good for not that much loss in file size, the decompression time leaves a little to be desired. Since, generally speaking, items get written infrequently in our system and read quite frequently, the decompression time (which is done at the client or in the case of web apps, in the web server) is the more important of the two. ZLIB-Small did much better with decompressing, but the cost of compressing was fairly high. GZIP does pretty well in compression time, size, and decompression time. Zip speeds compression a bit but took a lot longer to decompress, and BZip2 (as expected) trades off time for tighter compression. I was under the impression that Hadoop's GZIP would come out the same as JDK's (in fact, I thought it was using it) but the numbers are consistently different.

I'm looking for something that helps to reduce the size of the data and doesn't take too much time. So all told, GZIP seems to be the clear winner here. Note that these times are only useful for relative comparison. I'm getting the data from many different files (which are cached) and I'm reading/writing in 4K chunks for my test. I may well do better with a different chunk size, but I suspect that the relative numbers will come out about the same.

Wednesday Aug 19, 2009

The Shawshankr

We've had a joke here in The AURA Project for a while now about a movie recommender that is guaranteed to give a good recommendation. It is a very simple concept. It always recommends "The Shawshank Redemption". "The Shawshank Redemption" is a great, if undervalued, movie. It is a touching story of hope that has fairly universal appeal (thank you, Stephen King; and thank YOU Morgan Freeman). The joke about the movie turned into a joke about a web site called The Shawshankr. That of course turned into an actual web site at

Now, in addition to being ridiculous, it actually serves to make a minor point about recommendation. Recommendation isn't always about recommending something that everybody likes. Without novelty, a recommender isn't very worthwhile. While working on building a recommender system, we try to keep this in mind. Just because everybody read The Da Vinci Code (and it might sell if you suggest it), that doesn't necessarily make it a good recommendation.

Thursday Jul 16, 2009

Developing the Facebook Music Explaura

In order to try to get a little more interest in the Music Explaura, I decided that I'd try making a Facebook app that uses some of the data available on Facebook to get you started with your exploration. Facebook would provide the UI and input, and the Search Inside the Music application built on Aura would provide the data. The live system would run on a grid/compute cloud that was developed as a research project inside Sun. I had hoped to pull in and accumulate some of the social data in order to help with the recommendations, but sadly that turned out to be against Facebook's terms of service. Still, when a user runs an app, I can get access to the user's "Favorite Music" in their profile, and I can get access to their friends and their friends' "Favorite Music". This was enough to go on.

After finding the Java API, it was mostly a matter of learning what you can and can't do with JavaScript inside the Facebook page. Facebook intercepts all traffic between the browser and your site (which gets around the cross-site scripting issue) and rewrites your JavaScript to make sure you're only doing things that are on the up-and-up. They call their JavaScript environment FBJS. Despite the frequent frustrations that I had while bumping my head against the frobbed code, I will give Facebook credit that they've taken a pretty hard problem (trusting third party code to run in your app) and created a workable solution. Facebook does allow AJAX calls (using its fairly easy to use wrappers), so I could include some level of interactivity.

For the interactivity, I needed some servlets that would provide answers to the queries I was running. We've been working on some web services for the music features, but I had some specific needs. In particular, there are certain things that need to be returned in a special format for Facebook to allow (for example, code for special Facebook markup tags needs to be frobbed by facebook before it gets back to the JavaScript running in the browser). So custom servlets it was.

The first thing I wanted to do was generate a tag cloud that represented the user's taste. This would involve getting the user's favorite music, finding band names in it, then pulling out and combining the most distinctive tags of those bands. The communication process here is kind of interesting. First, the user loads the Facebook app. Facebook hits our web server(s) and requests the "canvas" page to display the app. As the UI is displayed to the user, an AJAX call is made to fetch the tag cloud. So the client sends the AJAX to Facebook and Facebook forwards it to us. To answer the request, we first send a query to facebook asking for the "Favorite Music" in their profile (and some other data that we'll short-term cache for later use). We get a String back, then we can parse that and search our data store for matching band names. We get the matching bands and their associated distinctive tags, combine their weights, and return a JSON object with term/weight pairs back to the Facebook AJAX proxy. Facebook checks it over to make sure it is legit, and finally returns the JSON to the client. A bit of JavaScript then renders the tags and weights as a cloud and displays it to the user. Finally, a link is added below the tag cloud to load the cloud into the Steering interface in the Music Explaura.

I wanted to be able to display my tag cloud on my Profile, so next up was the "Add to Profile" button. I figured that when you click the button, Facebook would request the content of the profile box then insert it. Not so. The Add button (inserted with FBML that Facebook controls the rendering of) doesn't render unless there's already a field set in Facebook with the HTML for the profile box. I want the button to show up when the page loads, but it won't render it unless I've already filled in the data, and I can't fill in the data until the page loads! I needed to get Facebook to render the "Add to Profile" button after I fetched the user's cloud (and had a chance to post the profile markup to Facebook). Since Facebook will render markup in your AJAX results, I added a field to the JSON I return to build the cloud that contains the FBML necessary for creating an "Add to Profile" button. Then I just get that data in the client (now rendered by Facebook) and put it into a DIV for display. Hacky? Yes, but it works!

There were some other interesting issues as well. For example, the Facebook Java API (not actually maintained by Facebook) will return JSON objects for some things, but the format isn't specified. Some calls will return a JSONObject and some a JSONArray, depending on the data returned at runtime. This isn't very helpful when you're trying to cast to the correct type to get at your data. I was also a little surprised to see that Facebook doesn't have any more of a developer "sandbox". I'd love to see something that lets me (or more than one codeveloper at a time) test a dev version of the app without us each having to create a new application listing. It'd also be great to be able to create ad-hoc users (basically profiles that are only accessible by my app) for testing. This would let me have a few users with different bands and compare their tastes without having to create fake Facebook accounts. As it was, I generally tested by changing my own favorite music.

This has turned into a fairly long winded post, so I'll leave it at that for now. Try out the Facebook Music Explaura and see what you think!

Monday Jul 13, 2009

Jumping through hoops

Sometimes you find something that sounds like it should be fairly straightforward but ends up being a bit more complicated than you thought.

Our intern Frank commented that he'd really like to have a little more agility in trying out the code that he's working on. You see, we run our data store on a grid. The grid is physically somewhere else, and more importantly, it runs our code in a virtual network (you create a network to run your processes in so everything is isolated). Every process you start runs in a Solaris zone with its own IP address(es). So in order to run some test code, you need to build it, upload it, then restart your process. This doesn't take a lot of time, but it adds several steps between "change a line of code" and "try it out against our live system". To get the results, you need to either look at an output file that your process creates, or you need to connect to some server providing it (probably a web server that can talk to the process you ran). Fairly cumbersome.

This has been a problem for a while, but we hadn't really gotten to a point yet where we couldn't just test our code by running against a small dataset running on a "distributed" instance of our data store that actually runs all in one JVM on our development machines. Frank is working on stuff that really needs "big data" so that isn't going to cut it anymore. The external interface to our live datastore is based on web-services, but for heavy computation, we want to speak directly to the store over RMI. So since we seem to be letting our interns dictate what we should work on, I went to work providing a solution.

One of the many nice features of the research grid we're running on is that we can use a VPN client to connect in to our virtual network. Once I figured out how to configure this and allocate the resources for it, it worked like a charm. This should be easy now, right? Just get your client machine on the network and you can use RMI to talk directly to our data store! Except, it didn't work. We use Jini to self-assemble the pieces of our data store and also to discover the "head" nodes that allow client access. Turns out the VPN client wasn't handling the multicast packets needed to do discovery. No problem - I specified the IP address that was assigned to our RMI registry process. Still didn't work. It would attempt to load our services but it would hang and eventually time out. This was getting tricky.

One problem I noticed from the start was that I wasn't getting any name resolution to work. Perhaps Java was trying to look up the names of hosts and timing out. Enough host name timeouts, and our own connection timeout would be reached and it would fail. Now the problem was that while all of the processes running on the grid itself (inside Solaris zones) had access to the name server for the grid, the VPN clients connecting in could not actually route packets to the DNS server. (Remember, everybody gets their own private network, but name service is provided by the grid -- you don't need to (or want to) run your own). Now we could have actually started our own name server, or we could have made a hosts file that has all the names of the machines we use. The problem is, this is a dynamic environment. From one start of the system to another, the actual addresses may change. (Note that we don't care what the addresses are for the system to work -- Jini lets us self-discover all the components.) The solution I chose to go with was to start an on-grid process that would relay DNS requests to the actual name server and back.

Of course, I initially thought that even this would be easy because there are any number of port-forwarding Java programs out there, and I've even written one or two myself for various tasks. I started by deploying one of these and was surprised to discover that it still didn't work. No name resolution. Fortunately, I happened to look at the little network monitor I was running and noticed (with a loud slap of my forehead) that of course name resolution usually uses UDP, not TCP (which is what all these port forwarders are written for). I hoped that maybe I could configure my name resolver to use only TCP (yeah, much less efficient, but still tiny in the grand scheme of things), but no such luck.

So, I looked around and was (I guess not very) surprised to find that there wasn't any Java code out there for running a simple UDP port forwarder. Fortunately, the code turns out to be relatively simple - at least, it was once I reminded myself how UDP works. I wrote a simple process that receives datagrams on port 53 then chooses an unused outgoing port to resend the datagram (after re-writing the source and destination addresses) to the actual DNS. It starts a thread that waits for the response on that port then sends the response datagram packet back (again, with the addresses re-written) to the original requesting machine (the VPNed client machine). I deployed the code as a grid process and sure enough, it worked like a charm!

All told, this probably should have been easier. On the other hand, we can now connect our machines into the network that runs our live system and run code directly against it. What I didn't mention is that Frank isn't actually running Java code. He wants to run Python code that uses a native number crunching library that starts a JVM that allows him to pull Java objects from the data store and pass them into the python library. He's running that in his virtual linux box running in VMWare on his Mac (which is connected to the virtual network on the grid via VPN). We don't know yet just how slow this is all going to be.

Thursday Jun 25, 2009

First pass at JSR223 support

With encouragement from Steve, I went ahead and put in the code to allow attention processing with scripts in the data store, as proposed in my last post. To keep things simple, I added a single method for attention and didn't try to expand to other data types yet. The method signature looks like:

public Object processAttention(AttentionConfig ac,
                               String script, String language)
         throws AuraException, RemoteException;

The method is implemented in part in the BerkeleyItemStore class. The AttentionConfig describes the parameters for the attentions to process. In the BerkeleyItemStore, the database is queried to get the matching attentions, then they're passed as a list into the script.

The script is expected to implement two methods. A process method is called by the BerkeleyItemStore with the matching Attentions and returns any object. The results are collected by the DataStoreHead (in a different process) and passed as a list into a collect method. This method also returns an Object which is returned to the caller of processAttention.

So to sum up the play counts for a particular artist, we'd set the AttentionConfig's target to the band's key, and the type to "PLAYED". Using JavaScript, our script would look something like this:

// process takes a list of attention as its input
function process(attns) {
    var attentions = attns.toArray();
    var count = 0;

    for (var x in attentions) {
        count += attentions[x].getNumber() - 0;
    return count;

// collect takes a list of the results from above
function collect(results) {
    var counts = results.toArray();
    var count = 0;
    for (var x in counts) {
        count += counts[x] - 0;
    return count;

Or in Python like this:

def process(attns):
     return sum([int(x.getNumber()) for x in attns.toArray()])

def collect(results):
     return sum(results.toArray())

The oddity with subtracting zero ensures that the variable is treated as an integer rather than a string. To test the performance, I tried two different counting methods. I selected about 14,000 attention objects to use, representing about 24 million plays (this is unusually high because of a bug in some test code I was running, but the number of plays isn't too important here). In the first method, I pulled the attention objects to the client and added up their counts. On average, this seemed to take about 320ms. For the second method, I used the above script to do the counting deeper in the data store. This method took only about 120ms (including the time to compile the script). I suspect that caching the compiled scripts (since they're likely to run more than once) will save a chunk of that time as well!

Tuesday Jun 23, 2009

Trying some map/reducey behavior

The AURA Data Store is more than just a key/value store. As with most distributed data stores, there are "head" nodes (DataStoreHeads in our lingo) that take requests and route them to appropriate "leaf" nodes (Replicants to us). Each head node has knowledge of which leaf node any given value is stored at based on its key. Each leaf node has a database storing the values with a nice big in-memory cache.

But unlike most key/value stores, we have at least some knowledge of the type of data that is stored. Specifically, we have three types: Items are things that can be recommended. They have a name and a map of any serializable data you want. Users are just a type of item that have some user-specific fields well defined. Finally, Attentions are mappings between items (user X played song Y or song Y is on album Z) that have a type, a time, and a little optional meta data. These types are stored in a Berkeley DB (Java Edition, using the DPL) with indexed secondary keys for all the well-defined data (but not for the custom map). Our not-so-secret sauce is that we also have a search engine next to each database that indexes not just the well-defined data, but also any custom data that is stored in the item's map. Applications can build wrappers around Items that store whatever information they want in the Item map.

With all this indexed data (either in the database or the search engine), there are many operations that we know we want to perform by running computation where the data lives and sending back results. This means that the interface to the data store is pretty big. We can't get away with just "get", "put", and "delete". Our methods fall into two categories. The gets, puts, etc make up the first. This kind of query is for data that lives on a node that we can identify by key. But we also need methods to answer queries such as "get me all the times that user X has played song Y since last Tuesday" or "find me the artists most similar to a user's profile". For these, we need to query all of our data nodes since the answers could be found on any of them. Then we need to collect up the results (probaby ordering them in some way) and return them to the client of the data store. We've coded a number of fairly flexible methods that let users of the data store peform a wide range of queries to answer these kinds of problems. But there's always going to be some case where there'll be some application logic that we didn't count on that really wants to run close to the data.

This is where the fun comes in. Our intern, François, wanted to run a query that would have needed specific knowledge about the way an application uses one of our data types. This would violate the isolation the data store has from applications built on it. Specifically, François wanted to get a total count of "listens" between a user and an artist. No problem, except that he wants to store multiple listens per attention object (using the metadata field). We don't want to return all those attention objects back over the wire and require that they get counted by the client, and we don't want to put specific code for that in the data store. What if we wanted to be a little more map-reducey and run some custom code at the replicant level?

Enter JSR223.

By supporting scripting, we can allow applications to run code in the datastore (at the leaf nodes) while keeping the database fairly secure by limiting what objects the scripts have access to. JSR223 makes this possible since there are a large number of scripting languages supported (meaning we don't need to make our own). There are many nice things about this, not the least of which is that we don't have to worry about class loading as we would if we allowed arbitrary compiled Java code to be passed in. Of course, since JSR223 supports Java, there's no reason you couldn't pass Java in too.

To solve the Attention counting problem, we could define a method that takes a query for which attention should be consider and also a string containing the script to run (and one for what langauge it is). At the leaf nodes, the method would invoke a function on the script to do the "map"-like phase of the work. The functions would all return Object. Then up at the head node, the results from each leaf would be collected and another function defined in the script would be invoked, passing in the Objects, to combine the results (the "reduce"-like phase). Not exactly map/reduce, but this would definitely go a long way towards solving the issue of custom attention processing.

Of course, this approach is limited only to the objects we're willing to hand the scripts. An open question exists around how to expose more of the data to the script methods.

Monday Jun 22, 2009


Hi folks.  I'm a researcher in Sun Labs working on The AURA Project.  I am responsible for designing and developing the distributed Data Store.  We're interested both in studying existing data stores, and playing with creating our own.  Making our own has a couple advantages.  First, it can be customized to do things the way we want.  Second, it gives us a good perspective from which to understand and evaluate other distributed data stores.  We have different needs than most distributed key/value stores though.  We don't just want to store values and retrieve them by key.  We're interested in doing some computation on the data that will help us to generate recommendations on the fly.  That computation is best done close to the data, so we have a special-purpose data store that can handle these types of queries and give back results quickly enough that we can build a live service on them.  We do them on the fly so that you can give feedback to the recommender and have it immediately take your preferences into account.

The data store isn't exactly mature, but it is functional and stable.  I'll post to this blog as I have thoughts or interesting (in my opinion) problems regarding the data store.  I may also post observations about our data store and how it compares to others which may not be at all interesting to you but may help me as I build a mental picture of what the realm of existing data stores looks like.

You can get the source code to our project on Kenai, and you can try out our music recommender system built on our data store.


Jeff Alexander is a member of the Information Retrieval and Machine Learning group in Oracle Labs.


  • Sun
« August 2016