Trying some map/reducey behavior

The AURA Data Store is more than just a key/value store. As with most distributed data stores, there are "head" nodes (DataStoreHeads in our lingo) that take requests and route them to appropriate "leaf" nodes (Replicants to us). Each head node has knowledge of which leaf node any given value is stored at based on its key. Each leaf node has a database storing the values with a nice big in-memory cache.

But unlike most key/value stores, we have at least some knowledge of the type of data that is stored. Specifically, we have three types: Items are things that can be recommended. They have a name and a map of any serializable data you want. Users are just a type of item that have some user-specific fields well defined. Finally, Attentions are mappings between items (user X played song Y or song Y is on album Z) that have a type, a time, and a little optional meta data. These types are stored in a Berkeley DB (Java Edition, using the DPL) with indexed secondary keys for all the well-defined data (but not for the custom map). Our not-so-secret sauce is that we also have a search engine next to each database that indexes not just the well-defined data, but also any custom data that is stored in the item's map. Applications can build wrappers around Items that store whatever information they want in the Item map.

With all this indexed data (either in the database or the search engine), there are many operations that we know we want to perform by running computation where the data lives and sending back results. This means that the interface to the data store is pretty big. We can't get away with just "get", "put", and "delete". Our methods fall into two categories. The gets, puts, etc make up the first. This kind of query is for data that lives on a node that we can identify by key. But we also need methods to answer queries such as "get me all the times that user X has played song Y since last Tuesday" or "find me the artists most similar to a user's profile". For these, we need to query all of our data nodes since the answers could be found on any of them. Then we need to collect up the results (probaby ordering them in some way) and return them to the client of the data store. We've coded a number of fairly flexible methods that let users of the data store peform a wide range of queries to answer these kinds of problems. But there's always going to be some case where there'll be some application logic that we didn't count on that really wants to run close to the data.

This is where the fun comes in. Our intern, François, wanted to run a query that would have needed specific knowledge about the way an application uses one of our data types. This would violate the isolation the data store has from applications built on it. Specifically, François wanted to get a total count of "listens" between a user and an artist. No problem, except that he wants to store multiple listens per attention object (using the metadata field). We don't want to return all those attention objects back over the wire and require that they get counted by the client, and we don't want to put specific code for that in the data store. What if we wanted to be a little more map-reducey and run some custom code at the replicant level?

Enter JSR223.

By supporting scripting, we can allow applications to run code in the datastore (at the leaf nodes) while keeping the database fairly secure by limiting what objects the scripts have access to. JSR223 makes this possible since there are a large number of scripting languages supported (meaning we don't need to make our own). There are many nice things about this, not the least of which is that we don't have to worry about class loading as we would if we allowed arbitrary compiled Java code to be passed in. Of course, since JSR223 supports Java, there's no reason you couldn't pass Java in too.

To solve the Attention counting problem, we could define a method that takes a query for which attention should be consider and also a string containing the script to run (and one for what langauge it is). At the leaf nodes, the method would invoke a function on the script to do the "map"-like phase of the work. The functions would all return Object. Then up at the head node, the results from each leaf would be collected and another function defined in the script would be invoked, passing in the Objects, to combine the results (the "reduce"-like phase). Not exactly map/reduce, but this would definitely go a long way towards solving the issue of custom attention processing.

Of course, this approach is limited only to the objects we're willing to hand the scripts. An open question exists around how to expose more of the data to the script methods.


Post a Comment:
  • HTML Syntax: NOT allowed

Jeff Alexander is a member of the Information Retrieval and Machine Learning group in Oracle Labs.


« July 2016