The AURA Data Store: An Overview

Paul said some pretty nice things about The Explaura today. He mentioned that I hadn't talked much about the backend system architecture, which we call the AURA Data Store (or just "the data store"). This is something that I will be talking about in future posts, but since he mentioned it, I wanted to give you an idea of what the AURA data store is like.

Here's a snapshot of our data store browsing and visualization tool (click for the full sized version):

You can see that the data store has almost 2 million items in it. These are the actual artists along with items for the photos, videos, albums and so on. Along with that are about 1.2 million attentions. An attention is a triple consisting of a user, an item, and an attention type. So, for example, if we know that a particular user listened to a particular artist then we would add an attention for that user with the item corresponding to that artist, and an attention type of LISTEN.

The layout of the visualization tool gives you a pretty good idea of the layout of the data store. At the top level there are data store heads, which is what clients of the data store will use for adding and fetching items and attention and for doing searches in the data store.

The data store heads talk to the second level of the data store, which is composed of a number of partition clusters. Each of the partition clusters manages a number of replicants (currently that number is 1, but stay tuned throughout the summer), which is meant to provide for scalabilty and redundancy in the data store.

Each replicant is composed of a BDB key-value store and a Minion search index. BDB is responsible for storing and retrieving the item and attention data and Minion is responsible for almost all of the searching and all of the similarity computations.

The addition of the Minion search index means that the AURA data store is a little different from typical key value stores because it makes it possible to ask the kinds of queries that a search engine can handle (e.g., find the artists that contain the word progressive in their Wikipedia bios, find artists whose social tags are similar to a particular artist) as well as the kind of queries that a traditional key-value store can handle (e.g., fetch me the item with this key.)

We refer to this data store as a 16-way store, because there are 16 partition clusters and 16 replicants. Each of the boxes that you see in the visualization represents a JVM running a chunk of the data store code. We use Jini in combination with our configuration toolkit to do service registration and discovery. It's all running on the Project Caroline infrastructure, and we'll be migrating it to the new Sun Cloud offering as soon as we can.

In our load tests, a data store like this one was capable of serving about 14,000 concurrent users doing typical recommendation tasks (looking at items, finding similar items, adding attention data) with sub-500ms response time at the client.


Wow, that's pretty cool! I don't know if you're allowed to say, but how do you guys distribute the -CPU- load? Hadoop-style? Or just aggregate the data from the data store and do all the CPU work on one box?

Just curious :)

Posted by TheAlchemist on April 08, 2009 at 10:07 AM EDT #

Oh, we're allowed to say. The bulk of the computation that's occurring is taking place in the nodes at the bottom of the data store where the BDB and Minion databases are. Because this is not intended to be a general computation framework, we control what's going to run against the databases.

Mostly this is queries against the search index (aura-type = artist <and> aura-name <substring> "tragically"), finding similar documents based on a particular field of an item, and fetching items from the BDB database.

We were working on ways to specify a computation that runs against the data in a replicant, but we haven't had a serious need for that yet.

Posted by Stephen Green on April 10, 2009 at 02:29 AM EDT #

Post a Comment:
Comments are closed for this entry.

This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« July 2016