The AURA Data Store: An Overview
By searchguy on Apr 03, 2009
Paul said some pretty nice things about The Explaura today. He mentioned that I hadn't talked much about the backend system architecture, which we call the AURA Data Store (or just "the data store"). This is something that I will be talking about in future posts, but since he mentioned it, I wanted to give you an idea of what the AURA data store is like.
Here's a snapshot of our data store browsing and visualization tool (click for the full sized version):
You can see that the data store has almost 2 million items in it.
These are the actual artists along with items for the photos, videos,
albums and so on. Along with that are about 1.2 million
attentions. An attention is a triple consisting of a user, an
item, and an attention type. So, for example, if we know that a
particular user listened to a particular artist then we would add an
attention for that user with the item corresponding to that artist,
and an attention type of
The layout of the visualization tool gives you a pretty good idea of the layout of the data store. At the top level there are data store heads, which is what clients of the data store will use for adding and fetching items and attention and for doing searches in the data store.
The data store heads talk to the second level of the data store, which is composed of a number of partition clusters. Each of the partition clusters manages a number of replicants (currently that number is 1, but stay tuned throughout the summer), which is meant to provide for scalabilty and redundancy in the data store.
Each replicant is composed of a BDB key-value store and a Minion search index. BDB is responsible for storing and retrieving the item and attention data and Minion is responsible for almost all of the searching and all of the similarity computations.
The addition of the Minion search index means that the AURA data store is a little different from typical key value stores because it makes it possible to ask the kinds of queries that a search engine can handle (e.g., find the artists that contain the word progressive in their Wikipedia bios, find artists whose social tags are similar to a particular artist) as well as the kind of queries that a traditional key-value store can handle (e.g., fetch me the item with this key.)
We refer to this data store as a 16-way store, because there are 16 partition clusters and 16 replicants. Each of the boxes that you see in the visualization represents a JVM running a chunk of the data store code. We use Jini in combination with our configuration toolkit to do service registration and discovery. It's all running on the Project Caroline infrastructure, and we'll be migrating it to the new Sun Cloud offering as soon as we can.
In our load tests, a data store like this one was capable of serving about 14,000 concurrent users doing typical recommendation tasks (looking at items, finding similar items, adding attention data) with sub-500ms response time at the client.