X

The Oracle NoSQL Database Blog covers all things Oracle NoSQL Database. On-Prem, Cloud and more.

  • December 9, 2013

Look I shrunk the keys

Whether
we use relational or non-relational database to durably persist our data on the
disk, we all are aware of the fact that how indexes plays a major role in accessing
the data in real time. There is one aspect most of us tend to overlook while
designing the index-key i.e. how to efficiently size them.

Not that
I want to discount the fact but in traditional databases where we used to store
from hundreds of thousands to few million of records in a database, sizing the index-key
didn’t come (that often) as a very high priority but in NoSQL database where
you are going to persist few billion to trillions of records every byte saved
in the key goes a long mile.

That is
exactly what came up this week while working on one of the POC and I thought I
should share with you the best practices and tricks of the trade that you can
also use in developing your application. So here is my numero uno
recommendation for designing the index Keys:

  • Keep it Small.

Now
there is nothing new there that you didn't know already, right?
Right but I just wanted to highlight it, so if there is anything you remember
from this post then it is this one bullet item.

All
right, here is what I was dealing with this week: couple billion records of telematics/spacial
data that we needed to capture and query based on the timestamp (of the feed that was received) and x and y co-ordinates of the system. To run the kind of queries we
wanted to run (on spacial data), we came up with this as an index-key:

/S/{timestamp}{x-coordinate}{y-coordinate}

How we
used above key structure to run spacial queries is another blog post but for
this one I would just say that when we plugged in the values to the variables our
key became 24 bytes 
(1+13+5+5) long. Here’s how:

Table Prefix => type char = 1
byte (eg. S)

Timestamp => type long = 13
bytes (eg.1386286913165)

X co-ordinate => type string =
5 bytes (eg. 120.78 degree, 31.87 degree etc)

Y co-ordinate => type string =
5 bytes (eg. 132.78 degree, 33.75 degree etc)

With
amount of hardware resource we had available (for POC) we could create 4 shards
cluster only. So to store two billion records we needed to store (2B records/4
shards) 500 million records on each of the four shards. Using DBCacheSize
utility, we calculated we would need about 32 GB of JE cache on each of the
Replication Node (RN).

$java -d64 -XX:+UseCompressedOops -jar $KVHOME/lib/je.jar DbCacheSize -records 500000000 

-key 24 

=== Database Cache Size ===
 Minimum Bytes    Maximum Bytes          Description
---------------       ---------------        -----------
 29,110,826,240   32,019,917,056         Internal nodes only

But we
knew that if we can shrink the key size (without losing the information) we can
save lot of memory and can improve the query time (as search is a function of #
of records and size of each record) as well. So we built a simple encoding program
that uses range of 62 ASCII characters (0-1, a-z, A-Z) to encode any numeric
digit. You can find the program from
here
or build your own but what is important to note here is that we were able to
represent same information with less number of bytes:

13 Byte Timestamp (e.g. 1386286913165)
became 7 byte (e.g. opc2BTn)

5 byte X/Y co-ordinate (e.g.
132.78) became 3 byte each (e.g. a9X/6dF)

i.e. 14
byte encoded key (1 + 7 byte + 3 byte + 3 byte). So what’s the fuss that we shrunk our keys (it’s
just 10 bytes saving), you would ask? Well, we plugged in the numbers again to
DBCacheSize utility and this time the
verdict was that we needed only 20GB of
JE cache to store same half a billion records on each RN. That’s 40%
improvement (12GB of saving/Replication Node) and is definitely an impressive
start.

$java -d64 -XX:+UseCompressedOops -jar $KVHOME/lib/je.jar DbCacheSize -records 500000000 
-key 14 

=== Database Cache Size ===
 Minimum Bytes    Maximum Bytes          Description
---------------       ---------------        -----------
 16,929,008,448   19,838,099,264         Internal nodes only

To conclude: you just seen how simple encoding technique can save you big time when you are dealing with billions of records. Next time
when you design an index-key just think little harder on how you can shrink it
down!

Join the discussion

Comments ( 2 )
  • Fadi (itoctopus) Tuesday, December 10, 2013

    I like the idea of shrinking the keys - this can be really done across the board and for every key imaginable. The thing is, in most cases, keys are just an integer so there is not much gain in shrinking that key.

    By the way, I'm wondering why are you only allowing alphanumeric characters in your encoding. You can easily have the 7 bytes as 5 bytes if you go with all the characters.


  • guest Tuesday, December 10, 2013

    Hi Fadi,

    First, I am glad you found the post useful. And yes it can be applied across the board, even int and longs are 4B - 8B long which can be shrink-ed by a byte or so and In NoSQL database, keys (major/minor) are constructed by concatenating string components. So even if you use integers as the keys, they all are going to be represented as string components. And this simple technique can be be used to make your keys compact.

    And I agree with your observation i.e. if I use all characters then it would be much smaller encoding. I just wanted it to begin with numeric & alphabets.

    Thanks for your comment.

    -Anuj


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha