Our customers have often asked us “what’s the fastest and most efficient way to insert a large number of records in the Oracle NoSQL database? “ Very recently, a shipping company reached out to us with a specific requirement of using Oracle NoSQL database for their ship management application, which is used to track the movements of their container ships that moves the cargo from port to port. The cargo ships are all fitted with GPS and other tracking devices, which relays ship's location after a few seconds into the application.
The application is then queried for 1) The location of all the ships displayed on the map 2) A specific ship's trajectory over a given period of time displayed on the map too. As the volume of the location data started growing, the company is finding hard to scale the application and is now looking at a back-end system that can ingest this large data-set very efficiently.
Historically, we have supported the option to execute a batch of operations for records that share the same shard key, which is what our large airline customer (Airbus) has done. They pre-sort the data by the shard key and then perform a multi-record insert when the shard key changes. Basically, rather than sending and storing a record at a time they can send a large number of records in a single operation. This certainly saved network trips, but they could only batch insert records that shared the same shard key. With Oracle NoSQL Database release 3.5.2, we have added the ability to do a bulk insert or a bulk put records across different shards, allowing application developers to work more effectively with very large data-sets.
The BulkPut API is available for the table as well as the key/Value data model. The API provides significant performance gains over single row
inserts by reducing the network traffic round trips as well as by doing ordered inserts in batch on internally sorted data across different shards in parallel. This feature is released in a controlled fashion, so there aren’t java docs available for this API with this release, but we encourage you to use it and give us feedback.
loads Key/Value pairs supplied by special purpose streams into the store.
void put(List<EntryStream<KeyValue>> streams, BulkWriteOptions
public void put(List<EntryStream<Row>>
interface: loads rows supplied by
special purpose streams into the store.
streams the streams that
supply the rows to be inserted.
arguments controlling the behavior the bulk write operations
Stream Interface :
public interface EntryStream<E>
void keyExists(E entry);
exception, E entry);
ran the YCSB benchmark with the new Bulk-Put API on 3x3 (3 shards each with 3 copies of data) NoSQL Cluster running
on bare metal servers, ingesting 50M records per shard or 150M records across
the datastore, using 3 parallel thread per shard or total 9 ( 3x3) for the
store and 6 parallel input streams per SN or total 54 ( 6 *9) across the store.
The results for the benchmark run are shown in the graph below
above graph compares the throughput (ops/sec) of Bulk vs Simple Put API with
NoSQL store having 1000 partitions
settings of None and Simple Majority.
As seen from the above charts there is over a 100% increase in throughput with
either durability settings.
Here's link program uploaded to the github repository, the sample demonstrate how to use the BulkPut API in your application code. refer to the readme file for details related to the program execution.
you are looking at bulk loading data into Oracle NoSQL Database the latest Bulk
Put API provides the most efficient and fastest (as demonstrated by the YCSB) way
to ingest large amount of data. Check it out now and download the latest
version of the Oracle NoSQL Database at: www.oracle.com/nosql.
I'd like to thanks my colleague Jin Zhao for inputs on the performance numbers.