Loading Twitter Data into Berkeley DB Java Edition
A recent OTN Forum Post from Alexy Khrabrov talked about his Scala program to load Twitter data into Berkeley DB Java Edition (JE) and the challenges he was facing. Alexy is a researcher at the Thayer School of Engineering at Dartmouth.
There are a couple of interesting things about his program. The first is that it is written in Scala; although it is relatively easy to write Scala code which uses JE, Alexy was able to use the JE Direct Persistence Layer (DPL) annotations to create his classes. He has 3 entity classes, each with a primary and two secondary indices each. The second is that he is using some rather large hardware for his loads. Cache sizes of 20GB are typical for his loads. While several of our users are familiar with caches that big (3x larger actually), they nevertheless present some issues in terms of GC tuning. With some minor tuning we were able to improve his load results for one day's of data from 150 minutes to 25-35 minutes. Parsing the data takes approximately 10 of those 25-35 minutes.
To improve the load performance we disabled the cleaner and checkpointer, increased the log file size to 500MB, and used non-transactional, no sync writes.
For the GC, he uses the Concurrent Mark Sweep (CMS) GC.
His results have been good. He is seeing consistent load times of 25-35 minutes per day's worth of data even after 13 days of data (16GB) being loaded. So (again), although this is not the largest database we have seen (by far), it is gratifying to help out a user and realize drastic improvements in performance.