Berkeley DB Java Edition: Why Database.preload() doesn't always help
By Charles Lamb on Oct 15, 2009
A customer sent us a simple program and an environment with data. The program opened the environment (approx. 2GB) and scaned the records of one of the databases in primary key order. The records had been inserted in random (i.e. non-key-sequential order) order and this caused lots of random IO during the scan. The customer wanted to know how to make the scan go faster. We suggested using
Database.preload() since that would sort the LSNs of all of the records in the database and then load the cache by reading the records in LSN order rather than key order. The customer's program set the cache size to a fixed size of
1200 * 1024 * 1024 bytes. Interestingly enough, the call to
preload() made the overall time longer than just doing the scan and taking the hit from the random IO.
The reason is that
preload() will stop when it has filled the cache. In this case, 1.2GB was not large enough to hold all of the records in the database. Once
preload() had filled the cache, it returned a status of
PreloadStatus.CACHE_FILLED after which the scan commenced. Whereas the
preload had read the records in LSN order, the scan was reading the records in key order (effectively random LSN order). Since the cache had already been filled by
preload(), any cache miss by the scan would cause something to be evicted from the cache, and if the evicted record had not already been used by the scan, the work done by
preload() to load the cache for that record was wasted. So with too small a cache, some IO done by
preload() was inevitably wasted, thereby causing lower throughput.
Increasing the cache size to a level where
preload() could fill the cache resulted in a significant speed-up.