jeudi sept. 17, 2009

Synchronous write bias property

With the release of 2009.Q3 release of fishworks along with a new iSCSI implemtation we're coming up with a very significant new feature for managing performance of Oracle database : the new dataset Synchronous write bias property or logbias for short. In a nutshell, this property takes the default value of Latency signifying that the storage should handle synchronous writes in urgency, the historical default handling. See Brendan's comprehensive blog entry on the Separate Intent Log and synchronous writes. However for datasets holding Oracle Datafiles, the logbias property can be set to Throughput signifying that the storage should avoid using log devices acceleration instead trying to optimize the workload's throughput and efficiency. We definitely expect to see a good boost to Oracle performance from this feature for many types of workloads and configs; workloads that generate 10s of MB/sec of DB writer traffic and have no more than 1 logzilla per tray/JBOD.

The property is set in the Share Properties just above database recordsize. You might need to unset the Inherit from projet checkbox in order to modify the settings on a particular share:

The logbias property addresses a peculiar aspect of Oracle workloads : namely that DB writers are issuing a large number of concurrent synchronous writes to Oracle datafiles, writes which individually are not particularly urgent. In contrast to other types of synchronous writes workloads, the more important metrics for DB Writers is not about individual latency. The important metric is that the storage keep up with the throughput demand in order to have database buffers always available for recycling. This is unlike redo log writes which are critically sensitive to latency as they are holding up individual transactions and thus users.

ZFS and the ZIL

A little background; with ZFS, synchronous writes are managed by the ZFS Intent Log ZIL. Because synchronous writes are typically holding up applications, it's important to handle those writes with some level of urgency and the ZIL does an admirable job at that.

In the Openstorage hybrid storage pool the ZIL itself is speeded up using low latency write-optimized SSD devices : the logzillas. Those devices are used to commit a copy of the in-memory ZIL transaction and retain the data until an upcoming transaction group commits the in-memory state to the on-disk pooled storage (Dynamics of ZFS, The New ZFS write throttle).

So while the ZIL speeds up synchronous writes, logzillas speeds up the zil. Now SSDs can serve IOPS at a blazing 100μs but also have their own throughput limits: currently around 110MB/sec per device. At that throughput, committing, for example, 40K of data will need minimally 360μs. The more data we can divert away from log devices, the lower the latency response of those devices will be.

It's interesting to note that other types of raid controllers will be hostage of their NVRAM and require, for consistency, that data be committed through some form of acceleration in order to avoid the Raid Write Hole (Bonwick on Raid-Z). ZFS, however, does not require that data passes through its SSD commit accelerator and it can manage consistency of commits either using disk or using SSDs.

Synchronous write bias : Throughput

With this newfound ability of storage administrators to signify to ZFS that some datasets will be subject to highly threaded synchronous writes for which global throughput is more critical than individual write latency, we can enable a different handling mode. By setting Logbias=Throughput ZFS is able to divert writes away from the Logzillas which are then preserved for servicing low latency sensitive operations (e.g. redo log operations).

  • A setting of Synchronous write bias : Throughput for a dataset allows synchronous writes to files in other datasets to have lower latency access to SSD log devices.
But that's not all. Data flowing through a logbias=Throughput dataset is still served by the ZIL. It turns out that the ZIL has different internal options in the way it can commit transactions one of which being tagged WR_INDIRECT. WR_INDIRECT commits issue an I/O for the modified file record and record a pointer to it in the zil chain. (see WR_INDIRECT in zil.c, zvol.c, zfs_log.c ).

ZIL transaction of type WR_INDIRECT might use more disk I/Os and slightly higher latency immediately but less I/Os and less total bytes during the upcoming transaction group update. Up to this point, the heuristics that lead to using WR_INDIRECT transactions, were not triggered by DB writer workloads. But armed with the knowledge that comes with the new logbias property, we're now less concerned about the slight latency increase that WR_INDIRECT can have. So from efficiency consideration the logbias=Throughput datasets are now set to use this mode leading to more leveled latency distributions of Transactions.

  • Synchronous write bias : Throughput is a dataset mode that reduces the number of I/Os that need to be issued on behalf of this dataset during the regular transaction group updates leading to more leveled response time.
A reminder that such kind of improvements sometimes can go unnoticed in sustained benchmarks if the downstream Transaction group destage is not given enough resources. Make sure you have enough spindles (or total disk KRPM) to sustain the level of performance you need. A pool with 2 logzillas and a single JBOD, might have enough SSD throughput to absorb DB writer workloads without adversely affecting redo log latency and so would not benefit from the special logbias settings, however for 1 logzillas per JBOD the situation might be reversed.

While the DB Record Size property is inherited by files in a dataset and is immutable, the logbias setting is totally dynamic and can be toggled on the fly during operations. For instance, during database creation or some lightly threaded write operations to Datafiles, it's expected that logbias=Latency should perform better.

Logbias deployments for Oracle

As of the 2009.Q3 release of fishworks, the current wisdom around deploying Oracle DB an Openstorage system with SSD acceleration, is to segregate, at the filesystem/dataset level, but within the single storage pool, Oracle datafiles, index files and redo Log files. Having each type of files in different dataset allows better observability into each one using the great analytics tool. But also, each dataset can then be tuned independantly to deliver the most stable performance characteristics. The most important parameter to consider is the ZFS internal recordsize used to manage the files. For Oracle datafiles the established (ZFS Best Practice) is to match the recordsize and the DB block size. For redo log files using default 128K records means that fewer file updates will be stradling multiple filesystem records. With 128K records we expect to have fewer transaction needing to wait for redo log input I/Os leading more leveled latency distribution for transactions. As for Index files, using smaller blocks of 8K offers better cacheability feature for both the primary and secondary caches (only cache what is used from indexes), but using larger blocks offers better index-scan performance. Experimenting is in order, depending on your use case, but an intermediate block size of maybe 32K might also be considered for mixed usage scenario.

For Oracle datafiles specifically, using the new setting of Synchronous write bias : Throughput has potential to deliver more stable performance in general and higher performance for redo log sensitive workloads.

Dataset Recordsize Logbias
Datafiles 8K Throughput
Redo Logs 128K(default)Latency(default)
Index 8K-32K?Latency(default)

Following these guidelines yielded a 40% boost in our Transaction processing testing in which we had 1 logzillas for a 40 disk pool.




« avril 2014

No bookmarks in folder