Monday Mar 18, 2013

Optimized Storage for DB Workloads

Today’s Enterprise database workloads require storage devices that must embed a variety of features so that the optimal level of performance, manageability and operational agility can be achieved. Even if most Enterprise Database products have the ability to rely on their own file system to optimize access to storage, there is much more to do on the storage side in order to bring the database workloads at the level that is expected from today’s demanding applications and users.

A possible approach to this problem is to fully integrate the storage devices into an engineered stack, which includes all the components required to build a high performance database system (software, servers, network and storage). This is exactly what Oracle does with the Exadata Engineered System which embeds dedicated and highly optimized storage nodes that actually take in charge a part of the database workload in a way that would not be possible with a standalone storage device sitting next to the database engine. As a result, the performance obtained cannot be beaten by a traditional architecture.

However, one size does not fit all, and there is still a need to deliver storage systems that can be used as primary storage for databases while guaranteeing best in class performance and ease of integration. There are several requirements that must be satisfied in order to achieve that goal:

  • The Storage System must be able to meet the high IO requirements of the database in the most efficient manner;
  • High throughput must also be guaranteed for sequential access to the database as well as data protection tasks;
  • Best in class compression capabilities is a must if we consider the current growth of structured data in the IT centers. While traditional compression algorithms typically found in disk subsystem controllers can do a good job, a more specialized compression mechanism aligned with the database structure can go much further in terms of data reduction.
  • Snapshot and cloning at the storage level are absolutely mandatory features these days so that data protection, disaster recovery and database quick duplication tasks can be expedited as efficiently as possible;
  • A versatile connectivity must be provided by the storage system in order to guarantee a quick and easy integration in various production environments. At the top of that, protocols that have the ability to transport specific database metadata to and from the storage can benefit from an additional performance increase compared to non-optimized storage devices;
  • Finally, the “observability” of the storage system, i.e. the ability an administrator has to monitor and diagnose the storage device, becomes of critical importance if we consider the consolidation trend in the data center that make the management of the database storage system always more complex and resource consuming.

Oracle’s Sun ZFS Storage Appliance is a storage subsystem that brings very convincing answers to all these requirements. Let’s detail step by step why the ZFS-SA constitutes an excellent device to host the data of an Oracle Enterprise Database.

An enterprise database workload actually requires very different IO profiles in order to provide a consistent performance for all the different data exchanges that take place between the database engine and the storage. The table below summarizes the different data types in a database and their IO profile in terms of IO throughput (IOPS), IO bandwidth (MB/sec), access type (random, sequential or mixed) and IO bias (read, write or mixed).

Data Type IO Throughput IO Bandwidth Access Type IO Bias 
Control Files High  Moderate  Mixed  Mixed 
Database Index Very High  Moderate to High  Mixed  Mixed 
Database Tables Moderate  Possibly High  Mixed  Mixed 
Temporary Files Moderate Medium Write  Write 
Redo Log Files Quite High High Sequential  Write 
Archive Log Files Moderate High  Sequential  Write 

It is clear that it is a challenging task for a given storage subsystem to provide optimal performances for all these different profiles at the same time. Let’s now try to see how the ZFS-SA’s controller architecture would behave when having to cope with two typical database workloads, OLTP and Datawarehouse.

In an OLTP environment, we can expect many random reads which are repeated on subsets of data (within index and hot tables) as well as a bunch of synchronous writes (for table updates) that need to be quickly acknowledged. At the top of that, the redo logs might experience bursts of activity that need to be efficiently processed as sequential writes by the storage subsystem. The ZFS Storage Appliance architecture brings optimal answers to these concurrent requirements. This is in large part due to the Hybrid Storage Pool (HSP), a read and write caching mechanism based on a mix of DRAM, read-oriented SSD, write-oriented SSD and HDD which optimize data placement and retrieval. The HSP dynamically and intelligently caches active data over multiple tiers of storage media, without the need for administrator intervention. As a result, the HSP provides unprecedented caching benefits for both reads and writes without the traditional need for lots of Hard Disk Drive (HDD) spindles demonstrated by traditional vendors.

This diagram on the left shows the HSP in action with in blue the read activity, in red the update write activity and in orange the redo log writes. Assuming a large DRAM is available, we expect that many reads will result in cache hits at ARC level. ARC (ZFS Adaptive Replacement Cache) is the first level of read cache sitting in DRAM, and it must be maximized for high OLTP workloads in order to provide exceptional performance. We typically see 80% cache hits in ARC in normal workloads and even on benchmarks designed to be cache unfriendly, we still see 70% on average.

If data is not present in cache, it can be in the second level cache (L2ARC) on read-oriented SSD called ReadZillas.

For some OLTP workloads, it might be a good idea not to use ReadZilla and leave all DRAM for the first level cache (since using ReadZilla will consume some of the RAM to manage the second level cache). OLTP update writes on the ZFS Storage Appliance are always synchronous writes, which means that they must be stored in non-volatile memory before being acknowledged. This is what the ZIL (ZFS Intent Log) can achieve very efficiently when it is set to reside on write-oriented SSD – ZIL on SSD is called LogZilla, and it is actually more than a simple SSD. It contains a DRAM buffer, which is secured by a Super Capacitor, as well as a controller and SLC Flash memory. For that type of write, the share holding the database table spaces in the ZFS-SA must have the Synchronous Write Bias parameter set to “latency” in order to guarantee that the ZIL is resides in the LogZilla SSD.

As show on the diagram, writes are performed simultaneously in the ARC and ZIL (step #1 in the diagram), and this is only when the write is securely performed into the ZIL (at the speed of the SSD flash memory) that the acknowledgement is returned to the application. In a second time, bunches of writes are flushed to the HDD pool on a regular basis. This is marked as step #2 in the diagram. The net result is that the database update writes can be very efficiently performed since they do not have to go to rotating disks before being acknowledged, resulting in a spectacular performance improvement without the risk of a non-committed write in DRAM cache. For OLTP workloads with important update activity, it is therefore advised to maximize the LogZilla size.

Then comes the question of the Redo Logs which is a totally different beast… Redo Logs need sequential writes with a high IO bandwidth, and on that aspect this is a totally different IO profile compared to the update writes we have just reviewed. Fortunately, the ZFS-SA brings an elegant and easy solution: all what we have to do is to store the Redo Logs on a share which has the Synchronous Write Bias parameter set to “throughput”. This will result in the ZIL to reside directly in the HDD pool, and not on SSD, bringing the throughput that we need to achieve efficient Redo Log handling. It must be noted that with our latest generation of SSD, we observe excellent performance for both random and sequential IO profiles, and, in many cases, it might not even be necessary to set the ZIL on HDD to achieve the required bandwidth.

Let’s see now how the ZFS-SA Hybrid Storage Pool would behave with a Datawarehouse workload where we can expect large sequential table reads as well as bulk loads on entire table partitions.

The diagram on the left shows in blue the read activity, in red the bulk load write activity and in orange the redo log writes. It is clear that in the case of table scans, level 2 read caching is not going to be useful since data is rarely accessed twice. However, level 1 caching in the DRAM ARC continues to be important since the ZFS-SA is able to perform “prefetching” of the data on disk, resulting in optimized reads on table rows during scans. Reads are also optimized when HCC compression is activated because of the decrease in IO activity it implies on the HDD pool side.

These two optimizations make the ZFS-SA an efficient device for a typical table scan workload which is extremely common in a Datawarehouse environment.

Let’s review now how the writes are going to be handled in a typical Datawarehouse workload. The bulk load writes are still performed as synchronous writes, and in that case there is no interest in having the ZIL on SSD since we are going to require a high throughput in MB/sec rather than a high IO bandwidth in IO/sec. All what we have to do is to set the Synchronous Write Bias parameter to “throughput” for both shares supporting the data files and the Redo Logs in order to guarantee the best performance during bulk loads. All the writes (intent log, data flushed from ARC and Redo Logs) will therefore be performed directly in the HDD pool. It must be noted that for the ZFS-SA, sequential writes on rotating disks in the HDD pool are always faster than sequential reads on HDD since a write will always use the closest block while a read will follow the block allocation from the copy-on-write mechanism.

Regarding RAID levels, the ZFS-SA offers a wide range of possibilities: mirrored, triple mirrored, striped, single parity RAID with narrow stripes, double parity RAID and triple parity RAID wide stripes. For a demanding OLTP or Datawarehouse workload, it is always advised to use mirrored or triple mirrored storage profiles in order to maximize performance and availability.

More to come in a further blog...