mercredi juil. 12, 2006

ZFS and Directio




ZFS AND DIRECTIO





In view of the great performance gains that UFS gets out of the 'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly do those gains come from and if ZFS can be tweaked to benefit from them in the same way.


UFS Directio


UFS Directio is actually a set of things bundled together that improves performance of very specific workloads most notably that of Database. Directio is actually a performance hint to the filesystem and apart from relaxing posix requirements does not carry any change in filesystem semantics. The users of directio actually assert the condition on the full Filesystem or individual file level and the filesystem code if given extra freedom to run or not the tuned DIO codepath.


What does that tuned code path gets us ? A few things:

	- output goes directly from application buffer to disk
	  bypassing the filesystem core memory cache.

	- the FS is not constrained  anymore to strictly obey the POSIX
	write ordering.  The FS is thus  able to allow multiple thread
	concurrently issuing some I/Os to a single file.

	- On input UFS DIO refrains from doing any form of readahead.

In a sense, by taking out the middleman (the filesystem cache), UFS/DIO causes files to behave a lot like a raw device. Application reads and writes map one to one onto individual I/Os.


People often consider that the great gains that DIO provides comes from avoiding the CPU cost of the copy into system caches and from the avoiding the double buffering, once in the DB, once in the FS, that one gets in the non-directio case.


I would argue that while the CPU cost associated with a copy certainly does exists, the copy will run very very quickly compared to the time the ensuing I/O takes. So the impact of the copy would only appear on systems that have their CPU quite saturated, notably for industry standard benchmarks. However real systems, which are more likely to be I/O constrained than CPU constrained should not pay a huge toll to this effect.


As for double buffering, I note that Databases (or applications in general), are normally setup to consume a given amount of memory and the FS operates using the remaining portion. Filesystems caches data in memory for lack of better use of that memory. And FS give up their hold whenever necessary. So the data is not double buffered but rather 'free' memory keeps a hold on recently issued I/O. Buffering data in 2 locations does not look like a performance issue to me.


Anything for ZFS ?


So what does that leaves us with ? Why is DIO so good ? This tells me that we gain a lot from those 2 mantras

		don't do any more I/O that  requested

   		allow   multiple concurrent I/O to a file.
I note that UFS readahead is particularly bad for certain usage; when UFS sees access to 2 consecutive pages, it will read a full cluster and those are typically 1MB in sizes today. So avoiding UFS readahead has probably contributed greatly to the success of DIO. As for ZFS there are 2 levels of readahead (a.k.a prefetching). One that is filebased and one device based. Both are being reworked at this stage. I note that filebased readahead code has not and will not behave like UFS. On the other hand device level prefetching probably is being over agressive for DB type loads and it should be avoided. While I have not given hope of that this can be managed automatically, watch this space for tuning scripts to control the device prefetching behavior.


DIO for input does not otherwise appear an interesting proposition since if the data is cached, I don't really see the gains in bypassing it (apart from slowing down the reads).


As for writes, ZFS, out of the box, does not suffer from the single writer lock that UFS needs to implement the posix ordering rules. The transaction groups (TXG) are sufficient for that purpose (see The Dynamics of ZFS).



This leaves us to the amount of I/O needed by the 2 filesystems when running many concurrent O_DSYNC writers running small writes to random file offsets.


UFS actually handles this load by overwriting the data in it's preallocated disk locations. Every 8K pages is associated with set place on the storage and a write to that location means a disk head movement and an 8K output I/O. This loads should scale well with number of disks in the storage and the 'random' IOPS capability of each drives. If a drives handle 150 random IOPS, then we can handle about 1MB/s/drive of output.


Now ZFS will behave quite differently. ZFS does not have preallocation of file blocks and will not, ever, overwrite live data. The handling of the O_DSYNC writes in ZFS will occur in 2 stages.


The 2 stages of ZFS


First at the ZFS Intent Log (ZIL) level where we need to I/O the data in order to release the application blocked in a write call. Here the ZIL has the ability of aggregating data from multiple writes and issue fewer/larger I/Os than UFS would. Given the ZFS strategy of block allocation we also expect those I/O to be able to stream to the disk at high speed. We don't expect to be restrained by the random IOPS capabilities of disk but more by their streaming performance.


Next at the TXG level, we clean up the state of the filesystem and here again the block allocation should allow high rate of data transfer. At this stage there are 2 things we have to care about.


With current state of things, we probably will see the data sent to disk twice, once to the ZIL once to the pool. While this appears suboptimal at first, the aggregation and streaming characteristics of ZFS makes the current situation already probably better than what UFS can achieve. We're also looking to see if we can make this even better by avoiding the 2 copies while preserving the full streaming performance characteristics.


For pool level I/O we must take care to not inflate the amount of data sent to disk which could eventually cause early storage saturation. ZFS works out of the box with 128K records for large files. However for DB workloads, we expect this will be tuned such that the ZFS recordsize matches the DB block size. We also expect the DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to the DB block size is a recommendation that is inline with what UFS DIO has taught us: don't do any more I/O than necessary.


Note also that with ZFS, because we don't overwrite live data, every block output needs to bubble up into metadata block updates etc... So there are some extra I/O that ZFS has to do. So depending on the exact test conditions the gains of ZFS can be offset by the extra metadata I/Os.


ZFS Performance and DB


Despite all the advantage of ZFS, the reason that performance data has been hard to come by is that we have to clear up the road and bypass the few side issues that currently affects performance on large DB loads. At this stage, we do have to spend some time and apply magic recipes to get ZFS performance on Database to behave the way it's intended to.


But when the dust settles, we should be right up there in terms of performance compared to UFS/DIO, and improvements ideas are still plenty, if you have some more I'm interested....

mercredi juin 21, 2006

The Dynamics of ZFS



The Dynamics of ZFS



ZFS has a number of identified components that governs its performance. We review the major ones here.

Introducing ZFS


A volume manager is a layer of software that groups a set of block devices in order to implement some form of data protection and/or aggregation of devices exporting the collection as a storage volumes that behaves as a simple block device.


A filesystem is a layer that will manage such a block device using a subset of system memory in order to provide Filesystem operations (including Posix semantics) to applications and provide a hierarchical namespace for storage - files. Applications issue reads and writes to the Filesystem and the Filesystem issues Input and Output (I/O) operations to the storage/block device.


ZFS implements those 2 functions at once. It thus typically manages sets of block devices (leaf vdev), possibly grouping them into protected devices (RAID-Z or N-way mirror) and aggregating those top level vdevs into pool. Top level vdevs can be added to a pool at any time. Objects that are stored onto a pool will be dynamically striped onto the available vdevs.


Associated with pools, ZFS manages a number of very lightweight filesystem objects. A ZFS filesystem is basically just a set of properties associated with a given mount point. Properties of a filesystem includes the quota (maximum size) and reservation (guaranteed size) as well as, for example, whether or not to compress file data when storing blocks. The filesystem is characterized as lightweight because it does not statically associate with any physical disk blocks and any of its settable properties can be simply changed dynamically.


Recordsize


The recordsize is one of those properties of a given ZFS filesystem instance. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K.


The FSB is the basic file unit managed by ZFS and to which a checksum is applied. After a file grows to be larger than the recordsize (and gets to be stored with multiple FSB) changing the Filesystem's recordsize property will not impact the file in question. A copy of the file will inherit the tuned recordsize value. A FSB can be mirrored onto a vdev or spread to a RAID-Z device.


The recordsize is currently the only performance tunable of ZFS. The default recordsize may lead to early storage saturation: For many small updates (much smaller than 128K) to large files (bigger than 128K) the default value can cause an extra strain on the physical storage or on the data channel (such as a fiber channel) linking it to the host. For those loads, If one notices a saturated I/O channel then tuning the recordsize to smaller values should be investigated.


Transaction Groups


The basic mode of operation for writes operations that do not require synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will absorb the operation in a per host system cache called Adaptive Replacement Cache (ARC). Since there is only one host system memory but potentially multiple ZFS pools, cached data from all pools is handled by a unique ARC.


Each file modification (e.g. a write) is associated with a certain transaction group (TXG). At regular interval (default of txg_time = 5 seconds) each TXG will shut down and the pool will issue a sync operation for that group. A TXG may also be shut down when the ARC indicates that there is too much dirty memory currently being cached. As a TXG closes, a new one immediately opens and file modifications then associate with the new active TXG.


If the active TXG shuts down while a previous one is still in the process of syncing data to the storage, then applications will be throttled until the running sync completes. In this situation where are sinking a TXG, while TXG + 1 is closed due to memory limitations or the 5 second clock and is waiting to sync itself; applications are throttled waiting to write to TXG + 2. We need sustained saturation of the storage or a memory constraint in order to throttle applications.


A sync of the Storage Pool will involve sending all level 0 data blocks to disk, when done, all level 1 indirect blocks, etc. until eventually all blocks representing the new state of the filesystem have been committed. At that point we update the ueberblock to point to the new consistent state of the storage pool.


ZFS Intent Log (ZIL)


For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. The ZIL marks each FS operation (say a write) with a log sequence number. When a synchronous command is requested for the operation (such as an fsync), the ZIL will output blocks up to the sequence number. When the ZIL is in process of committing data, further commit operations will wait for the previous ones to complete. This allows the ZIL to aggregate multiple small transactions into larger ones thus performing commits using fewer larger I/Os.


The ZIL works by issuing all the required I/Os and then flushing the write caches if those are enabled. This use of disk write cache does not artificially improve a disk's commit latency because ZFS insures that data is physically committed to storage before returning. However the write cache allows a disk to hold multiple concurrent I/O transactions and this acts as a good substitute for drives that do not implement tag queues.


CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it's files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review.


I/O Scheduler and Priorities


ZFS keeps track of pending I/Os but only issues to disk controllers a certain number (35 by default). This allows the controllers to operate efficiently while never overflowing their queues. By limiting the I/O queue size, service times of individual disks are kept to reasonable values. When one I/O completes, the I/O scheduler then decides the next most important one to issue. The priority scheme is timed based; so for instance an Input I/O to service a read calls will be prioritize over any regular Output I/O issued in the last ~ 0.5 seconds.


The fact that ZFS will limit each leaf devices I/O queue to 35, is one of the reasons that suggests that zpool should be built using vdevs that are individual disks or at least volumes that map to small number of disks. Otherwise this self imposed limits could become an artificial performance throttle.


Read Syscalls


If a read cannot be serviced from the ARC cache, ZFS will issue a 'prioritized' I/O for the data. So even if the storage is handling a heavy output load, there are only 35 I/Os outstanding, all with reasonable service times. As soon as one of the 35 I/Os completes the I/O scheduler will issue the read I/O to the controller. This insures good service times for read operations in general.


However to avoid starvation, when there is a long-standing backlog of Output I/Os then eventually those regain priority over the Input I/O. ZIL synchronous I/Os are of the same priority to synchronous reads.


Prefetch


The prefetch code allowing ZFS to detect sequential or strided access to a file and issue I/O ahead of phase is currently under review. To quote the developer "ZFS prefetching needs some love".


Write Syscalls


ZFS never overwrites live data on-disk and will always output full records validated by a checksum. So in order to partially overwrite a file record, ZFS first has to have the corresponding data in memory. If the data is not yet cached, ZFS will issue an input I/O before allowing the write(2) to partially modify the file record. With the data now in cache, more writes can target the blocks. On output ZFS will checksum data before sending to disk. For full record overwrite the input phase is not necessary.


CAVEAT: Simple write calls (not O_DSYNC) are normally absorbed by the ARC cache and so proceed very quickly. Such a sustained dd(1)-like load can quickly overrun a large amount of system memory and cause transaction groups to eventually throttle all applications for large amount of time (10s of seconds). This is probably what underwrites the notion that ZFS needs more RAM (it does not). Write throttling code is under review.


Soft Track Buffer


An input I/O is serious business. While a Filesystem can decide where to write stuff out on disk, the Inputs are requested by applications. This means a necessary head seek to the location of the data. The time to issue a small read will be totally dominated by this seek. So ZFS takes the stance that it might as well amortize those operations and so, for uncached reads, ZFS normally will issue a fairly large Input I/O (64K by default). This will help loads that input data using similar access pattern to the output phase. The data goes into a per device cache holding 20MB.


This cache can be invaluable in reducing the I/Os necessary to read-in data. But just like the recordsize, if the inflated I/O cause a storage channel saturation the Soft Track Buffer can act as a performance throttle.


The ARC Cache


The most interesting caching occurs at the ARC layer. The ARC manages the memory used by blocks from all pools (each pool servicing many filesystems). ARC stands for Adaptive Replacement Cache and is inspired by a paper of Megiddo/Modha presented at FAST'03 Usenix conference.


That ARC manages it's data keeping a notion of Most Frequently Used (MFU) and Most Recently Use (MRU) balancing intelligently between the two. One of it's very interesting properties is that a large scan of a file will not destroy most of the cached data.


On a system with Free Memory, the ARC will grow as it starts to cache data. Under memory pressure the ARC will return some of it's memory to the kernel until low memory conditions are relieved.


We note that while ZFS has behaved rather well under 'normal' memory pressure, it does not appear to behave satisfactorily under swap shortage. The memory usage pattern of ZFS is very different to other filesystems such as UFS and so exposes VM layer issues in a number of corner cases. For instance, a number of kernel operations fails with ENOMEM not even attempting a reclaim operation. If they did, then ZFS would be responding by releasing some of it's own buffers allowing the initial operation to then succeed.


The fact that ZFS caches data in the kernel address space does mean that the kernel size will be bigger than when using traditional filesystems. For heavy duty usage it is recommended to use a 64-bit kernel i.e. any Sparc system or an AMD configured in 64-bit mode. Some systems that have managed in the past to run without any swap configured should probably start to configure some.


The behavior of the ARC in response to memory pressure is under review.


CPU Consumption


Recent enhancement to ZFS has improved it's CPU efficiency by a large factor. We don't expect to deviate from other filesystems much in terms of cycles per operations. ZFS checksums all disk blocks but this has not proven to be costly at all in terms of CPU consumption.


ZFS can be configured to compress on-disk blocks. We do expect to see some extra CPU consumption from that compression. While it is possible that compression could lead to some performance gain due to reduced I/O load, the emphasis of compression should be to save on-disk space not performance.


What About Your Test ?


This is what I know about the ZFS performance model today. My performance comparison on different types of modelled workloads made last fall already had ZFS ahead on many of them; we have improved the biggest issues highlighted then and there are further performance improvements in the pipeline (based on UFS, we know this will never end). Best Practices are being spelled out.
You can contribute by comparing your actual usage and workload pattern with the simulated workloads. But nothing will beat having reports from real workloads at this stage; Your results are therefore of great interest to us. And watch this space for updates...


mercredi juin 07, 2006

Tuning ZFS recordsize

One important performance parameter of ZFS is the recordsize which govern the size of filesystem blocks for large files. This is the unit that ZFS validates through checksums. Filesystem blocks are dynamically striped onto the pooled storage, on a block to virtual device (vdev) basis.

It is expected that for some loads, tuning the recordsize will be required. Note that, in traditional Filesytems such a tunable would govern the behavior of all of the underlying storage. With ZFS, tuning this parameter only affects the tuned Filesystem instance; it will apply to newly created files. The tuning is achieved using

zfs set recordsize=64k mypool/myfs

In ZFS all files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks. Once a file grows to be multiple blocks, it's blocksize if definitively set to the FS recordsize at the time.

Some more experience will be required with the recordsize tuning. Here are some elements to guide along the way.

If one considers the input of a FS block typically in response to an application read, the size of the I/O in question will not basically impact the latency by much. So, as a first approximation, the recordsize does not matter (I'll come back to that) to read-type workloads.

For FS block outputs, those that are governed by the recordsize, actually occur mostly asynchronously with the application; and since applications are not commonly held up by those outputs, the delivered throughput is, as for read-type loads, not impacted by the recordsize.

So the first approximation is that recordsize does not impact performance much. To service loads that are transient in nature with short I/O bursts (< 5 seconds) we do not expect records tuning to be necessary. The same can be said for sequential type loads.

So what about the second approximation ? A problem that can occur with using an inflated recordsize (128K) compared to application read/write sizes, is early storage saturation. If an application requests 64K of data, then providing a 128K record doesn't change the latency that the application sees much. However if the extra data is discarded from the cache before ever being read, we see that the extra occupation of the data channel was occupied for no good reason. If a limiting factor to the storage is, for instance, a 100MB/sec channel, I can handle 700 times 128K records per second onto that channel. If I halves the recordsize that should double the number of small records I can input.

On the small record output loads, the system memory creates a buffer that defer the direct impact to applications. For output, if the storage is saturated this way for tens of seconds, ZFS will eventually throttle applications. This means that, in the end, when the recordsize leads to sustained storage overload on output, there will be an impact as well.

There is another aspect to the recordsize. A partial write to an uncached FS block (a write syscall of size smaller than the recordsize) will have to first input the corresponding data. Conversely, when individual writes are such that they cover full filesystem recordsize blocks, those writes can be handled without the need to input the associated FS blocks. Other consideration (metadata overhead, caching) dictates however that the recordsize not be reduced below a certain point (16K to 64K; do send-in your experience).

So, one advice is to keep an eye on the channel throughput and tune recordsize for random access workloads that saturate to storage. Sequential type workloads should work quite well with the current default recordsize. If the applications' read/write sizes can be increased, that should also be considered. For non-cached workloads that overwrites file data in small aligned chunks , then matching the recordsize with the write access size may bring some performance gains.



mardi juin 06, 2006

DOES ZFS REALLY USE MORE RAM ?

DOES ZFS REALLY USE MORE RAM ?



I'll touch 3 aspects of that question here :

- reported freemem

- syscall writes to mmap pages

- application write throttling

Reported freemem will be lower when running with ZFS than say UFS. The UFS page cache is considered as freemem. ZFS will return it's 'cache' only when memory is needed. So you will operate with lower freemem but won't normally suffer from this.

It's been wrongly feared that this mode of operation puts us back to the days of Solaris 2.6 and 7 where we saw a roaller coaster effect on freemem leading to sub-par application performance. We actually DO NOT have this problem with ZFS. The old problem came because the memory reaper could not distinguish between a useful application page and an UFS cached page. That was bad. ZFS frees up it's cache in a way that does not cause this problem.

ZFS is designed to release some of it's memory when kernel modules exert back pressure onto the kmem subsystem. Some kernel code that did not properly exert that pressure was recently fixed (short description here: 4034947).

There is one peculiar workload that does lead ZFS to consume more memory: writing (using syscalls) to pages that are also mmaped. ZFS does not use the regular paging system to manage data that passes through reads and writes syscalls. However mmaped I/O which is closely tied to the Virtual Memory subsystem still goes through the regular paging code . So syscall writting to mmaped pages, means we will keep 2 copies of the associated data at least until we manage to get the data to disk. We don't expect that type of load to commonly use large amount of ram.

Finally, one area where ZFS will behave quite differently from UFS is in throttling writters. With UFS, up to not long ago, we throttled a process trying to write to a file, as soon as that file had 0.5 M B of I/O pending associated with it. This limit has been recently upped to 16 MB. The gain of such throttling is that we prevent an application working on a single file or consuming inordinate amount of system memory. The downside is that we throttle an application possibly unnecessarely when memory is plenty.

ZFS will not throttle individual apps like this. The scheme is mutualized between all writers: when the global load of applications data overflows the I/O subsystem for 5 to 10 seconds then we throttle the applications allowing the I/O to catch up. Applications thus have a lot more ram to play with before being throttled.

This is probably what's behind the notion that ZFS likes more RAM. By and large, to cache some data, ZFS just needs the equivalent amount of RAM as any other filesystem. But currently, ZFS lets applications run a lot more decoupled from the I/O subsystem. This can speed up some loads by very large factor, but at times, will appear as extra memory consumption.

mercredi mai 31, 2006

WHEN TO (AND NOT TO) USE RAID-Z


		WHEN TO (AND NOT TO) USE RAID-Z


RAID-Z is the technology  used by ZFS  to implement a data-protection  scheme
which is less  costly  than  mirroring  in  terms  of  block
overhead.

Here,  I'd  like  to go  over,    from a theoretical standpoint,   the
performance implication of using RAID-Z.   The goal of this technology
is to allow a storage subsystem to be able  to deliver the stored data
in  the face of one  or more disk   failures.  This is accomplished by
joining  multiple disks into  a  N-way RAID-Z  group. Multiple  RAID-Z
groups can be dynamically striped to form a larger storage pool.

To store file data onto  a RAID-Z group, ZFS  will spread a filesystem
(FS) block onto the N devices that make up the  group.  So for each FS
block,  (N - 1) devices  will  hold file  data  and 1 device will hold
parity  information.   This information  would eventually   be used to
reconstruct (or  resilver) data in the face  of any device failure. We
thus  have 1 / N  of the available disk  blocks that are used to store
the parity  information.   A 10-disk  RAID-Z group  has 9/10th of  the
blocks effectively available to applications.

A common alternative for data protection, is  the use of mirroring. In
this technology, a filesystem block is  stored onto 2 (or more) mirror
copies.  Here again,  the system will  survive single disk failure (or
more with N-way mirroring).  So 2-way mirror actually delivers similar
data-protection at   the expense of   providing applications access to
only one half of the disk blocks.

Now  let's look at this from  the performance angle in particular that
of  delivered filesystem  blocks  per second  (FSBPS).  A N-way RAID-Z
group  achieves it's protection  by spreading a  ZFS block  onto the N
underlying devices.  That means  that a single  ZFS block I/O must  be
converted to N device I/Os.  To be more precise,  in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.

Now after a request for a  ZFS block has been spread  this way, the IO
scheduling code will take control of all the device  IOs that needs to
be  issued.  At this  stage,  the ZFS  code  is capable of aggregating
adjacent  physical   I/Os  into   fewer ones.     Because of  the  ZFS
Copy-On-Write (COW) design, we   actually do expect this  reduction in
number of device level I/Os to work extremely well  for just about any
write intensive workloads.  We also expect  it to help streaming input
loads significantly.  The situation of random inputs is one that needs
special attention when considering RAID-Z.

Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  delivered    random input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.  This is the price to
pay to achieve proper data  protection without  the 2X block  overhead
associated with mirroring.

With 2-way mirroring, each FS block output must  be sent to 2 devices.
Half of the available IOPS  are thus lost  to mirroring.  However, for
Inputs each side of a mirror can service read calls independently from
one another  since each  side   holds the full information.    Given a
proper software implementation that balances  the inputs between sides
of a mirror, the  FS blocks delivered by a  mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.

So looking  at random access input  load, the number  of FS blocks per
second (FSBPS), Given N devices to be grouped  either in RAID-Z, 2-way
mirrored or simply striped  (a.k.a RAID-0, no  data protection !), the
equation would  be (where dev  represents   the capacity in  terms  of
blocks of IOPS of a single device):

					Random
		Blocks Available	FS Blocks / sec
		----------------	--------------
RAID-Z		(N - 1) \* dev		1 \* dev		
Mirror		(N / 2) \* dev		N \* dev		
Stripe		N \* dev			N \* dev		


Now lets take 100 disks of  100 GB, each each  capable of 200 IOPS and
look  at different  possible configurations;  In the   table below the
configuration labeled:
	
	"Z 5 x (19+1)"

refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.


						Random
	 Config		Blocks Available	FS Blocks /sec
	 ------------	----------------	--------- 
	 Z 1  x (99+1) 	9900 GB		  	  200	  
	 Z 2  x (49+1)	9800 GB		  	  400	  
	 Z 5  x (19+1)	9500 GB			 1000	  
	 Z 10 x (9+1)	9000 GB			 2000	  
	 Z 20 x (4+1)	8000 GB			 4000	  
	 Z 33 x (2+1)	6600 GB			 6600	  

	 M  2 x (50) 	5000 GB			20000	  
	 S  1 x (100)   10000 GB		20000	  


So RAID-Z  gives you  at most 2X  the number  of blocks that mirroring
provides  but hits you  with  much fewer  delivered IOPS.  That  means
that, as the number of  devices in a  group N increases, the  expected
gain over mirroring (disk blocks)  is bounded (to  at most 2X) but the
expected cost  in IOPS is not  bounded (cost in  the range of [N/2, N]
fewer IOPS).  

Note  that for wide RAID-Z configurations,  ZFS takes into account the
sector  size of devices  (typically 512 Bytes)  and dynamically adjust
the effective number of columns in a stripe.  So even if you request a
99+1  configuration, the actual data  will probably be  stored on much
fewer data columns than that.   Hopefully this article will contribute
to steering deployments away from those types of configuration.

In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups    should be restrained  to smaller   sizes and one must
accept some level of disk block overhead.

When performance matters most, mirroring should be highly favored.  If
mirroring  is considered too   costly but performance  is nevertheless
required, one could proceed like this:

	Given N devices each capable of X IOPS.

	Given a target of delivered  Y FS blocks per second
	for the storage pool.

	Build your storage using dynamically  striped RAID-Z groups of
	(Y / X) devices.

For instance: 

	Given 50 devices each capable of 200 IOPS.

	Given a target of delivered 1000 FS blocks per second
	for the storage pool.

	Build your storage using dynamically striped RAID-Z groups of
	(1000 / 200) = 5 devices.

In that system we then would have  20% block overhead lost to maintain
RAID-Z level parity.

RAID-Z is a great  technology not only  when disk blocks are your most
precious resources but also  when your available  IOPS far exceed your
expected needs.  But beware  that if you  get your hands on fewer very
large  disks, the IOPS capacity  can  easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a  dynamic stripe of RAID-Z  groups each made up of a
small number of devices.


mardi mai 16, 2006

128K Suffice

I argue for the fact that 128K I/O sizes is sufficient to extract the most out of a disk given enough concurrent I/Os[Read More]

jeudi déc. 15, 2005

Beware of the Performance of RW Locks

In my naive little mind a rw lock would represents a performant scalable construct inasmuch as WRITERS do not hold the lock for a significant amount of time. One figures that the lock would be held for short WRITERS times followed by concurrent execution of RW_READERS.

What I recently found out is quite probably well known to seasoned kernel engineer but this was new to me. So I figured it could be of interest to others.

The SETUP



So Reader/Writer locks (RW) can be used in kernel and user level code to allow multiple READERS of, for instance, a data structure, to access the structure while allowing only a single WRITER at a time within the bounds of the rwlock().

A RW locks (rwlock(9F), rwlock(3THR)) is more complex that a simple mutex. So acquiring such locks will be more expensive. This means that if the expected hold times of a lock is quite small (say to update or read 1 or 2 fields of a structure) then regular mutexes can usually do that job very well. A common programming mistake is to expect faster execution of RW locks for those cases.

However when READ hold times need to be fairly long; then RW locks represent an alternative construct. With those locks we expect to have multiple READERS executing concurrently thus leading to performant code that scales to large numbers of threads. As I said, if WRITERS are just quick updates to the structure, we can naively believe that our code will scale well.

Not So



Let's see how it goes. A WRITER lock cannot get in the protected code while READERS are executing protected code. The WRITER must then wait at the door until READERS releases their hold. If the implementation of RW locks didn't pay attention, there would be cases in which at least one READER is always present within the protected code and WRITERS would get starved of access. To prevent such starvation, RW lock must block READERS as soon as a WRITER has requested access. But no matter, our WRITERS will quickly update the structure and we will get concurrent execution most of the time. Isn't it ?

Well not quite. As just stated, a RW locks will block readers as soon as a WRITER has hit the door. This means that the construct does not allow parallel execution at that point. Moreover the WRITER will stay at the door while READERS are executing. So the construct stays fully serializing from the time a WRITER hits until all current READERS are done followed by the WRITERS time.

For Instance:

	- a RW_READER gets in and will keep a long time. ---|
	- a RW_WRITER hits the lock; is put on hold.        |
	- other RW_READERS now also block.                  |
	.... time passes			            |
	- the long RW_READER releases	   <----------------|
	- the RW_WRITER gets the lock; work; releases
	- other RW_READER now work concurrently.


Pretty obvious once you think about it. So to assess the capacity of a RW lock to allow parallel execution, one must consider the average hold time as a READER but also the frequency of access as a WRITER. The construct becomes efficient and scalable to N threads if and only if:

(avg interval between writers) >> (N \* avg read hold times).

Roundup

In the end, from a performance point of view, RW locks should be used only when the average hold times is significant in order to justify the use of this more complex type of lock: for instance, calling a function of unknown latency or issuing an I/O while holding the lock represent good candidates. But the construct will be scalable to N threads, if and only if WRITERS are very infrequent.

[T]:

mardi déc. 06, 2005

Showcasing UltraSPARC T1 with Directory Server's searches

So my Friend and Sun's Directory Server (DS) developer Gilles Bellaton recently got his hands onto an early access Niagara (UltraSPARC T1) system; something akin to SunFireTMT2000.

The chip in the system only had 7 active cores and thus 28 hardware threads (a.k.a strands) but we wanted to check how well it would perform on DS. The results here are a little anecdotal: we just ran a few quick test with the aim to showcase Niagara but nevertheless the results we're beyond expectations.

If you consider the Throughput Engine architecture that Niagara provides (what the Inquire says), we can expect it to perform well in highly multithreaded loads such as a directory search test. Since we had limited disk space on the system the slapd instance was created on /tmp. We realize that this is not at all proper deployment conditions; however the nature of the test is such that we would expect the system to operate mostly from memory (Database fully cached). The only data that would need to go to disk on a real deployment would be the 'access log' and this typically is a not a throughput limiting subsystem.

So we can prudently expect that a real on-disk deployment of a read-mostly workload in which the DB can be fully cached could perform perhaps closely to our findings. This showcase test is a base search over a tiny 1000 entries Database using 50 thread slapd. Slapd was not tuned in any way before the test. For simplicity, the client was run on the same system as the server. This means that, on the one hand, the client is consuming some CPU away from the server, but on the other it reduces the need to run the Network adapter driver code. All in all, this was not designed as a realistic DS test but only to see in a few hours of access time to the system if DS was running acceptably well on this new cool Hardware.

The Results were obtained with Gilles' workspace of DS 6.0 optimized build of August 29th 2005. The number of CPUs where adjusted by creating psrset.


Numbers of Strands                   Search/sec			Ratio
1                                     920			 1    X
3 (1 core; 3 str/core)               2260			 2.45 X
4 (1 core; 4 str/core)               2650			 2.88 X
4 (4 core; 1 str/core)               4100			 4.45 X
14 (7 cores, 2 str/core) 	    12500			13.59 X
21 (7 cores, 3 str/core)            16100			17.5  X
28 (7 cores; 4 str/core)            18200			19.8  X



Those are pretty good scaling numbers straight out of the box. While other more realistics investigation will be produced, this test at least showed us early that Niagara based systems were not suffering from an flagrant deficiencies when running DS searches.

[T]:

mercredi nov. 16, 2005

ZFS to UFS Performance Comparison on Day 1


With special thanks to Chaoyue Xiong for her help in this work.
        
In this paper I'd like to review the performance data we have gathered
comparing this initial  release of ZFS  (Nov 16 2005) with the Solaris
legacy, optimized beyond reason, UFS filesystem.  The  data we will be
reviewing is based on 14 Unit tests that  were designed to stress some
specific usage pattern of  filesystem operations.  Working  with these
well  contained usage     scenarios, greatly  facilitate    subsequent
performance engineering analysis.

Our focus was to issue a fair head to  head comparison between UFS and
ZFS but not  try to  produce the  biggest,  meanest marketing numbers.
Since ZFS  is also a Volume   Manager, we actually  compared  ZFS to a
UFS/SVM combination.  In cases  where ZFS underperforms UFS, we wanted
to figure out why and how to improve ZFS.

We currently also are focusing on data intensive operations.  Metadata
intensive tests are  being develop and we will   report on those  in a
later study.

Looking ahead to  our results we find  that of our  12 Filesystem Unit
test that were successfully run:

  •     ZFS outpaces UFS in 6 tests by a mean factor of 3.4
  •     UFS outpaces ZFS in 4 tests by a mean factor of 3.0
  •     ZFS equals UFS in 2 tests.


In this paper, we will be taking a closer  look at the tests where UFS
is ahead and try to make proposition toward improving those numbers.


THE SYSTEM UNDER TEST

Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At
this point  we are  not  yet monitoring  the  CPU utilization  of  the
different tests  although we plan to do  so in the future. The storage
is  an insanely  large  300 disk   array; The disks  were   rather old
technology,  small &  slow  9 GB  disks.  None  of the  test  currently
stresses the array very much  and the idea was  mostly trying to  take
the storage configuration   out  of the  equation.  Working  with  old
technology disks, the absolute throughput  numbers are not necessarily
of interest; they are presented in an appendix.

Every disk  in our configuration  is partitioned into   2 slices and a
simple zvm or  zpool stripped volume  is made across all spindles. We
then build  a filesystem on top of  the volume.  All commands  are run
with default parameters.  Both filesystems  are mounted and we can run
our test suite on either one.

Every  test is rerun  multiple  times  in  succession; The tests   are
defined and developed to avoid variability between instances. Some of
the current test definition  require that file data  not be present in
the filesystem cache. Since we currently do not  have a convenient way
to control this for  ZFS, the result for those  tests are omitted from
this report.


THE FILESYSTEM UNIT TESTS


Here  is the  definition   of the 14  data   intensive tests  we  have
currently identified.   Note  that  we   are  very  open to   new test
definition; if you know of an data  intensive application, that uses a
Filesystem in  a very  different pattern,  and  there must be  tons of
them, we would dearly like to hear from you.


Test 1


This is the simplest way  to create a  file; we open/creat a file then
issue 1MB writes until the filesize reaches 128 MB; we then close the file.

Test 2


In this test, we also create a new file,  although here we work with a
file opened  with the O_DSYNC  flag.  We work with  128K writes system
calls.  This maps to some database file creation scheme.

Test 3


This test is  also relative to file creation  but with writes that are
much smaller and of varying sizes. In this test, we create a 50MB file
using writes of size picked randomly between [1K,8K]. The file is open
with  default flags (no  O_\*SYNC) but  every 10 MB  of written  data we
issue an fsync() call for the  whole file. This form  of access can be
used for log files that have data integrity requirements.

Test 4


Moving now to a read test; we read a  1 GB file (assumed in cache) with
32K read system call. This is a rather  simple test to keep everybody
honest.

Test 5


This is same test as Test  4 but when  the file is assumed not present
in the filesystem cache. We currently have  no control on ZFS for this
and so we  will not be reporting   performance numbers for  this test.
This is a basic streaming read sequence that should test the readahead
capacity of a filesystem.

Test 6


Our previous write test, were allocating  writes. In this test we will
verify the ability of a filesystem to rewrite over an existing file.
We will look at 32K writes, to a file open with O_DSYNC.

Test 7


Here we also test the ability to rewrite existing  files. The size are
randomly picked  in the [1K,8K] range. Not  special  control over data
integrity (no O_\*SYNC, no fsync()).

Test 8


In  this test  we  create a very  large  file  (10 GB) with 1MB  writes
followed by 2 full-pass sequential  read.  This test is still evolving
but we want  verify the ability of the  filesystem to work with  files
that are of size close or larger that available free memory.

Test 9


In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB
file. When 128 MB of data is written we issue an fsync().

Test 10


Here,  we issue  2K writes at   random (unaligned) offsets  to  a file
opened O_DSYNC.


Test 11


Same test   as 10 but using 4   cooperating threads all  working  on a
single file.

Test 12


Here we attempt to  simulate a mixed  read/write pattern. Working with
an existing file, we loop through  a pattern of 3  reads at 3 randomly
selected 8K aligned  offsets followed by an  8K write to the last read
block.
 

Test 13

In this test  we   issue 2K pread()    calls (to an  random  unaligned
offset).  File is asserted to not be in  the cache. Since we currently
have no such control, no won't report data for this test.

Test 14

We have 4 cooperating  threads (working on a  single file)  issuing 2K
pread() calls  to random unaligned offset. The  file is present in the
cache.




THE RESULTS

We have  a common testing framework  to generate the performance data.
Each test is written using as a simple  C program and the framework is
responsible   for creating   threads,   files,  timing   the runs  and
reporting.  We currently are in discussing merging this test framework
with the Filebench  suite.  We regret that  we cannot easily share the
test  code,  however the   above descriptions  should  be sufficiently
precise to allow someone to  reproduce our data.   In my mind a simple
10 to 20 disk array and any small server  should be enough to generate
similar  numbers.  If anyone  find very different  results, I would be
very interested in knowing about it.

Our      framework reports all    timing    results   as a   throughput
measure. Absolute values of throughput is  highly test case dependent.
A 2K O_DSYNC write will  not have the same throughput  as a 1MB cached
read.  Some test would be better described in  terms of operations per
second.    However  since  our focus  is  a   relative ZFS to  UFS/SVM
comparison, we will focus here on  the delta in throughput between the
2 filesystems (for the curious  the full throughput  data is posted in
the appendix).


Drumroll....


Task ID      Description                                               Winning FS / Performance Delta

1                 open() and allocation  of a                        ZFS / 3.4X

                   128.00 MB file with
                   write(1024K) then close().                

2                 open(O_DSYNC) and                               ZFS / 5.3X
                   allocation of a
                   5.00 MB file with
                   write(128K) then close().              


3                 open()  and allocation of a                        UFS / 1.8X
                   50.00 MB file with write() of
                   size picked uniformly  in
                   [1K,8K] issuing fsync()                          
                   every 10.00 MB


4                 Sequential read(32K) of a
                       ZFS / 1.1X
                   1024.00 MB file, cached.
                              


5                 Sequential read(32K) of a                         no data
                  1024 MB MB file, uncached.


6                 Sequential rewrite(32K) of a                    ZFS / 2.6X
                   10.00   MB  file,  O_DSYNC,
                   uncached                      


7                 Sequential rewrite() of a 1000.00            UFS / 1.3X
                   MB cached file, size picked
                   uniformly    in the [1K,8K]                  
                   range, then close().



8                 create  a file   of size 1/2  of                    ZFS / 2.3X
                   freemem   using  write(1MB)
                   followed  by 2    full-pass
                   sequential   read(1MB).  No              
                   special cache manipulation.


9                 128.00  MB  worth  of random  8            UFS / 2.3X
                   K-aligned write       to  a
                   1024.00  MB  file; followed                  
                   by fsync(); cached.



10              1.00  MB worth of   2K write to            draw (UFS == ZFS)
                  100.00   MB file,  O_DSYNC,
                  random offset, cached.


11             1.00  MB worth  of  2K write  to                ZFS  / 5.8X
                  100.00 MB    file, O_DSYNC,
                  random offset, uncached.  4
                  cooperating  threads   each             
                  writing 1 MB

12             128.00 MB  worth of  8K aligned            draw (UFS == ZFS)
                 read&write   to  1024.00 MB
                 file, pattern  of 3 X read,
                 then write to   last   read
                 page,     random    offset,
                 cached.

13             5.00  MB worth of pread(2K) per            no data
                 thread   within   a  shared
                1024.00  MB    file, random
                offset, uncached



14            5.00 MB  worth of  pread(2K) per                UFS / 6.9X
                thread within a shared
                1024.00 MB file, random                         
                offset, cached 4 threads.


As stated in the abstract

  •     ZFS outpaces UFS in 6 tests by a mean factor of 3.4
  •     UFS outpaces ZFS in 4 tests by a mean factor of 3.0
  •     ZFS equals UFS in 2 tests.

The performance differences can be sizable; lets have a closer look
at some of them.





PERFORMANCE DEBRIEF

Lets look at each test to try and understand what  is the cause of the
performance differences.

Test 1 (ZFS 3.4X)

     open() and allocation  of a
    128.00 MB file with
    write(1024K) then close().                

This  test is not fully  analyzed. We note  that in this situation UFS
will regularly kick off some I/O from the  context of the write system
call.  This would occur  whenever a  cluster  of writes (typically  of
size  128K or 1MB)  has completed. The initiation  of I/O by UFS slows
down the process.  On the other hand ZFS  can zoom through the test at
a rate much closer to  a memcopy.  The  ZFS I/Os to disks are actually
generated internally by the ZFS  transaction group mechanism: every few
seconds a transaction group will come and flush the dirty data to disk
and this occurs without throttling the test.

Test 2 (ZFS 5.3X)

     open(O_DSYNC) and
    allocation of a
    5.00 MB file with
    write(128K) then close().              

Here ZFS shows  an even bigger advantage.   Because of it's design and
complexity,  UFS is actually somewhat limited  in it capacity to write
allocate files in  O_DSYNC mode.  Every  new  UFS write  requires some
disk block   allocation, which must occur  one  block at a   time when
O_DSYNC is set. ZFS can easily outperform UFS for this test.

Test 3 (UFS 1.8X)

     open()  and allocation of a
    50.00 MB file with write() of
    size picked uniformly  in
    [1K,8K] issuing fsync()                          
    every 10.00 MB


Here ZFS pays the advantage it had in test  1.  In this test, we issue
very many writes to a file.  Those are cached as the process is racing
along.  When the fsync() hits (every 10 MB  of outstanding data per the
test definition) the FS must now guarantee that all the data is set to
stable  storage.  Since UFS  kicks off  I/O more  regularly, when  the
fsync() hits UFS has a smaller amount  of data left to  sync up.  What
save the day for ZFS is that, for that leftover data UFS slows down to
a crawl.  On the other hand ZFS has accumulated a large amount of data
in the cache and when  the fsync() hits.   Fortunately ZFS is able  to
issue much larger I/Os to  disk and catches some  of it's lag that has
built  up.  But the final  results shows that UFS  wins the horse race
(at least  in this specific test);  Details of the test will influence
final  result here. 

However the ZFS  team  is working on ways   to make the fsync()   much
better.  We actually have 2  possible avenues of improvements.  We can
borrow from  the  UFS behavior and kick  off  some I/Os when too  much
outstanding data is cached.  UFS does  this at a very regular interval
which does not look  right either.  But clearly  if a file has many MB
of outstanding  dirty  data  sending   them  off  to  disk   might  be
beneficial.  On    the other hand,    keeping  the data in   cache  in
interesting when  the pattern of  writing is such  that the same file
offsets  are written and re-written over  and over again.  Sending the
data to disk  is wasteful if  data is subsequently  rewritten shortly
after.  Basically the FS must place a bet on whether a future fsync()
will occur before an new write  to the block.   We cannot win this bet
on all tests all the time.

Given that fsync() performance  is important, I  would like to see  us
asynchronously kick off I/O when some we reach many MB of outstanding
data to a file. This is nevertheless debatable.

Even if we don't do this, we have another area of improvement that the
ZFS team  is looking into.  When the  Fsync finally hits the fan, even
with a  lot of outstanding data;  the current  implementation does not
issue  disk I/Os very efficiently.   The proper way  to  do this is to
kick-off all required I/Os  and then wait  for  them to all  complete.
Currently in the  intricacies of the   code, some I/Os are  issued and
waited  upon one after the   other.  This is  not yet  optimal but  we
certainly  should see  improvements coming  in the  future and I truly
expect ZFS fsync() performance to be ahead all the time.
   
Test 4 (ZFS 1.1X)

     Sequential read(32K) of a 1024.00
    MB file, cached.

Rather simple  test, mostly    close  to memcopy  speed  between   the
Filesystem  cache and the  user buffer. Contest is  almost a wash with
ZFS slightly on top. Not yet analyzed.


Test 5 (N/A)

     Sequential read(32K) of a 1024.00
    MB file, uncached.

No results dues to lack of control on the ZFS file level caching.


Test 6 (ZFS 2.6X)

      Sequential rewrite(32K) of a
    10.00   MB  file,  O_DSYNC,
    uncached                      

Due  to the WAFL  (Write Anywhere File  Layout) ZFS, a  rewrite is not
very different to an initial  write and it seems  to perform very well
on this  test.  Presumably UFS performance is  hindered by the need to
synchronize the cached data. Result not yet analyzed.

Test 7 (UFS 1.3X)

     Sequential rewrite() of a 1000.00
    MB cached file, size picked
    uniformly    in the [1K,8K]                  
    range, then close().


In this test we are not timing any of the  disk I/O. This is merely a
test about unrolling the  filesystem code for 1K  to 8K cached writes.
The  UFS codepath wins in  simplicity and years of performance tuning.
The ZFS codepath here somewhat suffers from it's youth. Understandably
the ZFS  current  implementation is very well   layered and we  easily
imagine  that the  locking  strategies of   the different layers   are
independent of one another. We have found (thanks dtrace) that a small
ZFS cached write would use about 3 times as many lock acquisition that
an equivalent UFS    call.  Mutex rationalization  within  or  between
layers certainly seems to be an area of  potential improvement for ZFS
that would help this particular test.  We  also realised that the very
clean and  layered code implementation   is causing the  callstack  to
follow very many elevator ride up and down between  layers. On a Sparc
CPU going up  and down 6  or 7 layers  deep in the callstack causes  a
spill/fill trap and one   additional trap for every additional   floor
travelled. Fortunately there  are very  many  areas where ZFS  will be
able to merge different functions into  single one or possibly exploit
the technique of  tail calls to regain  some of the lost  performance.
All in all, we find that the performance difference is small enough to
not  be  worrysome at this  point  specially in  view of  the possible
improvements we already have identified.


Test 8 (ZFS 2.3X)

      create  a file   of size 1/2  of
    freemem   using  write(1MB)
    followed  by 2    full-pass
    sequential   read(1MB).  No              
    special cache manipulation.

This  test  needs to  be   analyzed further.  We   note that  UFS will
proactively  freebehind read blocks. While  this is a very responsible
use of memory   (give it back  after use)  it  potentially  impact the
re-read UFS performance.  While we're happy  to see ZFS performance on
top, some investigation is  warranted to make sure  that ZFS does  not
overconsume memory in some situations.

Test 9 (UFS 2.3X)

       128.00  MB  worth  of random  8
    K-aligned write       to  a
    1024.00  MB  file; followed                  
    by fsync(); cached.


In this test we expect a rational similar to the one of Test 3 to take
effect. The same cure should also apply.

Test 10 (draw)

      1.00  MB worth of   2K write to
    100.00   MB file,  O_DSYNC,
    random offset, cached.

Both FS must issue and wait  for a 2K I/O on  each write. They both do
this as efficiently as possible.


Test 11 (ZFS 5.8X)

     1.00  MB worth  of  2K write  to
    100.00 MB    file, O_DSYNC,
    random offset, uncached.  4
    cooperating  threads   each             
    writing 1 MB

This test is similar to the previous  one except for the 4 cooperating
threads. ZFS being on top highlights a key feature of ZFS, the lack of
single writer lock.  UFS can only allow  a single write thread working
per file.  The only  exception is  when directio  is enabled  and then
only with rather restrictive conditions. UFS with directio would allow
concurrent writers with the implied restriction  that it did not honor
full POSIX semantics regarding write atomicity.  ZFS,  out of the box,
is able  to  allow concurrent  writers  without requiring any  special
setup  nor   giving up full     POSIX semantics. All   great news  for
simplicity of deployment and great Data-Base performance .

Test 12 (draw)

    128.00 MB  worth of  8K aligned
    read&write   to  1024.00 MB
    file, pattern  of 3 X read,
    then write to   last   read
    page,     random    offset,
    cached.

Both filesystem perform appropriately. Test still require analysis.

Test 13 (N/A)

      5.00  MB worth of pread(2K) per       
    thread   within   a  shared
    1024.00  MB    file, random
    offset, uncached

No results dues to lack of control on the ZFS file level caching.

Test 14 (UFS 6.9X)

     5.00 MB  worth of  pread(2K) per   
    thread within a shared
    1024.00 MB file, random                         
    offset, cached 4 threads.

This test unexplicably  shows UFS on  top.   The UFS  code can perform
rather well given  that  the FS cache  is  stored in the   page cache.
Servicing writes from  cache can be made  very scalable.  We  are just
starting our analysis  of  the performance characteristic of   ZFS for
this   test  We have identified  some  serialization  construct in the
buffer management code where we find that  reclaiming the buffers into
which to put the cached  data is acting  as a serial throttle. This is
truly the  only test where  the   ZFS performance disappoint  although
there    is no doubt   that    we will be    finding   a cure to  this
implementation issue.




THE TAKEAWAY


ZFS is  on top   on very  many  of  our  test  often by a  significant
factor. Where UFS is ahead we have a clear view on  how to improve the
ZFS implementation.  The case of shared readers to  a single file will
be the test that requires special attention.

Given   the youth of the  ZFS  implementation, the performance outline
presented in this paper shows that the ZFS design decision are totally
validated from a performance perspective.


FUTURE DIRECTIONS

Clearly, we should now expands the unit  test coverage.  We would like
to study more metadata intensive workloads.  We also would like to see
how   ZFS  features such as  compression    and RaidZ perform.   Other
interesting studies could   focus    on CPU consumption     and memory
efficiency.  We also  need to find  a solution to running the existing
unit test that requires the files to not be cached in the filesystem.







APPENDIX/ THROUGHPUT MEASURE

Here are the raw throughput measures for each of the 14 Unit test.

 Task ID      Description              ZFS latest+nv25(MB/s)      UFS+nv25 (MB/s)

1     open() and allocation  of a        486.01572         145.94098
    128.00 MB file with
    write(1024K) then close().                 ZFS 3.4X



2     open(O_DSYNC) and            4.5637             0.86565
    allocation of a
    5.00 MB file with
    write(128K) then close().               ZFS 5.3X


3     open()  and allocation of a          27.3327         50.09027
    50.00 MB file with write() of
    size picked uniformly  in
    [1K,8K] issuing fsync()                           1.8X UFS
    every 10.00 MB


4     Sequential read(32K) of a 1024.00    674.77396         612.92737
    MB file, cached.
                               ZFS 1.1X


5     Sequential read(32K) of a 1024.00    1756.57637         17.53705
    MB file, uncached.
                               XXXXXXXXX


6      Sequential rewrite(32K) of a        2.20641         0.85497
    10.00   MB  file,  O_DSYNC,
    uncached                       ZFS 2.6X


7     Sequential rewrite() of a 1000.00    204.31557         257.22829
    MB cached file, size picked
    uniformly    in the [1K,8K]                   1.3X UFS
    range, then close().



8      create  a file   of size 1/2  of    698.18182         298.25243
    freemem   using  write(1MB)
    followed  by 2    full-pass
    sequential   read(1MB).  No               ZFS 2.3X
    special cache manipulation.


9       128.00  MB  worth  of random  8        42.75208         100.35258
    K-aligned write       to  a
    1024.00  MB  file; followed                   2.3X UFS
    by fsync(); cached.




10      1.00  MB worth of   2K write to        0.117925         0.116375
    100.00   MB file,  O_DSYNC,
    random offset, cached.                      ====


11     1.00  MB worth  of  2K write  to    0.42673         0.07391
    100.00 MB    file, O_DSYNC,
    random offset, uncached.  4
    cooperating  threads   each              ZFS 5.8X
    writing 1 MB



12      128.00 MB  worth of  8K aligned        264.84151         266.78044
    read&write   to  1024.00 MB
    file, pattern  of 3 X read,
    then write to   last   read                 =====
    page,     random    offset,
    cached.



13      5.00  MB worth of pread(2K) per        75.98432         0.11684
    thread   within   a  shared
    1024.00  MB    file, random               XXXXXXXX
    offset, uncached



14     5.00 MB  worth of  pread(2K) per    56.38486         386.70305
    thread within a shared
    1024.00 MB file, random                          6.9X UFS
    offset, cached 4 threads.




,

lundi juin 13, 2005

Bonjour Monde

That's "Hello World" in French but one wouldn't say it that way anyway. Maybe one would say "Bonjour tout le monde" meaning "Hello All" which you may say for example entering a room filled of people (specially if like me, you don't care much about greeting everyone individually). So that's your first hint. I'm a geeky sociopath more likely to communicate through a weblog than in real life. The next hint is that I master the french language as you might expect from someone that lives in France. However I've lived in France for only about 15 years and that should allow you to guess that I was not born in France (french laws prohibits child labor). And I've been working for Sun since 1997. The reason I master the French language is probably because both my parents spoke no other language. That was in Quebec, a part of Canada filled with People that speak English with a funny semi-french accent. So bear with me my writing also will have this accent. In Summary: Canadian, lives in France, works for Sun for 8 years. I do performance engineering which to me means: I take a performance number, I explain why it is what is it, and propose what needs to be done to improve it. Welcome to my blog, and your name is ?
About

User13278091-Oracle

Search

Categories
Archives
« février 2015
lun.mar.mer.jeu.ven.sam.dim.
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
21
22
23
24
25
26
27
28
 
       
Today
News
Blogroll

No bookmarks in folder