Thursday Sep 18, 2008

time for new challenges

It's been a great eight years at Sun, but a new opportunity has piqued my interest. So as of today, i'm moving on.

Where to?

Well i'm going to be helping out some friends at a startup:
http://www.lumosity.com/

Best to team ZFS (including those outside of Sun).

eric

Wednesday Apr 23, 2008

zones and ZFS file systems

Starting off with a freshly created pool, let's see the steps to create a zone based on a ZFS file system. Here we see our new pool with only one file system:

fsh-sole# zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
kwame   160K  7.63G    18K  /kwame
fsh-sole#

Now, we'll create and configure a local zone "ejkzone". Note, we set the zonepath within the path of the ZFS pool:

fsh-sole# zonecfg -z ejkzone
ejkzone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:ejkzone> create
zonecfg:ejkzone> set zonepath=/kwame/kilpatrick
zonecfg:ejkzone> commit
zonecfg:ejkzone> exit
fsh-sole#

Now we install zone "ejkzone" and notice that the installation tells us that it will automatically create a ZFS file system for us:

fsh-sole# zoneadm -z ejkzone install
A ZFS file system has been created for this zone.
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <10116> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1198> packages on the zone.
Initialized <1198> packages on zone.                                 
Zone  is initialized.
The file  contains a log of the zone installation.
fsh-sole#

Now we can boot the zone to use it, and can also see that the file system kwame/kilpatrick was automatically created for us:

fsh-sole# zoneadm -z ejkzone boot   
fsh-sole# zoneadm list
global
ejkzone
fsh-sole# zoneadm -z ejkzone list -v
  ID NAME             STATUS     PATH                           BRAND    IP    
   3 ejkzone          running    /kwame/kilpatrick              native   shared
fsh-sole# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
kwame              517M  7.12G    20K  /kwame
kwame/kilpatrick   517M  7.12G   517M  /kwame/kilpatrick
fsh-sole# 

Now if we login into the zone via 'zlogin -C ejkzone', we notice that the local zone cannot see any ZFS file systems (only the global zone can):

ejkzone# zfs list
no datasets available
ejkzone# 

If we then want to create and delegate some ZFS file systems to the local zone "ejkzone" so that "ejkzone" has administrative control over the file systems, we can do that. From the global zone, we do:

fsh-sole# zfs create kwame/textme
fsh-sole# zonecfg -z ejkzone
zonecfg:ejkzone> add dataset
zonecfg:ejkzone:dataset> set name=kwame/textme
zonecfg:ejkzone:dataset> end
zonecfg:ejkzone> exit
fsh-sole#

Now, we can get the "zoned" property of the newly created file system:

fsh-sole# zfs get zoned kwame/textme 
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  zoned     off           default
fsh-sole# 

Huh, it says "off". But we delegated it to a local zone. Why is that? Well in order for this to take affect, we have to reboot the local zone. After doing that, we can see from the global zone:

fsh-sole# zfs get zoned kwame/textme
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  zoned     on            local
fsh-sole# 

And from the local zone "ejkzone":

ejkzone# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
kwame          595M  7.05G    20K  /kwame
kwame/textme    18K  7.05G    18K  /kwame/textme
ejkzone# 

And now we have administrative control over the file system via the local zone:

ejkzone# zfs get copies kwame/textme 
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  copies    1             default
ejkzone# zfs set copies=2 kwame/textme
ejkzone# zfs get copies kwame/textme  
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  copies    2             local
ejkzone# 

Double checking on the global zone:

fsh-sole# zfs get copies kwame/textme
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  copies    2             local
fsh-sole# zpool history -l
History for 'kwame':
2008-04-23.16:01:17 zpool create -f kwame c1d0s3 [user root on fsh-sole:global]
2008-04-23.16:29:42 zfs create kwame/textme [user root on fsh-sole:global]
2008-04-23.16:36:45 zfs set copies=2 kwame/textme [user root on fsh-sole:ejkzone]

fsh-sole# 

Happy zoning

Wednesday Mar 19, 2008

how dedupalicious is your pool?

WIth the putback of 6656655 zdb should be able to display blkptr signatures, we can now get the "signature" of the block pointers in a pool. To see an example, let's first put some content into an empty pool:

heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
bigIO   928G  95.5K   928G     0%  ONLINE  -
heavy# mkfile 1m /bigIO/1m.txt
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt

Now lets run zdb with the new option "-S". We pass in "user:all", where "user" tells zdb that we only want user data blocks (as opposed to both user and metadata) and "all" tells zdb to print out all blocks (skipping any checksum algorithm strength comparisons).

heavy# zdb -L -S user:all bigIO
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
0       512     1       ZFS plain file  fletcher2       uncompressed    7075646564206f6e:a:eac8cac840dedc0:140
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
heavy# 

This displays the signature of each block pointer - where the columns are level, physical size, number of dvas, object type, checksum type, compression type, and finally the actual checksum of the block.

So this is interesting, but what could we do with this information? Well, one thing we could do, is to figure out how much your pool can take advantage of dedup. Let's assume that the dedup implementation does matching based on the actual checksum and any checksum algorithm is strong enough (in reality, we'd need sha256 or stronger). So starting with the above pool and using a simple perl script 'line_by_line_process.pl' (shown at the end of this blog), we find:

heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt 
total PSIZE:               0t1050624
total unique PSIZE:        0t132096
total that can be duped:   0t918528
percent that can be duped  87.4269005847953%
heavy#   

In our trivial case, we can see that we could get a huge win - 87% of the pool can be dedup'd!. Upon closer examination, we notice that mkfile writes out all zero blocks. If you had compression enabled, there won't be any actual blocks for this file. So let's look at a case where just the "ejk.txt" contents are getting dedup'd:

heavy# zpool destroy bigIO
heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# dd if=/dev/random of=/bigIO/1m.txt bs=1024 count=5
5+0 records in
5+0 records out
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt
heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt       
total PSIZE:               0t7168
total unique PSIZE:        0t6144
total that can be duped:   0t1024
percent that can be duped  14.2857142857143%
heavy# 

Ok in this different setup we can see ~14% of capacity can actually be dedup'd - a nice savings on capacity.

So the question becomes - how dedupalicious is your pool?




ps: here is the simple perl script 'line_by_line_process.pl':

#!/usr/bin/perl

# Run this script as:
#  % script 


# total PSIZE
$totalps = 0;

# total unique PSIZE
$totalups = 0;

$last_cksum = -1;

$path = $ARGV[0];
open(FH, $path) or die "Can't open $!";

while (<>) {
        my $line = $_;
        ($level, $psize, $ndvas, $type, $cksum_alg, $compress, $cksum) = split /\\t/, $line, 7;
        if ($cksum ne $last_cksum) {
                $totalups += $psize;
        }
        $last_cksum = $cksum;
        $totalps += $psize;
}

print "total PSIZE:               0t".$totalps."\\n";
print "total unique PSIZE:        0t".$totalups."\\n";
print "total that can be duped:   0t".($totalps - $totalups)."\\n";
print "percent that can be duped  ".($totalps - $totalups) / $totalps \* 100 ."%\\n";

Tuesday Dec 18, 2007

Complete Sourceforge build for FileBench

Awhile back, we gave FileBench a much needed facelift. After its botox injection, we updated the sourceforge site's source code and integrated FileBench into Opensolaris.

Now i'm happy to report that we have updated the building process on the sourceforge site so that anyone can do a complete build with the updated source (as in the freshest FileBench source in the world). This complete build not only builds the open source parts (go_filebench, filebench(1), workloads, .prof files, scripts), but also includes the closed source binaries (such as davenet and statit). So yes xanadu is back! For those curious, the reason davenet and statit are closed binaries is because i don't happen to have the source code for them.

This update also includes pre-built packages for x86 and sparc (Solaris 10/OpenSolaris only). As i'm not familiar with creating packages for OSX, \*BSD, or linux, if someone from that knowledge set wants to help out and automate the process to build packages for non-OpenSolaris platforms, we'd be much obliged.

"file"s merged into "fileset"s

Drew just putback 6601818 Turn FileBench "files" into filesets with 1 entry, which was a nice cleanup that merged the implementation of files into the filesets's implementation. In FileBench News (sometimes comes out quarterly, sometimes bi-monthly), you can see Drew's implications of the changes.

This is a very nice simplification of the code and something that has been on the "todo" list for over two years. This was a major change, so FileBench has been updated to version 1.1.0 (from the previous 1.0.1). You can find these changes in OpenSolaris build snv_81 and immediately on sourceforge.

More goodness in the works...

FileBench Source and Bug/RFE Info

You can now easily browse the source code for FileBench in OpenSolaris using OpenGrok. I find OpenGrok much friendlier to use than what sourceforge offers.

For a basic breakdown of the source, the \*.c's, \*.h's, \*.l, and \*.y that construct the C binary 'go_filebench' can be found here. You can also browse the workloads, the main perl script filebench(1), the .prof files, and the scripts to compare results and to flush file system caches.

You can now also query our bug database for FileBench bugs found and perhaps more interestingly RFEs requested on OpenSolaris.

Thursday Oct 04, 2007

FileBench : a New Era in FS Performance

I'm happy to report that FileBench has gone over a significant overhaul and we're happy to release that updated version. Bits will be posted to sourceforge tonight. I'm also happy to report that FileBench is now included in OpenSolaris. You can find it in our new "/usr/benchmarks" path.

Ok, great, just what the industry needed - FileBench is just another simple benchmark. right? Nope.

First let me give you, dear reader, a taste of what we get internally and externally here at the ZFS team:

"I ran bonnie, dd, and tar'd up myfav_linuxkernel_tarball.tar.  Your file system sucks."

Though sometimes i'm happy to note we get:

"I ran bonnie, dd, and tar'd up myfav_linuxkernel_tarball.tar.  Your file system rulz!."

What is FileBench?

It is nice to hear that your file system does in fact "rule", but the problem with the above is that bonnie, dd, and tar are (obviously) not a comprehensive set of applications that can completely measure a file system. IOzone is quite nice but it only tests basic I/O patterns. And there are many other file system benchmarks (mkfile, fsstress, fsrandom, mongo, iometer, etc). The problem with all of these benchmarks is that they only measure a specific workload (or a small set of workloads). None of them actually let you measure what a real application does. Yes part of what Oracle does is random aligned reads/writes (which many of the pre-mentioned benchmarks can measure), but the key is how the random read/writes interact with each other \*and\* how they interact with the other parts of what Oracle does (the log writer as a major example). None of the pre-mentioned benchmarks can do that.

Enter FileBench.

So how does FileBench differ? FileBench is a framework of file system workloads for measuring and comparing file system performance. The key is in the workloads. FileBench has a simple .f language that allows you to describe and build workloads to simulate applications. You can create workloads to replace all the pre-mentioned benchmarks. But more importantly, you can create workloads to simulate complex applications such as a database. For instance, i didn't have to buy an Oracle license nor figure out how to install it on my system to find out if my changes to the vdev cache for ZFS helped database performance or not. I just used FileBench and its 'oltp.f' workload.

How do i Start using FileBench?

The best place to start is via the quick start guide. You can find out lots more at our wiki. Lots of good information and a great place to contribute. For trouble shooting see the gotchas section.

How do i Contribute to FileBench?

If you'd like to write your own workloads, check out the very nice documentation Drew Wilson wrote up. This is actually where we (where we is the FileBench community, not just us Sun people) would love the most contribution. We would really like to have people verify the workloads we have today and build the workloads for tomorrow. This is a great opportunity for industry types and academics. We very much plan to incorporate new workloads into FileBench.

If you would like to help with the actual code of the filebench binaries, find us at perf-discuss@opensolaris.org.

FileBench is done, right?

Um, nope. There's actually lots of interesting work up ahead for FileBench. Besides building new workloads, the two other major focuses that need help are: multi-client support and plug-ins. The first is pretty obvious - we need to have support for multiple clients to benchmark NFS and CIFS. And we need that to work on multiple platforms (OpenSolaris, linux, \*BSD, OSX, etc). The second is where experts in specific communities can help out. Currently, FileBench goes through whatever client/initiator implementation you have on your machine. But if you wanted to just do a comparison of server/target implementations, then you need a plug-in built into FileBench that both systems utilize (even if they are different operating systems). We started prototyping a plug-in for NFSv3. We've also thought about NFSv4, CIFS, and iSCSI. A community member suggested XAM. This is a very interesting space to explore.

So What does this all Mean?

If you need to do file system benchmarking, try out FileBench. Let us know what you like and what needs some love.

If you're thinking about developing a new file system benchmark, consider creating a new workload for FileBench instead. If that works out for you, please share your work. If for some reason it doesn't, please let us know why.

We really believe in the architecture of FileBench and really want it to succeed industry wide (file system and hence storage). We know it works quite well on OpenSolaris and would love other developers to make sure it works just as well on their platforms (linux, \*BSD, OSX, etc.).

Happy Benchmarking and long live RMC!

Tuesday Sep 11, 2007

NCQ sata_max_queue_depth tunable

Previously, i did some analysis on NCQ in OpenSolaris. It turned out that to get good multi-stream sequential read performance, you had to disable NCQ via the tunable 'sata_func_enable'. Disabling NCQ actually does two things: 1) sets the number of concurrent I/Os to 1 and 2) changes what you send down protocol wise.

Turns out, the first is all we really need to get good performance for the multi-stream sequential read case, and doing the second actually exposes a bug on certain firmware of certain disks. So i highly recommend the newly added 'sata_max_queue_depth' tunable instead of 'sata_func_enable'. As a reminder, put the following in /etc/system and reboot:

set sata:sata_max_queue_depth = 0x1

An admin command to allow you to do this on the fly without rebooting would be another step forward, but no official plans on that just yet.

Thanks Pawel!

Wednesday Aug 01, 2007

how to get the spa_history's bonus buffer

Here's a quick debugging note on how to get the offsets of the pool's history log:

> ::spa
ADDR                 STATE NAME                                                
fffffffec529d5c0    ACTIVE d
> fffffffec529d5c0::print spa_t spa_history
spa_history = 0xd
> ::dbufs -n mos -o 0xd -b bonus
0xffffffff39948990
> 0xffffffff39948990::print dmu_buf_t db_data | ::print spa_history_phys_t
{
    sh_pool_create_len = 0x10c
    sh_phys_max_off = 0x2000000
    sh_bof = 0x10c
    sh_eof = 0x30c
    sh_records_lost = 0
}
> 

Wednesday Jul 18, 2007

vdev cache and ZFS database performance

A little while back Neel did some nice evaluating of DB on ZFS performance. One issue that he correctly noted was: 6437054 vdev_cache wises up: increase DB performance by 16%.

The major issue for DB performance was that the vdev cache would inflate reads under 16KB (zfs_vdev_cache_max) to 64KB (1 << zfs_vdev_cache_bshift).

As you can guess, this inflating would really hurt typical databases as they do lots of record aligned random I/Os - and the random I/Os are typically under 16KB (Oracle and Postgress are usually configured with 8KB, JavaDB with 4KB, etc.). So why do we have this inflation in the first place? Turns out its really important for pre-fetching metadata. One workload that demonstrates this is the mult-stream sequnetial read workload for FileBench. We can also use the oltp workload of FileBench to test database performance.

What we changed in order to fix 6437054 was to make the vdev cache only inflate I/Os for \*metadata\* - not \*user\* data. You can now see that logic in vdev_cache_read(). This logically makes sense, as we can now rely on zfetch to correctly pre-fetch user data (which depends more on what the application is doing), and the vdev cache to pre-fetch metadata (which depends more on where it was located on disk).

Ok, yeah, theory is nice, but let's see some measurements...

OLTP workload

Below are the results from using this profile (named 'db.prof'). This was on a thumper, non-debug bits, ZFS configured in a 46 disk RAID-0, and the recordsize set to 8KB.

OpenSolaris results without the fix for 6437054 (onnv-gate:2007-07-11)

diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/thisrun.f
 1698: 0.013: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
 1698: 0.014: Creating/pre-allocating files
 1698: 0.014: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
 1698: 0.014: Creating fileset logfile...
 1698: 0.118: Preallocated 1 of 1 of fileset logfile in 1 seconds
 1698: 0.118: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
 1698: 0.118: Creating fileset datafiles...
 1698: 341.433: Preallocated 10 of 10 of fileset datafiles in 342 seconds
 1698: 341.434: Starting 200 shadow instances
...

 1698: 345.768: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 1698: 345.774: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k
 1698: 345.776: Running...
 1698: 466.858: Run took 120 seconds...
 1698: 466.913: Per-Operation Breakdown
random-rate                 0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
shadow-post-dbwr         4237ops/s   0.0mb/s      0.5ms/op       37us/op-cpu
shadow-post-lg           4237ops/s   0.0mb/s      0.0ms/op        7us/op-cpu
shadowhog                4237ops/s   0.0mb/s      0.6ms/op      257us/op-cpu
shadowread               4255ops/s  33.2mb/s     45.7ms/op       67us/op-cpu
dbwr-aiowait               42ops/s   0.0mb/s      0.6ms/op       83us/op-cpu
dbwr-block                 42ops/s   0.0mb/s    215.7ms/op     2259us/op-cpu
dbwr-hog                   42ops/s   0.0mb/s      0.0ms/op       15us/op-cpu
dbwrite-a                4244ops/s  33.2mb/s      0.1ms/op        9us/op-cpu
lg-block                    1ops/s   0.0mb/s    745.3ms/op     6225us/op-cpu
lg-aiowait                  1ops/s   0.0mb/s      8.0ms/op       25us/op-cpu
lg-write                    1ops/s   0.3mb/s      0.3ms/op       26us/op-cpu

 1698: 466.913: 
IO Summary:      1034439 ops 8543.3 ops/s, (4255/4245 r/w)  66.7mb/s,    188us cpu/op,  45.7ms latency
 1698: 466.913: Stats dump to file 'stats.large_db_oltp_8k.out'
 1698: 466.913: in statsdump stats.large_db_oltp_8k.out
 1698: 466.922: Shutting down processes
\^C 1698: 601.568: Aborting...
 1698: 601.568: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/stats.large_db_oltp_8k.out

diskmonster# 

OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18)

diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/thisrun.f
 1083: 0.037: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
 1083: 0.037: Creating/pre-allocating files
 1083: 0.057: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
 1083: 0.057: Creating fileset logfile...
 1083: 0.194: Preallocated 1 of 1 of fileset logfile in 1 seconds
 1083: 0.194: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
 1083: 0.194: Creating fileset datafiles...
 1083: 335.203: Preallocated 10 of 10 of fileset datafiles in 336 seconds
 1083: 335.203: Starting 200 shadow instances
...
 1083: 339.484: Creating 221249536 bytes of ISM Shared Memory...
 1083: 339.649: Allocated 221249536 bytes of ISM Shared Memory... at fffffd7f8f600000
 1083: 339.650: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 1083: 339.725: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k
 1083: 339.729: Running...
 1083: 460.683: Run took 120 seconds...
 1083: 460.724: Per-Operation Breakdown
random-rate                 0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
shadow-post-dbwr         5224ops/s   0.0mb/s      0.8ms/op       30us/op-cpu
shadow-post-lg           5224ops/s   0.0mb/s      0.1ms/op        6us/op-cpu
shadowhog                5223ops/s   0.0mb/s      0.7ms/op      255us/op-cpu
shadowread               5240ops/s  40.9mb/s     36.3ms/op       56us/op-cpu
dbwr-aiowait               52ops/s   0.0mb/s      0.9ms/op       86us/op-cpu
dbwr-block                 52ops/s   0.0mb/s    167.6ms/op     1605us/op-cpu
dbwr-hog                   52ops/s   0.0mb/s      0.0ms/op       15us/op-cpu
dbwrite-a                5226ops/s  40.8mb/s      0.2ms/op        9us/op-cpu
lg-block                    2ops/s   0.0mb/s    600.5ms/op     4580us/op-cpu
lg-aiowait                  2ops/s   0.0mb/s      4.0ms/op       23us/op-cpu
lg-write                    2ops/s   0.4mb/s      1.2ms/op       24us/op-cpu

 1083: 460.724: 
IO Summary:      1272557 ops 10520.9 ops/s, (5240/5228 r/w)  82.2mb/s,    156us cpu/op,  36.3ms latency
 1083: 460.724: Stats dump to file 'stats.large_db_oltp_8k.out'
 1083: 460.724: in statsdump stats.large_db_oltp_8k.out
 1083: 460.731: Shutting down processes
 1083: 1060.645: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/stats.large_db_oltp_8k.out

diskmonster# 

10520.9 ops/s vs. 8543.3 ops/s, over 20%! That's a nice out of the box improvement!



Multi-Stream Sequential Read workload

A workaround previously mentioned to get better DB performance was to set 'zfs_vdev_cache_max' to 1B (which essentially disables the vdev cache as the random I/Os are not going to be lower than 1B). The problem with this approach is that it really hurst other workloads, such as the multi-stream sequential workload. Below are the results using the same thumper, non-debug bits, ZFS in a 46 disk RAID-0, checksums turned off, NCQ disabled via 'set sata:sata_func_enable = 0x5' in /etc/system', and using this profile (named 'sqread.prof').

OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' left as its default value

diskmonster# filebench sqread
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/thisrun.f
 2135: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
 2135: 0.005: Creating/pre-allocating files
 2135: 55.235: Pre-allocated file /bigIO/largefile4
 2135: 118.147: Pre-allocated file /bigIO/largefile3
 2135: 184.602: Pre-allocated file /bigIO/largefile2
 2135: 251.991: Pre-allocated file /bigIO/largefile1
 2135: 263.341: Starting 1 seqread instances
 2136: 264.348: Starting 1 seqread4 threads
 2136: 264.348: Starting 1 seqread3 threads
 2136: 264.348: Starting 1 seqread2 threads
 2136: 264.348: Starting 1 seqread1 threads
 2135: 267.358: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 2135: 267.362: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m
 2135: 267.362: Running...
 2135: 388.128: Run took 120 seconds...
 2135: 388.130: Per-Operation Breakdown
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread4                  469ops/s 468.7mb/s      2.1ms/op     1391us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread3                  454ops/s 454.1mb/s      2.2ms/op     1412us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread2                  444ops/s 443.8mb/s      2.2ms/op     1400us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread1                  421ops/s 421.0mb/s      2.4ms/op     1414us/op-cpu

 2135: 388.130: 
IO Summary:      215878 ops 1787.6 ops/s, (1788/0 r/w) 1787.6mb/s,   1638us cpu/op,   2.2ms latency
 2135: 388.130: Stats dump to file 'stats.seqread1m.out'
 2135: 388.130: in statsdump stats.seqread1m.out
 2135: 388.136: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/stats.seqread1m.out

diskmonster# 

OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' set to 1 (disabled vdev cache)

diskmonster# ./do_sqread 
cannot open 'bigIO': no such pool
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/thisrun.f
 4110: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
 4110: 0.005: Creating/pre-allocating files
 4110: 55.681: Pre-allocated file /bigIO/largefile4
 4110: 119.324: Pre-allocated file /bigIO/largefile3
 4110: 182.188: Pre-allocated file /bigIO/largefile2
 4110: 245.260: Pre-allocated file /bigIO/largefile1
 4110: 255.216: Starting 1 seqread instances
 4113: 256.222: Starting 1 seqread4 threads
 4113: 256.222: Starting 1 seqread3 threads
 4113: 256.222: Starting 1 seqread2 threads
 4113: 256.222: Starting 1 seqread1 threads
 4110: 259.232: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 4110: 259.236: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m
 4110: 259.236: Running...
 4110: 380.112: Run took 120 seconds...
 4110: 380.115: Per-Operation Breakdown
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread4                  369ops/s 369.5mb/s      2.7ms/op     1034us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread3                  375ops/s 375.2mb/s      2.7ms/op     1047us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread2                  369ops/s 369.2mb/s      2.7ms/op     1042us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread1                  296ops/s 296.2mb/s      3.4ms/op     1066us/op-cpu

 4110: 380.115: 
IO Summary:      170443 ops 1410.1 ops/s, (1410/0 r/w) 1410.1mb/s,   1325us cpu/op,   2.8ms latency
 4110: 380.115: Stats dump to file 'stats.seqread1m.out'
 4110: 380.115: in statsdump stats.seqread1m.out
 4110: 380.121: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/stats.seqread1m.out

diskmonster#

So by disabling the vdev cache, the throughput drops from 1787MB/s to 1410MB/s - a 25% regression. So disabling the vdev cache really hurts here. The nice thing is that with the fix for 6437054, we don't have to - and we get great DB performance too. My cake is tasty.

Future Work

Future work to increase DB on ZFS performance includes:
6457709 vdev_knob values should be determined dynamically
6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops

ps: If you want the oltp workload of filebench to run correctly, you'll need this minor fix to 'flowop_library.c':

@@ -600,10 +600,11 @@ 
                 aiocb->aio_fildes = filedesc;
                 aiocb->aio_buf = threadflow->tf_mem + memoffset;
                 aiocb->aio_nbytes = \*flowop->fo_iosize;
                 aiocb->aio_offset = offset;
+                aiocb->aio_reqprio = 0;
 
 
                 filebench_log(LOG_DEBUG_IMPL,
                                 "aio fd=%d, bytes=%lld, offset=%lld",
                                 filedesc, \*flowop->fo_iosize, offset);


happy databasing!

Tuesday Jun 12, 2007

ZFS on a laptop?

Sun is known for servers, not laptops. So a filesystem designed by Sun would surely be too powerful and too "heavy" for laptops, that the features of a "datacenter" filesystem wouldn't fit on a laptop. Right? Actually... not. As it turns out, ZFS is a great match for laptops.

Backup

One of the most important things a user needs to do on a laptop is to back his data up. Copying your data to DVD or an external drive is one way. ZFS snapshots with 'zfs send' and 'zfs recv' is a better way. Due to its architecture, snaphots in ZFS are very fast and only take up as much space as much data has changed. For a typical user, taking a snapshot every day, for example, will only take up a small amount of capacity.

So let's start off with a ZFS pool called 'swim' and two filesystems: 'Music' and 'Pictures':

fsh-mullet# zfs list
NAME            USED  AVAIL  REFER  MOUNTPOINT
swim            157K  9.60G    21K  /swim
swim/Music       18K  9.60G    18K  /swim/Music
swim/Pictures    19K  9.60G    19K  /swim/Pictures
fsh-mullet# ls /swim/Pictures 
bday.jpg        good_times.jpg

Taking a snapshot 'today' of Pictures is this easy:

fsh-mullet# zfs snapshot swim/Pictures@today

And now we can see the contents of snapshot 'today' via the '.zfs/snapshot' directory:

fsh-mullet# ls /swim/Pictures/.zfs/snapshot/today 
bday.jpg        good_times.jpg
fsh-mullet# 

If you want to take a snapshot of all your filesystems, then you can do:

fsh-mullet# zfs snapshot -r swim@today      
fsh-mullet# zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
swim                  100M  9.50G    21K  /swim
swim@today               0      -    21K  -
swim/Music            100M  9.50G   100M  /swim/Music
swim/Music@today         0      -   100M  -
swim/Pictures          19K  9.50G    19K  /swim/Pictures
swim/Pictures@today      0      -    19K  -
fsh-mullet# 

Now that you have snapshots, you can use the built-in features of 'zfs send' and 'zfs recv' to backup your data - even to another machine.

fsh-mullet# zfs send swim/Pictures@today | ssh host2 zfs recv -d backupswim

After you've sent over the first snapshot via 'zfs send', you can then do incremental 'zfs send's:

fsh-mullet# zfs send -i swim/Pictures@today | ssh host2 zfs recv -d backupswim

Now let's look at the backup ZFS pool 'backupswim' on host 'host2':

host2# zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
backupswim                  100M  9.50G    21K  /backupswim
backupswim/Music            100M  9.50G   100M  /backupswim/Music
backupswim/Music@today         0      -   100M  -
backupswim/Pictures          18K  9.50G    18K  /backupswim/Pictures
backupswim/Pictures@today      0      -    18K  -

What's really nice about using ZFS's snapshots is that you only need to send over (and store) the differences between snapshots. So if you're doing video editing on your laptop, and have a giant 10GB file, but only change, say, 1KB of data on this day, with ZFS you only have to send over 1KB of data - not the entire 10GB of the file. This also means you don't have to store multiple 10GB versions (one per snapshot) of the file on your backup device.

You can also backup with an external hard drive. Create a backup pool on the second hard drive, and just 'zfs send/recv' your nightly snapshots.

Reliability

Since laptops (typically) only have 1 disk, handling disk errors is very important. Bill introduced ditto blocks to handle partial disk failures. With typical filesystems, if part of the disk is corrupted/failing and that part of the disk stores your metadata, you're screwed. There's no way to access the data associated with the inaccessible metadata without backing up. With ditto blocks, ZFS stores multiple copies of the metadata in the pool. In the single disk case, we strategically store multiple copies of the metadata on different locations on disk (such as at the front and back of the disk). A subtle partial disk failure can make other filesystems useless, whereas ZFS can survive.

Matt took ditto blocks one step further and allowed the user to apply it to any filesystem's data. What this means is that you can make your more important data more reliable by stashing away multiple copies of your precious data (without muddying your namespace). Here's how you store two copies of your pictures:

fsh-mullet# zfs set copies=2 swim/Pictures
fsh-mullet# zfs get copies swim/Pictures
NAME           PROPERTY  VALUE          SOURCE
swim/Pictures  copies    2              local
fsh-mullet# 

Note, the number of copies property only affects future writes (not existing data). So i recommend you set this at filesystem creation time:

fsh-mullet# zfs create -o copies=2 swim/Music
fsh-mullet# zfs get copies swim/Music
NAME        PROPERTY  VALUE       SOURCE
swim/Music  copies    2           local
fsh-mullet# 

Built-in Compression

With ZFS, compression comes built-in. The current algorithms are lzjb (based on Lempel-Ziv) and gzip. Now its true that your jpegs and mp4s are already compressed quite nicely, but if you want to save capacity on other filesystems, all you have to do is:

fsh-mullet# zfs set compression=on swim/Documents
fsh-mullet# zfs get compression swim/Documents
NAME            PROPERTY     VALUE           SOURCE
swim/Documents  compression  on              local
fsh-mullet# 

The default compression algorithm is lzjb. If you want to use gzip, then do:

fsh-mullet# zfs set compression=gzip swim/Documents
fsh-mullet# zfs get compression swim/Documents     
NAME            PROPERTY     VALUE           SOURCE
swim/Documents  compression  gzip            local
fsh-mullet# 

That single disk stickiness

A major problem with laptops today is the single point of failure: the single disk. It makes complete sense today that laptops are designed this way given the physical space and power issues. But looking foward, as, say, flash gets cheaper and cheaper as well as more reliable, it becomes more and more of a possibility to replace the single disk in laptops. So now that you save physical space, you can actually fit more than one flash device in the laptop. Wouldn't it be really cool if you could then build RAID ontop of the multiple devices? Introducing some hardware RAID controller doesn't make any sense - but software RAID does.

ZFS allows you to do mirroring as well as RAID-Z (ZFS's unique form of RAID-5) - in software.

Creating a mirrored pool is easy:

diskmonster# zpool create swim mirror c7t0d0 c7t1d0
diskmonster# zpool status
  pool: swim
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        swim        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c7t0d0  ONLINE       0     0     0
            c7t1d0  ONLINE       0     0     0

errors: No known data errors
diskmonster# 

Similarly, creating a RAID-Z is also easy:

diskmonster# zpool create swim raidz c7t0d0 c7t1d0 c7t2d0 c7t5d0
diskmonster# zpool status
  pool: swim
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        swim        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c7t0d0  ONLINE       0     0     0
            c7t1d0  ONLINE       0     0     0
            c7t2d0  ONLINE       0     0     0
            c7t5d0  ONLINE       0     0     0

errors: No known data errors
diskmonster# 

With either of these configurations, your laptop can now handle a whole device failure.

ZFS on a laptop - a perfect fit.

Tuesday May 29, 2007

NCQ performance analysis

hi[Read More]

Tuesday Apr 10, 2007

Poor Man's Cluster - end the corruption

The putback of 6282725 hostname/hostid should be stored in the label introduces hostid checking when importing a pool.
If the pool was last accessed by another system, then the import is denied (of course can be overridden with the '-f' flag).

This is especially important to people rolling their own cluster's - the so-called poor man's cluster. What people were finding is:

1) clientA creates the pool (using shared storage)
2) clientA reboots/panics
3) clientB forcibily imports the pool
4) clientA comes back up
5) clientA automatically imports the pool via /etc/zfs/zpool.cache

At this point, both clientA and clientB have the same pool imported and both can write to it - however, ZFS is not designed
to have multiple writers (yet), so both clients will quickly corrupt the pool as both have a different view of the pool's state.

Now that we store the hostid in the label and verify the system importing the pool was the last one that accessed the pool, the
poor man's cluster corruption scenario mentioned above can no longer happen. Below is an example using shared storage over iSCSI.
In the example, clientA is 'fsh-weakfish', clientB is 'fsh-mullet'.

First, let's create the pool on clientA (assume both clients are already setup for iSCSI):

fsh-weakfish# zpool create i c2t01000003BAAAE84F00002A0045F86E49d0
fsh-weakfish# zpool status
  pool: i
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-weakfish# zfs create i/wombat
fsh-weakfish# zfs create i/hulio 
fsh-weakfish# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
i          154K  9.78G    19K  /i
i/hulio     18K  9.78G    18K  /i/hulio
i/wombat    18K  9.78G    18K  /i/wombat
fsh-weakfish#

Note the enhanced information 'zpool import' reports on clientB:

fsh-mullet# zpool import
  pool: i
    id: 8574825092618243264
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

        i                                        ONLINE
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE
fsh-mullet# zpool import i
cannot import 'i': pool may be in use from other system, it was last accessed by
fsh-weakfish (hostid: 0x4ab08c2) on Tue Apr 10 09:33:07 2007
use '-f' to import anyway
fsh-mullet#

Ok, we don't want to forcibly import the pool until clientA is down. So after clientA (fsh-weakfish) has rebooted,
forcibly import the pool on clientB (fsh-mullet):

fsh-weakfish# reboot
....

fsh-mullet# zpool import -f i
fsh-mullet# zpool status
  pool: i
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-mullet#

After clientA comes back up, we'll see this message via syslog:

WARNING: pool 'i' could not be loaded as it was last accessed by another system
(host: fsh-mullet hostid: 0x8373b35b).  See: http://www.sun.com/msg/ZFS-8000-EY

And just to double check to make sure that pool 'i' is in fact not loaded:

fsh-weakfish# zpool list
no pools available
fsh-weakfish# 

And to verify the pool has not been corrupted from clientB's view of the world, we see:

fsh-mullet# zpool scrub i
fsh-mullet# zpool status
  pool: i
 state: ONLINE
 scrub: scrub completed with 0 errors on Tue Apr 10 10:28:03 2007
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-mullet# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
i          156K  9.78G    21K  /i
i/hulio     18K  9.78G    18K  /i/hulio
i/wombat    18K  9.78G    18K  /i/wombat
fsh-mullet# 

See you never again poor man's cluster corruption.

One detail i'd like to point out is that you have to be careful on \*when\* you forcibly import a pool. For instance,
if you forcibly import the pool on clientB \*before\* you reboot clientA then corruption can still happen. This is because
the command reboot(1M) cleanly takes down the machine, which means it unmounts all filesystems, and unmounting a
filesystem will write a bit of data to the pool.

To see the new information on the label, you can use zdb(1M):

fsh-mullet# zdb -l /dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=6
    name='i'
    state=0
    txg=665
    pool_guid=8574825092618243264
    hostid=2205397851
    hostname='fsh-mullet'
    top_guid=5676430250453749577
    guid=5676430250453749577
    vdev_tree
        type='disk'
        id=0
        guid=5676430250453749577
        path='/dev/dsk/c2t01000003BAAAE84F00002A0045F86E49d0s0'
        devid='id1,ssd@x01000003baaae84f00002a0045f86e49/a'
        whole_disk=1
        metaslab_array=14
        metaslab_shift=26
        ashift=9
        asize=10724048896
        DTL=30
--------------------------------------------
LABEL 1
--------------------------------------------
...

Wednesday Mar 28, 2007

sharemgr - a new way to share

Back in Nov. of 2006 (snv_53), Doug introduced sharemgr(1M), a new way of managing your NFS shares in Solaris.

He also has a blog on how sharemgr currently interacts with ZFS.

happy sharing,
eric

Wednesday Mar 14, 2007

iSCSI storage with zvols

Now that iSCSI support is built into ZFS, let's see how to setup some storage with zvols.

On the server, we create a pool, a zvol, and share the zvol over iSCSI:

fsh-suzuki# zpool create iscsistore c0t1d0
fsh-suzuki# zfs create -s -V 10gb iscsistore/zvol
fsh-suzuki# zfs set shareiscsi=on iscsistore/zvol
fsh-suzuki# iscsitadm list target -v
Target: iscsistore/zvol
    iSCSI Name: iqn.1986-03.com.sun:02:a7f19760-5d17-ee50-f011-c4c749add692
    Alias: iscsistore/zvol
    Connections: 0
    ACL list:
    TPGT list:
    LUN information:
        LUN: 0
            GUID: 0x0
            VID: SUN
            PID: SOLARIS
            Type: disk
            Size:   10G
            Backing store: /dev/zvol/rdsk/iscsistore/zvol
            Status: online
fsh-suzuki# 

Now on the client, we need to discover the iSCSI share (192.168.16.135 is the IP of the server):

fsh-weakfish# iscsiadm list discovery
Discovery:
        Static: disabled
        Send Targets: enabled
        iSNS: disabled
fsh-weakfish# iscsiadm modify discovery --sendtargets enable
fsh-weakfish# iscsiadm add discovery-address 192.168.16.135
fsh-weakfish# svcadm enable network/iscsi_initiator
fsh-weakfish# iscsiadm list target
Target: iqn.1986-03.com.sun:02:a7f19760-5d17-ee50-f011-c4c749add692
        Alias: iscsistore/zvol
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
fsh-weakfish# 

Now we can create a pool on the client using the iSCSI device:

fsh-weakfish# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c1t1d0 
          /pci@0,0/pci1022,7450@a/pci17c2,10@4/sd@1,0
       1. c2t01000003BAAAE84F00002A0045F86E49d0 
          /scsi_vhci/disk@g01000003baaae84f00002a0045f86e49
Specify disk (enter its number): \^C
fsh-weakfish# zpool create i c2t01000003BAAAE84F00002A0045F86E49d0
fsh-weakfish# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
i                      9.94G     89K   9.94G     0%  ONLINE     -
fsh-weakfish# zpool status
  pool: i
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE CKSUM
        i                                        ONLINE       0     0     0
          c2t01000003BAAAE84F00002A0045F86E49d0  ONLINE       0     0     0

errors: No known data errors
fsh-weakfish# 

Yep, that's it!

About

erickustarz

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today