Wednesday Apr 23, 2008

zones and ZFS file systems

Starting off with a freshly created pool, let's see the steps to create a zone based on a ZFS file system. Here we see our new pool with only one file system:

fsh-sole# zfs list
NAME    USED  AVAIL  REFER  MOUNTPOINT
kwame   160K  7.63G    18K  /kwame
fsh-sole#

Now, we'll create and configure a local zone "ejkzone". Note, we set the zonepath within the path of the ZFS pool:

fsh-sole# zonecfg -z ejkzone
ejkzone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:ejkzone> create
zonecfg:ejkzone> set zonepath=/kwame/kilpatrick
zonecfg:ejkzone> commit
zonecfg:ejkzone> exit
fsh-sole#

Now we install zone "ejkzone" and notice that the installation tells us that it will automatically create a ZFS file system for us:

fsh-sole# zoneadm -z ejkzone install
A ZFS file system has been created for this zone.
Preparing to install zone .
Creating list of files to copy from the global zone.
Copying <10116> files to the zone.
Initializing zone product registry.
Determining zone package initialization order.
Preparing to initialize <1198> packages on the zone.
Initialized <1198> packages on zone.                                 
Zone  is initialized.
The file  contains a log of the zone installation.
fsh-sole#

Now we can boot the zone to use it, and can also see that the file system kwame/kilpatrick was automatically created for us:

fsh-sole# zoneadm -z ejkzone boot   
fsh-sole# zoneadm list
global
ejkzone
fsh-sole# zoneadm -z ejkzone list -v
  ID NAME             STATUS     PATH                           BRAND    IP    
   3 ejkzone          running    /kwame/kilpatrick              native   shared
fsh-sole# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
kwame              517M  7.12G    20K  /kwame
kwame/kilpatrick   517M  7.12G   517M  /kwame/kilpatrick
fsh-sole# 

Now if we login into the zone via 'zlogin -C ejkzone', we notice that the local zone cannot see any ZFS file systems (only the global zone can):

ejkzone# zfs list
no datasets available
ejkzone# 

If we then want to create and delegate some ZFS file systems to the local zone "ejkzone" so that "ejkzone" has administrative control over the file systems, we can do that. From the global zone, we do:

fsh-sole# zfs create kwame/textme
fsh-sole# zonecfg -z ejkzone
zonecfg:ejkzone> add dataset
zonecfg:ejkzone:dataset> set name=kwame/textme
zonecfg:ejkzone:dataset> end
zonecfg:ejkzone> exit
fsh-sole#

Now, we can get the "zoned" property of the newly created file system:

fsh-sole# zfs get zoned kwame/textme 
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  zoned     off           default
fsh-sole# 

Huh, it says "off". But we delegated it to a local zone. Why is that? Well in order for this to take affect, we have to reboot the local zone. After doing that, we can see from the global zone:

fsh-sole# zfs get zoned kwame/textme
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  zoned     on            local
fsh-sole# 

And from the local zone "ejkzone":

ejkzone# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
kwame          595M  7.05G    20K  /kwame
kwame/textme    18K  7.05G    18K  /kwame/textme
ejkzone# 

And now we have administrative control over the file system via the local zone:

ejkzone# zfs get copies kwame/textme 
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  copies    1             default
ejkzone# zfs set copies=2 kwame/textme
ejkzone# zfs get copies kwame/textme  
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  copies    2             local
ejkzone# 

Double checking on the global zone:

fsh-sole# zfs get copies kwame/textme
NAME          PROPERTY  VALUE         SOURCE
kwame/textme  copies    2             local
fsh-sole# zpool history -l
History for 'kwame':
2008-04-23.16:01:17 zpool create -f kwame c1d0s3 [user root on fsh-sole:global]
2008-04-23.16:29:42 zfs create kwame/textme [user root on fsh-sole:global]
2008-04-23.16:36:45 zfs set copies=2 kwame/textme [user root on fsh-sole:ejkzone]

fsh-sole# 

Happy zoning

Wednesday Mar 19, 2008

how dedupalicious is your pool?

WIth the putback of 6656655 zdb should be able to display blkptr signatures, we can now get the "signature" of the block pointers in a pool. To see an example, let's first put some content into an empty pool:

heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
bigIO   928G  95.5K   928G     0%  ONLINE  -
heavy# mkfile 1m /bigIO/1m.txt
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt

Now lets run zdb with the new option "-S". We pass in "user:all", where "user" tells zdb that we only want user data blocks (as opposed to both user and metadata) and "all" tells zdb to print out all blocks (skipping any checksum algorithm strength comparisons).

heavy# zdb -L -S user:all bigIO
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       131072  1       ZFS plain file  fletcher2       uncompressed    0:0:0:0
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
0       512     1       ZFS plain file  fletcher2       uncompressed    7075646564206f6e:a:eac8cac840dedc0:140
0       512     1       ZFS plain file  fletcher2       uncompressed    656d207075646564:a:ada40e0eac8cac80:140
heavy# 

This displays the signature of each block pointer - where the columns are level, physical size, number of dvas, object type, checksum type, compression type, and finally the actual checksum of the block.

So this is interesting, but what could we do with this information? Well, one thing we could do, is to figure out how much your pool can take advantage of dedup. Let's assume that the dedup implementation does matching based on the actual checksum and any checksum algorithm is strong enough (in reality, we'd need sha256 or stronger). So starting with the above pool and using a simple perl script 'line_by_line_process.pl' (shown at the end of this blog), we find:

heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt 
total PSIZE:               0t1050624
total unique PSIZE:        0t132096
total that can be duped:   0t918528
percent that can be duped  87.4269005847953%
heavy#   

In our trivial case, we can see that we could get a huge win - 87% of the pool can be dedup'd!. Upon closer examination, we notice that mkfile writes out all zero blocks. If you had compression enabled, there won't be any actual blocks for this file. So let's look at a case where just the "ejk.txt" contents are getting dedup'd:

heavy# zpool destroy bigIO
heavy# zpool create bigIO c0t0d0 c0t1d0
heavy# dd if=/dev/random of=/bigIO/1m.txt bs=1024 count=5
5+0 records in
5+0 records out
heavy# echo "dedup me" > /bigIO/ejk.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk2.txt
heavy# echo "no dedup" > /bigIO/nope.txt
heavy# cp /bigIO/ejk.txt /bigIO/ejk3.txt
heavy# zdb -L -S user:all bigIO > /tmp/zdb_out.txt
heavy# sort -k 7 -t "`/bin/echo '\\t'`" /tmp/zdb_out.txt > /tmp/zdb_out_sorted.txt
heavy# ./line_by_line_process.pl /tmp/zdb_out_sorted.txt       
total PSIZE:               0t7168
total unique PSIZE:        0t6144
total that can be duped:   0t1024
percent that can be duped  14.2857142857143%
heavy# 

Ok in this different setup we can see ~14% of capacity can actually be dedup'd - a nice savings on capacity.

So the question becomes - how dedupalicious is your pool?




ps: here is the simple perl script 'line_by_line_process.pl':

#!/usr/bin/perl

# Run this script as:
#  % script 


# total PSIZE
$totalps = 0;

# total unique PSIZE
$totalups = 0;

$last_cksum = -1;

$path = $ARGV[0];
open(FH, $path) or die "Can't open $!";

while (<>) {
        my $line = $_;
        ($level, $psize, $ndvas, $type, $cksum_alg, $compress, $cksum) = split /\\t/, $line, 7;
        if ($cksum ne $last_cksum) {
                $totalups += $psize;
        }
        $last_cksum = $cksum;
        $totalps += $psize;
}

print "total PSIZE:               0t".$totalps."\\n";
print "total unique PSIZE:        0t".$totalups."\\n";
print "total that can be duped:   0t".($totalps - $totalups)."\\n";
print "percent that can be duped  ".($totalps - $totalups) / $totalps \* 100 ."%\\n";

Tuesday Dec 18, 2007

FileBench Source and Bug/RFE Info

You can now easily browse the source code for FileBench in OpenSolaris using OpenGrok. I find OpenGrok much friendlier to use than what sourceforge offers.

For a basic breakdown of the source, the \*.c's, \*.h's, \*.l, and \*.y that construct the C binary 'go_filebench' can be found here. You can also browse the workloads, the main perl script filebench(1), the .prof files, and the scripts to compare results and to flush file system caches.

You can now also query our bug database for FileBench bugs found and perhaps more interestingly RFEs requested on OpenSolaris.

Thursday Oct 04, 2007

FileBench : a New Era in FS Performance

I'm happy to report that FileBench has gone over a significant overhaul and we're happy to release that updated version. Bits will be posted to sourceforge tonight. I'm also happy to report that FileBench is now included in OpenSolaris. You can find it in our new "/usr/benchmarks" path.

Ok, great, just what the industry needed - FileBench is just another simple benchmark. right? Nope.

First let me give you, dear reader, a taste of what we get internally and externally here at the ZFS team:

"I ran bonnie, dd, and tar'd up myfav_linuxkernel_tarball.tar.  Your file system sucks."

Though sometimes i'm happy to note we get:

"I ran bonnie, dd, and tar'd up myfav_linuxkernel_tarball.tar.  Your file system rulz!."

What is FileBench?

It is nice to hear that your file system does in fact "rule", but the problem with the above is that bonnie, dd, and tar are (obviously) not a comprehensive set of applications that can completely measure a file system. IOzone is quite nice but it only tests basic I/O patterns. And there are many other file system benchmarks (mkfile, fsstress, fsrandom, mongo, iometer, etc). The problem with all of these benchmarks is that they only measure a specific workload (or a small set of workloads). None of them actually let you measure what a real application does. Yes part of what Oracle does is random aligned reads/writes (which many of the pre-mentioned benchmarks can measure), but the key is how the random read/writes interact with each other \*and\* how they interact with the other parts of what Oracle does (the log writer as a major example). None of the pre-mentioned benchmarks can do that.

Enter FileBench.

So how does FileBench differ? FileBench is a framework of file system workloads for measuring and comparing file system performance. The key is in the workloads. FileBench has a simple .f language that allows you to describe and build workloads to simulate applications. You can create workloads to replace all the pre-mentioned benchmarks. But more importantly, you can create workloads to simulate complex applications such as a database. For instance, i didn't have to buy an Oracle license nor figure out how to install it on my system to find out if my changes to the vdev cache for ZFS helped database performance or not. I just used FileBench and its 'oltp.f' workload.

How do i Start using FileBench?

The best place to start is via the quick start guide. You can find out lots more at our wiki. Lots of good information and a great place to contribute. For trouble shooting see the gotchas section.

How do i Contribute to FileBench?

If you'd like to write your own workloads, check out the very nice documentation Drew Wilson wrote up. This is actually where we (where we is the FileBench community, not just us Sun people) would love the most contribution. We would really like to have people verify the workloads we have today and build the workloads for tomorrow. This is a great opportunity for industry types and academics. We very much plan to incorporate new workloads into FileBench.

If you would like to help with the actual code of the filebench binaries, find us at perf-discuss@opensolaris.org.

FileBench is done, right?

Um, nope. There's actually lots of interesting work up ahead for FileBench. Besides building new workloads, the two other major focuses that need help are: multi-client support and plug-ins. The first is pretty obvious - we need to have support for multiple clients to benchmark NFS and CIFS. And we need that to work on multiple platforms (OpenSolaris, linux, \*BSD, OSX, etc). The second is where experts in specific communities can help out. Currently, FileBench goes through whatever client/initiator implementation you have on your machine. But if you wanted to just do a comparison of server/target implementations, then you need a plug-in built into FileBench that both systems utilize (even if they are different operating systems). We started prototyping a plug-in for NFSv3. We've also thought about NFSv4, CIFS, and iSCSI. A community member suggested XAM. This is a very interesting space to explore.

So What does this all Mean?

If you need to do file system benchmarking, try out FileBench. Let us know what you like and what needs some love.

If you're thinking about developing a new file system benchmark, consider creating a new workload for FileBench instead. If that works out for you, please share your work. If for some reason it doesn't, please let us know why.

We really believe in the architecture of FileBench and really want it to succeed industry wide (file system and hence storage). We know it works quite well on OpenSolaris and would love other developers to make sure it works just as well on their platforms (linux, \*BSD, OSX, etc.).

Happy Benchmarking and long live RMC!

Tuesday Sep 11, 2007

NCQ sata_max_queue_depth tunable

Previously, i did some analysis on NCQ in OpenSolaris. It turned out that to get good multi-stream sequential read performance, you had to disable NCQ via the tunable 'sata_func_enable'. Disabling NCQ actually does two things: 1) sets the number of concurrent I/Os to 1 and 2) changes what you send down protocol wise.

Turns out, the first is all we really need to get good performance for the multi-stream sequential read case, and doing the second actually exposes a bug on certain firmware of certain disks. So i highly recommend the newly added 'sata_max_queue_depth' tunable instead of 'sata_func_enable'. As a reminder, put the following in /etc/system and reboot:

set sata:sata_max_queue_depth = 0x1

An admin command to allow you to do this on the fly without rebooting would be another step forward, but no official plans on that just yet.

Thanks Pawel!

Tuesday May 29, 2007

NCQ performance analysis

hi[Read More]
About

erickustarz

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today