Wednesday Mar 07, 2007

ASCII Bar graph using Ruby

I am a big Ruby fan. In the spirit of sharing, here is a small ruby script that draws ASCII bar graphs.

#!/bin/env ruby -w
# Given an array, print it as a bar graph
# Use it at your own risk

class HorizBar
  WIDTH = 72
  HEIGHT = 16
  def initialize(array)
    @values = array
  end
  def draw
    #Adjust X axis when there are more than WIDTH cols
    if @values.length > WIDTH then
      old_values = @values;
      @values = []
      0.upto(WIDTH - 1){ |i| @values << old_values[i\*old_values.length/WIDTH]}
    end
    max = @values.max
    # initialize display with blanks
    display = Array.new(HEIGHT).collect { Array.new(WIDTH, ' ') }
    @values.each_with_index do |e, i|
      num= e\*HEIGHT/max
      (HEIGHT - 1).downto(HEIGHT - 1 - num){|j| display[j][i] = '|'}
    end    
    display.each{|ar| ar.each{|e| putc e}; puts "\\n"} #now print
  end
end

# Sample usage 1
sample = [28829, 29095, 29301, 31827, 43478, 52937,62969]
HorizBar.new(sample).draw

# Another Sample usage
a = []
100.times { a << rand(100)}
HorizBar.new(a).draw

It produces output similar to



 |  |  |             |               |                |      |
 |  |  |             |      |   |    |     |   |      |     ||
 |  |  |             |      ||  |    |    ||   |  |   |     ||
 |  |  |           | |      ||  |    | |  ||   |  |   |   | ||
|| ||  |           | |      ||| |    | |  |||  || |   |   | ||       |
|| ||  |      |   || |      ||| |    | |  |||  || |   |   | ||    |  |
|| ||  |   |  |   || ||   | ||| |    | |  |||  || |   |   | ||    |  |
|| ||  |   |  |   || ||   | ||| | |  | |  |||  || |   |   | ||    |  |
|||||  |   |  |   || |||  | ||| | |  |||  |||  || |   |   | ||    | ||
|||||  |   |  |   || |||  | ||| | | |||||||||  || |  ||   | ||    | ||
|||||  || ||  | | || ||| || ||| ||| |||||||||  || |  ||   ||||    | ||
|||||  || ||  | | || ||| || ||| ||| |||||||||| || || || | ||||    | |||
||||| |||||| || | || |||||| ||| ||| |||||||||| || || || |||||| | ||||||
||||| |||||| ||||||| |||||| ||| |||||||||||||| |||||||| |||||| | ||||||
||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Rudimentary, but quite useful if you want to look at some data quickly.

Thursday Feb 08, 2007

ZFS and OLTP workloads: Time for some numbers

My last entry provided some recommendations regarding the use of ZFS with databases. Time now to share some updated numbers.

Before we go to the numbers, it is important to note that these results are for the OLTP/Net workload, which may or may not represent your workload. These results are also specific to our system configuration, and may not be true for all system configurations. Please test your own workload before drawing any conclusions. That said, OLTP/Net is based on well known standard benchmarks, and we use it quite extensively to study performance on our rigs.

Filesystem
FS
Checksum
Database
Checksum1
Normalized
Throughput2
UFS Directio N/A
No
1.12
UFS Directio N/A
Yes
1.00
ZFS
Yes
No
0.94
1 Both block checksumming as well as block checking
2 Bigger is better

Databases usually checksum its blocks to maintain data integrity. Oracle for example, uses a per-block checksum. For Oracle, checksum checking is on by default. This is typically recommended as most filesystems do not have a checksumming feature. With ZFS checksums are enabled by default. Since databases are not tightly integrated with the filesystem/volume manager, a checksum error is handled by the database. Since ZFS includes volume manager functionality, a checksum error will be transparently handled by ZFS (i.e if you have some kind of redundancy like mirroring or raidz), and the situation is corrected before returning a read error to the database. Moreover ZFS will repair corrupted blocks via self-healing. While RAS experts will note that end-to-end checksum at the database level is slightly better than end-to-end checksum at the ZFS level, ZFS checksums give you unique advantages while providing almost the same level of RAS.

If you do not penalize ZFS with double checksums, you can note that we are within 6% of our best UFS number.  So 6% gives you provable data integrity, unlimited snapshots, no fsck, and all the other good features. Quite good in my book :-) Of course, this number is only going to get better as more performances enhancements make it into the ZFS code.

More about the workload.
The tests were done with OLTP/Net with a 72 CPU Sun Fire E25K connected to 288 15k rpm spindles. We ran the test with around 50% idle time to simulate real customers. The test was done on Solaris Nevada build 46. Watch this space for numbers with the latest build of Nevada.

Monday Sep 25, 2006

ZFS and Databases

Databases and ZFS

Comparing UFS and ZFS out-of-the-box, we find that ZFS performs slightly better than UFS Buffered. We also demonstrate that it is possible to get performance improvements with ZFS by following a small set of recommendations. We have also identified a couple of tunings that help performance. These tunings will be on by default in future releases of ZFS


We (PAE - Performance Engineering) recently completed a study to understand database performance with ZFS. Read on more details and recommendations. You can also read Roch's blog on the same study

Databases stress the filesystem in unique ways. Depending on the workload and configuration, you can have thousands of IO operations per second. The size of these IO is usually small (database block size). All the writes are synchronized writes. Reads can be random or sequential. Some writes are also more critical than others. Depending on the configuration, Reads are cached by the database program or the filesystem (if supported/requested). In many cases where filesystems are used, the IO is spread over a few files. This causes the single writer lock to be very hot under certain configurations like Buffered UFS.

Since IO is so important for databases, not surprisingly, there are a lot heavy weight players in this arena. UFS, QFS, VxFS, are quite popular with customers as the underlying filesystem. So how does the new kid on the block (ZFS) do?

We used an internally developed benchmark called OLTP/Net to study database performance with ZFS. OLTP/Net (O-L-T-P slash Net) is a OLTP benchmark that simulates an online store. The major feature of the benchmark is that it has a bunch of tuning knobs that control the ratio of network IO to disk IO, and/or read/write nature of the transactions, and/or number of new connects/disconnects to the database etc.. This makes it quite easy to simulate customer situations in our labs. We use it quite extensively inside Sun to model real-world database performance, and have found/fixed quite a few performance issues using this workload.

For our ZFS study, we used the default settings for OLTP/Net. In this scenario, we have a read/write ratio of 2:1 and a network/disk IO ratio of 10:1. Since our goal is to run like most customers, we controlled the number of users (load generators) such that the box was 60% utilized.

The hardware configuration consisted of a T2000 with 32x1200Mhz CPUs, 32GB RAM connected to 140 Fibre channel JBODs. We used both Solaris 10 Update 2 as well as Solaris Nevada build 43 to do the analysis We created one big dynamically stripped pool with all the disks. We set the recordsize of this pool to 8k. Each disk was divided into 2 slices. These slices were allocated to UFS and ZFS in round robin fashion to ensure that each filesystem got equal number of inner and outer slices.

Normally for OLTP benchmark situations, we try to use the smallest database blocksize for best performance. When we started out with our study, we used a block size of 2048 as that gives us the best performance for other filesystems. But since we are trying to do what most customers might do, we switched over to a block size of 8192. We did two kinds of tests, a cached database as well as a large (not cached) database. Details follow in following sections.

Recommendations for ZFS and Databases

Most customers use UFS buffered filesystems and ZFS already performs better than UFS buffered!. Since want to test performance, and we want ZFS to be super fast, we decided to compare ZFS with UFS directio. We noticed that UFS Directio performs better than what we get with with ZFS out-of-the-box. With ZFS, not only was the throughput much lower, but we used more twice the amount of CPU per transaction, and we are doing 2x times the IO. The disks are also more heavily utilized.
We noticed that we were not only reading in more data, but we were also doing more IO operations that what is needed. A little bit of dtracing quickly revealed that these reads were originating from the write code path! More dtracing showed that these are level 0 blocks, and are being read-in for the read-modify-write cycle. This lead us to the FIRST recommendation
Match the database block size with ZFS record size.
A look at the DBMS statistics showed that "log file sync" was one of the biggest wait events. Since the log files were in the same filesystem as the data, we noticed higher latency for log file writes. We then created a different filesystem (in the same pool), but set the record size to 128K as log writes are typically large. We noticed a slight improvement in our numbers, but not the dramatic improvement we we wanted to achieve. We then created a separate pool and used that pool for the database log files. We got quite a big boost in performance. This performance boost could be attributed to the decrease in the write latency. Latency of database log writes is critical for OLTP performance. When we used one pool, the extra IOs to the disks increased the latency of the database log writes, and thus impacted performance. Moving the logs to a dedicated pool improved the latency of the writes, giving a performance boost. This leads us to our SECOND recommendation
If you have a write heavy workload, you are better off by separating the log files on a separate pool
Looking at the extra IO being generated by ZFS, we noticed that the reads from disk were 64K in size. This was puzzling as the ZFS recordsize is 8K. More dtracing, and we figured out that the vdev_cache (or software track buffer) reads in quite a bit more than what we request. The default size of the read is 64k (8x more than what we request). Not surprisingly, the ZFS team is aware of this, and there are quite a few change requests (CR) on this issue

4933977: vdev_cache could be smarter about prefetching
6437054: vdev_cache: wise up or die
6457709: vdev_knob values should be determined dynamically

Tuning the vdev_cache to read in only 8K at a time decreased the amount of extra IO by a big factor, and more importantly improved the latency of the reads too. This leads to our THIRD recommendation
Tune down the vdev_cache using ztune.sh1 until 6437054 is fixed
Ok, we have achieved quite a big boost from all the above tunings, but we are still seeing high latency for our IOs. We see that the disks are busier during the spa_sync time. Having read Eric Kustarz's blog about 'vq_max_pending' , we tried playing with that value. We found that setting it to 5 gives us the best performance (for our disks, and our workload). Finding the optimal value involves testing it for multiple values -- a time consuming affair. Luckily the fix is in the works

6457709: vdev_knob values should be determined dynamically

So, future releases of ZFS will have this auto-tuned. This leads us to our FOURTH recommendation
Tune vq_max_pending using ztune.sh1 until 6457709 is fixed
We tried various other things. For example, we tried changing the frequency of the spa_sync. The default is once every 5 seconds. We tried once every second, or once every 30 seconds, and even once every hour. While in some cases we saw marginal improvement, we noticed higher CPU utilization, or high spin on mutexes. Our belief is that this is something that is good out of the box, and we recommend you do not change it. We also tried changing the behaviour of the ZIL by modifying the zfs_immediate_write_sz value. Again, we did not see improvements. This leads to our FINAL recommendation

Let ZFS auto-tune. It knows best. In cases were tuning helps, expect ZFS to incorporate that fix in future releases of ZFS

In conclusion, you can improve out-of-the-box performance of databases with ZFS by doing simple things. We have demonstrated that it is possible to run high-throughput workloads with current release of ZFS. We have also shown that it is quite possible to get huge improvements in performance for databases in future versions of ZFS. Given the fact that ZFS is around a year old, this is amazing!!

1ztune.sh Roch's script

Wednesday Aug 09, 2006

Nanotechnology and Renewable Energy

I attended a talk titled "Nanotechnology and Renewable Energy" by Prof. Paul Alivisatos Professor of Chemistry, University of California, Berkeley, and Associate Laboratory Director for Physical Sciences Lawrence Berkeley National Laboratory. It was very nice talk. Do not miss out on a chance to attend his talks.

Efficiency of Solar cells vary from 2% to 35%. As the efficiency increases, cost increases non linearly. It is interesting to note that the amount of energy consumed by the United States in one year is roughly equal to the amount of solar energy received by the earth in one hour! To generate 3TW a year (the current US power usage) using a 3% efficient solar technology would require solar cells to cover an area that is roughly equal to the size of Texas!. The cost per kilowatt will be much higher too.

Disclaimer: I did not take notes during the talk, so my numbers may be slightly off. But I guess you get the big picture

Wednesday Aug 02, 2006

Solaris Internals

There are very few books that let people understand and admire the complexities of Solaris. Richard McDougall, and Jim Mauro have written two such masterpieces titled Solaris Internals 2nd Edition and Solaris Performance and Tools. I highly recommend you get your copy fast. Both Richard and Jim are colleagues of mine at PAE, so I am sure to get my book autographed!

Monday Jul 31, 2006

Real-World Performance

Performance for the real-world, where it matters the most.

A major portion of my job (@ PAE) is spent trying to optimize Solaris for real customer workloads. We tend to focus on databases, but work with other applications too. We have tons (both weight wise and dollar wise :-)) of equipment in our labs, where we try to replicate a real enterprise data center. Of course, the term "real customer workload" is a loaded term. Since most big customers are rarely willing to share their workloads, we have to simulate them or write something close it in house. Trying to rewrite every customer's workload is not a scalable approach. Hence we have developed a workload called OLTP/Net that can be retrofitted to fit most customer workloads. Using several tuning knobs we can control the amount of reads, writes, network packet per transaction, connects, disconnects, etc.. Think of it like a super workload! We have used it quite effectively to simulate several customer workloads.

There is a big difference in trying to get the best numbers for a benchmark and in replicating a customer's setup. PAE has traditionally focused on getting the most out of the system. Our machines typically run at 100% utilization, run the latest and greatest Solaris builds, have lot of tunings applied to the system. We believe fully in Carry Millsap's statement

Each CPU cycle that passes by unused is a cycle that you will never have a chance to use again; it is wasted forever. Time marches irrevocably onward."
(Performance Management: Myths & Fact, Cary V. Millsap, Oracle Corp, June 28, 1999)

However, many customers run their machines at less than 100% utilization to leave enough headroom for growth. When machines are not running at 100% utilization, things like idle loop performance matter a lot. If you have followed Solaris releases closely, there were several enhancements to the idle loop performance that increase the efficiency of lightly loaded systems by quite a bit. Similarly we have seen quite a few UFS + Database performance enhancements over the past few releases of Solaris.

So while benchmark numbers do matter, real performance also matters, and we are working on it!

Monday Dec 12, 2005

Six OS's on a disk? Wait I can do seven!!

Update: In my previous blog I showed how to install 6 os's on a disk. Well, actually you can have seven (7). Disk partitions are numbered from 0 to 7. Ignoring slice 2, that leaves us with 7 free slices on which to install our OS. Although I am yet to log on to a machine with 7 OS's on disk!!

Richard Elling pointed it out that you could also use slice 2 (the loopback/backup/overlap slice) also. So that's 8. He also mentions that some SCSI devices support 16 slices, and so you could install quite a lot more OS installations! Maybe we should have a completion of how many OS's you have installed on a single disk :-) My personal best is 6.

Friday Dec 02, 2005

Six OS's in one disk? Yes it is possible

Six (6) OS's in one disk

Do you want to install 6 OS's on a single disk? If so read on..

The goal is to have 6 bootable OS on a single disk. Why should one do it? Because better sharing, more reliability, easier comparisons between OS versions, quicker recovery, ...BTW, I have only tried this on sparc.

Although I am sure that people have been doing this for ages, I first heard it from Charles Suresh, who encouraged me to go ahead and give it a try.

Create Partitions

Disk partitions usually are from 0 - 7, with 2 being the overlap. For our experiment, we set 1 to be the swap. We sized the other partitions equally, with 0 being a little smaller than others. On my 36G disk, the partition looks like the following

Part      Tag    Flag     Cylinders         Size            Blocks
  0       root    wm    2178 -  5655        4.79GB    (3478/0/0)  10047942
  1       swap    wu       0 -  2177        3.00GB    (2178/0/0)   6292242
  2     backup    wm       0 - 24619       33.92GB    (24620/0/0) 71127180
  3       root    wm    5656 -  9285        5.00GB    (3630/0/0)  10487070
  4       root    wm    9286 - 12915        5.00GB    (3630/0/0)  10487070
  5       root    wm   12916 - 16545        5.00GB    (3630/0/0)  10487070
  6       root    wm   16546 - 20175        5.00GB    (3630/0/0)  10487070
  7       root    wm   20176 - 24619        6.12GB    (4444/0/0)  12838716

Install The OS

Install Solaris from any source. I typically download the images from nana.eng, and use my jumpstart server. You can also install from CD, DVD etc.. Once you install on a slice, you can dd(1) it to other slices, and fix /etc/vfstab. This is the fastest way of installing multiple solaris instances on a disc. If you want another version, or a different build, bfu is your friend. You can also save off these slices to some /net/... place and restore an OS at will (again using dd both ways since you need to preserve the boot blocks). If you slice multiple machines this way, you can even copy slices across machines (assuming same architecture etc) - more scripts are needed to change /etc/hosts, hostname, net/\*/hosts etc

Install via Jumpstart: Setup Profile

If you like things automated, you could perform a hands-off install via custom jumpstart. The first step is to setup the profile for your server. Since you want to preserve the existing partitions, you have to use the preserve keyword. The profile for my machine looks like the following
$cat zeeroh_class
install_type    initial_install
system_type    server
partitioning    explicit
dontuse        c1t0d0
filesys        c1t1d0s0 existing /
filesys        c1t1d0s1 existing swap
filesys        c1t1d0s3 existing /s3 preserve
filesys        c1t1d0s4 existing /s4 preserve
filesys        c1t1d0s5 existing /s5 preserve
filesys        c1t1d0s6 existing /s6 preserve
filesys        c1t1d0s7 existing /s7 preserve
cluster        SUNWCall

To install an OS on another slice, just change the root disk (c1t0d0s0 above).

Make sure that the directory where the profiles are stored is shared read-only.

Also ensure that you have a sisidcfg file setup correctly.
[neel@slc-olympics] config > cat sysidcfg
name_service=NIS
{domain_name=xxx.yyy.sun.com}
root_password=XXXXXXXXXX
security_policy=NONE
system_locale=en_US
terminal=vt100
timezone=US/Pacific
timeserver=localhost
network_interface=PRIMARY{protocol_ipv6=no}
[neel@slc-olympics] config >

Run the check script.

Note that these profiles can be stored on any server. That machine does not need to have anything special installed. You only need to make sure that the location of the profile, and other custom jumpstart scripts are shared via NFS in a "read-only" mode.

Jumpstart

On the jumpstart server (abc.yyy in my case), we added our machine to the list of clients as follows

./add_install_client -i bbb.aaa.xxx.xxx -e a:b:c:d:e:f -c slc-olympics:/export/config -p slc-olympics:/export/config zorrah sun4u

Now reboot your machine as follows

$ reboot -- net - install

Booting via multiple disks/partitions


  1. Find the path (ls -l /dev/rdsk/..)
  2. At the ok prompt, type show-disks and select disk
  3. Type nvalias diskX # this paste's the selected path
  4. init 0
  5. boot diskX

Technorati Tag:
Technorati Tag:

Monday Nov 28, 2005

ZIL block size selection policy

ZIL block size selection policy

As I mentioned in my previous blog entry, the ZIL (ZFS Intent Log) operates with block sizes between ZIL_MIN_BLKSZ(4K) and ZIL_MAX_BLKSZ(128k).  Let us take a closer look at this.

The ZIL has to allocate a new zil block before it commits the current zil block.  This is because the zil block being committed has to have a link to the next zil block. If you do not preallocate, you will have to update the next pointer in the previous block whenever you write a new zil block. This means that you will have read in the previous block, update the next pointer, and rewrite it out. Obviously this is quite expensive (and quite complicated).

The current block selection strategy is to chose either the sum of all outstanding ZIL blocks or if no outstanding zil blocks are present, the size of the last zil block that was committed. If the size of the outstanding zil blocks is greater than 128k, it is rounded up to 128k.

The above strategy works in most cases, but behaves badly for certain edge cases.

Let us examine the zil block size for the set of actions described below
(dtrace -n ::zil_lwb_commit:entry'{@[1]=quantize((lwb_t\*)args[2]->lwb_sz);}')

  1. Bunch of 2k O_DYNC writes -- zil block size: 4k (ZIL_MIN_BLKSZ)
  2. Bunch of 128bytes O_DSYNC writes -- zil block size was 4k (ZIL_MIN_BLKSZ)
  3. Bunch of non-O_DSYNC writes ... No zil blocks written
  4. Bunch of 128 byte O_DSYNC writes -- zil block size was 64k !!
oops! Why did the zil block size suddenly jump up to 64k above?

When the first O_DSYNC write was initiated in (4), the zil coalesced all outstanding log operations into a big block (in my case a 128k block and a 64k block) and then did a zil_commit. The next O_DSYNC write then chose 64k as the zil block size as that was the size of the last zil_commit. The following O_DSYNC writes then continued to use 64K as the zil block size.

Neil Perrin filed CR 6354547: sticky log buf size to fix this issue. His proposed fix is to use the size of the last block as the basis for the size of the new block. This should work optimally for most cases, but there is a possiblity for empty log writes. Need to investigate this issue with "real" workloads.

Wednesday Nov 16, 2005

The ZFS Intent Log

A quick guide to the ZFS Intent Log (ZIL)

I am not a ZFS developer. However I am interested in ZFS performance, and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it ;-). What follows is my notes as I made my way through the ZIL

Introduction

Most modern file systems include a logging feature to ensure faster write times and crash recovery time (fsck). UFS has supported logging since Solaris 2.7 and uses logging as the default on Solaris 10. Our tests internally have shown us that logging file systems perform as good as (sometimes even better) non-logging file systems.

Logging is implemented via the ZFS Intent Log module in ZFS. ZFS Intent Log or ZIL is implemented in the zil.c file. Here is a brief walk through of the logging implementation in ZFS. All of this knowledge can be found in the zil.[c|h] files in the ZFS source code. I also recommend you check out Neil's blog -- He is one of the ZFS developers who works on the ZIL.

All file system related system calls are logged as transaction records by the ZIL. These transaction records contain sufficient information to replay them back in the event of a system crash.

ZFS operations are always a part of a DMU (Data Management Unit) transaction. When a DMU transaction is opened, there is also a ZIL transaction that is opened. This ZIL transaction is associated with the DMU transaction, and in most cases discarded when the DMU transaction commits. These transactions accumulate in memory until an fsync or O_DSYNC write happens in which case they are committed to stable storage. For committed DMU transactions, the ZIL transactions are discarded (from memory or stable storage).

The ZIL consists of a zil header, zil blocks and zil trailer. The zil header points to a list of records. Each of these log records are variable sized structures whose format depends on the transaction type. Each log record structure consists of a common structure of type lr_t followed by multiple structures/fields that are specific to each transaction. These Log records can reside either in memory or on disk. The on disk format described in zil.h. ZIL records are written to disk in variable sized blocks. The minimum block size is defined as ZIL_MIN_BLKSZ and is currently 4096 (4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ which is equal to SPA_MAXBLOCKSIZE (128KB). The zil block size written to disk is chosen to be either the size of all outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there are no outstanding ZIL transactions, the size of the last zil block that was committed.

ZIL and write(2)
The zil behaves differently for different size of writes that happens. For small writes, the data is stored as a part of the log record. For writes greater than zfs_immediate_write_sz (64KB), the ZIL does not store a copy of the write, but rather syncs the write to disk and only a pointer to the sync-ed data is stored in the log record. We can examine the write(2) system call on ZFS using dtrace.


230  -> zfs_write                                         21684
230    -> zfs_prefault_write                              28005
230    <- zfs_prefault_write                              35446
230    -> zfs_time_stamper                                69932
230      -> zfs_time_stamper_locked                       72893
230      <- zfs_time_stamper_locked                       74813
230    <- zfs_time_stamper                                76893
230    -> zfs_log_write                                   81054
230    <- zfs_log_write                                   89855
230  <- zfs_write                                         96257
230  <= write


As you can see there is a log entry associated with every write(2) call. If the file was opened with the O_DSYNC flag, writes are supposed to be synchronous. For synchronous writes, the ZIL has to commit the zil transaction to stable storage before returning. For non-synchronous writes the ZIL holds on to the transaction in memory where it is held until the DMU transaction commits or there is an fsync or an O_DSYNC write.

zil.c walk thorough

There are several zil functions that operate on zil records. What follows is a very brief description of their functionality.

  • zil_create() creates a dmu transaction and allocates a first log block and commits it.
  • zil_itx_create() is used to create a new zil transaction.
  • zil_itx_assign() is used to associate this intent log transaction with a dmu transaction.
  • zil_itx_clean() is used to clean up all in memory log transactions. Clearing in memory zil transactions implies that these are not flushed to disk. zil_itx_clean() is called via the zil_clean() function which dispatches a work request to a dispatch thread.
  • zil_commit() is used to commit zil transactions to stable storage.
  • zil_sync() ZIL transactions are then cleaned (or deleted) in the zil_sync routine when the DMU transactions that they are assigned to is committed to disk (maybe as a result of a fsync) It is mostly called from the txg_sync_thread every txg_time (5 seconds) via this code path.
 

             zfs`dmu_objset_sync+0x6c
             zfs`dsl_pool_sync+0x108
             zfs`spa_sync+0xac
             zfs`txg_sync_thread+0x130
             unix`thread_start+0x4

ZFS Mount

During file system mount time, ZFS checks to see if there is an intent log. If there is an intent log, this implies that the system crashed (as the ZIL is deleted at umount(2) time). This intent log is converted to a replay log and is replayed to updated the file system to a stable state. If both the replay log and intent log are present, it implies that the system crashed while replaying the replay log in which case it is OK to ignore/delete the replay log and replay the intent log.

ZIL Tunables
I am almost tempted to mention some tunables here but the truth is that ZFS is intended to not require any tuning.  ZFS should (and will) perform optimally "Out of the Box". You might find some switches in the code, but they are only for internal development and will be yanked out soon!

ZIL Performance
As you must have figured out by now, ZIL performance is critical for performance of synchronous writes. A common application that issues synchronous writes is a database. This means that all of these writes run at the speed of the ZIL. The ZIL is already quite optimized, and I am sure ongoing efforts will optimize this code path even further. As Neil mentions, using nvram/solid state disks for the log would make it scream!. I also recommend that you checkout Roch's work on ZFS performance for details of other performance studies in progress.


Dtrace scripts for use with zil

  • To see ZIL activity 
    • dtrace -n zil\\\*:entry'{@[probefunc]=count();}' -n zio_free_blk:entry'{@[probefunc]=count();}' 
  • To see blocksize of log writes
    • dtrace -n ::zil_lwb_commit:entry'{@[1] = quantize((lwb_t\*)args[2]->lwb_sz);}'

Finally

Congratulations to the ZFS team for delivering such a world class product. You folks rock!.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Tuesday Nov 15, 2005

Introduction

I guess an introduction is necessary!

I am Neelakanth Nadgir and I am a part of PA2E (Performance Architecture, and Availability Organization) group. I work out of Menlo Park, CA. My professional interests include scalability, networking, filesystems, distributed systems etc.

Before joining PA2E, I worked at Sun's Market Development Engineering, where I spent 4 years working on Performance tuning, Porting, Sizing, and ISV account management.

I am was also involved with several open source projects. I am an active member of the JXTA community and jointly started two projects viz Ezel Project and JNGI Project. I have also served as web-master to the GNU project for 2 years. I also contributed to the Mozilla project in the past by providing sparc binaries and misc performance fixes.

Before working at Sun, I graduated with a masters in Computer Sc from Texas Tech University at Lubbock, TX (GO Raiders!). My thesis was on the Reliability of distributed systems, where I devised a faster algorithm for calculating minimal file spanning trees. I have a Bachelor's degree in Computer Sc from Karnatak University, India.

My other interests include Cricket, and tropical aquarium fish ( African cichlids in particular) My favorite fish is known as Pseudotropheus demasoni. My wife got me hooked on to the aquarium hobby after we got married, and even before I knew, we had more than 60 fishes in 6 tanks :-)

I plan to use this blog to share the knowledge that I gained from working with lots of cool people here at Sun. Keep tuned for more insights!

About

realneel

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today