fsync performance

Not the most exciting title I know, but here goes anyway... OK, so fsync() is used to ensure that dirty pages that have been written to a file actully go down to disk. The same sort of thing can be done by opening a file using one of the O_SYNC options when a file is opened. fsync() however, allows greater flexibility since the programmer can specify when the synchronisation to disk takes place - perhaps in a separate thread. Anyhow, generally fsync() is goodness - and 'cheap' since it only sync's the data that is dirty. So far so good. However there is (or rather was) a subtle problem that shows up when very large files are mapped into the memory of systems with reasonable amounts of memory. The problem is not to do with large memory systems as such, just that you need a lot of memory to really cache a large file. The problem is that the file is searched linearly (from beginning to end) from the first page that is mapped in right through to the last. This can take quite a lot of time. Given a big enough file, and a big enough physical memory - the time taken can be measured in seconds (yes really!). Since many developers think of fsync() as a 'free' system call, often it is called quite indiscriminately and so fsync() performance can really make a BIG difference (See the test results below for a pathalogical case).

The good news is that this behavior is changed in Solaris 9 (I think of this is the version of Solaris designed for large systems like the 15/12/25K StarCat range) so that all the dirty pages are put at the head of the list, and we need only search the list until we find the first non-dirty page. This is logged as CR 4336082, fixed in patch 112233-09 and later.

So how can you tell if your application is suffering this problem? I would use truss -c against a running process, and see if fdsync()\* is using some appreciable amount of CPU time. Note that this is 'real' cpu time spent examining pages rather than 'sleeping' or time that is spent waiting for the actual page to be written to disk.

In the experiments below, I used Solaris 9, with the latest recommended patch set and with the same file on UFS and VxFS on the same system, the results are quite dramatic. VxFS at the time of wrtiting does not incorporate the fixes that are in UFS - so serves as a good counter example to the speed of UFS. This test was performed on a 16Gb file with 16Gb RAM.

The inital run shows a fast time since, there are few pages mapped into the page cache - after the first run, we read in the file using 'dd' making sure to keep below the threshold used by VxFS to do 'discovered direct IO' which seems not to populate the cache - as you would expect.
Firstly we run the test file (read then write 100 blocks at a size of 4Kb)

# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocks blocksize = 4096

real        0.91
user        0.00
sys         0.10
Then we read in the entire file in blocks of 4Kb
# dd if=./testfile of=/dev/null bs=4096k
4000+1 records in
4000+1 records out
We see that the performance has not changed, since we didn't populate the cache, because we read in the blocks at a size that triggers the VxFS discover direct IO size
# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096

real        0.92
user        0.00
sys         0.07
Next we read the file in the blocks of 4Kb (rather than 4Mb)
# dd if=./testfile of=/dev/null bs=4096
4096001+0 records in
4096001+0 records out
Now we see the problem - note that we are still only writing 100 blocks of 4K as we have always been. Note also that the time is accurately attributed to sys as expected.
#  timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096

real     6:47.77
user        0.00
sys      6:45.85
Using truss -c we can see where the time has gone.
# truss -c /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile
The size of the file is -402649088 bytes -98303 application blocks blocksize = 4096
syscall               seconds   calls  errors
_exit                    .000       1
read                     .012     101
write                    .018     106
open                     .000       5      1
close                    .000       4
brk                      .000       2
stat                     .000       4
lseek                    .007     202
fstat                    .000       3
ioctl                    .000       1
fdsync                403.329     101  <--- Here's where all the time went, in fsync() as expected
execve                   .000       1
getcontext               .000       1
evsys                    .000       1
evtrapret                .000       1
mmap                     .000      11
munmap                   .000       1
getrlimit                .000       1
memcntl                  .000       1
resolvepath              .000       5
                     --------  ------   ----
sys totals:           403.371     553      1
usr time:                .006
elapsed:              404.210
In the above truss -c each fdsync() takes around 4 SECONDS to complete On UFS, having used 'dd' to map in the file
# dd if=testfile of=/dev/null bs=4096

4096001+0 records in
4096001+0 records out

# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096

real        1.33
user        0.00
sys         0.06
#  truss -c  /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open  testfile blocksize = 4096
syscall               seconds   calls  errors
_exit                    .000       1
read                     .005     101
write                    .006     106
open                     .000       5      1
close                    .000       4
brk                      .000       2
stat                     .000       4
lseek                    .006     202
fstat                    .000       3
ioctl                    .000       1
fdsync                   .094     101
execve                   .000       1
getcontext               .000       1
evsys                    .000       1
evtrapret                .000       1
mmap                     .000      11
munmap                   .000       1
getrlimit                .000       1
memcntl                  .000       1
resolvepath              .000       5
                     --------  ------   ----
sys totals:              .115     553      1
usr time:                .005
elapsed:                1.370
The fsync() calls above are still super-quick despite having a lot of the file cached in RAM

\*fsync() is the call made from 'c' but shows up in truss as fdsync().

I tried reducing discovered_direct_iosz to 2k, and it seems to follow that if you allow VxFS to do directIO then the fsync issue is not hit. However, from the read side, you will not get any benefit of cacheing, whereas on UFS you get both cached data and fast fsync's

Comments:

Do you have the script/code that you used for the write_random_fsync_loopsome ? We are having a similar issue on Solaris 9 running off of VxFS file systems syscall seconds calls errors read .023 572 write .190 9584 open .001 46 34 close .002 216 198 time .003 294 times .002 173 ioctl .001 63 fdsync 34.900 322 fcntl .000 5 poll .023 519 priocntlsys .003 190 mmap .000 4 munmap .000 4 writev .007 83 lwp_suspend .000 14 lwp_continue .001 14 lwp_self .000 2 lwp_mutex_wakeup .027 1728 lwp_mutex_lock .042 2262 lwp_cond_wait .029 952 308 lwp_cond_signal .000 2 lwp_cond_broadcast .012 645 llseek .013 775 lwp_schedctl .000 2 resolvepath .000 4 stat64 .009 204 160 fstat64 .000 13 recv .005 234 136 send .017 240 -------- ------ ---- sys totals: 35.320 19166 836 usr time: 3.043 elapsed: 22.110

Posted by Andy on November 05, 2006 at 02:37 PM GMT #

Interesting article. I have a client who I think is suffering from this problem -- do you have the source for your write_random_fsync_loopsome test that I could get to try out on their system?

Posted by Philip Peake on November 16, 2006 at 08:13 AM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

gjl

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today