I've been studying the popular Bonnie++ load generator to see if it was a suitable benchmark to use with Network attached storage such as Sun Storage 7000
line. At this stage I've looked at the single client runs, and it doesn't appear that Bonnie++ is an appropriate tool in this environment because as we'll see here, for many of the tests, it either stresses the networking environment or the strength of client side cpu.
The first interesting thing to note is that Bonnie will work on a data set that is double the client's memory. This does address some of the client side caching concern one could otherwise have. In a NAS environment the amount of memory present on the server is not considered by a default bonnie++ run. My client had 4GB leading to a working set was then 8GB while the server had 128GB of memory. The Bonnie++'s output looks like :
Writing with putc()...done Method
Reading with getc()...done
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03d ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
v2c01 8G 81160 92 109588 38 89987 67 69763 88 113613 36 2636 67
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 687 10 +++++ +++ 1517 9 647 10 +++++ +++ 1569 8
I have used a combination of Solaris truss(1), reading Bonnie++ code, looking at AmberRoad's Analytics
data , as well as a custom Bonnie d-script
in order to understand how each test triggered system calls on the client and how those translated into a NAS server load. In the d-script, I characterise the system calls by the average elapse time as well as by the time spent waiting for a response from the NAS server. The time spent waiting is the operational latency that one should be interested in when characterising a NAS, while the additional time relates to the client CPU strength along with the client NFS implementation. Here is what I found trying to explain how performant each test was. Writing with putc()
So easy enough, that test creates a file using single character putc stdio library call.
This test is clearly a client CPU test with most of the time spent in user space running putc(). Every 8192 putc, stdio library will issue a write(2)
system call. That syscall is still a client CPU test since the data is absorbed on the client cache. What we test here is the client single CPU performance and the client NFS implementation. On a 2 CPU/ 4GB V20z running Solaris, we observed on the server using analytics a network transfer rate of 87 MB/sec.
Results : 87 MB/sec of writes. Limited by single CPU speed. Writing intelligently...done
Here it's more clever since it writes a file using sequential 8K write system calls.
In this test the CPU is much relieved. So here the application is running 8K write system call to client NFS. This is absorbed by memory on the client. With an Opensolaris client, no over the wire request is sent for such an 8K write. However after 4 such 8K writes we reach the natural 32K chunk advertised by the server and that will cause the client to asynchronously issue a write request to the server. The asynchronous nature means that this will not cause the application to wait for the response and the test will keep going on CPU. The process will now race ahead generating more 8K writes and 32K asynchronous NFS requests. If we manage to generate such request at a greater rate than responses, we will consume all allocated aysnchronous threads. On Solaris this maps to nfs4_max_threads (8) threads. When all 8 asynchronous threads are waiting for a response, then the application will finally block waiting for a previously issued request to get a response and free an async thread.
Since generating 8K write systems to fill the client cache is faster than the network connection between the client and the server we will eventually reach this point. The steady state of this test is that Bonnie++ is waiting for data to transfer to the server. This happens at the speed of a single NFS connection which for us saturated the 1Gbps link we had. We observed 113MB/sec which is network line rate considering protocol overheads.
To get more through on this test, one could use Jumbo Frame ethernet instead of the 1500 Byte default frame size used as this would reduce the protocol overhead slightly. One could also configure the server and client to use 10Gbps ethernet links.
One could also use LACP link aggregation of 1Gbps network ports to increase the throughput. LACP increases throughput of multiple network connections but not single socket protocol. By default a Solaris client will establish a single connection (clnt_max_conns = 1) to a server (1 connections per target IP). So using multiple aggregated links _and_ tuning clnt_max_conns could yield extra throughput here.
Using single connection one could use a faster network between client and server links to reach additional throughput.
More commonly, we expect to saturate the client 1Gbps connectivity here, not much of a stress for a Sun Storage 7000 server.
Results : 113 MB/sec of writes. Network limited. Rewriting...done
This gets a little interesting. It actually reads 8K, lseek back to the start of the block, overwrites the 8K with new data and loops.
So here we read, lseek back, overwrite . For the NFS protocol lseek is a noop since every over the wire write is tagged with the target offset. In this test we are effectively stream reading the file from the server and stream writing the file back to the server. The stream write behavior will be much like the previous test. We never need to block the process unless we consume all 8 asynchronous threads. Similarly 8K sequential reads will be recognised by our client NFS as streaming access which will deploy asynchronous readahead requests. We will use 4 (nfs4_nra) request for 32K blocks ahead of the point being currently read. What we observed here was that of 88 second of elapse time, 15 was spent in write and 20 in reads. However a small portion of that was spent waiting for response. It was mostly all spent on CPU time to interact with the client NFS. This implies that readhead and asynchronous writeback was behaving without becoming bottlenecks. The Bonnie++ process took 50 sec of the 88 sec and a big chunk of this, 27 sec, was spent waiting off cpu. I struggle somewhat in this interpretation but I do know from the Analytics data on the server that the network is seeing 100 MB/sec of data flowing in each direction. This must also be close to network saturation. The wait time attributed to Bonnie++ in this test seems be related to kernel preemption. As Bonnie++ is coming out of its system calls we see such events in dtrace.
This must be to service the kernel threads of higher priority, likely the asynchronous threads being spawned by the reads and writes.
This test is then a stress test of bidirectional flow of 32K data transfers. Just like the previous test, to improve the numbers one would need to improve the network connection throughput between the client and server. It also potentially could then benefit from faster and more client CPUs.
Results : 100MB/sec in each direction, network limited. Reading with getc()...done
Reads the file one character at a time.
Back to a test of the client CPU much like the first one. We see that the readahead are working great since little time is spent waiting (0.4 of 114 seconds). Given that this test does 1 million reads in 114 seconds, the average latency could be evaluated to be 114 usec.
Results : 73MB/sec, single CPU limited on the client. Reading intelligently...done
Reads with 8k system calls, sequential.
This test seems to be using 3 spawned bonnie process to read files. The reads are of size 8K and we needed 1M of them to read our 8GB working set. We observed with analytics no I/O done on the server since it had 128GB of cache available to it. The network on the other hand is saturated at 118 MB/sec.
The dtrace script shows that the 1M read calls collectively spend 64 seconds waiting (most of that NFS response). So that implies a 64 usec read response time for this sequential workload.
Results : 118MB/sec, limited by Network environment. start 'em...done...done...done...
Here is seems that Bonnie starts 3 helper processes used to read the files in the "Reading Intelligently" test. Create files in sequential order...done.
Here we see 16K files being created (with creat(2)
) then closed.
This test will create and close 16K files and took 22 seconds in our environment. 19 seconds were used for the creates, 17.5 waiting for responses. That means a 1ms response time for file creates. The test seems single threaded. Using analytics we observe 13500 NFS ops per second to handle those file create. We do see some activity on the Write bias SSD although very modest at 2.64MB /sec. Given that the test is single threaded we can't estimate if this metric is representative of the NAS server capability. More likely this is representative the single thread capability of the whole environment made of : client CPU, client NFS implementation, client network driver and configuration, network envinronment including switches, and the NAS server.
Results : 744 filecreate per second per thread. Limited by operational latency.
Here is the analytics view captured for the this tests and the following 5 tests. Stat files in sequential order...done.
Test was too elusive possibly working against cached stat information. Delete files in sequential order...done.
Here we unlink(2)
the 16K files.
Here we call the unlink system call for the 16K files. The run takes 10.294 seconds showing a 1591 unlink per second. Each call goes off cpu, waiting for a server response for 600 usec.
Much like the create file test above, while we get information about the single threaded unlink time present in the environment it's obviously not representative of the server's capabilities.'
Results : 1591 unlink per second per thread, Limited by operational latency. Create files in random order...done.
We recreate 16K files, closing each one but also running a stat() system call on each. Stat files in random order...done.
Elusive as above. Delete files in random order...done.
We remove the 16K files.
I could not discern in the "random order" test any meaninful differences to the sequential order ones. Analytics screenshot of Bonnie++ run
Here is the full screen shot from analytics including Disk and CPU data
The takeway here is that single instance bonnie++ does not generally stress one Sun Storage 7000 NAS server but will stress the client CPU and 1Gbps network connectivity. There is no multi-client support in Bonnie++ (that I could find).
One can certainly start multiple clients simultaneously, but since the different tests would not be synchronized the output of bonnie++ would be very questionable. Bonnie++ does have a multi-instance synchronisation mode that is based on semaphore which will only work if all instances are running within the same OS environment.
So in a multi client test, Only the total elapsed time would be of interest here and that would be dominated by the streaming performance as each client would read and write its working set 3 times over the wire. Filecreate and unlink times would also contribute to the total elapsed time of such a test.
For a single node multi-instance bonnie++ run, one would need to have a large client, with at least 16 x 2Ghz CPUS, and about 10Gbps worth of network capabilities in order to properly test one Sun Storage 7410 server. Otherwise, Bonnie++ is more likely to show client and network limits, not server ones. As for unlink capabilities, the topic is a pretty complex and important one that certainly cannot be captured with simple commands. The interaction with snapshots and the I/O load generated on the server during large unlink storms needs to be studied carefully in order to understand the competitive merits of different solutions.
In Summary, here is what governs the performance of the individual Bonnie++ tests :
|Writing with putc()... ||87 MB/sec ||Limited by client's single CPU speed |
|Writing intelligently...||113 MB/sec ||Limited by Network conditions |
|Rewriting...||100MB/sec ||Limited by Network conditions |
|Reading with getc()...||73MB/sec ||Limited by client's single CPU speed |
|Reading intelligently...||118MB/sec ||Limited by Network conditions |
|start 'em...done...done...done... |
|Create files in sequential order...||744 create/s||Limited by operational latency |
|Stat files in sequential order...||not observable |
|Delete files in sequential order...||1591 unlink/s||Limited by operational latency |
|Create files in random order...||same as sequential |
|Stat files in random order...||same as sequential |
|Delete files in random order...||same as sequential |
So Bonnie++ won't tell you much about our server's capabilities. Unfortunately, the clustered mode of Bonnie++ won't coordinate multiple clients systems and so cannot be used to stress a server. Bonnie++ could be used to stress a NAS server using a single large multi-core client with very strong networking capabilities. In the end though I don't expect to learn much about our servers over and above what is already known. For that please check out our links here : Low Level Performance of Sun Storage Analyzing the Sun Storage 7000 Designing Performance Metrics... Sun Storage 7xxx Performance Invariants
Here is the bonnie.d
d-script used and the output generated bonnie.out