By user13278091 on nov. 10, 2008
These SFS benchmark is a lot about "cache busting" the server : this is interesting but at Sun we think that Caches are actually helpful in real scenarios. Data goes in cycles in which it becomes hot at times. Retaining that data in cache layers allow much lower latency access, and much better human interaction with storage engines. Being a cache busting benchmark, SFS numbers end up as a measure of the number of disk rotation attached to the NAS server. So good SFS result requires 100 or 1000 of expensive, energy hungry 15K RPM spindles. To get good IOPS, layers of caching are more important to the end user experience and cost efficiency of the solution.
So we needed another way to talk about performance. Benchmarks tend to test the system in peculiar ways that not necessarely reflect the workloads each customer is actually facing. There are very many workload generators for I/O but one interesting one that is OpenSource and extensible is Filebench available in Source.
So we used filebench to gather basic performance information about our system with the hope that customers will then use filebench to generate profiles that map to their own workloads. That way, different storage option can be tested on hopefully more meaningful tests than benchmarks.
Another challenge is that a NAS server interacts with client system that themselve keep a cache of the data. Given that we wanted to understand the back-end storage, we had to setup the tests to avoid client side caching as much as possible. So for instance between the phase of file creation and the phase of actually running the tests we needed to clear the client caches and at times the server caches as well. These possibilities are not readily accessible with the simplest load generators and we had to do this in rather ad-hoc fashion. One validation of our runs was to insure that the amount of data transfered over the wire, observed with Analytics was compatible with the aggregate throughput measured at the client.
Still another challenge was that we needed to test a storage system designed to interact with large number of clients. Again load generators are not readily setup to coordinate multiple client and gather global metrics. During the course of the effort filebench did come up with a clustered mode of operation but we actually where too far engaged in our path to take advantage of it.
This coordination of client is important because, the performance information we want to report is actually the one that is delivered to the client. Now each client will report it's own value for a given test and our tool will sum up the numbers; but such a Sum is only valid inasmuch as the tests ran on the clients in the same timeframe. The possibility of skew between tests is something that needs to be monitored by the person running the investigation.
One way that we increased this coordination was that we divided our tests in 2 categories; those that required precreated files, and those that created files during the timed portion of the runs. If not handled properly, file creation would actually cause important result skew. The option we pursued here was to have a pre-creation phase of files that was done once. From that point, our full set of metrics could then be run and repeated many times with much less human monitoring leading to better reproducibility of results.
Another goal of this effort was that we wanted to be able to run our standard set of metrics in a relatively short time. Say less than 1 hours. In the end we got that to about 30 minutes per run to gather 10 metrics. Having a short amount of time here is important because there are lots of possible ways that such test can be misrun. Having someone watch over the runs is critical to the value of the output and to it's reproducibility. So after having run the pre-creation of file offline, one could run many repeated instance of the tests validating the runs with Analytics and through general observation of the system gaining some insight into the meaning of the output.
At this point we were ready to define our metrics.
Obviously we needed streaming reads and writes. We needed ramdom reads. We needed small synchronous writes important to Database workloads and to the NFS protocol. Finally small filecreation and stat operation completed the mix. For random reading we also needed to distinguish between operating from disks and from storage side caches, an important aspect of our architecture.
Now another thing that was on my mind was that, this is not a benchmark. That means we would not be trying to finetune the metrics in order to find out just exactly what is the optimal number of threads and request size that leads to best possible performance from the server. This is not the way your workload is setup. Your number of client threads running is not elastic at will. Your workload is what it is (threading included); the question is how fast is it being serviced.
So we defined precise per client workloads with preset number of thread running the operations. We came up with this set just as an illustration of what could be representative loads :
1- 1 thread streaming reads from 20G uncached set, 30 sec. 2- 1 thread streaming reads from same set, 30 sec. 3- 20 threads streaming reads from 20G uncached set, 30 sec. 4- 10 threads streaming reads from same set, 30 sec. 5- 20 threads 8K random read from 20G uncached set, 30 sec. 6- 128 threads 8K random read from same set, 30 sec. 7- 1 thread streaming write, 120 sec 8- 20 threads streaming write, 120 sec 9- 128 threads 8K synchronous writes to 20G set, 120 sec 10- 20 threads metadata (fstat) IOPS from pool of 400k files, 120 sec 11- 8 threads 8K file create IOPS, 120 sec.
For each of the 11 metrics, we could propose mapping these to relevant industries :
1- Backups, Database restoration (source), DataMining , HPC 2- Financial/Risk Analysis, Video editing, HPC 3- Media Streaming, HPC 4- Video Editing 5- DB consolidation, Mailserver, generic fileserving, Software development. 6- DB consolidation, Mailserver, generic fileserving, Software development. 7- User data Restore (destination) 8- Financial/Risk Analysis, backup server 9- Database/OLTP 10- Wed 2.0, Mailserver/Mailstore, Software Development 11- Web 2.0, Mailserver/Mailstore, Software Development
We managed to get all these tests running except the fstat (test 10) due to a technicality in filebench. Filebench insisted on creating the files up front and this test required thousands of them; moreover filebench used a method that ended up single threaded to do so and in the end, the stat information was mostly cached on the client. While we could have plowed through some of the issues the conjunctions of all these made us put the fstat test on the side for now.
Concerning thread counts, we figured that single stream read test was at times critical (for administrative purposes) and an interesting measure of the latency. Test 1 and 2 were defined this way with test 1 starting with cold client and server caches and test 2 continuing the runs after having cleared the client cache (but not the server) thus showing the boost from server side caching. Test 3 and 4 are similarly defined with more threads involved for instance to mimic a media server. Test 5 and 6 did random read tests, again with test 5 starting with a cold server cache and test 6 continuing with some of the data precached from test 5. Here, we did have to deal with client caches trying to insure that we don't hit in the client cache too much as the run progressed. Test 7 and 8 showcased streaming writes for single and 20 streams (per client). Reproducibility of test 7 and 8 is more difficult we believe because of client side fsflush issue. We found that we could get more stable results tuning fsflush on the clients. Test 9 is the all important synchronous write case (for instance a database). This test truly showcases the benefit of our write side SSD and also shows why tuning the recordsize to match ZFS records with DB accesses is important. Test 10 was inoperant as mentioned above and test 11 filecreate, completes the set.
Given that those we predefined test definition, we're very happy to see that our numbers actually came out really well with these tests particularly for the Mirrored configs with write optimized SSDs. See for instance results obtained by Amitabha Banerjee .
I should add that these can now be used to give ballpark estimate of the capability of the servers. They were not designed to deliver the topmost numbers from any one config. The variability of the runs are at times more important that we'd wish and so your mileage will vary. Using Analytics to observe the running system can be quite informative and a nice way to actually demo that capability. So use the output with caution and use your own judgment when it comes to performance issues.