As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:
#define SIZE 1024
int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);
Running this gave the following results:
Time per iteration 0.000065606310 MB/s
Time per iteration 2.709711563906 MB/s
Time per iteration 0.178590114758 MB/s
Yup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.
It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.