Catching disk latency in the act

Today, Brendan made a very interesting discovery about the potential sources of disk latency in the datacenter. Here's a video we made of Brendan explaining (and demonstrating) his discovery:



This may seem silly, but it's not farfetched: Brendan actually made this discovery while exploring drive latency that he had seen in a lab machine due to a missing screw on a drive bracket. (!) Brendan has more details on the discovery, demonstrating how he used the Fishworks analytics to understand and visualize it.

If this has piqued your curiosity about the nature of disk mechanics, I encourage you to read Jon Elerath's excellent ACM Queue article, Hard disk drives: the good, the bad and the ugly! As Jon notes, noise is a known cause of what is called a non-repeatable runout (NRRO) -- though it's unclear if Brendan's shouting is exactly the kind of noise-induced NRRO that Jon had in mind...

Comments:

Well, it was bound to happen sometime: Bryan's incredible energy and hyper-speed spoken output has finally driven his co-workers crazy. ;)

That's interesting--exactly how much performance degradation were you seeing? And I mean in transactions or megabytes per second, not time lost due to co-workers rolling around the floor laughing...

Posted by Derek on December 31, 2008 at 11:06 AM PST #

Derek,

Check out Brendan's blog entry where he has the graphs posted; the hit to throughput is tremendous. (Throughput drops from ~1 GB/sec to practically nothing while Brendan is shouting.) The next experiment is obviously to take the biggest amp we can find and see if sustained loud noise will induce sustained high latency and low bandwidth. Another question we don't know the answer to: is this due to frequency or volume or some combination? Science demands answers! I'm only half kidding, as there is one question which is legitimately on my mind: can the high noise levels found in most data centers be potentially responsible for NRROs? And can high noise levels shorten drive life? If so, are there ways to configure a datacenter such that this issue is either exacerbated or eliminated? Or is there just something magical about Brendan's primal scream?

Posted by Bryan Cantrill on December 31, 2008 at 11:40 AM PST #

Interesting... I wonder if shouting at a single disk would result in as dramatic a drop in performance? (Now I'm going to have to try that.) Also, it looks like Brendan actually touches the drive brackets when he's shouting. There's another whole branch of research right there!

>>can the high noise levels found in most data centers be potentially responsible for NRROs?

There's a thought, although I'd guess (as a complete failure at high school physics) that Brendan's screaming is more intense and focused than the overall hum in a datacenter. The drives are already stabilized to some degree by way of the bracket and the chassis, so how much further can one stabilize a drive without burying it in bricks?

Do you folks ever sleep? ;) Keep us posted if you discover anything and have a great (if hoarse) new year.

Posted by Derek on December 31, 2008 at 01:21 PM PST #

As a corollary, can you increase the performance of your JBODs by playing them some gentle Mozart piano sonatas? ;-) Happy New Year!

Posted by Kevin Hutchinson on December 31, 2008 at 02:58 PM PST #

Thats awesome! :)

Posted by benr on December 31, 2008 at 03:08 PM PST #

So does this mean I should go out and buy some noise cancellation headphones for our storage?

A more serious question: What would be the effect of vibrations of a datacentre next to a major highway or railway when traffic shakes the datacentre/racks? Or when a bus, truck or car hits a manhole cover next to the datacentre.

Posted by Karl Rossing on December 31, 2008 at 03:25 PM PST #

Wow, I think why this amazes me so much is that it makes sense when you think about it, but to actually have the instruments to measure it... wow. Nice work guys.

Posted by Jacob Becker on December 31, 2008 at 04:02 PM PST #

Karl ... depends on the underground .. there is a reason why chip fabs can´t build everywhere ... there are examples of defect numbers correlated with the time of the day, as of the urban train near of a fab.

I would assume, that you could measure the effect as well in hard disk. But Brendans scream seems to be very effective.

Maybe three harddisks are a formidable seismometer ;

Posted by Joerg M. on December 31, 2008 at 10:10 PM PST #

Now you found this on JBOD, have you tested this on other types of arrays? If so what ones?

Just a suggestion:

1. Use a db meter to measure your scream.
2. Use a db meter to measure the sound of your systems

Does location of the storage device matter if between other servers or between other storage devices makes a difference?

Is storage device location near other type of datacenter infrastructure causing noise that impact storage like you have proven such as being near diesel generators.

Sounds like to me for sure we need to keep storage devices away from exterior vibrations that could impact data lost. So question should we be thinking about how we layout our datacenters when exterior noise can cause data disruption?

Posted by Phillip Bruce on January 01, 2009 at 04:25 AM PST #

> A more serious question: What would be the effect of vibrations of a datacentre next to a
> major highway or railway when traffic shakes the datacentre/racks? Or when a bus, truck or
> car hits a manhole cover next to the datacentre.

I don't think it's a concern for two reasons: The axis the vibration is being delivered won't be focused like it was in this case, but more importantly the energy level of the vibration experienced by the drive would be a \*lot\* less, particularly at the frequencies that are likely to cause problems.

I'll let you in on a secret - the fishworks lab is directly on the corner of two busy streets, with bus stops below. It also has a large bus terminus on the opposite side of the street! It's not an issue.

> Sounds like to me for sure we need to keep storage devices away from exterior vibrations
> that could impact data lost. So question should we be thinking about how we layout our
> datacenters when exterior noise can cause data disruption?

Keep in perspective the amount of energy and frequency Brendan was directing (with cupped hands touching the drive) versus the energy level experienced in any typical building due to vibration - it's not a problem unless your disk drive is in front of the speaker stack at a Van Halen concert.

Posted by Greg Price on January 01, 2009 at 05:17 AM PST #

Can you make the video available for direct download? I'd like to show coworkers, but YouTube is blocked.

Thanks!

Posted by Greg on January 01, 2009 at 05:28 AM PST #

Yes a Van Halen or any Hard Rock Concert certainly could pose a similar problem.
Even Jimi Hendricks "Purple Haze" will send some good vibrations. :)

Military environments would certainly need to understand those kind of impact.

Besides Datacenters are noisy enough and if you listen to that video it is proof of that without the extra screaming at your disks. :)

Also if you do download that video you need a VLC complaint player. You can get one from http://download.videolan.org/pub/videolan/vlc/0.9.8a/vlc-0.9.8a.tar.bz2

Then you can use the YourTube Video Download Tool to get the video.

http://www.downloadyoutubevideos.com/

Phillip

Posted by Phillip Bruce on January 01, 2009 at 07:46 AM PST #

Does it mean the existing noise in the data center already causes some disk latency? Or am i asking a stupid question?

Posted by Saravanan on January 01, 2009 at 02:07 PM PST #

Derek, there is a throughput drop for one second - but that's for the disk subsystem from ZFS, not the delivered performance over NFS. Since this is a heavy streaming write test, ZFS is asynchronously flushing data from DRAM to disk, but the clients don't wait for that to complete. So whether that takes longer may not affect the client application performance at all (it can a little in this case, as it is a constant streaming write.) As for synchronous writes - the 7000 series supports Logzilla, which is SSD and should be immune to vibration (I assume - I've never shouted at an SSD to find out. :)

It's also worth noting that we believe that disks are more vulnerable to vibration during writes than reads, since for writes the disk must write the data properly - for reads the data must just pass the sector CRC.

We doubt that data center noise can cause this - the video is shot in a very noisy data center such that I needed to shout the entire time! And we never notice the tell-tale outlier disk latency caused by vibration just from our data center alone (even when the blade server in the neighboring rack is doing POST, which sounds like a jet aircraft.) We only think this happens if you cup your hands to disks and shout very loudly, as they are doing a heavy write workload.

Still, I'd rather have Analytics to confirm if vibration is an issue or not - which is what the video is about. People may have extreme circumstances where vibration is an issue, but lack the tools to identify it.

Posted by Brendan Gregg on January 01, 2009 at 03:30 PM PST #

Brendan,

I would think SSD would never be an issue with this because it is requires no moving parts. Still when you have heavy I/O to such a point where even caching no longer makes sense to use. I have seen folks turn OFF caching simply because of this.

So was caching is turned on, I would think you get the vibration issue regardless verses having a drive that is ENTIRELY SSD which vibrations should never occur.

The only thing you have to worry about with vibrations is how well secure the memory is in the Drive Unit itself. Why? It would depend on the position of the memory simms in the drives. Example: Memory place in flat like on motherboards vs being vertically placed on a daughter board configuration. Or if there is no socket configuration but completely all solder to the motherboard be a better solution. Most SSD are still using 200 to 240 pin DIM sockets. Simply put if they are not secure enough that if a tech doesn't locked them down can be a reason why
memory errors to occur given in other NON-Data center environments.

I tend to agree with you about the testing. I don't know what I/O tool your using to generate the I/O, could be dd, bonnie, Medusa tools, iometer, vdbench or others that are available could better test the drive and see if you get the same results.

Posted by Phillip Bruce on January 02, 2009 at 04:41 AM PST #

Try generating a simple sine wave into a .wav file and play that out your laptop into a IPod boombox... it will save Brendan's voice, and permit more reproducible experiments. I'm curious
as to which frequencies cause the problem; the acceleration due to sound waves is clearly preventing the heads from settling...

- Bart

Posted by Bart on January 02, 2009 at 12:16 PM PST #

I've seen prototype disk arrays where the disks next to the fan had worse performance than the other disks due to vibration issues, too. Took a while to figure out the details there, a pity we didn't have Fishworks then.

Posted by David Carlton on January 02, 2009 at 01:25 PM PST #

Does apologizing and offering it flowers and RAM fix the problem?

Posted by Liz on January 05, 2009 at 11:16 PM PST #

Post a Comment:
Comments are closed for this entry.
About

bmc

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today