Intro to Mean Time Between Service (MTBS) Analysis

Mean Time Between Service (MTBS)

Systems which include redundancy can often have very high reliability and availability. When systems get to high levels of redundancy, such as reliability estimated in decades or availability above 99%, then we need to look at other measurements to differentiate design choices. One such measurement is the Mean Time Between Service (MTBS). For example, a 2-way mirror RAID system will offer data redundancy, high Mean Time Between System Interruption (MTBSI), and good availability. Adding a hot spare disk will decrease the Mean Time To Repair (MTTR) and thus improve MTBSI and availability. But what about adding two hot spare disks? The answer is intuitively obvious, more spare disks means that the system should be able to survive even longer in the face of failures. In general, adding more hot spares is better. When comparing two different RAID configurations, it can be helpful to know the MTBS as well as other Reliability, Availability, and Serviceability (RAS) metrics.

We define MTBS as the mean time between failures of a system which would lead to a service call. For a non-redundant system, a service call would occur immediately after a system interruption. For redundant systems, a service call may be scheduled to replace a failed component, even if an interruption did not occur. Intuitively, if we can defer service calls, then the system is easier manage. In the RAID systems, more hot spares means that we can defer scheduled replacement service in the event that one disk fails. The more hot spares we have, the longer we can defer the scheduled replacement service.

Mathematical Details

To calculate MTBS we need to define a fixed time, T, as an interval for the analysis. At Sun, we often use 5 years for disk drives.

T = 5 years = 43,800 hours

Next, we need to calculate the reliability of the device over that period of time. For the general cases, we use an exponential distribution. For a disk, we might see:

MTBF = 800,000 hours

For which we calculate the reliability for time T as:

Rdisk(T) = e-(T/MTBF) = e-(43,800/800,000) = 0.9467

The average failure rate for time T is:

FRdisk(T) = (1 - Rdisk(T))/T = (1 - 0.9467)/43,800 = 0.000001217

For a single disk, which requires a service call when it fails, the MTBS is simply the reciprocal of the failure rate:

MTBS = 1/FRdisk(T) = 1/0.000001217 = 821,764 hours

For this simple case, you can see that MTBS is simply the time-bounded evaluation of the infinite-time MTBF and isn't very interesting. However, when we add more disks, the calculations become more interesting. We can use binomial theorem analysis to determine the probability that multiple disks have failed. Consider the case where you have two identical disks with the same expected MTBF as above. We can see that the possible combinations of disks being functional follow the binomial theorem distribution. First, we calculate the probability of being in a state where K of N (N choose K) drives are failed, which we'll call Pfailed:

Pfailed = (N choose K) \* Rdisk(T)N-K \* (1 - Rdisk(T))K

This is obviously a good job for a spreadsheet or other automation. The results can be easily seen in the table below, where we have 2 disks, N=2:

Number of Failed Disks (K)

Binomial Distribution
(N choose K)

Pfailed

Cumulative Pfailed

FR(T)

MTBS (hours)

0

1

0.8963

0.8963

2.368e-6

422,300

1

2

0.1009

0.9972

6.481e-8

15,430,323

2

1

0.0028

1.0000

0

If we have a service policy such that we immediately replace a failed drive, then the MTBS is 422,300 hours or approximately MTBF/2, as you would expect.

Suppose we have a mirrored pair and one hot spare (N=3), then the table looks like:

Number of Failed Disks (K)

Binomial Distribution
(N choose K)

Pfailed

Cumulative Pfailed

FR(T)

MTBS (hours)

0

1

0.8485

0.8485

3.458e-6

289,166

1

3

0.1433

0.9918

1.875e-7

5,332,858

2

3

0.0081

0.9998

3.453e-9

289,617,933

3

1

0.0002

1.0000

0

Here we can see that more disks means that the service rate increases and MTBS is lower. However, now we can implement a service policy such that if we only replace a disk when we get down to no redundancy, then the MTBS is larger than for the 2-disk case with the same policy, 289,617,933 vs 15,430,323 hours. A side benefit is that the MTTR of a hot spare does not include a human reaction component, so it is as low as can be reasonably expected. Low MTTR leads to higher availability. Clearly, adding a hot spare is a good thing, from a RAS perspective.

As we add disks, we tend to prefer a service policy such that we don't have to risk getting down to a non-redundant state before we take corrective action. In other words, we may want to look at the MTBS value when the number of failed disks (K) is equal to the number of hot spares. This isn't at all interesting in the 2-disk case, and only mildly interesting in the 3-disk case. Suppose we have 10 disks, then the table looks like:

Number of Failed Disks (K)

Binomial Distribution
(N choose K)

Pfailed

Cumulative Pfailed

FR(T)

MTBS (hours)

0

1

0.5784

0.5784

9.626e-6

103,888

1

10

0.3255

0.9039

2.194e-6

455,747

2

45

0.0824

0.9863

3.122e-7

3,202,918

3

120

0.0124

0.9987

2.978e-8

33,574,752

4

210

0.0012

0.9999

1.969e-9

507,775,096

5

252

8.227e-5

1.0000

9.098e-11

10,990,887,270

6

210

3.3858e-6

1.0000

2.893e-12

345,617,834,855

7

120

1.241e-7

1.0000

6.054e-14

16,519,338,761,278

8

45

2.619e-9

1.0000

7.519e-16

1,330,049,617,377,480

9

10

3.275e-11

1.0000

4.208e-18

237,659,835,757,624,000

10

1

1.843e-13

1.0000

0

Now you can begin to see the trade-off more clearly as you compare the 3-disk scenario with the 10-disk scenario. If we have a service policy which says to replace each disk as they fail, then we will be doing so much more frequently for the 10-disk scenario than for the 3-disk scenario. Again, this should make intuitive sense. But the case can also be made that if we only want to service the system at a given (effective, relative) rate of 3,000,000 hours, then we'd need to have at least two hot spares for the 10-disk scenario versus only one hot spare for the 3-disk scenario. Which is better? Well, it depends on other factors you need to consider when designing such systems. The MTBS analysis provides just one view into the problem.

Caveat Emptor

When we look at failure rates, MTBF, and the MTBS derived from them, people are often shocked by the magnitude of the values. 800,000 hours is 91 years. 91 years ago disk drives did not exist and any current disk drive is unlikely to function 91 years from now. Most disk drives have an expected lifetime of 5 years or so. The affect this has on the math is that the failure rate is not constant and changes over time. As the disk drive gets older, the failure rate increases. From an academic perspective, it would be nice to know the shape of that curve and adjust the algorithms accordingly. From a practical perspective, such data is very rarely publicly available. You can also inoculate your design from this by planning for a technology refresh every 5 years or so. The accuracy of the magnitude does not make a practical impact on system analysis. It is more important to judge the relative merits of competing designs and configurations. For example, should you use 2-way RAID-1 with a spare versus a 3-way RAID-5? The analysis described here can help answer that question by evaluating the impact of the decisions relative to each other. You should interpret the results accordingly.



Comments:

Post a Comment:
Comments are closed for this entry.
About

relling

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today