Mean Time Between Service (MTBS)
Systems which include redundancy can often have very high
reliability and availability. When systems get to high levels of
redundancy, such as reliability estimated in decades or availability
above 99%, then we need to look at other measurements to
differentiate design choices. One such measurement is the Mean Time
Between Service (MTBS). For example, a 2way mirror RAID system will
offer data redundancy, high Mean Time Between System Interruption (MTBSI), and good availability. Adding a
hot spare disk will decrease the Mean Time To Repair (MTTR) and thus improve MTBSI and
availability. But what about adding two hot spare disks? The answer
is intuitively obvious, more spare disks means that the system should
be able to survive even longer in the face of failures. In general,
adding more hot spares is better. When comparing two different RAID
configurations, it can be helpful to know the MTBS as well as other Reliability, Availability, and Serviceability (RAS) metrics.
We define MTBS as the mean time between failures of a system which
would lead to a service call. For a nonredundant system, a service
call would occur immediately after a system interruption. For
redundant systems, a service call may be scheduled to replace a
failed component, even if an interruption did not occur. Intuitively,
if we can defer service calls, then the system is easier manage. In
the RAID systems, more hot spares means that we can defer scheduled
replacement service in the event that one disk fails. The more hot
spares we have, the longer we can defer the scheduled replacement
service.
Mathematical Details
To calculate MTBS we need to define a fixed time, T, as an
interval for the analysis. At Sun, we often use 5 years for disk
drives.
T = 5 years = 43,800 hours
Next, we need to calculate the reliability of the device over that
period of time. For the general cases, we use an exponential
distribution. For a disk, we might see:
MTBF = 800,000 hours
For which we calculate the reliability for time T as:
R_{disk}(T) = e^{(T/MTBF)}
= e^{(43,800/800,000)} = 0.9467
The average failure rate for time T is:
FR_{disk}(T) = (1 
R_{disk}(T))/T = (1  0.9467)/43,800 = 0.000001217
For a single disk, which requires a service call when it fails,
the MTBS is simply the reciprocal of the failure rate:
MTBS = 1/FR_{disk}(T) =
1/0.000001217 = 821,764 hours
For this simple case, you can see that MTBS is simply the
timebounded evaluation of the infinitetime MTBF and isn't very
interesting. However, when we add more disks, the calculations become
more interesting. We can use binomial
theorem analysis to determine the probability that multiple disks
have failed. Consider the case where you have two identical disks
with the same expected MTBF as above. We can see that the possible
combinations of disks being functional follow the binomial theorem
distribution. First, we calculate the probability of being in a state
where K of N (N choose K) drives are failed, which we'll call
P_{failed}:
P_{failed} = (N choose K) \*
R_{disk}(T)^{NK} \* (1  R_{disk}(T))^{K}
This is obviously a good job for a spreadsheet or other
automation. The results can be easily seen in the table below, where
we have 2 disks, N=2:
Number of Failed Disks (K)

Binomial Distribution (N choose K)

P_{failed}

Cumulative P_{failed}

FR(T)

MTBS (hours)

0

1

0.8963

0.8963

2.368e6

422,300

1

2

0.1009

0.9972

6.481e8

15,430,323

2

1

0.0028

1.0000

0

∞

If we have a service policy such that we immediately replace a
failed drive, then the MTBS is 422,300 hours or approximately MTBF/2,
as you would expect.
Suppose we have a mirrored pair and one hot spare (N=3), then the
table looks like:
Number of Failed Disks (K)

Binomial Distribution (N choose K)

P_{failed}

Cumulative P_{failed}

FR(T)

MTBS (hours)

0

1

0.8485

0.8485

3.458e6

289,166

1

3

0.1433

0.9918

1.875e7

5,332,858

2

3

0.0081

0.9998

3.453e9

289,617,933

3

1

0.0002

1.0000

0

∞

Here we can see that more disks means that the service rate
increases and MTBS is lower. However, now we can implement a service policy such that
if we only replace a disk when we get down to no redundancy, then the
MTBS is larger than for the 2disk case with the same policy,
289,617,933 vs 15,430,323 hours. A side benefit is that the MTTR of a hot spare does not include a human reaction component, so it is as low as can be reasonably expected. Low MTTR leads to higher availability. Clearly, adding a hot spare is a
good thing, from a RAS perspective.
As we add disks, we tend to prefer a service policy such that we
don't have to risk getting down to a nonredundant state before we
take corrective action. In other words, we may want to look at the
MTBS value when the number of failed disks (K) is equal to the number
of hot spares. This isn't at all interesting in the 2disk case, and
only mildly interesting in the 3disk case. Suppose we have 10
disks, then the table looks like:
Number of Failed Disks (K)

Binomial Distribution (N choose K)

P_{failed}

Cumulative P_{failed}

FR(T)

MTBS (hours)

0

1

0.5784

0.5784

9.626e6

103,888

1

10

0.3255

0.9039

2.194e6

455,747

2

45

0.0824

0.9863

3.122e7

3,202,918

3

120

0.0124

0.9987

2.978e8

33,574,752

4

210

0.0012

0.9999

1.969e9

507,775,096

5

252

8.227e5

1.0000

9.098e11

10,990,887,270

6

210

3.3858e6

1.0000

2.893e12

345,617,834,855

7

120

1.241e7

1.0000

6.054e14

16,519,338,761,278

8

45

2.619e9

1.0000

7.519e16

1,330,049,617,377,480

9

10

3.275e11

1.0000

4.208e18

237,659,835,757,624,000

10

1

1.843e13

1.0000

0

∞

Now you can begin to see the tradeoff more clearly as you compare
the 3disk scenario with the 10disk scenario. If we have a service
policy which says to replace each disk as they fail, then we will be
doing so much more frequently for the 10disk scenario than for the
3disk scenario. Again, this should make intuitive sense. But the
case can also be made that if we only want to service the system at a
given (effective, relative) rate of 3,000,000 hours, then we'd need
to have at least two hot spares for the 10disk scenario versus only
one hot spare for the 3disk scenario. Which is better? Well, it
depends on other factors you need to consider when designing such
systems. The MTBS analysis provides just one view into the problem.
Caveat Emptor
When we look at failure rates, MTBF, and the MTBS derived from
them, people are often shocked by the magnitude of the values.
800,000 hours is 91 years. 91 years ago disk drives did not exist and
any current disk drive is unlikely to function 91 years from now.
Most disk drives have an expected lifetime of 5 years or so. The
affect this has on the math is that the failure rate is not constant
and changes over time. As the disk drive gets older, the failure rate
increases. From an academic perspective, it would be nice to know the
shape of that curve and adjust the algorithms accordingly. From a
practical perspective, such data is very rarely publicly available.
You can also inoculate your design from this by planning for a
technology refresh every 5 years or so. The accuracy of the magnitude
does not make a practical impact on system analysis. It is more
important to judge the relative merits of competing designs and
configurations. For example, should you use 2way RAID1 with a spare
versus a 3way RAID5? The analysis described here can help answer
that question by evaluating the impact of the decisions relative to
each other. You should interpret the results accordingly.