In my previous post on RAID-5, I talked about how we can recover the data using simple additions and subtractions.
We will discuss RAID-6 in this post. Unfortunately, you need a bit more than additions and subtractions to understand how we recover data in RAID-6. Toward the end of this post, I will touch briefly upon the "finite field," but I will refrain from being too mathematical.
I mentioned in my last post that we have no way to tell if someone swapped the drives in RAID-5 array (mathematically).
When the equation (a) is true, the equation (b) will also be true. Therefore, we are unable to find out the swapped content in the drive (x and y) from this equation.
1 + x + y = 6 ... (a)
1 + y + x = 6 ... (b)
To find the values of two unknown variables, we need at least two equations that describe the relationship between the variables. Yes, we need to use a "system of equation." Now the question is, how does another equation look?
If (a) is true, can another equation be something like below?
x + y = 5 ... (c)
If you think about it, it won't help us identify the "location, " so we need some twist.
In the previous post, in RAID-5, we learned that we calculated the sum of data by using exclusive OR (XOR) and use it as a parity. When a drive breaks, we can use the sum of the data from surviving drives and subtract from the parity to calculate missing data. To finish recovery, we need to read all the data and parity we need, recalculate, and write to the new drive. Because of this, the reconstruction of data takes time.
Any drives, regardless of HDDs or SSDs, have their lifetime. It will eventually break. We lose data if we lose another drive while recovering the data in RAID-5. Therefore, we need protection that is tolerant of a "two-drive failure, " such as RAID-6 or RAID-6 equivalent mechanism.
Having two drive failures and trying to recover data is like solving the equation below.
1 + x + y = 6 … (1) （Equation (7) from the previous post）
This equation has two problems.
Having only parity P won't help us in this case. We need another formula that allows us to solve x and y, while we can also identify the location of the data.
A byte of data in each drive is a number. So what if we simplify, for now, assume that we only can store an integer between 0 to 9 in each device. Also, suppose that we have 5 data drives and number them d_{0}, d_{1}, d_{2}, d_{3,} and d_{4}. We also have P and Q drives for parities. Let's say, for the sake of explanation, P and Q can have any positive integer. Because it is easier to calculate and explain later, I will place the drives in the following order: d_{4}, d_{3}, d_{2}, d_{1}, d_{0,} P, and Q. If you XOR from d_{4} through d_{0}, P is equal to 1.
Figure 1: Normal Data with Calculated P
Now, if you keep looking at the five numbers from d_{4} through d_{0}, do they start looking like a single number to you? If so, then, how about we assign a "digit" for each drive? If we calculate Q by adding each digit by using XOR, we end up calculating R. So let's put weight for each digit, such as 1 for d_{0}, 10 for d_{1}, and so on, up to 10000 for d_{4}.
The above example will produce the following Q.
5 x 10000 + 4 x 1000 + 3 x 100 + 2 x 10 + 1 = 54321 ... (2)
Now, let's replace 5, 4, 3, 2 and 1 with d_{4} through d_{0}. This will make (2) into below. Since 54321 is the value of Q:
Q = d_{4} x 10000 + d_{3} x 1000 + d_{2} x 100 + d_{1} x 10 + d_{0 }... (3)
If you have noticed, 10, 100, 1000, and 10000 are products of 10, and 10's power. We don't write explicitly in our daily lives, but d_{0} is actually d_{0} x 1, and 1 is 10^{0}. So if we write using 10's n^{th} power, (3) will be:
Q = 10^{4}d_{4} + 10^{3}d_{3} + 10^{2}d2 + 10^{1}d_{1} + 1 x d_{0 }
= 10^{4}d_{4} + 10^{3}d_{3} + 10^{2}d2 + 10^{1}d_{1} + 10^{0}d_{0 }... (4)
We used "10" since we know the decimal number well. But it does not have to be 10, but any generator g. (4) can be rewritten in the general equation such as below.
Q = g^{4}d_{4} + g^{3}d_{3 }+ g^{2}d_{2}+ g^{1}d_{1} + g^{0}d_{0} ...(5)
It now looks a bit mathematical. But remember that we are only doing addition and multiplication. Let's use some "math" signs, so it looks even more "mathematical." :-)
It looks hard on the surface, but it isn't.
I changed the number of drives from "fixed" to m drives. Since we numbered the drives from 0 (zero), the index would be between 0 and m-1 for the data drives. Each data drive has data d_{k}, and we multiply by g^{k}. Q is the sum of d_{k} multiplied by g^{k}.
If you prefer following the steps to calculate, it is below:
It may be tedious, but you can follow it.
You came all the way through adding, multiplying, and transforming equations from one form to another. You calculated Q. (*1)(*2)
Congratulations, Q is Reed-Solomon code!
CDs, DVDs, and BlueRay discs use Reed-Solomon code for their error correction.
Well, let me ask you -- have you seen the equations that look like (4) or (5)? That's right -- it is based on simple polynomials.
(*1) A little more precisely, they manipulate the number to be in a specific range for error correction. They use a finite field or Galois field to accomplish this. However, the basic idea is the same.
(*2) I walked through very detailed steps because I am not good at math. ;-)
Let's deal with the simple scenarios first.
In RAID-6, we use different methods for different failure scenarios. When I say "easy" in this post, it means that we need calculation as much as RAID-5 recovery. In the following explanation, when I say "data drive," it means drives with actual data, not parities.
We do the same thing as RAID-5 recovery. We use information from surviving drives and use XOR to calculate missing drive data.
When a data drive and Q parity drive break, we recover the data drive first using RAID-5 recovery and then recalculate Q.
One of the RAID-6 unique recovery scenarios is when both the parity drive and data drive x fail. This means that you can't use P to recover the data. In this case, you need to calculate a new Q or Q_{x} from surviving data drives and subtract Q_{x} from Q to calculate the data in the drive x (D_{x}). D_{x}, however, will be weighed; therefore, we need to divide by the weight after we know D_{x}.
In our example below, Q is 54321. (Note that we can store any positive decimal number in P and Q for the sake of explanation.)
Figure 2: 5 Data Drives with Calculated P and Q
Suppose if d_{3 }is broken and we don't know the P. Q_{x} and weighed d_{3} are as follows.
Q_{x} = 50000 + 0 + 300 + 20 + 1 = 50321
Q - Q_{x} = 54321 (Q) - 50321 (Q_{x}) = 4000 (d_{3} with weight)
Because we have weight of 10^{3 }for d_{3}, we need to divide the answer by 1000. This will recover 4. Then we can recalculate P to double-check. It will be 1 after recovery of d_{3}.
Suppose there are two failed data drives. Let's say, in order to represent the data in each failed drive, we call them D_{x} and D_{y}. We can calculate P parity from the surviving drives. Let's call this P_{xy}. We can also calculate Q parity from the surviving drives. Let's call this Q_{xy}.
Now we have the following simultaneous equations.
By transforming the equation for P, we can calculate the sum of D_{x} and D_{y}. (Equation (10)) This is just a sum, so with this equation, we would know neither values. So, let's transform the equation for Q as well. (Equation 11)
The equation 11 is telling us the difference between Q and Q_{xy} is equal to the sum of the weighted data from D_{x} and D_{y}.
Let's recall Figure 2, and go through this with actual numbers.
Figure 2: 5 Data Drives with Calculated P and Q
Suppose both d2 and d3 are broken.
Each Q_{xy} and Q - Q_{xy} are as follows.
Q = 54321
Q_{xy} = 50000 + 0 + 0 + 20 + 1 = 50021
Q - Qxy = 430
Now, in order for us to come up with D_{y, }we can divide the number by weight for D_{y }and get rid of the number below the floating point.
D_{y} = 4300 ÷ 1000 = 4.3 … (12)
The number above the floating point is 4, so we have D_{y} recovered.
(By the way, P_{xy }is 6, and P xor P_{xy} is 7. So, D_{x} xor D_{y} has to be 7.)
We can calculate D_{x} by subtracting D_{y} from the answer of Q-Q_{y}.
Let's transform equation 11 so that it is easier to calculate.
The answer would be 300. d_{2} had a weight of 100, so 300 divided by 100 makes 3. Now we have the value in d_{2}.
All looks good.
All looks good, doesn't it?
Unfortunately not.
There are two problems.
The first is the condition we set to explain the whole thing. Each drive could have a decimal number between 0 and 9. In the computer world, the information is handled by the units of bits and bytes. A bit is a digit in the binary system and a byte is 8 bits. This means that a byte is worth a number of between 0 and 255 in decimal. It does not necessarily fit our favorite boundary of 10. We need a different system without breaking the equations we discussed so far.
Another issue is "floating-point." We don't have a floating-point in the binary system. We simulate it in the code so that we can calculate numbers with floating points. However, in the end, but we need to round the number to the boundary of 2's power. So there can be divergence when we calculate the number that can't be divided cleanly in the binary system. Getting the different numbers than ones we stored means wrong data unless we are aware. This is critical for the storage systems because we take it granted that storage stores data correctly and replays correctly without any "rounded-up" or "rounded-down" numbers.
At the beginning of the discussion, we decided to use only integers but when we need to divide, we may get a non-integer.
To sum up, here are some conditions we need for calculating RAID-6 data.
We would like to be able to:
In mathematics, there is something called "Number Theory." It is the field of research about the set of integers and their characteristic.
Within Number Theory, there is a concept called "field." A "field" is a set of numbers that you can add, subtract, multiply and divide freely, other than divide by zero. For instance, a set of real numbers is a field. A set of integers and fractions forms a field. When there is an infinite number of elements in the field, it is called an "infinite field." You are using different infinite fields every day without knowing the name of them.
When the number of elements is limited, we call it a "finite" field. We also call it Galois Field because it was found by Évariste Galois. If we have a finite field that we can use integers between 0 and 255 in decimal, it would be convenient.
We use the binary system in computers. Lucky for us, the smallest finite field is a binary digit, that they denote as F_{2}.
(Furthermore, we can treat addition and subtraction in the same manner. Also, multiplication and division can easily be done by bit-shifting.)
Thanks to the effort of countless researchers, we know that the below polynomial does not fail in the binary system. (I am presenting the equation (5) again)
Q = g^{4}d_{4} + g^{3}d_{3 }+ g^{2}d_{2}+ g^{1}d_{1} + g^{0}d_{0} ...(5)
In the real world, it looks that we use 2^{8} for defining the field probably because the papers like A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems used it. We use a finite field in Reed-Solomon Coding, so if you have ever used CDs, DVDs or BlueRay, you used Galois Field without knowing it.
We recently hear triple parity. (RAID-Z3 is triple parity equivalent.) In case we have triple parity, we can recover from a "triple data drive" failure. The idea is the same we discussed so far, but we have P, Q, and R parity to protect data in data drives. Calculating R is not so difficult. (I am presenting the equation (5) again)
Q = g^{4}d_{4} + g^{3}d_{3 }+ g^{2}d_{2}+ g^{1}d_{1} + g^{0}d_{0} ...(5)
When g=4, it is the same equation calculates R. I will not go through the steps to recovery from triple faults, but we know the equation is closed in the binary system.
Lastly, but not least, there are other means to calculate P/Q parities. If you are interested, you should Google them with keywords like "Even-Odd" or "diagonal parity."
Thank you for reading my rather lengthy blog post. If this post helps you in any way, it is my pleasure. I appreciate corrections and comments. The mistakes here are all mine.
<For Those Interested>
P, Q, and R represent the set whose generators are 1, 2 and 4 respectively. When the generator is 1, it is equivalent to XOR, and 2 and 4 give us polynomials with linearly independent coefficients. According to the llumos source code, there is no "g" other than 1, 2 and 4 with this property, therefore, a paper by James S. Plank, A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems, will fail. Mr. Plank corrected his paper in 2003 with his co-author.
If we can take only 1, 2, and 4 as a value of g, the same method can't be applied to calculate the 4th parity and beyond. This also means that using this method will prevent us to make the protection mechanism when more than or equal to 4 drives fail at the same time.