Tuesday Jan 17, 2012

A place for everything and everything in its place

At Sun we tended to publish whatever we wanted, and put it wherever we wanted. While that felt quite liberating, it could be quite difficult for our employees, customers, and partners to find the things we published. One needed to market one's document or tool on all kinds of mailing lists and forums in order for people to find it. That was of course how the size calculator for the ZFS Storage Appliance was first published: to a blog and a wiki back in 2008.

Oracle (as you might imagine for a company that produces software to manage and share data) likes to keep things a little more organized. To that end, I'll be sending the next release of the size calculator through some paperwork and process to get it posted to OTN (Oracle Technology Network). On OTN it will be easier for people who haven't bookmarked my blog to find it, and it will be centrally located with other ZFS Storage Appliance tools and documents making it easier for people using the tool to find other interesting and relevant information.

In the short term this means that the tool won't be available (on my blog or anywhere else), but that won't be permanent. I'll be sure to share the OTN link to the next release once it is available. I'm sorry if this is an inconvenience.

EOF 

Wednesday Sep 21, 2011

Capacity Sizing for 15K Disks

On September 13, 2011 Oracle announced the availability of 300GB and 600GB 15K RPM disks for the ZFS Storage Appliance lineup. These disks are intended to make the entry capacity more realistic as the industry shifts away from 1TB disks in favour of 2TB disks today, and up to 3 and 4TB disks in the future. In the past week and a bit I've had a few people ask when the ZFS Storage Appliance Size Calculator would be updated to handle these new disks. The answer, surprisingly is that the latest release (here) already supports arbitrary disk sizes. It works using the same "size" keyword that supported 1 and 2TB disks before. Take a look:

./sizecalc-2010Q3.py 10.10.10.10 *** size 300G 24
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
  -----------------------------------------
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
 1| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2           22      3.30          2.95
mirror3      False      3      3           21      2.10          1.88
raidz1       False      4      4           20      4.50          4.03
raidz2       False     11      2           22      5.40          4.83
raidz3 wide  False     23      1           23      6.00          5.37
stripe       False      0      0           24      7.20          6.45

This example shows a single 24-disk shelf of 300GB disks. We can also mix and match sizes as before:

./sizecalc-2010Q3.py 10.10.10.10 *** size 300G 24 add size 600G 24
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
  -----------------------------------------
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
 1| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
  -----------------------------------------
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
 1| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      4           44      9.90          8.86
mirror3      False      3      6           42      6.30          5.64
raidz1       False      4      8           40     13.50         12.09
raidz2       False     11      4           44     16.20         14.50
raidz3 wide  False     23      2           46     18.00         16.12
stripe       False      0      0           48     21.60         19.34

In the example above we create a pool from 300GB disks, and then add a new shelf of 600GB disks to that pool. One final example for fun:

./sizecalc-2010Q3.py 10.10.10.10 *** size 875G 24
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
  -----------------------------------------
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
 1| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2           22      9.62          8.62
mirror3      False      3      3           21      6.12          5.48
raidz1       False      4      4           20     13.12         11.75
raidz2       False     11      2           22     15.75         14.10
raidz3 wide  False     23      1           23     17.50         15.67
stripe       False      0      0           24     21.00         18.80

 

This final example shows that one can predict the capacity of a pool for which the real disk doesn't even exist. We've never seen an 875GB disk, but if one were to exist, a shelf of 24 would deliver an 8.62TiB mirrored pool.

 Happy sizing!

EOF 

Thursday Oct 21, 2010

Capacity Sizing on 7x20

At Oracle Open World we announced a new line of storage controllers in the 7000 Series (now called ZFS Storage Appliances). These models are signified by a '20' at the end of their model name (ie 7120). On Tuesday, three of these new controllers began shipping (7120, 7320 and 7420). I have enhanced the size calculator to accommodate some of the changes in the product line, and make it a little more usable. A summary of the enhancements is below, along with some examples.

  1.  sizecalc now uses strict mode by default, the -s command line option is used to disable it if desired
  2. A new -n option will allow you to run sizecalc without  the ASCII art. I noticed that the art was useful for reports, but became annoying when trying compare the results from different configurations
  3. Added full support for 7120 and 7720 platforms. The platform type is detected by the number of disks in the first shelf (12 for 7120, 30 for 7720)
  4. Limited validation for hostnames and IP addresses. This will help eliminate some problems where the wrong parameters were supplied. It's not perfect, 'blowfish' is a valid password and a valid hostname, but '24' is not a valid hostname, for example
  5. Added proper error message when an invalid password for the 7000 is supplied
  6. Disk sizes can now be specified in upper case (size 1T) or lower case (size 1t) to help alleviate usage problems
  7. Dynamic Storage Profile Detection: sizecalc now determines the list of valid storage profiles by querying the appliance rather than having them hard coded. This ensures that invalid profiles are not displayed.

So, to some examples of the new tool. Here is an example using a 7120 configuration. You can see that by specifying 12 disks in the first shelf, the system detects (and draws) the 7120 Controller Chassis first, and then the JBOD shelves (up to 2 x 24 disk using 1 or 2TB drives).

./sizecalc-2010Q3.py 10.156.83.42 ebi 12 24
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
       7120 Controller Chassis
  -----------------------------------------
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
  -----------------------------------------
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
 2| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2           34     34.00         30.44
mirror        True      2     12           24     24.00         21.49
mirror3      False      3      3           33     22.00         19.70
mirror3       True      3      3           33     22.00         19.70
raidz1       False      4      4           32     48.00         42.97
raidz2       False     11      3           33     54.00         48.35
raidz3 wide  False     35      1           35     64.00         57.30
stripe       False      0      0           36     72.00         64.46

Here is another 7120 example, this one specifying an invalid configuration (a disk shelf with room for Log devices - not allowed on 7120 whose log devices are built into the controller) to show how strict mode can ensure that you use valid configurations:

 ./sizecalc-2010Q3.py -n 10.156.83.42 ebi 12 20
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
Error: 7120 Expansion Modules require 24 disks

Now for some examples using the 7720, the new high-density chassis. You can see the artwork showing the front and back of the chassis, and the disks populating each cage:

./sizecalc-2010Q3.py 10.156.83.42 ebi 30 30 30 30
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
                 FRONT                                   BACK
 ------------------------------------- 	-------------------------------------
 | | | | | | | | | | | | | | | | |L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
5|---------------|C|C|---------------|  |---------------|C|C|---------------|11
 | | | | | | | | |C|C| | | | | | | | |  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 | | | | | | | | | | | | | | | | |L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
4|---------------|C|C|---------------|  |---------------|C|C|---------------|10
 | | | | | | | | |C|C| | | | | | | | |  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
3|---------------|C|C|---------------|  |---------------|C|C|---------------|9
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
2|---------------|C|C|---------------|  |---------------|C|C|---------------|8
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
1|---------------|C|C|---------------|  |---------------|C|C|---------------|7
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
0|---------------|C|C|---------------|  |---------------|C|C|---------------|6
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      4          116    116.00        103.85
mirror        True      2      4          116    116.00        103.85
mirror3      False      3      6          114     76.00         68.04
mirror3       True      3      6          114     76.00         68.04
raidz1       False      4      4          116    174.00        155.78
raidz2       False     14      8          112    192.00        171.89
raidz2        True      8      8          112    168.00        150.41
raidz3 wide  False     58      4          116    220.00        196.96
raidz3 wide   True     12     12          108    162.00        145.04
stripe       False      0      0          120    240.00        214.87

The next example shows some more configuration rule enforcement; the 7720 only supports 2TB disks:

./sizecalc-2010Q3.py 10.156.83.42 ebi size 1t 30 30 30 30 30
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
Warning: 7720 only supports 2TB disks. Disk size reset to 2TB
                 FRONT                                   BACK
 ------------------------------------- 	-------------------------------------
 | | | | | | | | | | | | | | | | |L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
5|---------------|C|C|---------------|  |---------------|C|C|---------------|11
 | | | | | | | | |C|C| | | | | | | | |  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
4|---------------|C|C|---------------|  |---------------|C|C|---------------|10
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
3|---------------|C|C|---------------|  |---------------|C|C|---------------|9
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
2|---------------|C|C|---------------|  |---------------|C|C|---------------|8
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
1|---------------|C|C|---------------|  |---------------|C|C|---------------|7
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
 |1|1|1|1|1|1|1|1| | |1|1|1|1|1|1|L|L|  | | | | | | | | | | | | | | | | |L|L|
 | | | | | | | | |L|L| | | | | | |Z|Z|  | | | | | | | | |L|L| | | | | | |Z|Z|
0|---------------|C|C|---------------|  |---------------|C|C|---------------|6
 |1|1|1|1|1|1|1|1|C|C|1|1|1|1|1|1|1|1|  | | | | | | | | |C|C| | | | | | | | |
 | | | | | | | | | | | | | | | | | | |  | | | | | | | | | | | | | | | | | | |
 ------------------------------------- 	-------------------------------------
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      4          146    146.00        130.71
mirror        True      2      4          146    146.00        130.71
mirror3      False      3      6          144     96.00         85.95
mirror3       True      3      6          144     96.00         85.95
raidz1       False      4      6          144    216.00        193.38
raidz2       False     13      7          143    242.00        216.66
raidz2        True      9      6          144    224.00        200.54
raidz3 wide  False     48      6          144    270.00        241.73
raidz3 wide   True     13      7          143    220.00        196.96
stripe       False      0      0          150    300.00        268.59

And here's a 7720 example with no artwork, to save some screen real-estate:

./sizecalc-2010Q3.py -n 10.156.83.42 ebi 30 30 30 30 30 30 30 30 30 30 30 30
Oracle ZFS Storage Appliance Size Calculator for SAS disks Version 2010.Q3
Enumerating Available Storage Profiles... done.
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      4          356    356.00        318.72
mirror        True      2      4          356    356.00        318.72
mirror3      False      3      6          354    236.00        211.29
mirror3       True      3      6          354    236.00        211.29
raidz1       False      4      4          356    534.00        478.08
raidz2       False     14     10          350    600.00        537.17
raidz2        True     14     10          350    600.00        537.17
raidz3 wide  False     59      6          354    672.00        601.63
raidz3 wide   True     35     10          350    640.00        572.98
stripe       False      0      0          360    720.00        644.60

UPDATE:

The size calculator is currently not available. It is moving to a new home, you can read more here

For more detailed help on the original sizecalc for SAS disks features, run the tool with no arguments to read the help message, or read my previous blog entry here.

 Happy Sizing!

 EOF

Friday Aug 27, 2010

What is a Terabyte?

While this may be surprising to some people, one discussion I seem to have more frequently than any other is one surrounding usable capacity. Not surprisingly, capacity is a key metric by which customers buy storage, but what may be surprising is how poorly it is understood. I'll attempt to demystify things a little.

Before we get too far I think it's worth explaining what a Terabyte is. I often get the question "Isn't 1TB just one trillion bytes?" That is a perfectly reasonable conclusion to draw - tera is the standard prefix for trillion in base 10. The thing we need to remember that computers don't do decimal (base 10) math like we do, they do binary (base 2) math. As a computer understands it, 1TB is 240, or 1,099,511,627,776 bytes. Similarly, 1GB is actually 230, or 1,073,741,824 bytes. 

Surely then when we buy a 1TB drive in our storage array, or from the local electronics store, it must contain 1,099,511,627,776 bytes right? Wrong. If you refer to the specifications of nearly any disk on the market today you'll see the capacity footnoted with text something like this (found on a Seagate specification sheet) "1 One gigabyte, or GB, equals one billion bytes and one terabyte, or TB, equals one trillion bytes when referring to drive capacity". Technically this is correct, since the prefixes 'giga' and 'tera' do describe billions and trillions, but it leaves a little to be desired when we're talking in terms the computer understands. Update: It was pointed out to me that I should highlight the fact that RAM always comes in capacities based on a power of 2 as a result of the way it is addressed, meaning that this capacity difference will only ever apply to disks. 

I can't honestly pinpoint when this footnoting of capacity started happening without going over a lot of old spec sheets, but I would guess it would date back to the time when drives in the gigabyte range started to become available. Previous to that I can recall looking at the spec sheets for drives in which the number of megabytes of capacity was well documented. In fact, the geometry of the disk was described in painstaking detail including the number of spare sectors per cylinder, and the number of spare cylinders. But then, those were the days when you needed that information to use the drive.  

In any case,  it seems as though at some point marketing decided that bigger, rounder units were convenient and sexy, and as consumers we accepted this as 'close enough'. It may be due to the fact that in those days the difference was smaller (1 billion bytes vs 1GB would work out to a 70.33MB difference).

So now let's see how the gap magnifies as the scale gets bigger. The table below shows how the ratio between SI units (standard mega, giga prefixes in base 10) compares with binary units; it is borrowed from a wikipedia article here. What we can see from the table is that the ratio between SI units (standard base 10 mega, giga, tera, peta) and binary units (base 2) grows as the units grow bigger. A 1TB disk that you buy today actually contains a little over 931GB of space. If we could manufacture a 1PB disk, it would contain something like 909TB of space. 

Multiples of bytes
SI decimal prefixes IEC binary prefixes
Name
(Symbol)
Standard
SI
Binary
usage
Ratio
SI/Binary
Name
(Symbol)
Value
kilobyte (kB) 103 210 0.9766 kibibyte (KiB) 210
megabyte (MB) 106 220 0.9537 mebibyte (MiB) 220
gigabyte (GB) 109 230 0.9313 gibibyte (GiB) 230
terabyte (TB) 1012 240 0.9095 tebibyte (TiB) 240
petabyte (PB) 1015 250 0.8882 pebibyte (PiB) 250
exabyte (EB) 1018 260 0.8674 exbibyte (EiB) 260
zettabyte (ZB) 1021 270 0.8470 zebibyte (ZiB) 270
yottabyte (YB) 1024 280 0.8272 yobibyte (YiB) 280

As you can see above, a new naming convention has been created to describe storage capacity in binary terms. So a terabyte is actually not a terabyte, but rather a tebibyte or TiB. The problem is that I feel like the only guy using the term; that may be why I feel like I'm having this conversation all the time.

To make matters worse, in enterprise storage systems, there is additional overhead consumed by things like storage system meta data. In some cases this additional overhead can consume more than 10% of the purchased capacity in TiB. I doubt there's any changing the disk industry at this point, but you can challenge your enterprise storage system suppliers to tell you about the usable capacity of their system in tebibytes - and when they ask what that is (because they likely will), send them here. I've helped my customers decode the real capacity of my competitors systems while those competitors struggled to accurately describe a terabyte.

Update: One of my coworkers added this colourful (yes, I'm Canadian and need to spell colour with a 'u') anecdote which illustrates a number of the things (overzealous marketing, base 10 vs base 2, and system metadata overhead) that I describe, and shows that like always - what's old will be new again.

"The practice goes back to at least the 1980s, when marketing folks tried as hard as they could to make the disks sound as big as possible. In addition to the 1000 vs 1024 malarky, we used to have "unformatted capacity" about half of the time. I remember losing bids to the competition in the 80s because we (Sun) listed our drive as 669MB while they listed what turned out to be the identical drive as 760MB. If you actually formatted the drive for use, you'd see 669 million bytes, or 638 megabytes. And then when you did newfs on top of it to put a UFS file system on it, you would run out of space at around 555 MB, because statically allocated inodes consumed ~25 MB and then we also walled off a 10% reserve.

People were really annoyed that the 760MB disk actually allowed them to store 555MB"

If you'd prefer to buy storage from a company that's transparent, look no further than Oracle. The size calculator that I maintain (the latest release as of this writing is here) for our 7000 series tells you \*exactly\* how much capacity you will be able to use when you power up the system including all overhead, and allows you to see how much capacity you will have later when you expand your system.

EOF 

Friday Jul 02, 2010

Sizing J4410

I'm extremely excited about this week's quiet release of the J4410 expansion module for 7000 Series storage appliances. On the surface, it seems like they changed the look and feel and upgraded the back end to SAS-2, but it signifies much more. Inside these new modules are new disks that we affectionately call 'Fat SAS' - 1 or 2TB in size, spinning at 7200 RPM but with a SAS interface instead of SATA. This change from SATA to SAS interfaces has a lot of positive impact inside the 7000 series appliances, but the most relevant change to this post is the improved granularity with which one can configure storage pools...

Systems configured using J4410 expansion modules are able to provision storage at the granularity of a single disk where previously the smallest unit of configurable storage was a diskset, a group of 12 disks. As a result of this, and the 2010.Q1 enhancement which allows multiple pools per controller I decided to rework the 7000 Size Calculator. In order to simplify what was rapidly becoming a convoluted input language, we decided to create a new edition that will only handle the new hardware generation. Of course, you can continue to use the 2009.Q3 release to model configurations using the original J4400.

So let's take a look at how we model J4410 based configurations. There are two potential ways to buy J4410 modules: full, and with room for Log devices. The 'with room' configuration (I feel like I'm ordering at Starbucks every time I say that) has four open slots in the top row to be used with Log devices. Within each module, we can choose to allocate some of the disks to one or more pools. The new input language looks something like this:

./sizecalc-sas.py 10.156.83.42 \*\*\* 8/8/8

In the sample above, we are specifying a single expansion module configured with three pools, with 8 disks allocated to each pool. To specify two expansion modules, we add another layout statement:

./sizecalc-sas.py 10.156.83.42 \*\*\* 8/8/8 12/2/10

In this example, we now are allocating a total of 20 drives to pool 1, 10 to pool 2, and 18 to pool three with disks spread across both modules. Another new feature is strict mode, which ensures that the number of disks specified for a given expansion module adds to either 20 (for the 'with room' configuration) or 24. It is enabled simply with the '-s' argument:

./sizecalc-sas.py -s 10.156.83.42 \*\*\* 8/8/8 12/2/10

Let's see the output for the command above

./sizecalc-sas.py -s 10.156.83.42 \*\*\* 8/8/8 12/2/10
  Sun Storage 7000 Size Calculator for SAS disks Version 2010.Q1
  -----------------------------------------
  | |    3   |    3   |    3   |    3   | |
  | |-----------------------------------| |
  | |    3   |    3   |    3   |    3   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
 1| |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
  -----------------------------------------
  | |    3   |    3   |    3   |    3   | |
  | |-----------------------------------| |
  | |    3   |    3   |    3   |    3   | |
  | |-----------------------------------| |
  | |    2   |    2   |    3   |    3   | |
 2| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2           18     18.00         16.12
mirror        True      2      4           16     16.00         14.32
mirror3      False      3      2           18     12.00         10.74
mirror3       True      3      2           18     12.00         10.74
raidz1       False      4      4           16     24.00         21.49
raidz2       False      9      2           18     28.00         25.07
raidz3 wide  False     19      1           19     32.00         28.65
stripe       False      0      0           20     40.00         35.81
Pool 2
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2            8      8.00          7.16
mirror        True      2      6            4      4.00          3.58
mirror3      False      3      1            9      6.00          5.37
mirror3       True      3      4            6      4.00          3.58
raidz1       False      4      2            8     12.00         10.74
raidz2       False      9      1            9     14.00         12.53
raidz3 wide  False      9      1            9     12.00         10.74
stripe       False      0      0           10     20.00         17.91
Pool 3
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2           16     16.00         14.32
mirror        True      2      2           16     16.00         14.32
mirror3      False      3      3           15     10.00          8.95
mirror3       True      3      3           15     10.00          8.95
raidz1       False      4      2           16     24.00         21.49
raidz2       False      8      2           16     24.00         21.49
raidz3 wide  False     17      1           17     28.00         25.07
stripe       False      0      0           18     36.00         32.23

Oh, did I forget to mention that the new version draws ASCII art to show you which disks are in which pools? Beyond the visual representation however, you can also see three separate size tables, one per pool. Let's look at a configuration that includes a 'with room' module:

./sizecalc-sas.py -s 10.156.83.42 \*\*\* 10/10 12/12
Sun Storage 7000 Size Calculator for SAS disks Version 2010.Q1
  -----------------------------------------
  | |   LZ   |   LZ   |   LZ   |   LZ   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
 1| |-----------------------------------| |
  | |    1   |    1   |    2   |    2   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
  -----------------------------------------
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
 2| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2           20     20.00         17.91
mirror        True      2      2           20     20.00         17.91
mirror3      False      3      1           21     14.00         12.53
mirror3       True      3      1           21     14.00         12.53
raidz1       False      4      2           20     30.00         26.86
raidz2       False     10      2           20     32.00         28.65
raidz3 wide  False     21      1           21     36.00         32.23
stripe       False      0      0           22     44.00         39.39
Pool 2
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      2           20     20.00         17.91
mirror        True      2      2           20     20.00         17.91
mirror3      False      3      1           21     14.00         12.53
mirror3       True      3      1           21     14.00         12.53
raidz1       False      4      2           20     30.00         26.86
raidz2       False     10      2           20     32.00         28.65
raidz3 wide  False     21      1           21     36.00         32.23
stripe       False      0      0           22     44.00         39.39 

Here you can see the first row of the first module identifies the Log slots with 'LZ'. You may also have noticed that the default disk size is now 2TB as that is now the flagship size. If you need to model 1TB configurations, you can still use the 'size' keyword from previous releases. The other command which carried forward is the 'add' command, which allows you to model the change in configurations over time. Here's an example that shows both in use:

./sizecalc-sas.py -s 10.156.83.42 \*\*\* size 1T 12/12 add size 2T 12/12
Sun Storage 7000 Size Calculator for SAS disks Version 2010.Q1
  -----------------------------------------
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
 1| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
  -----------------------------------------
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
  | |-----------------------------------| |
  | |    2   |    2   |    2   |    2   | |
 1| |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  | |-----------------------------------| |
  | |    1   |    1   |    1   |    1   | |
  -----------------------------------------
============================================
Pool 1
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      4           32     15.00         13.43
mirror3      False      3      6           30      9.00          8.06
raidz1       False      4      8           28     18.00         16.12
raidz2       False     11      2           34     27.00         24.17
raidz3 wide  False     11      2           34     24.00         21.49
stripe       False      0      0           36     36.00         32.23
Pool 2
type          NSPF  width spares  data drives  raw (TB)  usable (TiB)
mirror       False      2      4           32     15.00         13.43
mirror3      False      3      6           30      9.00          8.06
raidz1       False      4      8           28     18.00         16.12
raidz2       False     11      2           34     27.00         24.17
raidz3 wide  False     11      2           34     24.00         21.49
stripe       False      0      0           36     36.00         32.23  

In this example, we specify that the first module uses 1TB drives, while the second uses 2TB drives. The double line separates the first configuration visually from the second that we added with the 'add' keyword.

 So after all of that, you should be armed and dangerous... Download the new tool here, and have fun modelling the new system!

Update July 7, 2010 - This is embarassing....

Thanks to Darius Zanganeh, an Oracle Storage Sales Consultant out of the Denver area for finding a couple of minor issues in the 2010.Q1 version. I made some last minute changes that I didn't think needed much testing, but it turns they had a bigger effect than I thought. In the initial 2010.Q1 release the number of data drives was wrong (I feel really silly for not catching that as I wrote the blog entry which clearly has the wrong number of data disks), and that the use of the 'add' keyword with configurations that would generate different stripe widths will cause the calculator to crash. The revised version which I call 2010.Q1a has been uploaded in the same location as the original, and doesn't exhibit these problems.

Update: The size calculator is currently unavailable for download. You can read more here

 EOF

Wednesday Sep 16, 2009

2009.Q3 Update for Size Calculator

With today's release of the 2009.Q3 update for 7000 Series Unified Storage Appliances there are a few enhancements to storage profiles that we can take advantage of and the trusty size calculator tool was in need of modification to use them, so I've spent some time putting them together. The resulting tool can be downloaded here (Update: The size calculator is currently unavailable, you can read more here), details about what has changed follow below.

The most obvious change from a size calculator perspective with the 2009.Q3 release is that the appliance has some new storage profiles (namely Triple Parity RAID with wide stripes, and Three-Way Mirroring). When run against an updated appliance or simulator, the tool will show these new profiles:

$ ./sizecalc.py 172.16.131.131 \*\*\* 12
Sun Storage 7000 Size Calculator Version 2009.Q3
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      4          284         142.00         127.13
mirror        True      2      4          284         142.00         127.13
mirror3      False      3      6          282          94.00          84.16
mirror3       True      3      6          282          94.00          84.16
raidz1       False      4      4          284         213.00         190.70
raidz1        True      4      4          284         213.00         190.70
raidz2       False     14      8          280         240.00         214.87
raidz2        True     14      8          280         240.00         214.87
raidz2 wide  False     47      6          282         270.00         241.73
raidz2 wide   True     20      8          280         252.00         225.61
raidz3 wide  False     56      8          280         265.00         237.25
raidz3 wide   True     35      8          280         256.00         229.19
stripe       False      0      0          288         288.00         257.84

\*\* As of 2009.Q3, the raidz2 wide profile has been deprecated.
\*\* New configurations should use the raidz3 wide profile.

In addition to these supporting these new storage profiles, I wanted to enhance the calculator to better handle the existing variety of drives we support in the 7210 (the 2009.Q2 release could only model 1TB disks), and prepare to model configurations with drives larger than 1TB in the 7310 and 7410.

The revised help message explains how to declare different disk sizes in more detail (run sizecalc.py without any arguments to read it), but the examples below should help to highlight the new capability and how it ties in with the previous enhancements. Here is an example modelling a 7210 configuration using 500GB disks; you can see that the size keyword and argument are prepended to the JBOD layout:

$ ./sizecalc.py 172.16.131.131 \*\*\* size 500G 1 t1
Sun Storage 7000 Size Calculator Version 2009.Q3
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      3           42          10.50           9.40
mirror3      False      3      3           42           7.00           6.27
raidz1       False      4      5           40          15.00          13.43
raidz2       False     14      3           42          18.00          16.12
raidz2 wide  False     43      2           43          20.50          18.35
raidz3 wide  False     43      2           43          20.00          17.91
stripe       False      0      0           45          22.50          20.14

\*\* As of 2009.Q3, the raidz2 wide profile has been deprecated.
\*\* New configurations should use the raidz3 wide profile.

When combined with Eric Schrock's modeling feature we can specify a different size drive each time we 'add' a JBOD layout (we're not actually selling 2TB drives yet, but the tool will allow us to model that configuration anyway):

./sizecalc.py 172.16.131.131 \*\*\* 1 add size 2T 1
Sun Storage 7000 Size Calculator Version 2009.Q3
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      4           44          33.00          29.54
mirror3      False      3      6           42          21.00          18.80
raidz1       False      4      8           40          45.00          40.29
raidz2       False     11      4           44          54.00          48.35
raidz2 wide  False     23      2           46          63.00          56.40
raidz3 wide  False     23      2           46          60.00          53.72
stripe       False      0      0           48          72.00          64.46

\*\* As of 2009.Q3, the raidz2 wide profile has been deprecated.
\*\* New configurations should use the raidz3 wide profile.

In this example, have supplied the configuration "1 add size 2T 1" which tells the calculator to initially model a single JBOD with the default drive size, which is 1TB, and then add a new JBOD with 2T drives. You can see that the total number of disks is only 48, however the results are based on 72TB of raw capacity.

As always, you can find the latest 7000 information in the Fishworks wiki. Happy calculating!

EOF 


Monday Jun 15, 2009

2009.Q2 Update for Size Calculator

After introducing new storage expansion options on the 7210, and shipping large number of 7000 series Unified Storage Systems, we discovered that the size calculator tool was in need of a couple of improvements.  

First, we needed to be able to model 7210 configurations with expansion modules.  With the previous software, no expansion was possible, so I just maintained a simple table of the shipping configurations leaving the size calculator to handle the more intricate 7410 configurations, however adapting this to the new matrix of potential configurations is unwieldy at best.

Second, we wanted to be able to model the growth of systems over time.  Customers who bought systems have been coming back and asking us questions like "If I add two more J4400's to my system, how much usable capacity will I have?".  Previously, we would have run the calculator with the original configuration, recorded the output and then run the calculator again with the new configuration and added both capacities together. This was becoming annoying if you were mucking around with configurations to see the impact over time.  

So, out of necessity the 2009.Q2 release of the size calculator was born.  Eric Schrock decided to spend a little time to build a modeling feature into the tool to help answer complicated questions, and I built on his modifications to add 7210 support with configuration modeling.  

The revised help message is a little more detailed than this (run sizecalc.py without any arguments to read it), but the examples should highlight the new features clearly. The coolest new feature is Eric's configuration modeling - it allows you to submit a number of configurations together and have the calculator add them together for you. Here is an example of the modeling feature in action:

$ ./sizecalc.py 172.16.131.131 \*\*\* 1 h1 add 1 h add 1 
Sun Storage 7000 Size Calculator Version 2009.Q2
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      5           42          21.00          18.80
raidz1       False      4     11           36          27.00          24.17
raidz2       False  10-11      4           43          35.00          31.33
raidz2 wide  False  10-23      3           44          38.00          34.02
stripe       False      0      0           47          47.00          42.08

We have supplied the configuration "1 h1 add 1 h add 1" which tells the size calculator that the initial configuration consists of 1 JBOD which is half full (the h in h1), and has one Log device (the 1 in h1), and that we will then add a second half JBOD (add 1 h), and later, we will add another full JBOD (add 1).  The calculator then computes all three, and adds them together to produce the output shown.  What's interesting is that you can see that the dual parity wide configurations have different widths, but the calculator just summarizes those as '10-23' and adds their contributed capacity to the pool.

When Eric's modeling feature is combined with my 7210 enhancements, we can also model the 7210 like this:

$ ./sizecalc.py 172.16.131.131 \*\*\* 1 t1 add 1 t
Sun Storage 7000 Size Calculator Version 2009.Q2
type          NSPF  width spares  data drives       raw (TB)   usable (TiB)
mirror       False      2      4           89          45.00          40.29
raidz1       False      4      6           87          66.00          59.09
raidz2       False     11      6           87          72.00          64.46
raidz2 wide  False  44-46      4           89          86.00          76.99
stripe       False      0      0           93          94.00          84.16

In this example, have supplied the configuration "1 t1 add 1 t" which tells the calculator to start with a single 7210 controller containing a single Log device (1 t1) and to add a J4500 expansion module (add 1 t).

As always, you can find the latest 7000 information in the Fishworks wiki. Happy calculating!

Update: The size calculator is currently unavailable. For more detail see here

EOF 

Wednesday Apr 08, 2009

Q&A on Hybrid Storage and SSDs

In a previous post I scratched the surface of how ZFS uses the ZFS Intent Log (ZIL), and how the 7000 Series uses Solid State Disk (SSD) to accelerate its performance. After having presented the Hybrid Storage Pool to more than a hundred customers, I can say that questions around how the 7000 leverages SSDs, and how it handles SSD failure are among the most frequently asked. I hope that I can expand on my previous entry here and explain things in clear detail.  I apologize in advance that my artwork is not nearly what it could be, but I wanted to share the information I have.

Background

Before we can cover the detail of how the file system leverages SSD and handles SSD failure, we need to understand the basic components of ZFS, and how the data flows between them. The ZFS file system is made up of a number of modules and layers. The interfaces that we use to store data run as modules at the top of the stack. In the 7000, we know these as Filesystems and LUNS in the Shares section of the BUI.

Both user level interfaces connect using transactions to a layer called the Data Management Unit (DMU). The DMU manages the storage and retrieval of data independent of its structure (the structure is implemented above by the modules that give us Filesystems and LUNs); it is the coordinator, orchestrating the movement of data between the various components below.

One of the key components it manages is called the Adaptive Replacement Cache (ARC). ARC is used as cache for both read and write operations as well as key file system data and metadata. With the excpetion of the cached copy of the ZIL (more on that later), and the actual write data cache, anything that can live in the ARC can also live in the Level 2 ARC (L2ARC) which is a 'disk' based extension of primary cache designed to operate as a second tier in the system storage model. I will cover L2ARC as it relates to 7000 more later, but if you're itching for details, check out Brendan's blog entry on it here.

Another component managed by the DMU is the ZIL. As I discussed in my previous entry, the ZIL is the journal that allows the file system to recover from system failures. The ZIL must always exist on non-volatile storage in order to ensure it will be there to recover from. By default, the ZIL is stored inside the storage pool, however it can also be stored on a dedicated disk device called a log device. Regardless of how the system is configured to store the ZIL, it is always cached in system memory while running in order to improve performance.

Below all of the caching tiers is the disk pool itself. It is built from groups of disk devices. In the Hybrid Storage Pool, this is where the data protection happens.

Translating This to the 7000 Series

In the Sun Storage 7000 Series, we use SSD to accelerate some of the components of the storage infrastructure. First, we use Write-Optimized SSDs to store the non-volatile copy of the ZIL. For our use case the devices we are shipping with the system today are capable of about 10,000 operations per second and use a supercapacitor to ensure that the device can stay powered long enough to write all data to the flash chips. Second, we use Read-Optimized SSDs to store the L2ARC. The devices we are shipping with the system today vary in read performance depending on the size of the operation being used, but are somewhere between 16 and 64 times faster than a standard disk device for read operations.

Q: How does data get into the Write Optimized SSD?

A: First, either a filesystem or LUN receives the new data to be written. That module then creates a transaction to add the new data to the currently open transaction group (TXG) in the DMU. As part of the transaction, the data is sent to the ARC while the Write Optimized SSD containing the ZIL is updated to reflect the changes. As new transactions continue to happen, they are logged sequentially to the ZIL.

Q: I've heard that SSDs can "wear out" if you write to them too many times.  How do you prevent that from happening to the Write Optimized SSD?

The system treats the SSD as a circular buffer starting to write at the beginning of the disk continuing in order until it reaches the end, and then resuming again at the beginning of the disk. This sequential pattern helps to minimize the risk of 'Wearing Out' the SSD over time. Some people I have explained this to express concern that the system could overwrite data required for recovery in this model, however the system is very aware of which part of the disk contains active data and which parts contain inactive data

Q: So how does the data get from the Write Optimized SSD to disk?

Surprisingly, the answer is that it doesn't -- at least not in the way the most people think. The trick here is that the ZIL is actually cached in the ARC for performance reasons. So, every few seconds when the system begins a commit cycle for the current transaction group it reads the copy of the ZIL in memory. This is the point at which the data will be integrated into the pool. If the data requires compression, it will be compressed and then a checksum for the data is generated. The system decides where the data should live, and then finally the data is synchronized from the ARC to disk.

Q: What happens if a Write Optimized SSD fails?

From my previous post:

"If the ZIL is stored on a single SSD, and that device fails, the system has a window to flush the ZIL from memory to disk (the Transaction Group Commit I mentioned earlier). Typically in the 7000 Series, this flush happens every 1-5 seconds, but it can take up to 30 seconds on an extremely busy system. Once the data is flushed from memory to disk, the system will use the disk pool to store the ZIL for the next transaction group. This window is the only time in a 7000 series where there is a chance for data loss. We mitigate this risk by mirroring the Write Optimized SSD's in the system."

Q: How does data get into the Read Optimized SSD?

As I mentioned earier, the Read Optimized SSD is used in the 7000 Series to hold the L2ARC. Since we would prefer to return the most popular data directly from our first level cache in DRAM, we use L2ARC to hold data that has a history of being useful, but hasn't been accessed as recently or as frequently as other data. As the ARC fills up, the system begins to scan the cache for the data that has been accessed least frequently or recently. After finding enough candidates, it begins to copy those blocks from ARC to L2ARC. While this process is happening, the data is still active in the ARC, so if a client did request it it could be returned. The process that fills the L2ARC operates in batches in order that there are a few larger writes rather than frequent smaller writes which improves performance.

Q: How do you prevent the Read Optimized SSD from "wearing out"?

Similar to the ZIL, the system writes to the ARC in a circular fashion to reduce the risk of wear over time.

Q: When does the system read from the Read Optimized SSD instead of Memory or Disk? 

When the system starts to run out of space in the ARC, it will attempt to evict the data that has been accessed least recently or frequently, the same data we copied to the L2ARC earlier. Now that the data has been evicted from the ARC, the lowest latency copy is living in L2ARC. When the next read request comes for that data, the system will find that the data is no longer available in the ARC, and will check the L2ARC to see if it has a copy. If a copy does exist in L2ARC, the checksum will be compared to ensure that there has been no corruption, and then the data will be returned at micro second latencies. If during the checksum comparison the system had found that the data had for some reason become corrupt in the L2ARC, it would release that copy of the data and read the correct data from the disk pool.

Q: What about the Read Optimized SSD, what happens if it fails?

The L2ARC is what we call a clean cache, meaning that all of the data stored in the L2ARC is available somewhere on disk. So if an L2ARC device fails, the system continues to operate returning read requests that would have been cached by that device directly from disk.

EOF

Wednesday Feb 04, 2009

Enhancing the Size Calculator

Adam Leventhal produced a really useful tool shortly after we launched the 7000 series which, in combination with our Storage VM would allow the user to see the usable capacity of various hardware configurations.  You can read Adam's original blog entry here.  The tool is fantastic, and I use it all the time, but the usable capacities were in raw TB, as reported by the drive manufacturer, which is not what the 7410 sees... You may have run across this with your own PC when you bought that shiny new 1TB drive and powered up the machine to find your OS asking you to format 931GB.

When I used the tool, I found myself manually converting the usable capacity into binary to find the true usable capacity.  With a little effort, I collaborated with Adam to enhance the tool to show a new column that gives the true usable capacity in binary and account for a small filesystem metadata reserve that the 7410 will hold back.  The resulting tool can be downloaded here. Of course, for the 7110 and 7210, you can continue to use the tables posted on the 7000 Series wiki here

For anyone who cares to know the difference between a base 10 TB and a base 2 TB (labeled in the updated tool as TiB):

One TB (as described by disk manufacturers) is 1000 to the 4th power, or 1,000,000,000,000 bytes
One TiB is 1024 to the 4th power, or 1,099,511,627,776 bytes

Update: The size calculator is currently unavailable. For more detail see here

EOF


Saturday Jan 10, 2009

ZIL, SSD, and Other Fun Acronyms

The ZFS Intent Log (or ZIL) is always written to non-volatile storage. The ZIL allows the file system to recover from crashes without data loss. In a 7000 Series with Write Optimized SSD, the ZIL is stored on the Write Optimized SSD, otherwise it is stored in the disk pool. Either way, it is also available in system memory. The ZIL flushes to the disk pool every once in awhile (this is called a Transaction Group Commit).

In a 7410 cluster, if a fail over occurs under normal conditions the pool is imported by the alternate node, the ZIL is replayed against the pool, and the pool is online and ready. You can think of the Write Optimized  SSD and ZIL as our NVRAM if that helps, but we don't need batteries.

If the ZIL is stored on a single SSD, and that device fails, the system has a window to flush the ZIL from memory to disk (the Transaction Group Commit I mentioned earlier).  Typically in the 7000 Series, this flush happens every 1-5 seconds, but it can take up to 30 seconds on an extremely busy system.  Once the data is flushed from memory to disk, the system will use the disk pool to store the ZIL for the next transaction group. This window is the only time in a 7000 series where there is a chance for data loss. We mitigate this risk by mirroring the Write Optimized SSD's in the system.

ZFS performance on asynchronous writes is good and SSD is not required in these configurations (although it will help improve performance and is recommended) however in configurations that require synchronous writes (many iSCSI configurations, NFS with O_DSYNC etc) Write SSD is almost mandatory.

Write SSD Sizing Rules of Thumb:

-Each device supports about 9000-10000 Write IOPS (Sequential writes stream directly to disk for better performance)
-If devices are mirrored, they only count for 1x Write IOPS (ie two devices at 9000 IOPS each when mirrored together support 9000 IOPS total)
-If aiming to support No Single Point of Failure configurations, more trays with less SSD's per tray will have higher usable capacities. Clusters will only allow SSD in pairs.

 EOF

About

This is the weblog for Ryan Matthews, a sales consultant at Oracle specializing in the ZFS Storage Appliance. It is the home to information on sizing and much more.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today