Wednesday Oct 01, 2008

Hey! Where did my snapshots go? Ahh, new feature...

I've been running Solaris NV b99 for a week or so.  I've also been experimenting with the new automatic snapshot tool, which should arrive in b100 soon. To see what snapshots are taken, you can use zfs's list subcommand.

# zfs list
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool                   3.63G    12.0G    57.5K   /rpool
rpool@install             17K        -    57.5K   -


This is typical output and shows that my rpool (root pool) file system has a snapshot which was taken at install time: rpool@install

But in NV b99, suddenly, the snapshots are no longer listed by default. In general, this is a good thing because there may be thousands of snapshots and the long listing is too long for humans to understand. But what if you really do want to see the snapshots?  A new flag has been added to the zfs list subcommand which will show the snapshots.

# zfs list -t snapshot
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool@install             17K        -    57.5K   -


This should clear things up a bit and make it easier to manage large numbers of snapshots when using the CLI. If you want to see more details on this change, see the head's up notice for PSARC 2008/469.

Wednesday Apr 09, 2008

RAS in the T5140 and T5240

Today, Sun introduced two new CMT servers, the Sun SPARC Enterprise T5140 and T5240 servers.

I'm really excited about this next stage of server development. Not only have we effectively doubled the performance capacity of the system, we did so without significantly decreasing the reliability. When we try to predict reliability of products which are being designed, we make those predictions based on previous generation systems. At Sun, we make these predictions at the component level. Over the years we have collected detailed failure rate data for a large variety of electronic components as used in the environments often found at our customer sites. We use these component failure rates to determine the failure rate of collections of components. For example, a motherboard may have more than 2,000 components: capacitors, resistors, integrated circuits, etc. The key to improving motherboard reliability is, quite simply, to reduce the number of components. There is some practical limit, though, because we could remove many of the capacitors, but that would compromise signal integrity and performance -- not a good trade-off. The big difference in the open source UltraSPARC T2 and UltraSPARC T2plus processors is the high level of integration onto the chip. They really are systems on a chip, which means that we need very few additional components to complete a server design. Fewer components means better reliability, a win-win situation. On average, the T5140 and T5240 only add about 12% more components over the T5120 and T5220 designs. But considering that you get two or four times as many disks, twice as many DIMM slots, and twice the computing power, this is a very reasonable trade-off.

Let's take a look at the system block diagram to see where all of the major components live.

You will notice that the two PCI-e switches are peers and not cascaded. This allows good flexibility and fault isolation. Compared to the cascaded switches in the T5120 and T5220 servers, this is a simpler design. Simple is good for RAS.

You will also notice that we use the same LSI1068E SAS/SATA controller with onboard RAID. The T5140 is limited to 4 disk bays, but the T5240 can accommodate 16 disk bays. This gives plenty of disk targets for implementing a number of different RAID schemes. I recommend at least some redundancy, dual parity if possible.

Some people have commented that the Neptune Ethernet chip, which provides dual-10Gb Ethernet or quad-1Gb Ethernet interfaces is a single point of failure. There is also one quad GbE PHY chip. The reason the Neptune is there to begin with is because when we implemented the coherency links in the UltraSPARC T2plus processor we had to sacrifice the builtin Neptune interface which is available in the UltraSPARC T2 processor. Moore's Law assures us that this is a somewhat temporary condition and soon we'll be able to cram even more transistors onto a chip. This is a case where high integration is apparent in the packaging. Even though all four GbE ports connect to a single package, the electronics inside the package are still isolated. In other words, we don't consider the PHY to be a single point of failure because the failure modes do not cross the isolation boundaries. Of course, if your Ethernet gets struck by lightning, there may be a lot of damage to the server, so there is always the possibility that a single event will create massive damage. But for the more common cabling problems, the system offers suitable isolation. If you are really paranoid about this, then you can purchase a PCI-e card version of the Neptune and put it in PCI-e slot 1, 2, or 3 to ensure that it uses the other PCI-e switch.

The ILOM service processor is the same as we use in most of our other small servers and has been a very reliable part of our systems. It is connected to the rest of the system through a FPGA which manages all of the service bus connections. This allows the service processor to be the serviceability interface for the entire server.

The server also uses ECC FB-DIMMs with Extended ECC, which is another common theme in Sun servers. We have recently been studying the affects of Solaris Fault Management Architecture and Extended ECC on systems in the field and I am happy to report that this combination provides much better system resiliency than possible through the individual features. In RAS, the whole can be much better than the sum of the parts.

For more information on the RAS features of the new T5140 and T5240 servers, see the white paper, Maximizing IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140 and T5240 Servers. The whitepaper has results of our RAS benchmarks as well as some performability calculations.

Monday Jul 30, 2007

Solaris Cluster Express is now available

As you have probably already heard, we have begun to release Solaris Cluster source at the OpenSolaris website. Now we are also releasing a binary version for use with Solaris Express. You can download the bits from the download center.

Share and enjoy!

Monday Jul 16, 2007

San Diego OpenSolaris User's Group meeting this week

Meeting Time: Wednesday, July 18 at 6:00pm
Sun Microsystems
9515 Towne Centre Drive
San Diego, CA 92121
Building SAN10 - 2nd floor - Gas Lamp Conference Room
Map to Sun San Diego

On July 18, Ryan Scott will be presenting "Building and Deploying OpenSolaris." Ryan will demonstrate how to download the OpenSolaris source code, how to make source code changes, and how to build and install these changes.

Ryan is a kernel engineer in the Solaris Core Technology Group, working on implementing Solaris on the Xen Hypervisor. In previous work at Sun, he worked on Predictive Self Healing, for which he received a Chairman's Award for Innovation. He has also worked on error recovery and SPARC platform sustaining. Ryan joined Sun in 2001 after receiving a BSE in Computer Engineering from Purdue University.

Information about past meetings is here.

Monday Jun 25, 2007

Brain surgery: /usr/ccs tumor removed

Sometimes it just takes way too long to remove brain damage. Way back when Solaris 2.0 was forming, someone had the bright idea to move the  C compilation system from /usr to /usr/ccs. I suppose the idea was that since you no longer had to compile the kernel, the C compiler no longer needed to be in the default user environment.  I think the same gremlin also removed /usr/games entirely, another long-time staple.  This move also coincided with the "planetization of Sun" idea, so the compilers were split off to become their own profit and loss center.  IMHO, this is the single biggest reason gcc ever got any traction beyond VAXen.  But I digress...

No matter what the real reasons were, and who was responsible (this was long before I started working at Sun), I am pleased to see that /usr/ccs is being removed. I've long been an advocate of the thought that useful applications should be in the default user environment. We should never expose our company organization structure in products, especially since we're apt to reorganize far more often than products change. IMHO, the /usr/ccs fiasco was exposing our customers to pain because of our organizational structure.  Brain damage. Cancer. A bad thing.

I performed a study of field installed software a few years ago. It seems that Sun makes all sorts of software which nobody knows about because it is not installed by default, or is not installed into the default user environment. I'm very happy to see all of the positive activity in the OpenSolaris community to rectify this situation and make Solaris a better out of the box experience.  We still have more work to do, but removing the cancerous brain damage that was /usr/ccs is a very good sign that we are moving in the right direction.

Monday Apr 23, 2007

Mainframe inspired RAS features in new SPARC Enterprise Servers

My colleague, Gary Combs, put together a podcast describing the new RAS features found in the Sun SPARC Enterprise Servers. The M4000, M5000, M8000, and M9000 servers have very advanced RAS features, which put them head and shoulders above the competition. Here is my list of favorites, in no particular order:

  1. Memory mirroring. This is like RAID-1 for main memory. As I've said many times, there are 4 types of components which tend to break most often: disks, DIMMs (memory), fans, and power supplies. Memory mirroring brings the fully redundant reliability techniques often used for disks, fans, and power supplies to DIMMs.
  2. Extended ECC for main memory.  Full chip failures on a DIMM can be tolerated.
  3. Instruction retry. The processor can detect faulty operation and retry instructions. This feature has been available on mainframes, and is now available for the general purpose computing markets.
  4. Improved data path protection. Many improvements here, along the entire data path.  ECC protection is provided for all of the on-processor memory.
  5. Reduced part count from the older generation Sun Fire E25K.  Better integration allows us to do more with fewer parts while simultaneously improving the error detection and correction capabilities of the subsystems.
  6. Open-source Solaris Fault Management Architecture (FMA) integration. This allows systems administrators to see what faults the system has detected and the system will automatically heal itself.
  7. Enhanced dynamic reconfiguration.  Dynamic reconfiguration can be done at the processor, DIMM (bank), and PCI-E (pairs) level of grainularity.
  8. Solaris Cluster support.  Of course Solaris Cluster is supported including clustering between Solaris containers, dynamic system domains, or chassis.
  9. Comprehensive service processor. The service processor monitors the health of the system and controls system operation and reconfiguration. This is the most advanced service processor we've developed. Another welcome feature is the ability to delegate responsibilities to different system administrators with restrictions so that they cannot control the entire chassis.  This will be greatly appreciated in large organizations where multiple groups need computing resources.
  10. Dual power grid. You can connect the power supplies to two different power grids. Many people do not have the luxury of access to two different power grids, but those who have been bitten by a grid outage will really appreciate this feature.  Think of this as RAID-1 for your power source.

I don't think you'll see anything revolutionary in my favorites list. This is due to the continuous improvements in the RAS technologies.  The older Sun Fire servers were already very reliable, and it is hard to create a revolutionary change for mature technologies.  We have goals to make every generation better, and we've made many advances with this new generation.  If the RAS guys do their job right, you won't notice it - things will just keep working.

Tuesday Feb 13, 2007

Solaris Express Developer Edition solves "support" problem

Last month, I blogged about open source Solaris Nevada build 55. I normally don't single out specific Nevada builds as worthy of a blog entry -- not because I think they are unworthy, but because they come out every two weeks which gets to be a lot of regular work to blog about.

So, why is build 55 so special? It is the first build which has become known as the Solaris Express Developer Edition. This is a milestone release because it is available with a support contract. Many people don't care about support, and you will often hear them complaining about the fact that Solaris 10 doesn't have some version of an application which was just put up on the web last week. The quality process involved in producing, distributing, and supporting a large software product, such as Solaris, takes time to crank through. If you really want the lastest, greatest, and riskiest version of Solaris, then you need to be on a Solaris Express release. The problem is that we couldn't provide a support contract for Solaris Express. That problem is now solved. Those who demand support and want to be closer to the leading edge can now get both in the Solaris Express Developer Edition.

Go ahead, give it a try. Then hang out with the Solaris community at the website where there is always interesting discussions going on.


Wednesday Feb 07, 2007

New white paper on Solaris Cluster and Oracle RAC

My colleagues Tim Read, Gia-Khan Nguyen, and Bob Bart have recently released an excellent white paper which cuts through the fog surrounding the complementary functions of Oracle Clusterware, RAC, and Solaris Cluster. The paper is titled Sun™ Cluster 3.2 Software: Making Oracle Database 10g R2 RAC even More “Unbreakable.”

We have people ask us about why they should use Solaris Cluster when Oracle says they could just use Clusterware. Daily. Sometimes several times per day. The answer is not always clear, and we view the products as complementary rather than exclusionary. This white paper shows why you should consider both products which work in concert to provide very highly available services. I would add that Sun and Oracle work very closely together to make sure that the Solaris platform running Oracle RAC is the best Oracle platform in the world. There is an incredible team of experts working together to make this happen.

Kudos to Tim, Gia-Khan, and Bob for making this readily available for Sun and Oracle customers. 


Wednesday Jan 31, 2007

ZFS RAID recommendations: space, performance, and MTTDL all-in

Wrapping up the thread on space, performance, and MTTDL, I thought that you might like to see one graph which would show the entire design space I've been using.  Here it is:

All-in graph 

This shows the data I've previously blogged about in scale. You can easily see that for MTTDL, double parity protection is better than single parity protection which is better than striping (no parity protection). Mirroring is also better than raidz or raidz2 for MTTDL and small, random read iops. I call this the "all-in" slide because, in a sense, it puts everything in one pot.

While this sort of analysis is useful, the problem is that there are more dimensions of the problem. I will show you some of the other models we use to evaluate and model systems in later blogs, but it might not be so easy to show so many useful factors on one graph.  I'll try my best...

Thursday Jan 18, 2007

Solaris Cluster (nee Sun Cluster)

Just a quick note, to those who might get confused.  It seems that marketing has decided to name the technologies formerly known as "Sun Cluster" to "Solaris Cluster."  Old habits die hard, so forgive me if I occasionally use the former name.

Branding is very important and, as you've probably seen over the years, not what I would consider to be Sun's greatest strength. But in the long run, I think this is a good change.

Friday Jan 12, 2007

ZFS RAID recommendations: space vs U_MTBSI

Some people get all wrapped around the axle worrying about disk controllers.  These same people often criticize innovative products like the Sun Fire X4500 (aka Thumper) server because it only has 6 SATA controllers for 48 disk drives (8 disks/controller). In this blog, I'll take a look at the various possible RAID configurations for an X4500 and see how they affect the Unscheduled Mean Time Between System Interruptions (U_MTBSI).

Get off the bus

I think this overabundance of worry originated years ago, when controllers were very expensive and the available I/O slots for a computer were quite limited. If you look at the old-school technologies such as parallel SCSI or IDE interfaces, which were buses, then the concerns were valid. Indeed, if you look at any buses, you'd see many of the same opportunities for problems: S100, VME, DBus, etc. What we fear the most about a parallel bus is that some device will grab ahold of it and not let go, thus wedging everything on the bus. For storage devices, this could mean that a single hardware fault could disrupt access to data and it's redundancy, causing an outage. We will often describe this as a single fault zone. The diversity concept encourages distributing the redundant components across different fault zones. The pocketbook concept places the ultimate limits on how many components are available, though.

Another place where bus designs limit our choices is in performance. In general, only one device can be talking on a bus at any given time. So, if you have a bunch of devices sharing a bus, then you could have a performance bottleneck to go with your single fault zone. The obvious way to avoid this is to not use buses. (I usually take this opportunity to dis' fibre channel, but I'll spare you this time :-)  Today, we have many opportunities to replace buses with point-to-point technologies: parallel SCSI replaced by serial attached SCSI (SAS), IDE/ATA replaced by Serial ATA (SATA), DBus replaced by Safari, front side bus (FSB) replaced by HyperTransport, Ethernet hubs replaced by Ethernet switches, etc.

From a RAS perspective, point-to-point technologies are very cool, largely because there are more fault zones, one per pair, but any single fault zone only affects one pair. This has numerous advantages because we can build highly reliable protocols into the links and not have to deal with sharing. Basically, if I have only two devices, then I can construct the link such that each device logically has one transmit and one receive interface.  Simple. Simple is good. Simple allows us to do things like automatically know when a device is on the other end of the link, we hear something, as opposed to a shared bus where you have to place a request for an address on the bus and hope that a device responds, which it can't if the bus is wedged by some other failed device. In other parts of the system, going point-to-point has allowed even more RAS improvements. For example, we use point-to-point connections between the system boards in a Sun Fire E25K server, which is why we can implement dynamic reconfiguration.

We also gain from Moore's Law. A curious thing happens when you integrate more functions onto a single chip -- the failure rate tends to remain more or less constant. In other words, if you take the functions performed by 4 different chips and integrate them onto one chip, then you get a 4:1 parts count reduction, the per-part failure rate stays roughly the same, so you get a 4x increase in reliability. Putting this together, consider replacing 4 parallel SCSI buses, each with a single controller and drive (4 fault zones), with a single 4-port SAS controller. At first glance you'd say that you replaced 4 fault zones with one fault zone. But in order to analyze such a system, you must take into account the reliability of the components. The single SAS controller will have approximately the same failure rate as each of the parallel SCSI controllers, so we get a 4x increase in controller reliability. Now, the answer to the question of which is better isn't so clear. And the math can get rather complicated. Naturally, we developed a tool to perform such analyses, RASCAD.


RASCAD is an industry-leading tool we developed and use at Sun to easily answer design questions regarding reliability, availability, and serviceability (RAS) of complex systems. For the computationally intrigued, we build heirarchial Markov models, very cool.

At Sun, we evaluate systems for their Mean Time Between System Interruptions (MTBSI). Very simply, a system interruption occurs when a component failure causes a system interruption (reboot). This includes service events to repair the component. For example, if a component fails and the system reboots in a degraded state and repairing the component requires another reboot, then the failure causes two interruptions (e.g. CPUs). Obviously, some components can often be repaired without causing a second service outage (e.g. disks). MTBSI does include a serviceability component, which can be mitigated using planned outages and other processes. So, we often will try to stick with the unscheduled outages, U_MTBSI, which is an indicator of pain. Pain is not good, so we try to increase the time between painful events, U_MTBSI, whenever possible.

Analysis of X4500 and ZFS U_MTBSI

Previously, I discussed space versus MTTDL for an X4500. Given the 6 SATA controller configuration of the 48 available disks, how is the U_MTBSI affected by the various possible RAID configurations in ZFS? A dandy question. I took the same configurations and computed the U_MTBSI under the condition that I would strive for the best possible controller diversity. This gets a little tricky when you have only 6 controllers. For example, if you have a 7-disk raidz set, then at least two of the disks will share a controller. If you had a 6-disk raidz set, then you could place each drive in the set on a different controller. For the X4500, this gets a little more difficult because the two drives that you can boot from are on the same controller (BIOS limitation?) Also, what about spares? What happens when a spare shares the same controller as a data disk? Should I mirror adjacent or in opposition? Will the sun rise again tomorrow?  Anyway, you can see where you can easily begin to get wrapped around the axle worrying about this stuff. Let's see what the analysis shows.

Plot of space vs U_MTBSI

The first thing you notice is that the same statement I made last time still applies - friends don't let friends use RAID-0!

The second thing you'll notice is that the RAID types are clumping again. But the clumping is not as diverse in the U_MTBSI axis as you might expect. This is because the reliability of the controller is much higher than the reliability of a single disk.  If you think about it, you are worried about the reliability of one chip versus a pile of chips, motors, amplifiers, heads, media, connectors, wires, and other stuff.  Since the reliability of the controller is much higher than the reliability of the disks (controller MTBF >> disk MTBF) the disk reliability dominates. This is also why we do see a significant difference between the RAID types used.

The third thing you'll notice is that once again I haven't labeled the U_MTBSI axis. If I had labeled it, then it would be a rat hole opportunity with a high probability of entrance.  In this case, all of the components are identical, the only change is the RAID configuration. So you could even consider the results normalized to some value and gain the same insights.

The explanation I'll offer for why RAID-Z (RAID-5) is worse than mirroring (RAID-1) or double parity RAID-Z2 (RAID-6) is that the probability of two disks failing remains the same. But the probability that two disks failing and causing you to have an interruption is very different. I think this clearly shows what David Patterson alluded to in a presentation he gave at the Sun Technical Conference once: single parity RAID-5 just doesn't give enough protection, double parity (e.g. RAID-Z2) would have been a better choice. He also mentioned that people hassle him because they've lost data with RAID-5. Needless to say, I'm not a big fan of RAID-5.

You can see that controller diversity doesn't make as big an impact on the U_MTBSI as the RAID type. For example, the U_MTBSI for raidz with 5+1 (one column per controller) is not that much different than for 6+1 (more than one column per controller). Similarly, the use of hot spares doesn't seem to make a big difference. You can more easily see the advantage of hot sparing when you look at MTTDL or Mean Time Between Service (MTBS, more on that later...)

I hope that this view of the trade-off between RAS and space will help you make better design decisions. There are other trade-offs to be considered as well, so I encourage you to look at all of the requirements and see how they can be satisfied. And don't get all wrapped around the axle worried about SAS/SATA controller diversity. Be diverse if you can, but don't worry too much if you can't.

Monday Jan 01, 2007

Nevada build 55 -- nice!

Over the holiday break I installed Solaris Nevada build 55 on a few machines.  This is an important build and I think you'll see lots of people talking and blogging about it.  Here are a few reasons why I think it is cool:

  • StarOffice 8 is now integrated. Previous Nevada builds had StarOffice 7, which meant that I had to install and patch StarOffice 8 separately.  This significantly reduces the amount of time I needed to get my desktop up to snuff.
  • Studio 11 is now integrated.  This is another time saver.
  • NetBeans 5.5 is also integrated.  I use NetBeans quite a bit for Java programming. I fell in love with refactoring and the debugging environment.  I also do quite a bit of GUI work, and NetBeans has saved me many hundreds of hours of drudgery.
  • NVidia 3-D drivers are integrated.  Still more installation time savings.
I do install every other build of Nevada, and these changes have significantly reduced the time it takes for me to go from install boot to being ready as my primary desktop.  More importantly, it shows that once we can get past some of the (internal political) roadblocks, we can create a richly featured, easy to install, open source Solaris environment that showcases some of the best of Sun's technologies.

BTW, one of the cool things I've been playing with over the break is the integration of StarOffice with NetBeans. There are NetBeans wizards which you can use to create add-ons for StarOffice. Check out how easy it is to add a function to StarOffice Calc!




« July 2014