Turning the corner

It's a little hard to believe that it's been only fifteen months since we shipped our first product. It's been a hell of a ride; there is nothing as exhilarating nor as exhausting as having a newly developed product that is both intricate and wildly popular. Especially in the domain of enterprise storage -- where perfection is not just the standard but (entirely reasonably) the expectation -- this makes for some seriously spiked punch.

For my own part, I have had my head down for the last six months as the Technical Lead for our latest software release, 2010.Q1, which is publicly available as of today. In my experience, I have found that in software (if not in life), one may only ever pick two of quality, features and schedule -- and for 2010.Q1, we very much picked quality and features. (As for schedule, let it be only said that this release was once known as "2009.Q4"...)

2010.Q1 Quality

You don't often see enterprise storage vendors touting quality improvements for a very simple reason: if the product was perfect when you sold it to me, why are you talking about how much you've improved it? So I'm going to break a little bit with established tradition and acknowledge that the product has not been perfect, though not without good reason. With our initial development of the product, we were pushing many new technologies very aggressively: not only did we seek to build enterprise-grade storage on commodity components (a deceptively daunting challenge in its own right), we were also building on entirely new elements like flash -- and then topped it all off with an ambitious, from-scratch management stack. What were we possibly thinking by making so many bets at once? We made these bets not out of recklessness, but rather because they were essential elements of our Big Bet: that customers were sick of paying monopoly rents for enterprise storage, and that we could deliver a quantum leap in price-performance. (And if nothing else, let it be said that we got that one very, very right -- seemingly too right, at times.) As for the specific technology bets, some have proven to be unblemished winners, while others have been more of a struggle. Sometimes the struggle was because the problem was hard, sometimes it was because the software was immature, and sometimes it was because a component that was assumed to have known failure modes had several (or many) unanticipated (or byzantine) failure modes. And in the worst cases, of course, it was all three...

I'm pleased to report that in 2010.Q1, we turned the corner on all fronts: in addition to just fixing a boatload of bugs in key areas like clustering and networking, we engaged in fundamental work like Dave's rearchitecture of remote replication, adapted to new device failure modes as with Greg's rearchitecture around resilience to HBA logic failure, and -- perhaps most importantly -- integrated critical firmware upgrades to each of the essential components of the I/O path (HBAs, SIM cards and disks). Also in 2010.Q1, we changed the way the way that we run the evaluation of the software, opening the door to many in our rapidly growing customer base. As a result, this release is already running on more customer production systems than any of its predecessors were at the time that they shipped -- and on many more eval and production machines within our own walls.

2010.Q1 Features

But as important as quality is to this release, it's not the full story: the release is also packed with major features like deduplication, iSER/SRP support, Kerberized NFS support and Fibre Channel support. Of these, the last is of particular interest to me because, in addition to my role as the Technical Lead for 2010.Q1, I was also responsible for the integration of FC support into the product. There was a lot of hard work here, but much of it was born by John Forte and his COMSTAR team, who did a terrific job not only on the SCSI Target Management facility (STMF) but also on the base ALUA support necessary to allow proper FC operation in a cluster. As for my role, it was fun to cut the code to make all of this stuff work. Thanks to some great design work by Todd Patrick, along with some helpful feedback from field-facing colleagues like Ryan Matthews, I think we came up with a clean, functional interface. And working closely with both John and our test team, we have developed a rock-solid FC product. But of course (and as one might imagine), for me personally, the really gratifying bit was adding FC support to analytics. With just a pinch of DTrace and a bit of glue code, we now have visibility into FC operations by LUN, by project, by target, by initiator, by operation, by SCSI command, by size, by offset and by latency -- and by any combination thereof.

As I was developing FC analytics, I would use as my source of load a silly disk benchmark I wrote back in the day when Adam and I were evaluating SSDs. Here for example, is that benchmark running against a LUN that I named "thicktail-bench":


The initiator here is the machine "thicktail"; it's interesting to break down by initiator and see the paths by which thicktail is accessing the LUN:


(These names are human readable because I have added aliases for each of thicktail's two HBA ports. Had I not added those aliases, we would see WWNs here.) The above shows us that thicktail is accessing the LUN through both of its paths, which is what we would expect (but good to visually confirm). Let's see how it's accessing the LUN in terms of operations:


Nothing too surprising here -- this is the write phase of the benchmark and we have no log devices on this system, so we fully expect this. But let's break down by offset:


The first time I saw this, I was surprised. Not because of what it shows -- I wrote this benchmark, and I know what it does -- but rather because it was so eye-popping to really see its behavior for the first time. In particular, this captures an odd phase I added to this benchmark: it does random writes across an increasing large range. I did this because we had discovered that some SSDs did fine when the writes were confined to a small logical region, but broke down -- badly -- when the writes were over a larger region. And no, I don't know why this was the case (presumably the firmware was in fragmented/wear-leveling/cache-busting hell); all I know is that we rejected any further exploration once the writes to the SSD were of a higher latency than that of my first hard drive: the IBM PC XT's 10 MB ST-412, which had roughly 95 ms writes! (We felt that expecting an SSD to have better write latency than a hard drive from the first Reagan Administration was tough but fair...)

What now?

As part of our ongoing maturity as a product, we have developed a new role here at Fishworks: starting in 2010.Q1, the Technical Lead for the release will, as the release ships, transition to become the full-time Support Lead for that release in the field. This means many things for the way we support the product, but for our customers, it means that if and when you do have an issue on 2010.Q1, you should know that the buck on your support call will ultimately stop with me. We are establishing an unprecedented level of engineering integration with our support teams, and we believe that it will show in the support experience. So welcome to 2010.Q1 -- and happy upgrading!

Comments:

Glad to hear something from the core OS guys at Sun. :-) All the best in your new home.

Posted by Derek on March 10, 2010 at 07:29 AM PST #

So when can we actually get our hands on the 2010.Q1 release? Its still not on the download sites. The release notes have been out for a week plus to taunt us, but no bits.

Posted by John on March 10, 2010 at 07:53 AM PST #

Bryan, As a fellow hacker, I know how tempting it is to work on the next set of new ideas and features. I admire that you are willing to put yourself on the line and take responsibility for the product release by being directly responsible for supporting it. It is the persistence of a team to succeed that makes a successful product. Good luck.

Posted by JAF on March 10, 2010 at 08:36 AM PST #

Sorry for all the teasing John -- it's there now. Derek: yes, we're here -- and busier than ever. And JAF: totally agree about persistence winning the day. Indeed, in my experience, persistence is the single greatest determinant of success in a software engineer -- though I still haven't really figured out how to interview for it...

Posted by Bryan Cantrill on March 10, 2010 at 08:55 AM PST #

We are very glad that this release is out and will be upgrading our systems soon.

The creation of the full-time Support Lead position is also very exciting as well as all the beta-testing work that was done with customers. Congrats for another release.

Posted by Giovanni on March 10, 2010 at 09:45 AM PST #

I understand wanting to release new features, however it would be really nice if Sun actually tried to address issues in cases that have been open since June 2009.
And furthermore the support that I have encountered has not been in the least bit helpful, especially considering we are paying for "Gold" level support.

Posted by CT on March 10, 2010 at 10:02 AM PST #

CT,

I would be interested to know the details; can you give or mail me a case number and I'll try to track it down? (My e-mail address is my first name dot my last name at sun dot com.) Anyway, I definitely hear what you're saying, and hopefully you see the creation of the Support Lead as a step in the right direction. Anyway, contact me and let's get it figured out.

Posted by Bryan Cantrill on March 10, 2010 at 12:34 PM PST #

Hey Brian,

Have you guys done anything to address the ZFS de-duplication issues that have been brought up on zfs-discuss? We would love to use ZFS de-duplication, but the myriad of issues pointed out on the list make me a bit hesitant. Curious to get your thoughts on this. Congrats on the new release!

- Ryan

Posted by Matty on March 10, 2010 at 08:45 PM PST #

Interesting analysis re: SSD performance. I wonder if improved performance visualisation tools such as these will help to convince manufacturers that OS people need to talk to SSDs directly and not just via a legacy emulation layer.

I'm looking forward to trying this release out, it's been a long time coming! Well done.

Posted by Jon on March 10, 2010 at 09:14 PM PST #

Great post. I think there's a typo there - you wrote "over a larger reason" instead of "over a larger region".

So what was the SSD drive model(s) with awful write latency?

Posted by Igor on March 11, 2010 at 12:30 AM PST #

Igor,

Corrected the typo -- thank you! As for that SSD, I can't remember the manufacturer, only the nickname we gave to it: Mr. Shitbag.

Posted by Bryan Cantrill on March 11, 2010 at 01:47 AM PST #

As a reseller accredited selling the first 7000 cluster globally (Thx to Phil L @SunUK) I've seen huge change with the product and now feel its ontrack to been the perfect storage appliance.

As for support I think the traditional Sun storage team have struggled to look after the product due to large support stack, OpenSolaris, iSCSI, CIFS, AD, LDAP, Kernel, LACP, IPMP, ZFS, NDMP, ZFS TX/RX etc..
But recently this has changed, I've seen a "Tiger Team" like approach to some recent support issues and the customers have been extremely satisfied with this new style.

Thanks Fishguys!, your making my life easier with every new release, to sell and support!

Posted by Andy Paton on March 11, 2010 at 03:19 AM PST #

Congratulations Bryan, great work. You definitely have one of the best and most advanced products on the market, and we love using it for our VMware environment. But I have to agree with CT, product support is definitely the worst I've experienced with any product, and we're on Gold level too. I'm glad you're making changes to support and hopefully this will address the problems. Support is the only problem that prevents me from recommending the product to my peers. Keep up the great work, and strengthen support and you'll have the absolute best product on the market.

Posted by Tariq Ali on March 11, 2010 at 05:17 AM PST #

Thanks Bryan, I'm really glad that support is being improved.
And I agree with Tariq, with the support piece of the puzzle in place, it will be the best product on the market hands down.

Posted by CT on March 11, 2010 at 06:14 AM PST #

"starting in 2010.Q1, the Technical Lead for the release will, as the release ships, transition to become the full-time Support Lead for that release in the field."

Bravo !
fx: applause until the welkin rings :fx

I've worked on support both ways - where the lead developers look at the problems, and where support problems are foisted off on some unfortunate in 'third-level' or QA or elsewhere in cubicle hell. The first model works better in many ways: the customer gets solutions faster, I learn faster and can consequently both support the customers better and insulate the developers better, and on.

Posted by Douglas Kretzmann on March 11, 2010 at 08:02 AM PST #

Congrats on the release! I'm hoping that this release will resolve some of the problems we've had with our 7310 cluster so far - so I have a couple of questions:

Is it now possible to re-configure network interfaces without causing service unavailability?

Our cluster fail-over time with 2009.09.01.4.1,1-1.13 was in the 10-15 minutes range, supposedly due to network interface plumbing... Has this been improved?

ALUA sounds really nice - how does this work? How will the standby node be able to process I/O, when there's no high-speed interconnect between itself and the pool owning node? Is it possible to use ALUA/MPIO with iSCSI too?

Tore

Posted by Tore Anderson on March 11, 2010 at 03:29 PM PST #

Bryan, when do you think we'll see either an API or a transparant data structure for the config? Coding expects is so much less efficient/uglier.

Posted by steve on April 05, 2010 at 06:22 AM PDT #

@Tore
Can't answer you regarding network plumbing improvements, but can put a face on ALUA. First, it's important to understand that only one cluster controller owns any given pool of storage at any given time. Thus, for FC, the paths to the standby head are in, well, standby. There is zero I/O activity to them unless and until the pool - and therefore LUNs - associated with the target group are failed over or taken over by the standby controller.

However, those paths must still be made visible to the initiator so that the client (MPxIO, MPIO, etc.) is aware that they can be used when they do become active.
ALUA doesn't really come into play with iSCSI because in that case, when the pool fails over, so do the network resources associated with it. The IP addresses remain the same. Compare that with use of FC HBAs where, of necessity, the port WWNs will be different on the standby controller head's HBAs.

Hope that helps. I'm sure Bryan will correct me if I've got any of this wrong.

Posted by Cecil on April 08, 2010 at 12:44 AM PDT #

@Cecil:

You're describing an active/passive mode, while as far as I understand, ALUA mode is basically a "fake" active/active mode. Both controllers will be able to service I/O, while only one is controlling the disks. If a I/O gets sent to the controller that also owns the disk, it is using the "optimal" path. If it is sent to the other controller, the I/O will still be serviced, only that the non-disk-owning controller will pass it on to its peer controller (via some kind of a high-speed interconnect between them), which will then process the request and then return it back the way it came in. This is a "non-optimal" path.

The host-based multipath software can then make an inquiry to determine which paths are "optimal" and which are not, and focus the I/O to the optimal one(s). However I/O to the non-optimal one will _not_ fail, which is especially useful when a server boots for instance - partition scanning and such will not cause I/O errors.

There's no reason why this mode should apply only to FC, by the way, the principle would be exactly the same with iSCSI targets on both controllers as with FC targets.

I've got some EMC CLARiiONs in my data centre too, and the active/passive mode you're describing is called "PNR" (passive not ready), while the ALUA mode works just like I've described. ALUA is much, much more trouble-free than PNR.

But I don't see how the Unified Storage cluster can possibly work in ALUA mode, as there's no high speed interconnect between the two controllers. I wish it were though, one of my main grievances with the Unified Storage is exactly that the network interface failover is coupled to the ZFS pool failover. It would have been so much nicer if those were completely independent.

Tore

Posted by Tore Anderson on April 08, 2010 at 01:51 AM PDT #

I got ALUA working in vmware vSphere. There is one drawback though, when I enable deduplication and compression on the LUN with the Virtual Machines on it and failover the clusterhead a few times, it eventually times out and becomes inaccessible. Any ideas?

Posted by Roger on May 03, 2010 at 01:34 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

bmc

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today