Monday May 05, 2008

Busy day today

It's been a busy day today, OpenSolaris was announced and the downloads started coming in fast and furious. It overwhelmed our bandwidth locally, so we shifted our traffic to one of our CDN providers. I'm happy that we've built into most of our systems the ability to shift traffic dynamically. It allows me to balance budget vs. performance in near real-time. Ok, it's still not where I want it (where I can just tweak some knobs and have things magically work), but it's really dang close. 

Rama asked me after we fixed the problem "How did you know it was bandwidth?", to which I replied - "Experience". Ok, it was experience and paranoia - I watched the server dashboard light up (in a bad way) and I was thinking "gee, I wonder if OpenSolaris is going to get hit hard this AM..." Checked the network bandwidth graphs and sure enough, we were seeing some ugly flat spots, way up at 961mb/s. Yep, nearly a gigabit/second was being driven through the infrastructure.

Good news is the firewalls, load balancers, et. al. were doing fine; just couldn't go any faster. So Joe shifted some traffic to our CDNs and presto, we're back in operating territory.


Saturday Aug 11, 2007

Downtime and consequences

Datacenter power outage affects

[Read More]

Wednesday May 23, 2007

Sometimes, even google can't help

Sometimes, even google can't help.

I got a new wireless router, and simultaneously decided to switch from DSL to Cable. Yay! Sort of... DSL has been a nightmare for me lately - I've had an incredible amount of downtime, resulting in a working connection at \*128k\* downlink. Great. ISDN was faster. And cheaper! Really a sore point for me, and now the DSL company wants me to spend another 6 hours debugging their stupid line. Heck, I'll pay the cancellation fee.

So I switched to cable. One thing I'd like to say is that it was incredibly easy to switch from DSL. Other than waiting a week from the install request to the actual install, and when I had an 8am-noon appt window, the guy shows up on my doorstep at \*noon\*. Other than that, WOW. Installed and running in 20 minutes including running a new cable under my house, drilling a hole in the floor and figuring out the rats nest of wiring under my desk... That went so well, I added in a new DLINK DIR-655 Extreme-N router/wireless hub.

Everything was dandy. I dutifully read their docs, put the CD in the drive and like a fool installed Network Magic. Uh oh. Should have known better. Never read the directions, never follow instructions... You fool!

I ended up not being able to see the router, and when I did, I got incomplete web pages. Tried googling. Nothing. Seriously, one other person had a similar error. Reset the modem. Twice. Even called tech support (oh the shame!).

Removed all wiring, software, changed ethernet ports on pc (don't ask) plugged PC back into router directly. Restarted everything. Turned off windoze firewall (don't ask I said!) and presto. I got to the router. Updated the firmware, did all configuration, messed around a bit more and then plugged in the cable modem. Still good.

Just goes to show you (a) google doesn't have all the answers and (b) never, ever follow documentation for consumer level products. Enterprise level stuff, well that's another matter for another blog entry...

Wednesday Aug 02, 2006

Having fun with WiFi (hacking)

Have fun with someone stealing your wifi![Read More]

Friday Feb 25, 2005

A few poorly designed websites will cost you serious money

So over the past few weeks we launched a new site at Sun. During the evaulation process, engineering suggested they trim the video on the homepage (1MB video, single file). The marketing folks didn't like our suggestion so they left it that way and we launched.

Shame on me for not testing the site a bit more - it took us a while to figure out what was going on, but it turns out that handy little 1MB file doesn't get cached. It's not cacheable by browsers. That's right, if you stay on the home page of this site, you get that file downloaded every 15 seconds or so. Good times.

We spiked our traffic 10 fold for this one site - from 30mb/s to over 130mb/s. The good news is that I get to charge back the group that owns the site. The bad news is it put some stress on my infrastructure that I didn't need. It's fun to watch the traffic graphs for our cage now - it's the first time we've spiked over 300mb/s. Of course, like I said, that costs real money. Serious money too (although if you buy bandwidth in bulk you get a MASSIVE discount :)

Lessons Learned: test pages for caching before release; don't allow video on the home page of a popular site; and don't ever allow creative freedom to marketing - now they want video everywhere on the site...

We were going to wait to upgrade the net until the big move. Now I've got to upgrade several parts of the network right now instead of the more leisurely pace I prefer (ok, leisurely is a misnomer, controlled, tested, etc. is a better way to phrase it :)

Tuesday Dec 14, 2004

Do you JASS?

Maintaining more than 350 machines of varying types with 3 ops people is challenging - but doable if you have the right tools. JASS - the Solaris Security toolkit is one of the tools we use all the time. With some up front configuration, you can rebuild machines completely hands off, and assure that you have a secure build. If you haven't looked at JASS yet, take a peek, take a test drive and enjoy. It's allowed my organization to be one of the most efficient ones in Sun - deploying servers fully configured, fully secure in less than an hour (4 if you include unboxing, transporting them to the cage, racking them, wiring them, routing the cables & stopping at starbucks)

Wednesday Aug 18, 2004

BGP storms and subtle changes

Subtle changes to networking infrastructure can have serious side effects. Our ISP decided to upgrade their edge routers from Cisco to Juniper Networks gear (don't ask me why, I don't know and don't care - Cisco is a big partner of ours, and I like their gear). Seemed like they were very similar, so no worries, right? We made the switch and saw an interruption of a few seconds, then everything stabilized, or so we thought.

Later that afternoon, I started to get calls - some websites were funky; sometimes they'd work, sometimes not, and most times very slow. That's odd because we have more bandwidth to our cage than all the rest of sun has combined (2Gb/s).

We start investigating and things get progressively worse, fast. Suddenly the whole site drops off the air. 20 minutes later we get things back, but we're all sweating (ok the ops guys are sweating, I'm popping Tums and waiting for evil phone calls from above - you know - when the COO or the CEO calls you?) Needless to say, we're working on it. And working... Late that evening we have a revelation - everything works fine until we failover a front-end switch. Our front-end switches are supposed to switch all traffic when anything goes wrong. Worked fine with the Cisco's, but with the Juniper's it touched off a bgp storm.

This had to get fixed and quick. We regularly make changes to the frontend load balancers, which we test by flopping switches, and if it's good we update 'em both and away we go. Otherwise we can switch back with little perceived downtime (like a few seconds at most.)

We had in place all the right things to snoop ALL traffic at all levels of the network. Doing so showed us a few interesting things. The Juniper switches advertise the same MAC address on multiple VLANs. Whoops - the Cisco didn't do that. Due to the way the network is setup, that meant that our front-end switches couldn't determine how to switch the network. Same MAC address on multiple interfaces... everything looks the same... barf, die, cause Will, Warren and Quoc ulcers. Ah, but this is what we live for, right? Tough problems are fun. Easy problems are boring. Probably why I don't clean my desk - too easy (until it gets really scary...)

That subtle difference - advertising the same MAC address on multiple VLANs was enough to cause 2 days of pain. We had our ISP make a manual change to the Juniper's, setting unique MAC addresses on each VLAN and we were stable again.

The complexity of high performance, secure networks makes finding these kind of things in advance VERY challenging. We thought we'd tested everything in the world, but here was a case when we missed one. And paid the price. You can be too cautious as well, never changing anything for fear of breaking something. On the web, that seems to be the worst choice you can make.

Monday Aug 09, 2004

Gah, shows you what getting popular gets you...

I should have listened. And I typically do. Operations wise, that is. Other things, not so much...

A couple of years ago I attended a PARC forum done by one of the top engineering guys at Yahoo (dang, I think he was the CTO, but looking through the PARC video archives, I can't find him :(. Anyway, he had some really interesting things to say about watching the search and click trends of millions of people daily and trying to discover patterns. He was trying to see if he could find epidemics early via what people were doing on the web. Turned out it seemed to be working, until he realized Britney Spears (or some other teen idol) had just mentioned the term.

One of the most interesting things he said, and the reason for this post, was that he never ever deployed anything at Yahoo on less than 2 servers. They just didn't do it.

That recently bit me in the hind end... As you may have noticed, has been a bit twitchy... Way too much downtime for my liking. Well, as I mentioned we kind of put the server up in a week - from scratch, not knowing the software or anything. I was in a meeting with the heads of blogging at Sun - Pat Chanezon, Tim Bray, Simon Phipps, Danese Cooper and two vips - Jonathan Schwartz and John Fowler, when Sun took the official first steps toward blogging. Schwartz and Fowler said "go". So I took the nearest machine and my blogging/wiki expert hoffie and off we went. Note: there were LOTS of people at that meeting, please forgive me for only focusing on a few

Usually I'll deploy in a multi-tier architecture, but for several reasons that wasn't going to happen. We were near the end of the quarter, so we didn't have much time, we wanted to take advantage of "top of mind" of the VIPs, and most importantly, the software wasn't written to be load balanced.

We dropped a single machine out there, and I figured that we'd be ok for a little while. Next thing I know, we have Schwartz blogging and Business Week writing articles about it! Holy moly!

Traffic spikes (seriously) like you can't believe. The only thing I believe at this point is that my blackberry couldn't be more annoying at 3 in the morning. Yeah, I was on call for the server. And I didn't do a great job at catching all the faults. Hoffie was out having a 5th kid... Ok, his wife was having the kid, but dang!

Finally last week I got the other machines in rotation, so if the primary server drops, we'll be able to survive for a while.

Sorry for all the downtime; realize that it was probably more painful for me than you (heck, you weren't getting paged at 3AM). Doc Searls, please know that it was simple downtime and we've disallowed pinging to our servers, so that's not a really good measure of whether a server is up (but a really good try!) Hey folks, if you want to read an informative blog, I'd suggest Doc's.

Operations notes: we took the machine down friday afternoon for approx. 20 min. because we noticed it was down to 200k free memory. We could have just added swap space, but we had been noticing a steady decrease in free memory that stopping/restarting the webservers wasn't clearing. It took a reboot to get the memory back. Went from 200k to 7.5GB!!! The rest of the changes included tuning mysql and the jvm for the application we use - rollerweblogger.

Bork, bork, bork,bork

So a colleague of mine recently said something along the lines of "the content is fine in a browser but borked in the RSS feed". I couldn't for the life of me remember where "bork" came from. So of course I googled, and low-and-behold, I found Google Bork, and the king of all sites, Ze sveedish chef page. You know, the muppets, sesame street, etc.

This is what I get for borking up the site.

Bork is almost as useful as another four letter word, but I will refrain from linking to the site that describes the wonders of that word (it's a noun, a verb, an adjective -- heck, all parts of speech!)

Thursday Jul 29, 2004

Lorem ipsum...

I was wandering about our site, looking for dead blogs (blogs with no entries) and found someone had posted the infamous "Lorem ipsum..." in their blog. That text is commonly seen on website comps when you're designing a website.

So I set about sending out a note to the bloggers at Sun saying it's not cool to post junk text. Then I got a little worried - I don't read latin, heck it might be some really interesting post on antigravity or black holes. Ok, I doubt it, but you never know...

Googled the term, and lo and behold, an interesting article on the subject. Cicero no less. Ok, whacked Cicero, but wow! And a pretty interesting quote as well, to whit:

    "There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain . . ."

A few words left out, a few others truncated... None the less, interesting, huh? Yeah, my kid woke me up at 4:30AM and I was surfing the net.

Geoff Arnold just pointed me to another interesting site and some cool mac software on this very subject.

Sunday Jul 04, 2004

Does it work this way at every company?

At Sun, it appears to be a common occurrence that once you've written something you end up owning it forever. Even changing jobs won't get you out of supporting a tool you wrote 5+ years prior (yes, I've worked at Sun forever). I'm happy to say that I don't get called very often anymore about SunSolve, but I do know they're still running some of my code. What a scary thought. I wrote it originally intending to change over to something else when the technology improved. Apparently it's not changed enough or improved enough to change to a new technology.

Which brings up another point - when does it make sense to change? Customers - internal and external - are loathe to change. This goes double for customers trying to solve mission critical problems. SunSolve is a vast repository of information (and I mean vast). Power users know how it works, what the dataset is like and how to get the best results. If you change the interface or query engine on these guys you get kicked around the schoolyard for a while. Internal support folks have their little tricks as well. How do you introduce new technology to them? It's an age old question and really tough to solve.

Oh well, it's not really my problem any more - I have to just deal with the customers...

Thursday Jun 24, 2004

JavaOne, and goofy comments

So JavaOne is really gearing up - the week before and it's a madhouse at the office; content that absolutely has to be updated/live, new stuff rolling out, end of the quarter, end of our fiscal year, gee, this is a \*great\* time for Sun's biggest conference.

Be that as it may, I love the chaos. Bring it on is what I say. Yesterday we got a nice note from a concerned citizen who mentioned that he'd been hacked and his hacked box (Win2k server running IIS) was trying to hack us. We'd been seeing some low level SYN flood attack, but due to some recent network changes, it wasn't really that much of a bother. Did I mention how much I like our new network "stuff"? Stuff is a technical term for completely new design, hardware, connection to the net, etc.?

This should in no way be construed as an invitation to hack us, just that we have some cool new kit, and it works well so far. It took one small DOS attack and thwarted it.

My operations group delivered a completely new network, and moved over 100 machines with very little downtime (almost nothing could be perceived by customers). It took a while longer than I thought, but considering the fact that it was all new network gear, new network topology, and had to be done with the aformentioned little down time, it was impressive as hell.

And now for the goofy comment: "Every quick hack deserves a place in history, and a lifetime of support"... Said earlier today - way too early - Liam woke me at the shiny hour of 5:30... I'm seeing a lot of sunrises these days... Anyway, I was telling this to Pat Chanezon who wrote an account creation tool for - internal employees can create their own accounts, which is a good thing, but now they want the tool to polish their shoes, check for errors, etc. Heck, I think he whipped it out in a day or so... Back in the day, I wrote several tools like this, and you end up supporting them forever. Heck, some of my original code from \*1994\* still runs Sunsolve... Geez.

Wednesday Jun 02, 2004

Hackers and Painters

Wow, what a great article - Hackers and Painters by Paul Graham.

Paul says a lot of things that I strongly believe in. Some of the comments on forgetting the theory of computation 3 weeks after the final hit home rather directly - of course looking at my grades, you'd think that I forgot the theory 3 weeks before the final...

I work for Sun Microsystems, and if we're not a big company, I don't know what is... Even so, I still follow the belief that I only hire folks that do work on the side. Hiring programmers that LOVE to hack is critical. I strongly also believe that separating design from implementation leads to IT systems - it works according to spec, but if you need a minor change, you're screwed. Not saying all IT is bad, just there are processes and bureaucracy in place that pretty much prevents the elegant designs we all strive for. I do hope that we all strive for elegant design (even though I'm frequently proven wrong on this).

Paul also has another article on taste that talks about this aspect... One thing I like about working on the web is that it is completely acceptable to hack things on the web - it's the land of eternal beta test. If it doesn't work today, change it tomorrow. People that don't understand this quality should work on financial systems or manufacturing ERP systems or something other than the web. If you're not regularly changing the web systems, you'll quickly lose pace with the rest of the industry. The way people interract with websites is continuing to change, and there is much that can be done to improve the experience. I like being part of that change.

Monday May 10, 2004

I have a psychic connection to my webservers...

I swear I do. I woke up at 4am and couldn't get back to sleep. I'm thinking it's the wind noise making something rattle around outside or some animal. So I get up, get some water, wander around the house, and try to go back to bed. 4:30AM, pager goes off, followed by cell phone. Wife punts me out of bed and I trundle off to answer the phone. is seeing "variable" performance.

That's special. Something you always want to hear, "variable" on the most visible site at Sun. Yippee. I spend the next few minutes logging in, confirming what I've been told, starting IM to talk to the on-call folks in the UK. Yep, it's variable. Seems that with varying traffic levels, we're dropping connections from certain locations in the world.

We've just swapped one of our switches last week, and I'm thinking it might be the new switch or the other one is having issues. I'm a hands-on kinda guy, and so I poke and prod a bit, checking connectivity from several locations (I run a few personal websites). I finally call my lead network engineer, and he switches over to the backup switch. Everything is resolved, for now. Of course, the predawn light hits my eyes and now I'm up for good. Another early start to my day.

The real kicker here is that we've built out a new network infrastructure that has significantly more capacity and performance. We started moving machines already, but as the site is live, you have to move things in a very studied fashion. Slowly and carefully and with lots of testing. The old network is running some network gear that is old and starting to feel it's age. I'm now \*very\* motivated to move stuff to the new network faster...

Thursday Apr 29, 2004

DOS attacks suck

Particularly when you're having a major content push.

The other day we were having our earnings announcement (can you get more visible?) and I walked into work at around 8:30am. I'd been getting paged every once in a while from our site monitors that something was occasionally failing or slow.

We looked around (my ops engineers and I) and saw a significant jump in incoming traffic - something north of 50mb/s - when we normally run about 3-5mb/s. The servers were still fine, as the switches were handling the attacks. Unfortunately it was only 8:30am, with our peak traffic loads hitting at around 10am. Looking at the switches we were running about 98% of capacity. The big question was did we have the capacity to live through the earnings announcment?

As the load goes up, we saw more pages from the monitors that some places were slow or couldn't connect... Earnings announcement is at 2pm PST. By noon we're pretty well maxed out on the switches, but we don't want to take downtime due to the earnings...

1pm. We attempt a content push - and it fails. This is the content that will be used for the announcement at 2. Great. We then attempt to reboot one switch, hoping against hopes we can bring it back before the announcment. The good news, it boots pretty fast (10 min). The bad news - the other switch succumbs to the load and dies (reboots). Switch 1 has some issues because switch 2 never really lets go. Great.

1:30pm I'm getting calls from all kinds of people - VPs, you name it. If we blow this, we're in serious trouble. One nice thing is the incoming traffic has started to fall off, the DOS is blowing over (but we're still semi-offline - some of the sites are fine, others are not so fine). We finally restart both switches and get them to split the load up as they're supposed to...

1:45pm: attempt the content push again. It finally goes through. Lots of sweat and hard work to get everything restarted, but in the end we get it all out there at \*1:56pm\*. Yikes. Way too close for me. I go back to my office and finalize the plans to move to the new network. We've tested that one to much higher levels of attacks and it's proven to be more resilient. Gotta get there soon, before I blow a blood vessel.

Sometimes I think network hardware providers are behind DOS attacks - I've been hit 3 times, and each one caused me to buy new network gear (newer stuff handles attacks better - there's no doubt about it)


I run the engineering group responsible for and the high volume websites at Sun.

Will Snow
Sr. Engineering Director


« April 2014