BGP storms and subtle changes
By was on Aug 18, 2004
Later that afternoon, I started to get calls - some websites were funky; sometimes they'd work, sometimes not, and most times very slow. That's odd because we have more bandwidth to our cage than all the rest of sun has combined (2Gb/s).
We start investigating and things get progressively worse, fast. Suddenly the whole site drops off the air. 20 minutes later we get things back, but we're all sweating (ok the ops guys are sweating, I'm popping Tums and waiting for evil phone calls from above - you know - when the COO or the CEO calls you?) Needless to say, we're working on it. And working... Late that evening we have a revelation - everything works fine until we failover a front-end switch. Our front-end switches are supposed to switch all traffic when anything goes wrong. Worked fine with the Cisco's, but with the Juniper's it touched off a bgp storm.
This had to get fixed and quick. We regularly make changes to the frontend load balancers, which we test by flopping switches, and if it's good we update 'em both and away we go. Otherwise we can switch back with little perceived downtime (like a few seconds at most.)
We had in place all the right things to snoop ALL traffic at all levels of the network. Doing so showed us a few interesting things. The Juniper switches advertise the same MAC address on multiple VLANs. Whoops - the Cisco didn't do that. Due to the way the network is setup, that meant that our front-end switches couldn't determine how to switch the network. Same MAC address on multiple interfaces... everything looks the same... barf, die, cause Will, Warren and Quoc ulcers. Ah, but this is what we live for, right? Tough problems are fun. Easy problems are boring. Probably why I don't clean my desk - too easy (until it gets really scary...)
That subtle difference - advertising the same MAC address on multiple VLANs was enough to cause 2 days of pain. We had our ISP make a manual change to the Juniper's, setting unique MAC addresses on each VLAN and we were stable again.
The complexity of high performance, secure networks makes finding these kind of things in advance VERY challenging. We thought we'd tested everything in the world, but here was a case when we missed one. And paid the price. You can be too cautious as well, never changing anything for fear of breaking something. On the web, that seems to be the worst choice you can make.