BGP storms and subtle changes

Subtle changes to networking infrastructure can have serious side effects. Our ISP decided to upgrade their edge routers from Cisco to Juniper Networks gear (don't ask me why, I don't know and don't care - Cisco is a big partner of ours, and I like their gear). Seemed like they were very similar, so no worries, right? We made the switch and saw an interruption of a few seconds, then everything stabilized, or so we thought.

Later that afternoon, I started to get calls - some websites were funky; sometimes they'd work, sometimes not, and most times very slow. That's odd because we have more bandwidth to our cage than all the rest of sun has combined (2Gb/s).

We start investigating and things get progressively worse, fast. Suddenly the whole site drops off the air. 20 minutes later we get things back, but we're all sweating (ok the ops guys are sweating, I'm popping Tums and waiting for evil phone calls from above - you know - when the COO or the CEO calls you?) Needless to say, we're working on it. And working... Late that evening we have a revelation - everything works fine until we failover a front-end switch. Our front-end switches are supposed to switch all traffic when anything goes wrong. Worked fine with the Cisco's, but with the Juniper's it touched off a bgp storm.

This had to get fixed and quick. We regularly make changes to the frontend load balancers, which we test by flopping switches, and if it's good we update 'em both and away we go. Otherwise we can switch back with little perceived downtime (like a few seconds at most.)

We had in place all the right things to snoop ALL traffic at all levels of the network. Doing so showed us a few interesting things. The Juniper switches advertise the same MAC address on multiple VLANs. Whoops - the Cisco didn't do that. Due to the way the network is setup, that meant that our front-end switches couldn't determine how to switch the network. Same MAC address on multiple interfaces... everything looks the same... barf, die, cause Will, Warren and Quoc ulcers. Ah, but this is what we live for, right? Tough problems are fun. Easy problems are boring. Probably why I don't clean my desk - too easy (until it gets really scary...)

That subtle difference - advertising the same MAC address on multiple VLANs was enough to cause 2 days of pain. We had our ISP make a manual change to the Juniper's, setting unique MAC addresses on each VLAN and we were stable again.

The complexity of high performance, secure networks makes finding these kind of things in advance VERY challenging. We thought we'd tested everything in the world, but here was a case when we missed one. And paid the price. You can be too cautious as well, never changing anything for fear of breaking something. On the web, that seems to be the worst choice you can make.

Comments:

Great story, and very well told. I certainly learned something from it . . . The problems that could arise from flooding MAC address advertisements is also mentioned by Radio Perlman in her book on networking.

Posted by M. Mortazavi on August 18, 2004 at 07:47 AM PDT #

Great story, and I certainly agree about the last part about complexity, and not just finding out in advance, when something goes wrong, as it is bound to do sooner or later, finding the cause can be very difficult.

Posted by K. Gullberg on August 19, 2004 at 07:19 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

I run the engineering group responsible for Sun.com and the high volume websites at Sun.

Will Snow
Sr. Engineering Director

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today