For a little more than 90 minutes yesterday, internet service for millions of users in the U.S. and around the world slowed to a crawl. Was this widespread service degradation caused by the latest botnet threat? Not this time. The cause was yet another BGP routing leak — a router misconfiguration directing internet traffic from its intended path to somewhere else.
On Nov. 6, our network experienced a disruption affecting some IP customers due to a configuration error. All are restored.
— Level 3 Network Ops (@Level3NOC) November 6, 2017
While not a day goes by without a routing leak or misconfiguration of some sort on the internet, it is an entirely different matter when the error is committed by the largest telecommunications network in the world.
In this blog post, I’ll describe what happened in this routing leak and some of the impacts. Unfortunately, there is no silver bullet to completely remove the possibility of these occurring in the future. As long as we have humans configuring routers, mistakes will take place.
At 17:47:05 UTC yesterday (6 November 2017), Level 3 (AS3356) began globally announcing thousands of BGP routes that had been learned from customers and peers and that were intended to stay internal to Level 3. By doing so, internet traffic to large eyeball networks like Comcast and Bell Canada, as well as major content providers like Netflix, was mistakenly sent through Level 3’s misconfigured routers. Traffic engineering is a delicate process, so sending a large amount of traffic down an unexpected path is a recipe for service degradation. Unfortunately, many of these leaked routes stayed in circulation until 19:24 UTC leading to over 90 minutes of problems on the internet.
Bell Canada (AS577)
— Andrew J Dow (@andrewjdow) November 6, 2017
Bell Canada (AS577) typically sends Level 3 a little more than 2,400 prefixes for circulation into Level 3’s customer cone. During the routing leak yesterday, that number jumped up to 6,459 prefixes – most of which were more-specifics of existing routes and, equally as important, announced to Level 3’s Tier 1 peers like NTT (AS2914) and XO (AS2828, now a part of Verizon).
Below is a visualization of the latency impact of the routing leak.
Next is the propagation profile of just one of those Bell Canada routes leaked by Level 3. 22.214.171.124/22, for example, is not normally in the global routing table. That address space is covered by 126.96.36.199/16, a less-specific route. During the leak, this route (along with about 4,000 others) appeared in the global routing table as originated by AS577 and transited by AS3356. About 40% of our BGP sources had these leaked routes in their routing tables and most chose NTT (AS2914) to reach AS3356 en route to AS577 (below right).
Comcast (various ASNs)
Comcast, the largest internet service provider in the United States, was also directly impacted by yesterday’s routing leak.
— Modiv (@ModivMusic) November 6, 2017
Comcast uses numerous ASNs to operate their network and Level 3 leaked prefixes from quite a few of them, diverting and slowing internet traffic bound for Comcast. According to our data, Level 3 leaked over 3000 prefixes from 18 of Comcast’s ASNs listed below.
Our traceroute measurements into Comcast reveal the impact of the leak from a performance standpoint. The two visualizations below show a bulge of internet traffic headed for the leaked IP address space diverted through Level 3, and the increase in observed latency.
Level 3 leaked 81 prefixes from RCN who appeared to pull the plug on their Level 3 connection at 18:34 UTC, once they figured out what was causing a slowdown in their network.
Level 3 leaked 97 prefixes from Netflix (AS2906) including the following:
Impacts were not limited to the United States. Networks in Brazil, Argentina and the UAE also had routes leaked by Level 3 yesterday. Below are example routes leaked from Giga Provedor de Internet Ltda (AS52610, 42 leaked prefixes), Cablevision S.A. (AS10481, 365 leaked prefixes), and even the Weill Cornell Medical College in Qatar (AS32539, 3 leaked prefixes):
It is important to keep in mind that the internet is still a best-effort endeavor, held together by a community of technicians in constant coordination. In this particular case, initial clues as to the to origin of this incident were first reported in a technical forum (the outages list) when Job Snijders astutely observed new prefixes being routed between Comcast and Level 3 yesterday.
Peer leaks are a continuing risk to the internet without any silver bullet solution. We previously suggested to use protection when peering promiscuously, but even a well-run network like Google has been both the leaker and the leaked.
Networks share more-specific routes to a peer in order to ensure that return traffic comes directly back over the peering link. But there is always the risk that the peer could leak those routes and adversely affect your network. When the leaker is the biggest telecom in the world (and only getting bigger), the impact is likely to be significant.