VolumeDrive is a Pennsylvania-based hosting company that uses Cogent and (since late May of this year) Atrato for Internet transit. A routing leak this morning by VolumeDrive was passed on to the global Internet by Atrato causing disruptions to traffic in places as far-flung from the USA as Pakistan and Bulgaria.
The way Internet transit is supposed to work in BGP is that a provider announces the global routing table to its customers (i.e., a large number of routes). Then, in turn, the customers announce local routes to their respective providers (generally a small number of routes). Each customer selects the routes it prefers from the options it receives. When a transit customer accidentally announces the global routing table back to one of its providers, things get messy. This is what happened earlier today and it had far-reaching consequences.
At 06:49 UTC this morning (18-September), VolumeDrive (AS46664) began announcing to Atrato (AS5580) nearly all the BGP routes it learned from Cogent (AS174). The resulting AS paths were of the following format:
... 5580 46664 174 ...
Normally, VolumeDrive announces 39 prefixes (networks) to Atrato: 27 it originates itself and 12 it transits for two of its downstream customers, Visperad Networks (AS15351) and DataWagon (AS27176). However, during this leak, Atrato propagated over 400,000 routes learned from VolumeDrive or nearly the entire global routing table. A full table is currently hovering at around 500,000 routes — a figure on the minds of many due to the 512k limit on many older routers. (Note that this particular routing leak resulted in no new routes and therefore didn't increase the size of the global routing table.)
The following graphic shows how the Internet has reached VolumeDrive over the past couple of days, using either Atrato or Cogent. VolumeDrive experienced a brief outage earlier in the week. Then just before midnight UTC, Atrato dropped out entirely as a VolumeDrive provider.
When Atrato briefly returned as a provider at 06:49 UTC, VolumeDrive passed them nearly all the routes it learned from Cogent. Evidently Atrato did not have the circuit breaker on the quantity of routes it would accept from VolumeDrive (MAXPREF), because it in turn announced these routes to the rest of the world. ISPs will often use such a limit to avoid propagating an erroneous flood of routes, electing instead to shutdown the link to contain the potential damage. That's a good engineering practice as there are probably no legitimate circumstances where Atrato should receive 400,000 BGP routes from one of its customers. In fact, for a link that normally transits 39 prefixes, seeing a few hundred routes should be enough to trigger an automated response.
To recap, a major routing leak occurred, one that was entirely preventable with some common-sense limits. So what? How much impact could this small Pennsylvania-hosting company have on the global Internet? Well, quite a lot in fact — such is the nature of our trust-based Internet routing. Pretty much anyone can mess it up.
According to Dyn's IP Transit Intelligence tool (shown below), Atrato transits around 5,000 prefixes and has over 600 peering connections (not simply BGP adjacencies, but "peering" as opposed to "transit") — at least they did at the start of the day. That much peering can act as a very loud amplifier for leaked routes.
Peering relationships are established between providers that wish to exchange traffic, often to avoid paying their upstream providers to carry their customers' traffic. Routes learned from peers are generally prioritized over routes learned from a transit provider because peers often exchange traffic for free (settlement-free). Thus, when bad routes are propagated by a provider with a lot of peers, those routes can travel far and wide.
The following graphics depict the number of our traceroute measurements completing from both Islamabad, Pakistan and Los Angeles to China Unicom, China's second largest ISP after China Telecom. The dips in completion rate begin at 06:49 UTC when, instead of going through Singtel (in the case of
Islamabad) or Telia (in the China Unicom), traffic was diverted to Atrato. Many of these traces never reached their intended destinations.
These graphics are generated from thousands of measurements; however, examining individual traceroutes reveals the exact details of the path changes during this incident. The traceroute shown below was performed yesterday and takes a path from Islamabad to Karachi where it then boards a submarine cable en-route to Singapore before finally reaching Zhengzhou, China. Geographically, it's a reasonable route even if the recorded latencies are quite high.
trace from Islamabad, Pakistan to China Unicom Henan Province at 09:08 Sep 17, 2014
2 188.8.131.52 (PTCL, Islamabad, PK) 0.66ms
3 184.108.40.206 s10-0-3-0.rwp44d1.pie.net.pk 0.523ms
4 220.127.116.11 (ITI, Rawalpindi, PK) 3.907ms
5 18.104.22.168 (ITI, Karachi, PK) 30.205ms
6 22.214.171.124 (PTCL, Karachi, PK) 26.221ms
7 126.96.36.199 (SingTel IX, Singapore) 226.13ms
8 188.8.131.52 (Singtel, Singapore) 521.436ms
9 184.108.40.206 (China Unicom, China) 522.932ms
10 220.127.116.11 (China Unicom, China) 472.701ms
11 18.104.22.168 (Backbone of China Unicom) 519.328ms
12 22.214.171.124 (China Unicom Henan province) 500.175ms
13 126.96.36.199 (China Unicom, Zhengzhou) 484.009ms
14 188.8.131.52 (China Unicom, Zhengzhou) 494.929ms
Next is a traceroute from the same server to the same IP in Zhengzhou during the routing leak. Instead of passing through Singapore en-route to China, the path goes first to Atrato in Amsterdam (Atrato peers with PTCL at AMSIX) and then onto Telia who takes it to San Jose, California before finally arriving in China. By exporting the leaked routes from VolumeDrive, Atrato, in effect, inserted itself into the path between Pakistan and China! Atrato could have easily overwhelmed its own capacity at this time, as many of our measurement did not reach their intended destinations.
trace from Islamabad, Pakistan to China Unicom Henan Province at 06:59 Sep 18, 2014
2 184.108.40.206 (PTCL, Islamabad, PK) 0.484ms
3 220.127.116.11 s10-0-3-0.rwp44d1.pie.net.pk 0.567ms
4 18.104.22.168 (ITI, Rawalpindi, PK) 2.838ms
5 22.214.171.124 (ITI, Karachi, PK) 27.867ms
6 126.96.36.199 (PTCL, Karachi, PK) 29.401ms
7 188.8.131.52 khi77.pie.net.pk 164.01ms
8 184.108.40.206 eth15-2.r1.ams2.nl.atrato.net 165.093ms
9 220.127.116.11 eth1-1.core1.ams2.nl.as5580.net 170.78ms
10 18.104.22.168 eth1-7.core1.ams1.nl.as5580.net 172.437ms
11 22.214.171.124 (Atrato, Amsterdam, NL) 159.247ms
12 126.96.36.199 adm-b5-link.telia.net 252.254ms
13 188.8.131.52 adm-bb3-link.telia.net 237.843ms
14 184.108.40.206 ldn-bb1-link.telia.net 243.867ms
15 220.127.116.11 nyk-bb1-link.telia.net 246.865ms
16 18.104.22.168 sjo-bb1-link.telia.net 316.894ms
17 22.214.171.124 chinaunicom-ic-141282-sjo-bb1.c.telia.net 356.87ms
18 126.96.36.199 (China Unicom, China) 355.425ms
19 188.8.131.52 (China Unicom, China) 350.21ms
20 184.108.40.206 (Backbone of China Unicom) 356.632ms
21 220.127.116.11 (Backbone of China Unicom) 644.495ms
22 18.104.22.168 (China Unicom Henan province) 590.476ms
23 22.214.171.124 (China Unicom, Zhengzhou) 537.887ms
24 126.96.36.199 (China Unicom, Zhengzhou) 543.934ms
In the newly released Dyn Internet Intelligence tool, such impairments in the flow of traffic show up as gaps in completed latency measurements, illustrated below:
As an example of good Internet hygiene, SK Broadband (formerly Hanaro) apparently shutdown its connection to Atrato during the leak. Traceroutes from locations around the world (ranging from Denver to Bulgaria) reveal that traffic suddenly shifted from Atrato to NTT en route to SK Broadband. It appears that SK Broadband automatically turned down its connection to Atrato when faced with a flood of bogus routes, as illustrated below.
Not all routing leaks are origination leaks like in the Indosat leak earlier this year or the China Telecom leak of 2010. In that scenario, a provider announces the global routing table claiming that it is the "origin" (and therefore the destination) for every single routed network in the Internet. Routing leaks can also occur when routes are simply passed in the wrong direction between providers.
While basic route hygiene (e.g. using MAXPREF to limit the number of routes accepted from a customer or peer) could have prevented this incident, it underscores a larger point: we're all in this together. The Internet is our electronic commons and its proper functioning depends on everyone in control of an Internet router.
Routing goof-ups like this don't need to involve your IP address space to impact you. If it impacts the routes of someone you are trying to communicate with or one of the ISPs along the way, then it's your problem too. By understanding how the Internet works and gathering real-time Internet Intelligence on your assets and those of your customers or suppliers, you can work with your providers to mitigate the damaged caused by the inevitable mistakes, even when the source is on the other side of the world.