Today we’re proud to launch our third tool, Oracle Internet Routing 3D Visualization, under a new initiative we are calling Oracle’s Safer Internet Initiative.
The Safer Internet Initiative already comprises our Internet Intelligence Map and IXP Filter Check. The mission is simple: help make the Internet safer so enterprise customers, and all users, can feel confident moving sensitive data to the cloud and operating critical workloads in an online environment. We recognize that Oracle Cloud customers don’t just need to trust the security of our cloud, they need to be able to rely on the public internet for critical business operations.
Oracle Internet Routing 3D Visualization is a service that aims to increase the public’s understanding of internet routing events such as BGP leaks and hijacks through novel visualizations. This new service will be free to the public.
In this post, we describe a new multi-dimensional BGP analysis technique that aims to better explore and understand propagation during routing leaks. And, with our new 3D Visualization tool launched today, we use this technique to revisit and re-analyze nearly 100 routing leak incidents from recent years.
Oracle Internet Routing 3D Visualization offers a unique perspective of these disruptive occurrences and helps to answer questions about which networks and geographies were affected the most over the course of any given incident.
Our hope is that a better understanding of route-leak propagation will help inform and assess the efficacy of technologies such as RPKI and Peer-lock that aim to mitigate the effects of routing leaks. Additionally, we intend to add future analyses if and when major routing leaks occur.
Large disruptive routing leaks have plagued the internet for years. Typically caused by human error and compounded by technical misconfiguration, route leaks are defined in RFC 7908 as “the propagation of routing announcement(s) beyond their intended scope”; or, in other words, sending BGP announcements where they shouldn’t go.
Every year, multiple large routing leaks disrupt internet connections and 2019 has been no exception. In June alone, there were two noteworthy incidents:
Press coverage described the scale of these incidents using prefix (route) count: “more than 70,000 Internet routes” Ars Technica reported for the first incident and “more than 20,000 IP address prefixes” according to The Register for the second. We’re as guilty as anyone of using this shorthand in the write-ups of leak events over the years, but prefix count is insufficient for capturing the nuance of these incidents.
Prefix count as a routing leak metric lacks nuance
In the analysis following a major routing leak, we often gauge the impact in terms of a single number: the count of unique prefixes mistakenly announced. However, this one-dimensional view of a complex incident obscures the fact that not every leaked route is in circulation for the same amount of time or propagated by the same number of autonomous systems (networks).
When we include these additional dimensions (duration and propagation) in the picture, one can see that leaks often exhibit a long tail of leaked prefixes that are picked up by very few ASes. This long tail likely serves to inflate, and potentially overstate, the impact of any particular major incident thus rendering the overall count of unique leaked prefixes a problematic metric.
For example, different analyses of the same incident have arrived at dramatically different estimates of the size of a leak - sometimes differing by thousands of prefixes. Take the AS4788 leak from 2015: BGPmon reported that the leak involved 176,000 prefixes, while our analysis observed 260,000 leaked prefixes. This seemingly large difference can be explained by subtle differences in the BGP sources used in the analyses. Generally, the prefixes that contribute to these differences were accepted by very few ASes and thus had very little operational impact on the internet. This type of discrepancy illustrates the limitation of using prefix count as the sole metric for assessing the impact of a routing leak.
There has to be a better way: 3D Visualization
To better understand our multi-dimensional approach, it is helpful to visualize it in 3-dimensional space. In the z-axis, we use “peer percentage” as our measurement of route propagation. Peer percentage is the proportion of our peers (BGP sources carrying full routing tables) that accepted a route at a given moment in time. The x-axis is simply the set of unique prefixes announced in the leak sorted in reverse order by its aggregate peer percentage values. Finally, the y-axis is time in 1-minute intervals.
Above is a 3D visualization of the AS396531 leak from 24 June 2019, which began at 10:37 UTC. We observed that the number of unique prefixes involved in this leak was over 29,000 – however, as can be seen, the vast majority of those prefixes didn’t propagate to many ASes. In fact, less than 500 prefixes were widely circulated at all (see raised surface in red towards left of graph) which we will delve into further below.
Of course, there are some assumptions inherent in this approach. We assume that every prefix is equal and independent from each other and furthermore assume that every AS is equal and independent. While it is true that not all prefixes and ASes are equivalent, for analysis focusing on internet-wide route propagation, these assumptions are necessary.
The New Analysis Tool
Within this site, multi-dimensional analyses of major routing leaks are available for study. We’ve included routing leaks that involved at least 100 prefixes and were seen by at least 10% of our BGP sources. If there are routing incidents that you think we missed, please let us know and we can add them to the corpus of events.
At the top-level, the tool lists the routing leaks that have been analyzed including the ASNs involved, start time, duration, and prefix counts. The prefix counts are split into two numbers: “All” includes the count leaked prefixes regardless of BGP source count, while “Significant” reports the count of leaked prefixes observed by more than 1% of our BGP sources.
An example is shown below:
For each incident, an interactive visualization allows the user to set filters based on the BGP origins or the geolocation of the leaked routes. By setting a filter, the user can compare the propagation of affected routes between different origins or countries, as shown below.
The tool lists the top origins and countries by impact; we calculate aggregate propagation, the sum of the area under the surface plot, and use this as a measure of impact.
Recent Routing Leaks Revisited in Oracle Internet Routing 3D Visualization
4134_21217 leak (6 June 2019)
As reported in June, AS21217 leaked over 70,000 routes to AS4134 beginning at 09:43 UTC until around 12:20 UTC. This incident, depicted below in Oracle Internet Routing 3D Visualization, reveals that a large portion of that unique prefix count had routes that didn’t propagate very far at all and likely had little operational impact on the internet.
If we were to set a threshold to include only leaked prefixes that were seen by at least 1% of our peering base, the prefix count would have dropped from 78252 prefixes to 15373. Such a conservative threshold would simply serve to remove the leaked prefixes that were only observed by a few peering sessions and arguably overstate the overall prefix count metric.
Alternatively, for those few thousand leaked routes that were widely circulated, nearly all of these were routes that normally had limited propagation by design. Sometimes telecoms and content providers will use “regional routes,” which only appear in the routing tables of a subset of global ASes, typically limited to a geographic region. These routes are intended to offer special handling for traffic from a certain part of the world - for everyone else there is a less-specific route.
When these regional routes are leaked, they can propagate widely. Since the rest of the internet doesn’t carry the legitimate route, it gladly accepts the leaked route as there is nothing for it to compete against. The leaked version fills the void left by the intentionally limited propagation of a regional route. This is perhaps an underappreciated risk of the use of regional routes.
That is what is happening in the graphics included in our write-up of the incident from June. In the example below, 220.127.116.11/16 was a regionally announced more-specific route of 18.104.22.168/15. During the event, the leaked /16 propagated farther than the legitimate /16 which had its propagation intentionally limited.
701_396531 leak (24 June 2019)
Only 18 days after the previous incident, another large disruptive routing leak occurred. At 10:35 UTC on 24 June 2019, AS396531 leaked over 29,000 routes to AS701 which were carried on to the internet. As illustrated below, like the previous incident, only a subset of the leaked routes was actually widely circulated.
As noted in several analyses of the event, this leak was exacerbated by the use of a BGP route-optimizer. Route-optimizers often conduct their traffic engineering through the use of more-specific routes intended to remain within an AS. These more-specifics can wreak havoc when leaked out into the global routing table.
The appearance of more-specifics in a route leak can be an indicator of the presence route-optimizer, but this isn’t always the case. In another major routing leak that we analyzed in November 2017, more-specifics had been used to influence the return path of traffic from a Tier-1 provider. In that case, a routing leak introduced more-specific routes into the global routing table, but it was not caused by a route optimizer.
While more-specifics might be a weak indicator of the presence of a route-optimizer, what really gives the route-optimizer away is the complete lack of prepending visible in the leaked routes. As our recent analysis showed, prepending (the use of repeated ASNs in AS paths for traffic engineering) is so over-used that it can be considered a widespread, self-inflicted routing vulnerability. Finding a large batch of routing announcements with absolutely no prepending is unnatural and could only be caused by a route-optimizer stripping out the unneeded traffic engineering technique.
What if RPKI invalids had been dropped?
Getting back to the leak at hand, out of more than 29,000 unique prefixes touched by the leak, only about 450 were widely circulated. Of these 450 prefixes, only 263 were more-specifics likely introduced by a route-optimizer and the rest were regional routes that, as in the previous routing leak incident, normally had limited circulation.
Had AS701 been dropping RPKI invalids, it would have dropped 158 routes (0.5% of the total prefix count of the incident). Of the almost 29,000 prefixes that we observed leaked, 26,873 had no ROA and would be classified as UNKNOWN to an RPKI filter. The remaining 2,145 prefixes had ROAs and would have been VALID – a route-optimizer preserves the origin in the AS path.
Of those invalid routes, 130 would have been INVALID_LENGTH and 28 would have been INVALID_ASN. It is in those 130 INVALID_LENGTH routes that RPKI could have made a difference – those that were INVALID_ASN have misconfigured ROAs and are always invalid. These included a sizable portion of more-specifics generated by the route-optimizer and included more-specifics of Cloudflare routes with ROAs. This relatively small number of routes constituted the largest contributor to connectivity disruption and underscores the idea that when it comes to operational internet impact, not all prefixes are equal.
Additionally, if a network hopes to use RPKI to fend off a routing leak by a route-optimizer, it will need to set its maxLength for each ROA to be an exact match of the prefix length of the routed prefix. Anything greater leaves room for the route optimizer to announce an RPKI-valid more-specific.
We need better metrics than simple prefix count to convey the scale of a major routing leak. As discussed above, prefix count can overstate the number of prefixes that experienced operational impact and can vary widely based on the BGP sources used. Alternatively, a multi-dimensional approach like the one we’ve developed, which takes the extent of propagation and duration of each leaked prefix into consideration, offers a richer and more nuanced look at each incident.
If we are going to use prefix count to convey the size of a routing leak, then maybe it should be required that we simply state the number of prefixes seen by more 1% of one’s BGP sources, or another similarly conservative threshold. A more nuanced option might be to use aggregate propagation, the area under the surface plot as we’ve defined, as a kind of Richter scale for seismic events in the global routing table.
We hope that these interactive routing leak autopsies will help inform discussion around routing leaks and that this transparency will lead to greater preventative action and adoption of best practices. Please take a look them and tell us what you think!And, stay tuned for more announcements from our Safer Internet Initiative later this year.