The Oracle Dyn team behind this blog have frequently covered 'network availability' in our blog posts and Twitter updates, and it has become a common topic of discussion after natural disasters (like hurricanes), man-made problems (including fiber cuts), and political instability (such as the Arab Spring protests). But what does it really mean for the Internet to be "available"? Since the Internet is defined as a network of networks, there are various levels of availability that need to be considered. How does the (un)availability of various networks impact an end user's experience, and their ability to access the content or applications that they are interested in? How can this availability be measured and monitored?
Deriving Insight From BGP Data
Many Tweets from @DynResearch feature graphs similar to this one, which was included in a September 20 post that noted "Internet connectivity in #PuertoRico hangs by a thread due to effects of #HurricaneMaria."
There are two graphs shown—"Unstable Networks" and "Number of Available Networks", and the underlying source of information for those graphs is noted to be BGP Data. The Internet analysis team at Oracle Dyn collects routing information in over 700 locations around the world, giving us an extensive picture of how the networks that make up the Internet are interconnected with one another. Using a mix of commercial tools and proprietary enhancements, we are also able to geolocate the IP address (network) blocks that are part of these routing announcements—that is, we know with a high degree of certainty whether that network block is associated with Puerto Rico, Portugal, or Pakistan. With that insight, we can then determine the number of networks that are generally associated with that geography. The lower "Number of Available Networks" graph shows the number of networks (IP address blocks, also known as "prefixes") that we have geolocated to that particular geography. This number declines when paths to those networks are no longer present in routing announcements (are "withdrawn"), and increases when paths to those networks become available again. The upper "Unstable Networks" graph represents the number of networks that have recently exhibited route instability—when we see a flurry of messages about a network, we consider it to be unstable.
Necessary But Not Sufficient
However, as we mentioned in a previous blog post, "It is worth keeping in mind that core network availability is a necessary, but not sufficient, condition for Internet access. Just because a core network is up does not mean that users have Internet access—but if it is not up, then users definitely do not have access." In other words, if a network (prefix) is being announced, that announcement may be coming from a router in a hardened data center, likely on an uninterruptible power supply (and maybe a generator). Just because the routes (paths to the network prefixes) are seen as being available, it does not necessarily mean that those routes are usable, since the last mile network infrastructure behind them may still be damaged and unavailable.
These "last mile" network connections to your house, your cell phone, or your local coffee shop, library, or place of business are critical links for end user access. When these networks are unavailable, then it becomes hard, if not impossible, for end users to access the Internet. More specifically, the components of the local networks in your house or coffee shop/library/business need to be functional—the routers/modems need to have power, and be connected to the last mile networks. Because of the power issues and physical damage (downed or broken power/phone/cable lines, impaired cell towers) that often accompany natural disasters, these local and last mile networks are arguably the most vulnerable critical links for Internet access.
Determining Last Mile Network Availability
While network availability can be measured at least in part by monitoring updates to routing announcements, last mile network availability can be determined both through reachability testing as well as observing traffic originating in those networks. On the latter point, our best perspective is currently provided by requests to Oracle Dyn's Internet Guide - an open recursive DNS resolution service. With this service, end user systems are configured to make DNS requests directly to the Internet Guide DNS resolvers, rather than the recursive resolvers run by their Internet Service Provider. (Users often do this for performance or privacy reasons, though some ISPs will simply have their users default to using a third-party resolver instead of running their own.) Using the same IP address geolocation tools described above, we can determine where the users appear to be connecting from. Looking at the graph below, we can see a roughly diurnal pattern in DNS traffic in the days before Hurricane Maria makes landfall in Puerto Rico. (It is interesting to note that the peaks increase significantly as the hurricane approaches.) However, the rate of queries drops sharply, reaching a near-zero level, at 11:30 UTC on September 20, about an hour and a half after Maria initially made landfall, due to damage caused to local power and Internet infrastructure.
On the former point, regarding reachability testing, this insight can be gathered from the millions of daily traceroutes done to endpoints around the globe. Because the Oracle Dyn team has been actively gathering these traceroutes for nearly a decade, they have been able to identify endpoints across network providers that are reliably reachable, and can serve as a proxy for that network's availability. The graph below illustrates the results of regular traceroutes to an endpoint in Liberty Puerto Rico, a local telecom provider. It shows that traceroutes to IP addresses announced by Liberty PR generally traverse networks including San Juan Cable, AT&T, and AT&T Mobility Puerto Rico. These networks are some of Liberty PR's "upstream providers", connecting it to the rest of the Internet. It is clear that the number of responding targets (of these traceroutes) drops sharply just before mid-day (UTC) on September 20, and further degrades over the next 15 hours or so, reaching zero just after midnight. These endpoints presumably became unreachable as power was lost around the island, copper and fiber lines were damaged, etc.
Above, we have examined the various ways that Oracle monitors and measures network availability in the face of disaster-caused damage. However, there is another common cause of Internet outages -- government-ordered shutdowns. In the past several years, we have seen Iraq shut down Internet access to prevent cheating on exams, and Syria has taken similar steps as well, as shown in the graph below. We have also seen countries such as Egypt shut down access to the global Internet in response to widespread protests against the government. In countries where such actions occur, the core networks often connect to the global Internet through a state-owned/controlled telecommunications provider and/or through a limited number of network providers at their international border. This situation was examined in more detail in a blog post published nearly five years ago by former Dyn Chief Scientist Jim Cowie. The post, entitled "Could It Happen In Your Country?", examines the diversity of Internet infrastructure at national borders, classifying the risk potential for Internet disconnection.
In these cases, our measurements will see the number of available networks decline, often to zero, because all routes to the country's networks have been withdrawn. In other words, the networks within the country may still up and functional, but other Internet network providers elsewhere in the world have no way of reaching these in-country networks because paths to them are no longer present within the global routing tables.
In order for Internet access to be "available" to end users, international connectivity, core network infrastructure, and last mile networks must all be up, available, and interconnected. Availability of these networks can be measured and monitored through the analysis of several different data sets, including BGP routing announcements, recursive DNS traffic, and traceroute paths, and further refined through the analysis of Web traffic and EDNS Client Subnet information in authoritative DNS requests.
And as always, we will continue to measure and monitor Internet availability around the world, providing evidence of brief and ongoing/repeated disruptions, whatever the underlying cause.