BGP, DNS, and the fragility of our critical systems

After last month's Facebook outage, Malcolm Heath, Senior Threat Researcher at F5 Labs, looks a little bit deeper into some of the internet technologies that we rely on so heavily.

IDGConnect_BGP_DNS_internet_system_fragile_shutterstock_1071591821_1200x800
Shutterstock

This is a contributed article by Malcolm Heath, Senior Threat Researcher at F5 Labs.

On October 4th, 2021, Facebook properties experienced a six-hour outage. The global outage extended to the Facebook-related properties including WhatsApp, Instagram, and Oculus VR.

In a blogpost, Facebook’s Santosh Janardhan, vice-president of engineering, explained the outage had begun when the company’s engineers issued a command that unintentionally disconnected Facebook data centres from the rest of the world.

During a routine maintenance job, the command was issued with the intention to assess the availability of global backbone capacity, however, this command took down all the connections in Facebook’s backbone network. Facebook’s systems are designed to audit commands to prevent mistakes like this, but a bug in an audit tool prevented it from stopping the command. This change caused a disconnection of the network connections between Facebook’s data centres and the internet.

Given the magnitude of this event, we thought it would be good to dig a little bit deeper into some of the internet technologies that we so heavily rely on.

It's always DNS

Domain Name System (DNS) is a single point of failure for Internet systems. DNS maps names, such as facebook.com, to IP addresses, allowing users to easily refer to sites by name.

DNS, in effect, provides translation between names and IP addresses, like an address book. When a site’s DNS servers are down, this lookup cannot happen, and people will be unable to reach your site. Keeping your DNS servers up, operational, and secure is a critical piece of site reliability.

Except when it’s BGP

Underneath that, there’s another technology that is at least as critical as DNS. This is a routing protocol (one of many) called Border Gateway Protocol (BGP). BGP is the protocol that allows Autonomous Systems (collections of large networks controlled by a single entity) to let other Autonomous Systems know how to reach the networks they control. It doesn't do the routing directly but is the protocol that shares information between routers. Having received this information, routers can make decisions about where to forward data.

Why is BGP important?

As an example, one might type “f5.com” into a web browser. This causes your computer to perform a DNS lookup, and the local DNS server your computer uses will hopefully return an IP address of 107.162.162.40. That’s the address book part.

Now however, your computer must be able to send traffic to that IP address. It's important to note that routing decisions are made on a hop-by-hop basis. Each router your data passes through will decide what the next step of the route should be by looking at the destination IP address and consulting its routing table to determine the next place to forward the data.

If the router participates in BGP, this routing table is constructed from the announcements it has received from other BGP enabled routers.

This will include information on what networks can be reached by which routers. It will also have information about how close that router is to the destination. Close, in this case, doesn't mean the number of routers the data will have to go through, but rather the number of Autonomous Systems that the data will traverse. There is a complex algorithm used to determine which of the possible routes is best. Best can mean a lot of things too, as factors such as egress policies and transit agreements between ISPs are considered as well.

If it turns out that Router A’s routing table shows two routers that it can forward the data to reach 107.162.162.40, it will pick one of the two, based on those metrics.

Similar routing decisions are made by each router that receives the data, either forwarding it to another router, or determining that it is directly connected to the 107.162.0.0/16 network and delivering the data to the final destination. The same process will be performed in reverse to route the traffic back through another series of routers, and then to the client.

There are a lot of advantages to this scheme. As long as an eventual final destination router for the traffic is available – and most companies with large internet presences have many such routers – our data should (eventually) end up there. As the information required to serve a site is broken down into many packets, they may even take different routes.

This is a feature – if some intermediate router goes down, the packets that compose our request or response can be re-routed to avoid the issue. This is great if the routing tables are consistent and have good information in them. After all, the internet was originally designed to route around nuclear strikes.

Can you supply an enlightening metaphor?

Imagine you want to get to your friend’s house, but you’ve never been there. You look up their address. That’s like the DNS part. Now, you need to figure out how to get there, so you go to the nearest intersection and ask someone which way you should go. They tell you to turn left. You go along that road until you reach another intersection and ask again. This person tells you to go right.

You continue this process until you reach your destination. It's possible that someone will tell you, "normally, I'd say go over the bridge, but the bridge is out, so go left here and ask at the next intersection." Or they may say, "going left is more direct, but going right and getting on the highway is actually faster."

The route you take won't always be the most direct way to get there, nor even necessarily the fastest, but it will help you avoid roadblocks, collapsed bridges, and washed-out roads. If everyone you ask has good info, you will get to where you're going. The means by which that good info is communicated is BGP. If BGP is providing incorrect information, or no information at all about how to get where you want to go, bad things can happen.

Is BGP bulletproof?

In a word, no. It's very robust, and scales well, which is a critical feature when you're trying to interconnect billions of hosts. But problems can occur.

A route announcement can omit routes it should be providing – meaning that the associated network simply disappears from the internet. No one knows how to get there, and the traffic destined for that network will be dropped.

This is sometimes done intentionally – it’s called blackholing a route, and it’s typically done to block connections to or from a given network. There are a variety of cases. For example, to block DDoS traffic from a hostile network, or in some circumstances, to remove an entire country from the internet during a time of civil crisis. The result is the network traffic is simply deleted, often with no notification back to the sender. The network being blackholed will receive no traffic and will effectively be cut off from the (digital) world.

A route can be announced incorrectly as well. A misconfiguration on the part of an Autonomous System can make it appear as if it can route traffic to networks it does not control. Done intentionally, this is called BGP hijacking, and while there are defences against this, it has happened many times, causing large amounts of traffic to be routed to very strange places, perhaps as an attempt to capture and inspect the traffic for purposes of espionage.

Accidents are far more common. For example, a network operator or automated system misconfigures something. The necessary route either disappears entirely, or the misconfiguration ends up creating a routing loop (where traffic is forwarded back and forth between two routers endlessly), or it sends the traffic to a router that doesn’t know anything about the route, which then drops it.

A wake-up call

The outage was an unexpected incident for Facebook. However, it is proof that any organisation – no matter how big or small – can be impacted by outages.

The story is a good reminder for us all to pay a bit more attention to this lesser known but critically important part of the internet's plumbing, and how it helps us get all those cat videos to our browsers in one (eventual) piece.

Malcolm Heath is a Senior Threat Researcher at F5 Labs. His career has included incident response, program management, penetration testing, code auditing, vulnerability research, and exploit development at both large and small organisations. Prior to joining F5 Labs, he was a Senior Security Engineer with the F5 SIRT.