And that’s not all! CDNs don’t just keep content closer to the devices that want it. They also help direct you through the internet. “It’s like orchestrating the flow of traffic on a massive road system,” says Ramesh Sitaraman, a computer scientist at the University of Massachusetts at Amherst who helped create the first major CDN as a principal architect in Akamai. . “If any link on the Internet fails or gets congested, CDN algorithms will quickly find an alternative route to the destination.”
So you can start to see how when a CDN goes down, you can take portions of the Internet with it. Although this alone does not explain well how the impacts of Tuesday were of great scope, especially when there are so many redundancies integrated into these systems. Or at least, there should be.
For the best part of Tuesday, it wasn’t clear exactly what was happening in Fastly. “We have identified a service configuration that has triggered interruptions in our POPs worldwide and have disabled this configuration,” a company spokesman said in a statement that morning. “Our global network is back online.”
At the end of Tuesday, the company offered more specifications in one blog details the incident. The radical cause actually dates back to May 12, when the company inadvertently introduced a bug as part of a large software distribution. As a rune that only unlocks its evil powers under a certain spell, the bug was innocuous so far and less so that a Fastly client configured its implementation in a specific way. Which, almost a month later, one of them did.
The global disturbance began at 5:47 am ET; I noticed it quickly in a minute. It took a little longer – until 6:27 am ET – to identify the configuration that triggered the bug that caused the failure. At this point, 85 percent of Fastly’s network has returned errors; every continent outside Antarctica has felt the impact. They started coming back at 6:36 am ET, and everything was mostly back to normal by the peak of the hour.
Even after Fastly fixed the underlying problem, he warned that users might still see a lower “cache hit ratio” – how often you can find the content you’re looking for already stored on a nearby server – and “increase the load source, ”which refers to the process of returning to the source for non-cached items. In other words, the closets were still pretty bare. And it wasn’t until they were resupplied to the world that Fastly tackled the underlying bug itself. They finally made a “permanent repair” several hours later, around lunchtime on the east coast.
That interruption would have been surprising, since CDNs are typically designed to withstand these storms. “In principle, there is massive redundancy,” says Sitaraman, speaking generally of CDN. “If one server fails, other servers can take over the load. If an entire data center fails, the load can be moved to other data centers. If things worked perfectly, you could have several network outages, data center problems , and server failures; CDN resilience mechanisms will ensure that users never see degradation. “
When things go wrong, Sitaraman says, it typically refers to a software bug or a configuration error that pushes multiple servers at once.
Even then, sites and services that employ CDNs typically have their own redundancies in place. Or at least, they should. In fact, one could see hints of how various services are in the speed of their response this morning, Medina says. It took Amazon about 20 minutes to get back up and running, so it could divert traffic to other CDN providers. Anyone who relies solely on Fastly, or who did not have automated systems in place to accommodate the disturbance, had to wait for it.