Created with Sketch.
The Internet Report
27 minutes | Oct 6, 2021
*NEW* Ep. 45 The Facebook Outage, Explained (10/4/21)
00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:15 Headlines: Today we’re going to do a thorough analysis of the major Facebook outage that took place yesterday, Monday, October 4. I’m joined by Gustavo Ramos, ThousandEyes’ in-house expert on Network Engineering. ThousandEyes Blog: https://www.thousandeyes.com/blog/facebook-outage-analysis Analysis from Facebook: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ 1:17 Under the Hood: We'll walk through the sequence of events that led to this outage, understand what went wrong (and what actions may have made the situation worse), and what lessons we can all learn from this outage. 25:40 Outro: We've been on a bit of a break for the past several months as things were relatively quiet on the Internet front and for the foreseeable future we'll be a bit reactive in our episodes, when something major happens trust we'll be here. Questions? Feedback? Have an idea for a guest? Send us an email at firstname.lastname@example.org
18 minutes | Aug 3, 2021
Ep. 44 When BGP Routes Accidentally Get Hijacked: A Lesson In Internet Vulnerability
00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:08 Headlines: Today, Mike Hicks (Principal Solutions Analyst, ThousandEyes) and I discuss a recent BGP routing incident that had intermittent impacts on Amazon’s services, including Amazon.com and AWS compute resources, during a five-hour period on July 12. 01:04 Under the Hood: When we look into BGP routing at the time, we can see multiple BGP path changes due to a service provider erroneously inserting themselves into the path for a large number of Amazon routes. Watch this episode to see how the BGP incident led to significant packet loss, resulting in service disruption for some Amazon and AWS users. We also discuss why enterprises need to have continuous oversight of the paths their traffic takes over the Internet. 17:58 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at email@example.com
18 minutes | Jul 24, 2021
Ep. 43: The Akamai DNS Outage and the Case for CDN Redundancy (July 1-23, 2021)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. I’m joined today by Mike Hicks, principal solutions analyst here at ThousandEyes, to cover the outage of Akamai’s DNS service. The outage, which occurred on July 22nd around 3:38 PM UTC (8:38AM PT), struck during the course of business hours in Europe and North America, resulting in widespread impacts to applications and services hosted within Akamai servers. The outage itself was short-lived and was resolved roughly one hour after the outage began. In this episode, we examine the customer impact, the relationship between DNS and CDNs, and what enterprises should take away from the incident. We also discuss the question of when it might make sense to invest in DNS or CDN redundancy—and when it is, frankly, overkill. Watch this week’s episode to hear our take, and as always let us know on Twitter what you think.
21 minutes | Jul 2, 2021
Ep. 42: BGP Routing Incident Shows Why the Shortest Path Isn’t Always the Chosen Path
00:00 Welcome:This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:13 Headlines: Today, Kemal and I unpack an interesting BGP incident, in which a large-scale route leak briefly altered traffic patterns across the Internet. 00:58 Under the Hood: The incident began on Thursday, June 3rd at around 10:24 UTC, and resulted in a significant spike in packet loss that was noticeable in ThousandEyes tests. While this packet loss resolved within the hour (at around 10:48 UTC), we observed some interesting routing changes during this window—as traffic was diverted to a Russian telecom provider that had not previously been in the path. Watch this episode as we explore how this network provider managed to get itself into the routing paths of many major services, and why network visibility is so important to recognize these types of incidents in which your site may still be reachable but your traffic is being sent through an unexpected network. 20:45 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at firstname.lastname@example.org
26 minutes | Jun 25, 2021
Ep. 41 Akamai Prolexic Outage Analysis + Takeaways (Week of June 9-17 2021)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. I’m joined by ThousandEyes’ BGP expert, Kemal Sanjta, to review the June 16th outage of Prolexic Routed, a DDoS Mitigation Service operated by Akamai. According to a statement from Akamai, the outage was not due to a DDoS attack or system update, but instead a routing table limitation that was inadvertently exceeded. In this episode, Kemal and I analyzed what happened and how customers of Akamai Prolexic who had automated failover mechanisms in place were able to recover more quickly than those that had to manually switch over to other providers. Watch this episode to learn more about this outage, and how different operational processes resulted in very different service outcomes.
40 minutes | Jun 9, 2021
Ep. 40 Fastly’s Outage and Why CDN Redundancy Matters (Week of June 3-8)
00:00 Welcome: This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. 00:12 Headlines: Today, I’m joined by Hans Ashlock, Director of Technology & Innovation at ThousandEyes, to unpack today’s major outage at Fastly, a popular CDN provider. 3:46 Under the Hood: Today, I’m joined by Hans Ashlock, Director of Technology & Innovation at ThousandEyes, to unpack today’s major outage at Fastly, a popular CDN provider. The widespread outage occurred around 9:50 UTC, about 5:50 am ET, and mostly impacted users across Europe and Asia due to the timing. he outage lasted approximately one hour until 10:50 UTC, yet residual impacts were felt beyond that. Today’s outage is a good example of the importance of having outside-in visibility not just across your app, but also to your app’s edge and all its dependent services. 39:05 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at email@example.com
32 minutes | Jun 4, 2021
Ep. 39: Bitcoin Dive Sparks Outage at a Popular Crypto Exchange (Weeks of May 17-June2)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. I’m joined today by Mike Hicks, Principle Solution Analyst at ThousandEyes, to cover two recent application-related outages. The first occurred on May 19th around 12:50 UTC at Coinbase—a well-known cryptocurrency exchange. Around the time that news broke saying that the Chinese government would be imposing strict regulation on cryptocurrencies, users attempting to execute transactions were unable to access the application. From the ThousandEyes platform we were able to see a drop in availability around this time as well as increased load times (which in some cases resulted in timeout errors). The second outage happened on May 20th around 17:35 UTC at Slack—an enterprise collaboration platform. While the outage was resolved within 90 minutes, it occurred during normal US business hours, making it particularly disruptive to users attempting to reach the application. These instances remind us that applications, much like the underlying networks they run on, can experience outages, and effective troubleshooting requires end-to-end visibility into both.
32 minutes | May 21, 2021
Ep. 38: DNS and BGP and DDoS Attacks—Oh, My! (May 11-17 2021)
00:00 Welcome 00:14 Headlines: DNS and BGP and DDoS Attacks—Oh, My! This week we cover a couple of recent service degradation incidents involving DNS providers 2:19 Under the Hood: Kemal Sanjta, ThousandEyes’ resident BGP expert, joins us to discuss the May 6th disruption to Neustar’s UltraDNS service, which lasted nearly four hours. We discuss the BGP routing changes we observed during the incident and what they can tell us about the cause of the disruption. We also cover a separate incident involving Quad 9, a public recursive resolver service, which the company says was caused by a DDoS attack on May 3rd. 16:19 Expert Spotlight: Michael Batchelder (A.K.A., Binky), is here to discuss the two “Ds” of the Internet — DDoS attacks and the DNS Questions for Binky? Contact him at firstname.lastname@example.org 31:49 Outro: Questions? Feedback? Have an idea for a guest? Send us an email at email@example.com
10 minutes | May 7, 2021
Ep. 37: Even Magic Can't Stop Internet Outages (April 28-May 3, 2021)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. Today, we focused on an interesting outage that impacted Cloudflare Magic Transit, a relatively new offering from the CDN provider which aims to efficiently route and protect the network traffic of its customers. On May 3rd at approximately 3:00 PM PDT (10:00 PM UTC), ThousandEyes vantage points connecting to sites using Magic Transit began to detect significant packet loss at Cloudflare’s network edge—with the loss continuing at varying levels, for approximately 2 hours. While the outage impacted some Magic Transit customers more significantly than others, we also observed mitigation actions by at least one customer to avoid the outage and restore the availability of their service to their users. This outage reminds us that no provider is immune to outages, even cloud and global CDN providers. However, with proactive visibility, you can respond quickly to reduce outage impact on your users. Watch this week’s episode to hear more about the outage from the ThousandEyes perspective.
18 minutes | Apr 29, 2021
Ep. 36 Microsoft Teams Outage Highlights Need to See Beyond App Front Door (Week of 4/20- 4/27 2021)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. We’re joined this week by Hans Ashlock, Director of Technology & Innovation at ThousandEyes, to discuss Tuesday’s Microsoft Teams outage. On Tuesday, April 27th, ThousandEyes tests began to detect an outage affecting the Teams service starting around 3 AM (PT) and lasting approximately 1.5 hours. While the outage occurred in the overnight hours for much of the Americas, the global nature of the outage resulted in service disruption for users connecting from Asia and Europe. Transaction views within the ThousandEyes platform show that Microsoft’s authentication service appeared to be available, however, the Teams application was unable to initialize, resulting in error responses. Watch this week’s episode to hear more about what ThousandEyes revealed about the nature of this outage—and what we can all learn from the incident.
31 minutes | Apr 22, 2021
Ep. 35: Major BGP Route Leak Disrupts Internet Traffic Globally (April 13-19 2021)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. On today’s episode, we’re thrilled to be joined by Kemal Sanjta, ThousandEyes’ resident expert on BGP. This week, we’re going under the hood on the April 16th BGP leak at Vodafone India, which leaked more than 30,000 prefixes, causing a major disruption of Internet traffic to some services. While some news outlets reported that the incident lasted approximately 10 minutes (starting around 1:50AM UTC or 9:50AM ET), we found that it lasted quite a bit longer—more than an hour in the case of some prefixes. Watch this week’s show to see how it impacted a major CDN provider.
25 minutes | Apr 15, 2021
Ep. 34 Facebook Outage Analysis; Plus, Why Cross-Layer Visibility Is a Must for App Experience
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. We’re back from a short sabbatical to cover an interesting outage at Facebook in what appears to be an application outage compounded by a series of routing issues. On April 8th, for roughly 40 minutes, the Facebook application became unavailable for users around the globe who were attempting to connect to the service. Despite the short-lived nature of the outage, we observed prolonged performance degradation even after the application came back online for users. Suboptimal page load and response times, both of which can impact the user experience, were observed alongside a series of routing changes. This outage reminds us all of the importance of having visibility across network and application layers when troubleshooting and prioritizing issues that are impacting user experience. Catch this week’s episode to hear about the outage from ThousandEyes perspective.
9 minutes | Feb 3, 2021
Ep. 33: What Happened with Verizon’s Recent Outage (Week of Jan. 25- Feb. 1 2021)
On today’s episode, we discuss the recent outage on Verizon’s network that had widespread impacts on users in the US. ThousandEyes Broadband Agents detected an outage starting around 11:30am EST that manifested as packet loss across multiple locations concentrated along Verizon backbone in the US east coast and midwest. While the outage was resolved approximately an hour later, users connecting from the Verizon network across the US experienced varying degrees of impact, depending on the services they were connecting to. This serves as yet another reminder that the context around an outage directly affects the scope of the disruption. Watch this week’s episode to see what this outage looked like from ThousandEyes vantage points.
34 minutes | Jan 6, 2021
Ep. 32: What Happened with Slack’s Outage; Plus, Talking Cloud Resiliency with Forrest Brazeal of Cloud Guru (Week of 12/28/20- 01/04/21)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. Despite a quiet last couple of weeks on the Internet, we started off our new year with quite the bang. As droves of mildly-caffeinated workers returned to their home offices on Monday after the holiday break, many were surprised to find that Slack was not available. On today’s episode, we go under the hood of Slack’s Monday outage to see what went wrong and how it was resolved. We’re also excited to be joined by Forrest Brazeal, a cloud architect, writer, speaker and cartoonist, to talk about everyone’s favorite subject: cloud resiliency. Watch this week’s episode to see the interview and hear our outage analysis. Show links: https://forrestbrazeal.com https://acloudguru.com https://cloudirregular.substack.com https://cloudirregular.substack.com/p/the-cold-reality-of-the-kinesis-incident
21 minutes | Dec 16, 2020
Ep. 31 About Monday’s Google Outage; Plus, Talking Holiday Internet Traffic Trends with Fastly (Week of Dec. 7- Dec. 14)
In this week's episode of #TheInternetReport... 00:00 Welcome 00:16 Headlines: About Monday’s Google Outage; Plus, Talking Holiday Internet Traffic Trends with Fastly 00:43 Under the Hood: This week, we go under the hood on a recent outage that took down the availability of several Google applications, including YouTube, Gmail and Google Calendar. Yesterday morning at approximately 6:50 AM EST, users around the world were unable to access several Google services for a span of around 40 minutes. While short-lived, the outage was notable in that it occurred during business hours in Europe and toward the beginning of the school day on the US east coast—so, people noticed, to put it bluntly. Catch this week’s episode to hear about the official RCA and what we saw from a network perspective. 10:18 Expert Spotlight: We’re thrilled to be joined by David Belson Senior Director of Data Insights, at Fastly talk about Internet traffic trends related to holiday online shopping and charitable giving. Cyber Five: what we saw during ecommerce's big week- https://www.fastly.com/blog/cyber-five-what-we-saw-during-ecommerces-big-week Decoding the digital divide- https://www.fastly.com/blog/digital-divide 19:14 Outro: We're taking a break for the rest of 2020 but join us on Jan. 05 2021 when we kick off the New Year with Forrest Brazeal: https://forrestbrazeal.com https://cloudirregular.substack.com
15 minutes | Dec 1, 2020
Ep. 30 Major AWS Outage Highlights Dependencies within Cloud Providers (Week of Nov. 23- Nov. 30)
If you’re an AWS customer or rely on services that use AWS, you might have noticed the major, hours-long outage last week. On November 25th, at approximately 5:15 am PST, users of Kinesis, a real-time processor of streaming data, began to experience service interruptions. The issue was not network-related, and AWS later issued a detailed incident post-mortem analysis identifying an existing operating system configuration issue that was triggered by a maintenance event that involved adding server capacity. Over the course of the day, Amazon attempted several mitigation measures, but the outage was not completely resolved until approximately 10:23 pm PST. What was notable about this outage was its blast radius, which extended far beyond AWS’s direct customers. Several AWS services that use Kinesis, including Cognito and CloudWatch, were affected, as were any user of applications consuming those services (e.g., Ring, iRobot, Adobe). This is a good reminder of the risk of hidden service dependencies, as well as the need for visibility to understand and communicate with customers when something’s gone wrong.
18 minutes | Nov 9, 2020
Ep. 29: 2020 Election—The Internet Held Strong with a Few App Performance Glitches (Week of Nov. 2- 8)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. This week, we’re pleasantly surprised to say that the network did not break, and there were no major election-night outages to report. However, that’s not to say we didn’t catch performance glitches in the days and weeks around the big night. Watch this week’s episode, as we cover performance issues at a Secretary of State website as well as why CNN’s election map website was so slow to load for many.
15 minutes | Oct 27, 2020
Ep. 28: 2020 Election Special: Going Under the Hood on State Election Websites (Week of Oct. 19-25)
We’ve got an election coming up here in the US, and over the last several weeks, we have been analyzing a dozen or so state election websites to take a closer look at how they’re hosted (e.g., do they use a CDN or are they self-hosted?) and to monitor them for outages. In this episode, we discuss the pros and cons of each hosting method and dive into some examples we’ve seen where election websites have had unexpected performance degradation. Catch this week’s episode to go under the hood on the websites powering the upcoming presidential election—and don’t forget to get out there and vote!
7 minutes | Oct 19, 2020
Ep. 27 No, Twitter Wasn’t Hacked and Zayo Goes Bump in the Night (Week of Oct. 12-18)
. In this week’s episode, we discuss two notable outages that happened last week. The first, at Twitter, took place on October 15 around 5:30 pm PST and impacted users’ ability to tweet or re-tweet. According to Twitter’s official statement, an internal system error was the culprit—putting to bed any theories of another hack. The second outage took place at the transit provider, Zayo, in the early morning hours of October 13. Although the outage seemed to mostly involve interfaces on the US west coast, Denver and the southwest (as well as a handful of other global locations), the impact of the outage was not very severe due to the time of the outage, which was outside of US business hours. Watch this week’s episode to hear more about these two outages.
19 minutes | Oct 12, 2020
Ep. 26 The case of an overloaded database and what happens when a bug bites (Week of Oct. 4-11)
This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. In this week’s episode, we dive into a recent outage at Slack that caused intermittent issues for its enterprise users (including ourselves) for nearly a full day. The cause, as noted by Slack, was on the backend and related to an overloaded database. Next, we dig into another outage at Microsoft. According to their statement, a bug in an internal update seems to have revoked the routes to a number of devices that were believed to be unhealthy—thereby creating congestion in the rest of their network. This explanation jives with the increased packet loss we observed during this time period. Don’t miss this week’s episode, where we walk through these outages in depth
Terms of Service
Do Not Sell My Personal Information
© Stitcher 2021