What caused the AWS outage that broke the internet on Oct 21st 2025

AWS outage 2025: More than 1,000 services were impacted by the outage, including popular platforms like WhatsApp, Snapchat and Reddit, which rely on AWS services, along with financial institutions like the British government’s tax services and entertainment services

In India, the impact of the outage was most pronounced in the aviation sector, with hundreds of flights delayed and several cancelled as airline operators found their systems inoperational and had to switch to manual processes. At least ten banks and NBFCs had “minor disruptions”, which have either been resolved or are being resolved, the Reserve Bank of India said at the time.

Here is a technical explanation for the AWS Outage due to a DNS failure.

AWS's own internal control plane is built on top of DynamoDB. It's a hidden dependency. When AWS's internal services couldn't find the IP for DynamoDB, the entire management layer collapsed.

Stage 1: DNS Fails. The internal DNS servers for dynamodb.us-east-1.amazonaws.com stopped working.

Stage 2: Control Plane Fails. AWS's own services that depend on DynamoDB immediately broke. This included:

IAM (for authentication and session state)

The EC2 instance launch subsystem (which uses DynamoDB for metadata)

Network Load Balancer (NLB) health checks (which, it turns out, write their health state to a DynamoDB table)

Stage 3: Circular Dependency. This is the crazy part. When the NLB health checks failed (because they couldn't write to DynamoDB), it caused more network connectivity issues, which in turn impacted the (already struggling) DynamoDB service itself. It created a vicious feedback loop.

Why it lasted 15+ hours (The UDP problem)

Fixing the DNS issue only took a couple of hours. The reason the recovery took so long was twofold:

The Retry Storm: DNS queries use UDP, which is a stateless, "fire and forget" protocol. When the DNS queries failed, millions of clients (SDKs, Lambda functions, other AWS services) didn't get an immediate "connection refused" (like with TCP). They just timed out after 5+ seconds and then retried. This created a "retry storm" (or thundering herd) of millions of requests that hammered the DNS servers and caches, preventing them from recovering even after the initial fix was in.

The Global Control Plane: Many of AWS's core control plane services (like IAM) are centralized in us-east-1. Even if your app was running in eu-west-1, if it needed to authenticate or launch an instance, that control plane operation was routed through us-east-1 and failed.

Ask Madhukar - Techbytes

Search This Blog

What caused the AWS outage that broke the internet on Oct 21st 2025

Labels

Comments

Post a Comment

Popular Posts

TechBytes on Linux

How to change the default pdf viewer application in your Mac or MacBook

ಮ್ಯಾಕ್ ಬುಕ್ ಮತ್ತು ವಿನ್ಡೋಸ್ ಲ್ಯಾಪ್ ಟಾಪ್ ನಲ್ಲಿ ಎಲ್ಲ ಆಪ್ ಗಳಲ್ಲಿ ಕನ್ನಡ ಟೈಪ್ ಮಾಡುವುದು ಹೇಗೆ ?? How to type kannada on any app in Mac OS and Windows laptops