Here is a technical explanation for the AWS Outage due to a DNS failure.
AWS's own internal control plane is built on top of DynamoDB. It's a hidden dependency. When AWS's internal services couldn't find the IP for DynamoDB, the entire management layer collapsed.
Stage 1: DNS Fails. The internal DNS servers for dynamodb.us-east-1.amazonaws.com stopped working.
Stage 2: Control Plane Fails. AWS's own services that depend on DynamoDB immediately broke. This included:
IAM (for authentication and session state)
The EC2 instance launch subsystem (which uses DynamoDB for metadata)
Network Load Balancer (NLB) health checks (which, it turns out, write their health state to a DynamoDB table)
Stage 3: Circular Dependency. This is the crazy part. When the NLB health checks failed (because they couldn't write to DynamoDB), it caused more network connectivity issues, which in turn impacted the (already struggling) DynamoDB service itself. It created a vicious feedback loop.
Why it lasted 15+ hours (The UDP problem)
Fixing the DNS issue only took a couple of hours. The reason the recovery took so long was twofold:
The Retry Storm: DNS queries use UDP, which is a stateless, "fire and forget" protocol. When the DNS queries failed, millions of clients (SDKs, Lambda functions, other AWS services) didn't get an immediate "connection refused" (like with TCP). They just timed out after 5+ seconds and then retried. This created a "retry storm" (or thundering herd) of millions of requests that hammered the DNS servers and caches, preventing them from recovering even after the initial fix was in.
The Global Control Plane: Many of AWS's core control plane services (like IAM) are centralized in us-east-1. Even if your app was running in eu-west-1, if it needed to authenticate or launch an instance, that control plane operation was routed through us-east-1 and failed.
AWS's own internal control plane is built on top of DynamoDB. It's a hidden dependency. When AWS's internal services couldn't find the IP for DynamoDB, the entire management layer collapsed.
Stage 1: DNS Fails. The internal DNS servers for dynamodb.us-east-1.amazonaws.com stopped working.
Stage 2: Control Plane Fails. AWS's own services that depend on DynamoDB immediately broke. This included:
IAM (for authentication and session state)
The EC2 instance launch subsystem (which uses DynamoDB for metadata)
Network Load Balancer (NLB) health checks (which, it turns out, write their health state to a DynamoDB table)
Stage 3: Circular Dependency. This is the crazy part. When the NLB health checks failed (because they couldn't write to DynamoDB), it caused more network connectivity issues, which in turn impacted the (already struggling) DynamoDB service itself. It created a vicious feedback loop.
Why it lasted 15+ hours (The UDP problem)
Fixing the DNS issue only took a couple of hours. The reason the recovery took so long was twofold:
The Retry Storm: DNS queries use UDP, which is a stateless, "fire and forget" protocol. When the DNS queries failed, millions of clients (SDKs, Lambda functions, other AWS services) didn't get an immediate "connection refused" (like with TCP). They just timed out after 5+ seconds and then retried. This created a "retry storm" (or thundering herd) of millions of requests that hammered the DNS servers and caches, preventing them from recovering even after the initial fix was in.
The Global Control Plane: Many of AWS's core control plane services (like IAM) are centralized in us-east-1. Even if your app was running in eu-west-1, if it needed to authenticate or launch an instance, that control plane operation was routed through us-east-1 and failed.
Comments
Post a Comment