Amazon Web Services (AWS) has explained the cause of last Wednesday’s widespread outage, which impacted thousands of third-party online services for several hours.
While dozens of AWS services were affected, AWS says the outage occurred in its Northern Virginia, US-East-1, region. It happened after a “small addition of capacity” to its front-end fleet of Kinesis servers.
Kinesis is used by developers, as well as other AWS services like CloudWatch and Cognito authentication, to capture data and video streams and run them through AWS machine-learning platforms.
SEE: IT Data Center Green Energy Policy (TechRepublic Premium)
The Kinesis service’s front-end handles authentication, throttling, and distributes workloads to its back-end “workhorse” cluster via a database mechanism called sharding.
As AWS notes in a lengthy summary of the outage, the addition of capacity triggered the outage but wasn’t the root cause of it. AWS was adding capacity for an hour after 2:44am PST, and after that all the servers in Kinesis front-end fleet began to exceed the maximum number of threads allowed by its current operating system configuration.
The first alarm was triggered at 5:15am PST and AWS engineers spent the next five hours trying to resolve the issue. Kinesis was fully restored at 10:23pm PST.
Amazon explains how the front-end servers distribute data across its Kinesis back-end: “Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map.”
According to AWS, that information is obtained through calls to a microservice vending the membership information, retrieval of configuration information from DynamoDB, and continuous processing of messages from other Kinesis front-end servers.
“For [Kinesis] communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants.”
As the number of threads exceeded the OS configuration, the front-end servers ended up with “useless shard-maps” and were unable to route requests to Kinesis back-end clusters. AWS had already rolled back the additional capacity that triggered the event but had reservations about boosting the thread limit in case it delayed the recovery.
As a first step, AWS has moved to larger CPU and memory servers, as well as reduced the total number of servers and threads required by each server to communicate across the fleet.
It’s also testing an increase in thread count limits in its operating system configuration and working to “radically improve the cold-start time for the front-end fleet”.
CloudWatch and other large AWS services will move to a separate, partitioned front-end fleet. It’s also working on a broader project to isolate failures in one service from affecting other services.
AWS has also acknowledged the delays in updating its Service Health Dashboard during the incident, but says that was because the tool its support engineers use to update the public dashboard was affected by the outage. During that time it was updating customers via the Personal Health Dashboard.
“With an event such as this one, we typically post to the Service Health Dashboard. During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event,” AWS said.
“We want to apologize for the impact this event caused for our customers.”