Cloud storm, EC2 startup takedown
I couldn’t get on anyone’s about.me page earlier today, so I checked their compeititor flavors.me, they were also down, seemed like a strange coincidence. I did a bit of Twitter snooping and it appeared that Amazon’s EC2 hosting service, specifically their Elastic Compute Cloud and Relational Database Service in Northern Virginia were having issues. You can find technical details at Amazon Web Services Health Dashboard. Mike Butcher at TechCrunch Europe was one of the first to write about the downtime.
Below are some of the services that have been facing downtime for (in many cases) over 7 hours.
It appears that all these startups can do is sit and wait for Amazon to sort things out. This will inevitably lead to further questions over ‘the cloud’. Servers have always gone down, there’s rarely a 100% solution, but it’s all the more exciting when a chunk of the social web gets taken down simultaneously.
Full details on Compute Cloud,
1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.
2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.
2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.
3:20 AM PDT Delayed EC2 instance launches and EBS API error rates are recovering. We’re continuing to work towards full resolution.
4:09 AM PDT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT
5:02 AM PDT Latency has recovered for a portion of the impacted EBS volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability Zone.
6:09 AM PDT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution.
6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue.
7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.
UPDATES SINCE POSTED (Compute Cloud),
8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.
11:09 AM PDT A number of people have asked us for an ETA on when we’ll be fully recovered. We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate. Our high-level ballpark right now is that the ETA is a few hours. We can assure you that all-hands are on deck to recover as quickly as possible. We will update the community as we have more information.
12:30 PM PDT We have observed successful new launches of EBS backed instances for the past 15 minutes in all but one of the availability zones in the US-EAST-1 Region. The team is continuing to work to recover the unavailable EBS volumes as quickly as possible.
1:48 PM PDT A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.
6:18 PM PDT Earlier today we shared our high level ETA for a full recovery. At this point, all Availability Zones except one have been functioning normally for the past 5 hours. We have stabilized the remaining Availability Zone, but recovery is taking longer than we originally expected. We have been working hard to add the capacity that will enable us to safely re-mirror the stuck volumes. We expect to incrementally recover stuck volumes over the coming hours, but believe it will likely be several more hours until a significant number of volumes fully recover and customers are able to create new EBS-backed instances in the affected Availability Zone. We will be providing more information here as soon as we have it.
Here are a couple of things that customers can do in the short term to work around these problems. Customers having problems contacting EC2 instances or with instances stuck shutting down/stopping can launch a replacement instance without targeting a specific Availability Zone. If you have EBS volumes stuck detaching/attaching and have taken snapshots, you can create new volumes from snapshots in one of the other Availability Zones. Customers with instances and/or volumes that appear to be unavailable should not try to recover them by rebooting, stopping, or detaching, as these actions will not currently work on resources in the affected zone.
10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It’s taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
2:41 AM PDT We continue to make progress in restoring volumes but don’t yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.
6:18 AM PDT We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we’ll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.
Full details on Relational Database,
1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region.
2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region.
3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution.
4:03 AM PDT We are making progress on failovers for Multi AZ instances and restore access to them. This event is also impacting RDS instance creation times in a single Availability Zone. We continue to work towards the resolution.
5:06 AM PDT IO latency issues have recovered in one of the two impacted Availability Zones in US-EAST-1. We continue to make progress on restoring access and resolving IO latency issues for remaining affected RDS database instances.
6:29 AM PDT We continue to work on restoring access to the affected Multi AZ instances and resolving the IO latency issues impacting RDS instances in the single availability zone.
8:12 AM PDT Despite the continued effort from the team to resolve the issue we have not made any meaningful progress for the affected database instances since the last update. Create and Restore requests for RDS database instances are not succeeding in US-EAST-1 region.
UPDATES SINCE POSTED (Relational Database),
10:35 AM PDT We are making progress on restoring access and IO latencies for affected RDS instances. We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance – currently those requests are not being processed.
2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.
11:42 PM PDT In line with the most recent Amazon EC2 update, we wanted to let you know that the team continues to be all-hands on deck working on the remaining database instances in the single affected Availability Zone. It’s taking us longer than we anticipated. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.