Acquia has detected an interruption in services for applications in the U.S. East region

Incident Report for Acquia, Inc.

Postmortem

Purpose of This Report

This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why, so there is a common understanding of what is required to prevent a future occurrence if at all possible. Any remaining issues or risks are identified, as are recommended or pending actions.

Executive Summary

On 27 September between approximately 02:20 - 10:40 UTC some Acquia applications and services were impacted by a service disruption that occurred at an Acquia infrastructure provider which impacted EBS services. This generally resulted in a degradation of performance for affected applications but may have resulted in an interruption of service for a small number of applications when remaining servers were unable to handle current resource requirements. This incident was resolved by Acquia’s infrastructure provider reverting to a previous EBS metadata service version. Further remediations by our vendor include a number of revisions to the system which had been in process of being released, enhanced monitoring and alerting, and better notification of system changes of this scope. Acquia will continue to follow all best practices provided by their vendor to limit the impact of incidents which affect a single availability zone (e.g. all highly available server pairs are currently provisioned across multiple availability zones to prevent single AZ events from causing a total outage).

Event Summary

On 27 September between approximately 02:20 - 10:40 UTC some Acquia applications and services were impacted by a service disruption that occurred at an Acquia infrastructure provider. This specifically affected the performance of Elastic Block Store (EBS) resources causing a general degradation of I/O performance across one availability zone within the US East hosting region.

Acquia Actions

02:23 UTC - Acquia Support received initial notification from customers experiencing degradation of performance for some actions and began investigation.
03:11 UTC - As Acquia investigation continued, Acquia’s infrastructure vendor made an initial public post indicating that some US East based services were impacted and an investigation was on-going.
04:01 UTC - As more systems affected by the vendor issue were noted Acquia posted public messaging to status.acquia.com to make customers generally aware of the impact to services.
04:47 UTC - Acquia Operations identified that rebooting portions of highly available server pairs outside of the impacted region and performing a file system remount could partially restore service for affected applications - these actions did not restore highly available service but allowed servers outside of the affected availability zone to operate normally. Acquia worked to implement this partial fix based both on prior notification by customers who were impacted as well as internal monitoring information available.
10:12 UTC - Acquia infrastructure provider resolved the issue affecting EBS resources resulting in the restoration of HA service for applications which had been impacted by the issue.
10:52 UTC - Final instance reboots and file system remounts were completed by Acquia across all affected systems.

Identified Root Cause

Beginning at 02:22 on 27 September the EBS Metadata Service began to experience elevated error rates when processing new requests from the EBS storage servers for EBS volume state change operations. These error rates were caused by an unexpected increase in request volume from the EBS storage servers to the EBS Metadata service, that led to resource contention and increased request latencies within the EBS Metadata Service. This in turn led to request timeouts from the EBS storage servers, leading to retries, which further increased the connection load on the EBS Metadata Service. This impact was resolved when Acquia infrastructure vendor then reverted to a previous version of the EBS Metadata service.

Corrective Actions

Acquia’s infrastructure vendor has provided an RCA which includes remediations to prevent a recurrence of this issue, this includes:
1. A discontinuation of use of the new EBS metadata services until further work can be done to eliminate all issues discovered during this incident.
2. Improved monitoring and alerting to better identify and act on any issue originating from these systems.
3. Improved proactive messaging to infrastructure service customers to provide more notification of when system changes take place.
Acquia continues to follow best practices from our infrastructure vendor including the provisioning of highly available servers across multiple availability zones to limit the scope of impact from any incident affecting a single AZ.

Posted Oct 29, 2021 - 22:58 UTC

Resolved

The underlying cause of this service interruption has been addressed. All affected Acquia Cloud interface services have been restored. All services are operational at this time.

Posted Sep 27, 2021 - 12:08 UTC

Investigating

We continue to experience interruption affecting some applications in the U.S. East region along Acquia Cloud UI and Acquia Cloud API.
We are currently working to resolve the issue for affected services. We will provide further information as available.

Posted Sep 27, 2021 - 08:53 UTC

Monitoring

The underlying cause of the service interruption affecting some applications in the U.S. East region has been fixed and addressed.
We are now currently monitoring the degraded performance of Acquia Cloud interface services. We will provide further information as soon as this issue has been fully resolved.

Posted Sep 27, 2021 - 08:29 UTC

Update

Acquia Cloud interface services are currently degraded due to the ongoing incident. We are working to resolve it at this time. We will provide additional updates when services have been fully restored.

Posted Sep 27, 2021 - 07:03 UTC

Identified

We are continuing to monitor the earlier interruption of service affecting some applications in the U.S. East region. We are working with AWS to resolve this as quickly as possible. We will provide further information as soon as this issue has been fully resolved.

Posted Sep 27, 2021 - 06:01 UTC

Investigating

We are currently investigating an interruption of service for Acquia Cloud affecting some applications in the U.S. East region. We will provide additional information as it becomes available.

Posted Sep 27, 2021 - 04:01 UTC

This incident affected: Cloud Platform Enterprise, Cloud Platform Professional, Acquia Site Factory, and Drupal Cloud UI.