Purpose of This Report
This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why, so there is a common understanding of what is required to prevent a future occurrence if at all possible. Any remaining issues or risks are identified, as are recommended or pending actions.
On 27 September between approximately 02:20 - 10:40 UTC some Acquia applications and services were impacted by a service disruption that occurred at an Acquia infrastructure provider which impacted EBS services. This generally resulted in a degradation of performance for affected applications but may have resulted in an interruption of service for a small number of applications when remaining servers were unable to handle current resource requirements. This incident was resolved by Acquia’s infrastructure provider reverting to a previous EBS metadata service version. Further remediations by our vendor include a number of revisions to the system which had been in process of being released, enhanced monitoring and alerting, and better notification of system changes of this scope. Acquia will continue to follow all best practices provided by their vendor to limit the impact of incidents which affect a single availability zone (e.g. all highly available server pairs are currently provisioned across multiple availability zones to prevent single AZ events from causing a total outage).
On 27 September between approximately 02:20 - 10:40 UTC some Acquia applications and services were impacted by a service disruption that occurred at an Acquia infrastructure provider. This specifically affected the performance of Elastic Block Store (EBS) resources causing a general degradation of I/O performance across one availability zone within the US East hosting region.
Beginning at 02:22 on 27 September the EBS Metadata Service began to experience elevated error rates when processing new requests from the EBS storage servers for EBS volume state change operations. These error rates were caused by an unexpected increase in request volume from the EBS storage servers to the EBS Metadata service, that led to resource contention and increased request latencies within the EBS Metadata Service. This in turn led to request timeouts from the EBS storage servers, leading to retries, which further increased the connection load on the EBS Metadata Service. This impact was resolved when Acquia infrastructure vendor then reverted to a previous version of the EBS Metadata service.
Acquia’s infrastructure vendor has provided an RCA which includes remediations to prevent a recurrence of this issue, this includes:
Acquia continues to follow best practices from our infrastructure vendor including the provisioning of highly available servers across multiple availability zones to limit the scope of impact from any incident affecting a single AZ.