Acquia has detected a temporary service interruption affecting multiple products
Incident Report for Acquia, Inc.
Postmortem

Purpose of This Report

This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why, so there is a common understanding of what is required to prevent a future occurrence if at all possible. Any remaining issues or risks are identified, as are recommended or pending actions.

Executive Summary

Between 7 December at approximately 16:30 UTC and 8 December 02:30 UTC a number of Acquia products were affected by a major outage originating from our 3rd party infrastructure provider.  This was caused by scaling activities taking place which impacted the control plane for communication between parts of the provider’s networks.  Our infrastructure provider has provided assurance that all related activities have been paused and will not be resumed until all remediations are in place to prevent this issue from occuring again.

Event Summary

Between 7 December at approximately 16:30 UTC and 8 December 02:30 UTC the following Acquia products and services were impacted, or noted as impacted, by a major incident which occurred with our infrastructure provider:

  • Acquia Cloud Next
  • Acquia Cloud Pipelines
  • Cloud IDE
  • Acquia Platform Email Services
  • Acquia Customer Data Platform
  • Acquia Personalization

 

For Acquia Cloud Next, this event primarily impacted the ability of tasks, such as code deployments, to complete.

For Acquia Cloud Pipelines, accessibility via the Pipelines UI was impacted for a portion of this event, the ability of code deployment tasks was impacted for the full duration.

For Cloud IDE, IDE environments were non-operational for the duration of the event.  This included new and existing IDEs.

For Acquia Cloud Platform Email, customers would have been unable to register or verify email domains or add new subscriptions to email service during the period of this event.  The ability to send emails through applications that were already configured was not impacted.

For Acquia Customer Data Platform, there was a possibility of workflow failures and our teams monitored closely to ensure that if any failures occurred those workflows would be restarted.  After review Acquia did not identify any failures that occurred as a result of this issue.

For Acquia Cloud Personalization, access via the UI for US East based customers was imparied until 21:29 UTC.

This event did not directly impact the availability of Acquia Cloud hosting environments or the ability of applications to respond to end user requests.  

Identified Root Cause

The root cause of this incident was a change made by Acquia’s 3rd party infrastructure provider.  At 15:30 UTC, an automated activity to scale capacity particular services hosted in the main provider network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main provider network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.

This resulted first in cascading DNS errors.  Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 17:28 UTC, the provider team completed this work and DNS resolution errors fully recovered. This change improved the availability of several impacted services by reducing load on the impacted networking devices, but did not fully resolve the AWS service impact or eliminate the congestion.  The infrastructure provider team continued working on a set of remediation actions to reduce congestion on the internal network including identifying the top sources of traffic to isolate to dedicated network devices, disabling some heavy network traffic services, and bringing additional networking capacity online.  As the provider team continued applying the remediation actions described above, congestion significantly improved by 21:34 UTC, and all network devices fully recovered by 22:22 UTC.  Acquia teams continued to monitor to ensure full recovery until the incident was declared as fully resolved at 02:30 UTC.

Corrective Actions

  1. Acquia’s infrastructure provider has ceased all scaling activities that were related to this incident.  Sufficient capacity is in place to safely defer further action.
  2. Acquia is in close communication with our infrastructure provider to review and ensure implementation of all remediating actions meant to prevent recurrence of this type of outage.
Posted Dec 17, 2021 - 22:14 UTC

Resolved
All impacted Acquia Cloud products and services have been restored. All services are operational at this time.
Posted Dec 08, 2021 - 02:30 UTC
Update
We are continuing to coordinate with our infrastructure provider regarding this interruption of service. The following products are known to be impacted:

Acquia Cloud Next
Acquia Cloud CD Pipelines
Drupal Cloud IDE

At present these issues are not known to be directly impacting the availability of production applications. This issue primarily impacts the ability of some tasks to complete (e.g. cde deployments).

Normal service for Acquia Platform Email services has been restored.

We will provide updates on impacted products and services as we continue to monitor. We will continue coordinating with our infrastructure provider until normal service is restored for all products.
Posted Dec 08, 2021 - 00:22 UTC
Update
We are continuing to coordinate with our infrastructure provider regarding this interruption of service. The following products are known to be impacted:

Acquia Cloud CD Pipelines
Drupal Cloud IDE
Acquia Platform Email Services
Acquia Cloud Next

At present these issues are not known to be directly impacting the availability of production applications.

Normal service for Acquia Cloud Data Platform has been restored.

We will provide updates on impacted products and services as we continue to monitor. We will continue coordinating with our infrastructure provider until normal service is restored for all products.
Posted Dec 08, 2021 - 00:05 UTC
Update
We are continuing to coordinate with our infrastructure provider regarding this interruption of service. The following products are known to be impacted:

Acquia Cloud CD Pipelines
Drupal Cloud IDE
Acquia Platform Email Services
Acquia Customer Data Platform
Acquia Cloud Next

At present these issues are not known to be directly impacting the availability of production applications.

Tasks such as code deployments via the Cloud UI or Acquia Pipelines may not complete.
Cloud IDE instances may not function at this time.

Acquia Personalizations - normal service has been restored and our teams continue to monitor to ensure that recovery remains stable.

We will provide updates on impacted products and services as we continue to monitor. We will continue coordinating with our infrastructure provider until normal service is restored for all products.
Posted Dec 07, 2021 - 21:29 UTC
Update
We are continuing to coordinate with our infrastructure provider regarding this interruption of service. We will provide updates on impacted products and services as we continue to monitor.
Posted Dec 07, 2021 - 21:04 UTC
Update
We are continuing to coordinate with our infrastructure provider regarding this interruption of service.
Posted Dec 07, 2021 - 19:03 UTC
Update
We are continuing to coordinate with our infrastructure provider regarding this interruption of service. The following products are known to be impacted:

Acquia Cloud CD Pipelines
Drupal Cloud IDE
Acquia Platform Email Services
Acquia Customer Data Platform
Acquia Personalization
Acquia Cloud Next

At present these issues are not known to be directly impacting the availability of production applications.

However, the availability of User Interfaces and the ability of tasks such as code deployments via the Cloud UI or Acquia Pipelines may not complete.

Cloud IDE instances may not function at this time.

We will provide updates on impacted products and services as we continue to monitor. We will continue coordinating with our infrastructure provider until normal service is restored for all products.
Posted Dec 07, 2021 - 18:27 UTC
Update
We have identified an issue originating from Acquia’s infrastructure provider. This impacts the following products and services:

Acquia Cloud CD Pipelines
Drupal Cloud IDE
Acquia Platform Email Services
Acquia Customer Data Platform
Acquia Personalization
Acquia Cloud Next

At present these issues are not known to be directly impacting the availability of production applications. However, the availability of User Interfaces and the ability of tasks such as code deployments via the Cloud UI or Acquia Pipelines may not complete. New Cloud IDE instances cannot be created at this time.

We will provide updates on impacted products and services as we continue to monitor. We will continue coordinating with our infrastructure provider until normal service is restored for all products.
Posted Dec 07, 2021 - 17:46 UTC
Identified
We have identified an issue originating from Acquia’s infrastructure provider. This impacts the following products and services:

Acquia Cloud CD Pipelines
Drupal Cloud IDE
Acquia Platform Email Services
Acquia Customer Data Platform
Acquia Personalization

We will provide updates on impacted products and services as we continue to monitor. We will continue coordinating with our infrastructure provider until normal service is restored for all products.
Posted Dec 07, 2021 - 17:08 UTC
This incident affected: Cloud IDE and Acquia Cloud CD Pipelines.