Acquia has detected an interruption of service for some Acquia Cloud Enterprise sites
Incident Report for Acquia, Inc.
Postmortem

Purpose of This Report

This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why so there is a common understanding of what is required to prevent a future occurrence if at all possible.

What happened

On February 7, 2019 Acquia was working on the implementation of a temporary change in order to capture specific system diagnostics not part of the standard package. This involved setting up a temporary file system to store the diagnostic data. While working on the setup of the file system, a rule was accidentally deleted from a VPC security group causing site stability issues and outages for multiple customers. The outage lasted from 22:17 to 23:25 UTC.

What we did about it

Acquia restored the missing rule in the security group manually.

Identified Root Cause

Due to human error a rule was incorrectly removed from a security group in a shared VPC in US-East, This rule allows web servers to connect to Memcache servers. Once the web servers were unable to reach Memcache servers, PHP processes were blocked on the web servers causing sites to become unavailable. Acquia’s configuration drift alerting did function appropriately, but it was determined that the processes in place around drift errors were such that the alert was not seen in time to make a material impact to the duration of the outage.

Next Steps

Acquia is working with all involved teams to correct our internal workflow to prevent similar situations in the future.

Acquia is establishing additional measure to alert internal teams faster about configuration drift for VPC security groups.

Posted 5 months ago. Feb 11, 2019 - 23:45 UTC

Resolved
All affected sites have been restored. All services are operational at this time.
Posted 5 months ago. Feb 07, 2019 - 23:41 UTC
Investigating
We are currently investigating an interruption of service for some US-East Acquia Cloud Enterprise sites. We will provide additional information as it becomes available.
Posted 5 months ago. Feb 07, 2019 - 22:46 UTC
This incident affected: Acquia Cloud Enterprise.