This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why so there is a common understanding of what is required to prevent a future occurrence if at all possible.
On February 7, 2019 Acquia was working on the implementation of a temporary change in order to capture specific system diagnostics not part of the standard package. This involved setting up a temporary file system to store the diagnostic data. While working on the setup of the file system, a rule was accidentally deleted from a VPC security group causing site stability issues and outages for multiple customers. The outage lasted from 22:17 to 23:25 UTC.
Acquia restored the missing rule in the security group manually.
Due to human error a rule was incorrectly removed from a security group in a shared VPC in US-East, This rule allows web servers to connect to Memcache servers. Once the web servers were unable to reach Memcache servers, PHP processes were blocked on the web servers causing sites to become unavailable. Acquia’s configuration drift alerting did function appropriately, but it was determined that the processes in place around drift errors were such that the alert was not seen in time to make a material impact to the duration of the outage.
Acquia is working with all involved teams to correct our internal workflow to prevent similar situations in the future.
Acquia is establishing additional measure to alert internal teams faster about configuration drift for VPC security groups.