This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why, so there is a common understanding of what is required to prevent a future occurrence if at all possible. Any remaining issues or risks are identified, as are recommended or pending actions.
On 19 February 2021 at 14:01 UTC an outage occurred for a single availability zone within the AP-Northeast-1 (Tokyo) datacenter managed by our infrastructure provider. This affected application performance or availability for a small number of Acquia Cloud Enterprise customers with hosting in this regional datacenter. By 20:28 UTC most systems had fully recovered, though continued monitoring was recommended. Acquia continued to actively monitor for any additional service failures or related alerts from monitoring systems until approximately 20 February at 01:00 UTC.
Acquia Cloud Enterprise and Acquia Cloud Site Factory, for production environments, make use of highly available server pairs provisioned in separate availability zones as a best practice to mitigate the chance of an interruption of service as is recommended by our provider.
Impact to applications varied depending on type of server affected. For customers whose primary load balancers were affected, Acquia responded to normal alerts from monitoring and failed over to the secondary balancer as needed and some period of service interruption may have been experienced. For customers whose primary file or database server were affected; failover actions were automatically initiated in the majority of cases with a small number of cases requiring additional manual intervention. Loss of a file or database server may have resulted in a degradation of performance during the loss of high availability, but this redundancy should have prevented an interruption of service for the majority affected.
Only Acquia Cloud Enterprise was affected by this outage. No other Acquia products or services were affected.
The identified root cause of the outage was a thermal event affecting a single data center supporting the AWS Tokyo (ap-ne1) region.