Acquia has detected an interruption in services for some applications in the Asia Pacific North East region

Incident Report for Acquia Inc

Postmortem

Purpose of This Report

This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why, so there is a common understanding of what is required to prevent a future occurrence if at all possible. Any remaining issues or risks are identified, as are recommended or pending actions.

What happened

On 19 February 2021 at 14:01 UTC an outage occurred for a single availability zone within the AP-Northeast-1 (Tokyo) datacenter managed by our infrastructure provider. This affected application performance or availability for a small number of Acquia Cloud Enterprise customers with hosting in this regional datacenter. By 20:28 UTC most systems had fully recovered, though continued monitoring was recommended. Acquia continued to actively monitor for any additional service failures or related alerts from monitoring systems until approximately 20 February at 01:00 UTC.

Acquia Cloud Enterprise and Acquia Cloud Site Factory, for production environments, make use of highly available server pairs provisioned in separate availability zones as a best practice to mitigate the chance of an interruption of service as is recommended by our provider.

Impact to applications varied depending on type of server affected. For customers whose primary load balancers were affected, Acquia responded to normal alerts from monitoring and failed over to the secondary balancer as needed and some period of service interruption may have been experienced. For customers whose primary file or database server were affected; failover actions were automatically initiated in the majority of cases with a small number of cases requiring additional manual intervention. Loss of a file or database server may have resulted in a degradation of performance during the loss of high availability, but this redundancy should have prevented an interruption of service for the majority affected.

Only Acquia Cloud Enterprise was affected by this outage. No other Acquia products or services were affected.

Identified Root Cause

The identified root cause of the outage was a thermal event affecting a single data center supporting the AWS Tokyo (ap-ne1) region.

Next Steps

Acquia has received confirmation of remediations already made and in progress for implementation to prevent recurrence of this type of event from our infrastructure provider.
Customer’s should contact their account manager for any additional questions related to this event.

Posted Mar 12, 2021 - 22:27 UTC

Resolved

The service interruption impacting some applications in the Asia Pacific North East (ap-northeast-1) region has been resolved. If you have any questions or experience any technical difficulties you believe are related to this service interruption, please file a Support ticket.

Posted Feb 20, 2021 - 01:01 UTC

Identified

We have identified and are continuing to monitor the interruption of service affecting some applications in the Asia Pacific North East region. We will provide further information as soon as this issue has been fully resolved.

Posted Feb 19, 2021 - 20:02 UTC

Investigating

We are currently investigating an interruption of service for Acquia Cloud affecting some applications in the Asia Pacific North East region. We will provide additional information as it becomes available.

Posted Feb 19, 2021 - 14:54 UTC

This incident affected: Cloud Platform Enterprise and Acquia Site Factory.