Site Degraded

Incident Report for Acquia Inc

Postmortem

What happened

On August 23, 2018 Acquia released a component update for an internal platform service on the web server tier. Acquia runs a specific daemon on all web servers for clients who have a high site count which, on incoming requests, determines if there is enough memory available on the system to serve the incoming request. If PHP is not already running it will first request information from this daemon to determine if it's ok to launch a new process. As part of our ongoing maintenance we deployed a change to this daemon. In our testing and in our pre production deployments we did not observe any problems with this change and our expectation was that there would not be any customer impact.

Hours after the deployment was complete, Acquia started receiving automated and customer submitted alerts that some sites were experiencing problems. Initial investigation narrowed down the likely culprits to the daemons that are responsible for launching new processes in a specific PHP configuration.

What we did about it

None of the initial attempts to restart the daemon had the desired effect so a decision was made to turn the specific functionality related to the offending PHP configuration off. While this change had a positive impact for some sites it caused an even bigger problem for some servers running a high number of sites where the PHP processes did not have enough memory to complete requests as expected.

At this point it was decided that the release which introduced the change to this daemon needed to be rolled back. During the rollback process a subset of customers experienced problems running their scheduled cron jobs. After the roll back was completed sites availability was restored.

Identified Root Cause

The resource control daemon stopped handling connections from customer applications utilizing a specific PHP configuration. While other failure modes of the resource control daemon are handled automatically, this case was not and caused PHP processes to hang or deny requests that normally would be handled.

Posted Aug 29, 2018 - 18:16 UTC

Resolved

This incident has been resolved.

Posted Aug 24, 2018 - 21:38 UTC

Investigating

All affected sites have been restored. All services are operational at this time.

Posted Aug 24, 2018 - 21:19 UTC

Update

Some customers are experiencing down or degraded sites. We have identified the cause and are actively working to resolve this incident.

Posted Aug 24, 2018 - 17:04 UTC

Identified

Some customers are experiencing down or degraded sites. The team is actively investigating this incident.

Posted Aug 24, 2018 - 13:00 UTC

This incident affected: Cloud Platform Enterprise and Cloud Platform Professional.