This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why, so there is a common understanding of what is required to prevent a future occurrence if at all possible. Any remaining issues or risks are identified, as are recommended or pending actions.
On Tuesday, January 7, 2020, at 10:37, Acquia was alerted through its monitoring systems that search cores for a specific region were generating 404 errors. Acquia performed initial troubleshooting of the issue and escalated the issue internally at 11:11. Acquia posted a notice on https://status.acquia.com/ at 11:49 informing customers of the service disruption.
At 11:39, Acquia identified that the scope of the incident was limited to a single Acquia Search cluster. By 15:36, Acquia identified that a configuration update to a single Acquia Search Solr core in that cluster caused an improper configuration to be made to the cluster’s load balancer configuration file. This improper configuration led the load balancer to route requests to the cluster to an invalid upstream, resulting in 404 errors. Acquia corrected the load balancer configuration, resolving the service disruption by 15:45.
Acquia determined that the improper configuration made to the load balancer configuration file was a result of the Acquia Search platform erroneously managing the state of a previously deactivated Solr core. When a customer on the affected Acquia Search cluster applied a new Solr configuration to their Solr core, the Acquia Search platform attempted to re-use the deactivated core. This led to the core’s improper addition to the load balancer configuration file, resulting in the 404 errors.