Acquia has detected a temporary service interruption in Acquia Search services
Incident Report for Acquia, Inc.
Postmortem

Purpose of This Report

This is a summary and analysis of an issue that occurred with the delivery of an Acquia product or service. The purpose of this document is to share details about what happened and why, so there is a common understanding of what is required to prevent a future occurrence if at all possible. Any remaining issues or risks are identified, as are recommended or pending actions.

What happened (all times are UTC unless noted)

On Tuesday, January 7, 2020, at 10:37, Acquia was alerted through its monitoring systems that search cores for a specific region were generating 404 errors. Acquia performed initial troubleshooting of the issue and escalated the issue internally at 11:11. Acquia posted a notice on https://status.acquia.com/ at 11:49 informing customers of the service disruption.

What we did about it

At 11:39, Acquia identified that the scope of the incident was limited to a single Acquia Search cluster. By 15:36, Acquia identified that a configuration update to a single Acquia Search Solr core in that cluster caused an improper configuration to be made to the cluster’s load balancer configuration file. This improper configuration led the load balancer to route requests to the cluster to an invalid upstream, resulting in 404 errors. Acquia corrected the load balancer configuration, resolving the service disruption by 15:45.

Identified Root Cause

Acquia determined that the improper configuration made to the load balancer configuration file was a result of the Acquia Search platform erroneously managing the state of a previously deactivated Solr core. When a customer on the affected Acquia Search cluster applied a new Solr configuration to their Solr core, the Acquia Search platform attempted to re-use the deactivated core. This led to the core’s improper addition to the load balancer configuration file, resulting in the 404 errors.

Next Steps

  1. Acquia will add additional validation to the Acquia Search platform so it accurately maintains the state of deactivated cores to prevent their reactivation by the system.
  2. Customers should reach out to their Account team with any followup questions so that we can address them on a case by case basis.
Posted Jan 13, 2020 - 18:12 UTC

Resolved
The underlying cause of this service interruption has been addressed. All affected Acquia Search services have been restored. All services are operational at this time.
Posted Jan 07, 2020 - 17:55 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 07, 2020 - 16:34 UTC
Identified
We are currently investigating an interruption of service for Acquia Search that is affecting some Acquia Cloud Enterprise and Acquia Cloud Site Factory sites. We will continue to provide updates as they become available.
Posted Jan 07, 2020 - 16:07 UTC
Update
We continue to investigate the interruption of service for Acquia Search that is affecting some Acquia Cloud Enterprise and Acquia Cloud Site Factory sites. We will continue to provide updates as they become available.
Posted Jan 07, 2020 - 15:23 UTC
Investigating
We are currently investigating an interruption of service for Acquia Search services for some Acquia Cloud Enterprise and Acquia Cloud Site Factory sites. We will provide additional information as it becomes available.
Posted Jan 07, 2020 - 11:49 UTC
This incident affected: Acquia Search.