How lessons learned from the incident in our Nuremberg Data Center in September 2024, helped us to prevent a similar situation in October.
What Happened?
On October 9, 2024, our Nuremberg Data Center briefly lost power from the public grid. At 18:52, our uninterruptible power supply (UPS) detected a power outage and took over to ensure all servers and networking devices run uninterrupted. The power outage also impacted our cooling system.
After the power returned some seconds later not all elements of cooling system came back online automatically. One of the six cooling system pumps got stuck in error mode. An alarm was triggered, and a Data Center Technician rushed to fix the issue.
The pump required a manual reset. We performed the reset before the increased temperature impacted any servers inside the Data Center.
All in all, it took us less than 60 minutes to investigate, mitigate, and resolve the incident.
This was possible thanks to the lessons learned from the September incident in the same Data Center. With new procedures in place, our onsite team was able to quickly identify the affected device and manually reset the cooling pump without relying on external support.
Timeline of the Event
- 18:52: The UPS detected a power outage, all servers remain online
- 19:24: Internal alarms triggered by the cooling system failing to restart after power outage
- 19:30: Data Center Technician started investigation
- 20:14: Data Center Technician identified one of the cooling pumps as stuck in reset mode. Data Center Technician manually reset the pump and put it back in working mode.
- 20:15: Incident is closed, no affected customers
No Impact on Migration to Hub Europe
This does not change our commitment to migrate all servers from Nuremberg to our Hub Europe Data Center free of charge for all Nuremberg customers. The migration is progressing swiftly and over 20,000 servers have already been migrated.
Conclusions
You might wonder why we are bothering you with a situation that, in the end, didn’t impact any of your servers. We believe it’s important to show you that when we said that we will learn from the September outage, we really meant it. We acted faster, our processes were better.
The October situation had the potential to have a similar impact on our customers to the September outage, but thanks to the changes we made, not a single customer was affected. We still have a long way to go, but we will keep on improving our infrastructure’s stability and making our customers’ experience better.