News Hub

How a data centre flat tyre disrupted Google services

Written by Tue 17 Mar 2020

Crushed wheels tipped data centre rack forward, misaligning liquid coolant levels

Google revealed a set of wheels supporting a rack in one of its data centres buckled, precipitating a chain of events that resulted in some CPUs overheating, disrupting Search, Gmail, and other services for some users.

The unusual episode was discovered after a site reliability engineer on the company’s traffic and load balancing team was alerted that Google services being supported by its edge network were producing an abnormally high number of errors.

After halting the affected machines from serving customers, the team set about a root cause analysis.

Initially, it appeared a router on a rack was at fault, but on further inspection, the nature of the router error suggested the machines rather than the router were at fault, and unusually, only machines on a single rack were affected.

“Why would a single rack be overheating to the point of CPU throttling when its neighbors were totally unaffected??,” wrote Steve McGhee, solutions architect, Google Cloud, in an online autopsy of the incident.

“What is it about the physical support for machines that would cause kernel errors? It didn’t add up,” he added.

To try and figure out what was going on, Google’s on-site hardware and operations management team were asked to inspect the overheating machines.

When they encountered the rack in question, it became clear what the problem was:

“Hello, we have inspected the rack. The casters on the rear wheels have failed and the machines are overheating as a consequence of being tilted,” the team wrote in a message to the site reliability engineer.

That’s right. The wheels supporting the rack had completely buckled, causing it to tilt forward and disrupt the flow of liquid coolant that was supposed to be keeping CPU temperature under control.

The hardware team duly fixed the wheel array and realigned the data centre rack. But their efforts didn’t stop there. They then systematically replaced all racks susceptible to the issue to prevent a similar incident occurring in the future.

“[A] phrase we commonly use here on SRE teams is “All incidents should be novel”—they should never occur more than once,” wrote McGhee. “In this case, the SREs and hardware operation teams worked together to ensure that this class of failure would never happen again.”

“This level of rigorous analysis and persistence is a great example of incident response using deep and broad monitoring and the culture of responsibility that keeps Google running 24×7,” he added.

Written by Tue 17 Mar 2020


Google liquid cooling
Send us a correction Send us a news tip