Software update glitch leads to Google Cloud outage
Thu 25 Aug 2016
Google has officially apologised to its U.S.-Central cloud customers after a software update glitch led to high error rates and latency.
The incident, which caused issues for almost two hours, occurred after a software update was carried out on traffic routers while data was being moved between data centres.
Google Cloud’s status page noted: ‘The incident was triggered by a periodic maintenance procedure in which Google engineers move App Engine applications between datacenters in US-CENTRAL in order to balance traffic more evenly.
‘As part of this procedure, we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources. The applications running on the drained servers are automatically rescheduled onto different servers.
‘During this procedure, a software update on the traffic routers was also in progress, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity.’
21% of applications in Google App Engine, hosted in the U.S.-Central region, experienced error rates of above 10%. 16% of applications also saw lower error rates. The problem lasted from 13:13 to 15:00 PDT on Thursday 11th August.
Google explained that its App Engine had automatically redirected requests to other data centres to reduce the overload. It added that engineers had also manually redirected traffic at 13:56 PDT.
The company said that fixing a configuration error which caused an imbalance of traffic at the other facilities had fully restored services.
Google confirmed that it has increased its traffic routing capacity in order to prevent further issues of this kind in the future. ‘We will also change how applications are rescheduled so that the traffic routers are not called and also modify that the system’s retry behavior so that it cannot trigger this type of failure.
‘We know that you rely on our infrastructure to run your important workloads and that this incident does not meet our bar for reliability. For that, we apologize.’