Information Technology News.


Trying to do two updates at the same time isn't a good idea

Share on Twitter.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

August 24, 2016

Trying to do two major software updates at once isn't a good idea, and Google will tell you that after its recent mishap, although it was no fault of its own.

Google has explained an August 11th update issue on its cloud as a self-inflicted wound of some sort.

At the time of the mishap, Google said its App Engine APIs were unavailable for a time, but without providing anymore information.

It's now saying the two-hour incident meant that “about 18 percent of applications hosted in the US-CENTRAL region experienced error rates between 10 and 50 percent, and that about 3 percent of applications experienced error rates in excess of 50 percent.”

“Additionally, 14 percent experienced error rates between 1 and 10 percent, and 2 percent experienced error rate below 1 percent but above baseline levels.”

Users also wore “a median latency increase of just under 0.8 seconds per request.” Google has now revealed the root cause of the accident, which started with “a periodic maintenance procedure in which Google engineers move App Engine applications between datacenters in US-CENTRAL in order to balance traffic more evenly.”

When Google does this sort of thing “we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim some resources. The applications running on the drained servers are automatically rescheduled onto different servers.”

But things started getting out of hand when Google was draining the pool on this occasion, “and a software update on the traffic routers was also in progress at the same time, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity.”

“The 'server drain' resulted in rescheduling of multiple instances of manually-scaled applications. App Engine creates new instances of manually-scaled applications by sending a startup request via the traffic routers to the server hosting the new instance,” Google explained.

Some of those manually-scaled instances started up slowly “resulting in the App Engine system retrying the start requests several times over and over which caused a large spike in CPU load on the traffic routers. The overloaded traffic routers dropped some incoming requests as a result.”

Google added that it had enough routing capacity to handle the load, but that the routers weren't expecting all those retry requests. And so its cloud experienced a mild shutdown as a result.

Google managed to rollback and restore its services and now promises that “In order to prevent a recurrence of this type of incident, we have added more traffic routing capacity in order to create more buffering when draining servers in this region of our system.”

“We will also modify how applications are rescheduled so that the traffic routers are not called and also modify that the system's retry behavior so that it cannot trigger this type of failure.”

But there's still no mention of trying to schedule upgrades so that it's only doing one at a time. We'll keep you updated.

Source: Google.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

Share on Twitter.

IT News Archives | Site Search | Advertise on IT Direction | Contact | Home

All logos, trade marks or service marks on this site are the property of their respective owners.

Sponsored by Sure Mail™, Avantex and
by Montreal Server Colocation.

       © IT Direction. All rights reserved.