Information Technology News.


Google suffered latency and write errors for 211 minutes on June 28

Share on Twitter.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

July 12, 2016

For at least the second time in the last three months, Google has admitted again to breaking its own cloud. Understandably, you could feel a bit of embarrassment coming from the company.

The most recent 'cascading snaffu' occurred on June 28 when Google Compute Engine SSD Persistent Disks system in the U.S. experienced elevated write latency and errors in one zone for a duration of 211 minutes.

The screwup meant that disks probably stopped accepting writes and instances that used SSDs as their root partition probably hung dry.

Google is usually good about revealing just why things go bad in its cloud. This time around it said-- “Two concurrent routine maintenance events triggered a rebalancing of data by the distributed storage system underlying Persistent Disk.”

Google told people not to worry too much since “this rebalancing is designed to make maintenance events invisible to the user, by redistributing data evenly around unavailable storage devices and machines.”

Which is just how a cloud should behave-- lots of moving parts at the back end invisible to you, who just keeps getting well-behaved servers.

But on this occasion, “a previously unseen software bug, triggered by the two concurrent maintenance events, meant that disk blocks which became unused as a result of the rebalancing act were not freed up for subsequent reuse, depleting the available SSD space in the zone until writes were rejected.”

Yes, that's a major snaffu, we agree. And once the disks thought they'd run out of space, no amount of clever 'last-minute ditch attempts' could reasonably compensate for the 211 minutes it took Google to determine what was going on and then fix the whole thing.

Google has pledged to do better in the future and says its “engineers are refining automated monitoring such that, if this issue were to come back, its people would be alerted before users saw any impact.”

“We are also improving our automation to better coordinate different maintenance operations on the same zone to reduce the time it takes to revert such operations if necessary,” the company added.

And as we've previously noted above, Google is more candid than its competition when it discloses service outages and their causes. But it also appears to have more outages to disclose-- we monitor the big three clouds' outage notifications and Google announces issues more than either AWS and Microsoft, both of which have larger clouds with more services.

The Alphabet subsidiary's new cloud chief Diane Greene has quite a job ahead of her. In this day and age of '6 nines or more' things need to get better if you want to stay on top of things.

Source: Google Cloud.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

Share on Twitter.

IT News Archives | Site Search | Advertise on IT Direction | Contact | Home

All logos, trade marks or service marks on this site are the property of their respective owners.

Sponsored by Sure Mail™, Avantex and
by Montreal Server Colocation.

       © IT Direction. All rights reserved.