Information Technology News.

Amazon Web Services suffers massive EC2 service outage in Sydney, Australia

Share on Twitter.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

June 9, 2016

Amazon Web Services has tried to explain as best it could regarding the extended service outage it suffered in its Sydney Australia data center last weekend.

Amazon attributed the extended downtime to a combination of power issues and a latent bug in its instance management software.

The Sydney area recorded over 150 mm of rain on last weekend. On Sunday alone , about 20 percent of the city got 93 mm of rain plus winds gusting to 96 km/h.

Amazon says that bad weather meant that “At 10:25 PM PDT on June 4th (mid-afternoon Sunday in Sydney) our utility provider suffered a loss of power at a regional substation as a result of severe weather in the area. This failure resulted in a total loss of utility power to multiple AWS facilities.”

AWS claims that is has two backup power systems, but for some reason both backups failed on the same night in question. That's really unusual. For a company the size of Amazon that is unacceptable.

Amazon's explanation says that its backups employ a “diesel rotary uninterruptable power supply (DRUPS), which integrates a diesel generator and a mechanical UPS.”

“Under normal operation, the DRUPS uses utility power to spin a flywheel which stores energy. If utility power is interrupted, the DRUPS uses this stored energy to continue to provide power to the datacenter while the integrated generator is turned on to continue to provide power until utility power is restored.”

Last weekend, however, “a set of breakers responsible for isolating the DRUPS from utility power failed to open quickly enough.” That was bad because these breakers should “assure that the DRUPS reserve power is used to support the datacenter load during the transition to generator power.”

“Instead, the DRUPS system’s energy reserve quickly drained into the degraded power grid.”

That failure meant the diesels couldn't send any power to the data centre, which promptly fell over.

AWS technicians got things running again at 11:46 PM PDT and by 1:00 AM PDT on the 5th, “about 80 percent of the impacted customer instances and volumes were back online and operational.”

However, some workloads were slower to recover thanks to what AWS calls “DNS resolution failures as the internal DNS hosts for that Availability Zone were brought back online and handled the recovery load.”

But some instances didn't come back. AWS now says that was due to “A latent bug in our instance management software” that meant some instances needed to be restored manually. AWS hasn't explained the nature of that bug.

Other instances were also impacted by dead disks that meant data was not immediately available. Manual work was required to restore data. If this sounds like a big mess, it's because it is.

As is always the case after such disasters, AWS has promised to harden the designs that failed.

“While we have experienced excellent operational performance from the power configuration used in this facility,” the mea culpa says, “it is apparent that we need to enhance this particular design to prevent similar power issues from affecting our power delivery infrastructure.”

More breakers are the order of the day, “to assure that we more quickly break connections to degraded utility power to allow our generators to activate before the UPS systems are depleted.”

Software improvements are also planned, including “changes that will assure our APIs are even more resilient to failure” so that those using multiple AWS regions can rely on failover between bit barns.

Those changes should land in the Sydney region sometime in July, AWS claims. The company is far from alone in suffering physical or software problems with its cloud. Salesforce also had previous issues with circuit breakers in the past.

Then a bit later, Google broke its own cloud with a bug and lost data after a lightning strike.

Source: Amazon Web Services.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

Share on Twitter.

IT News Archives | Site Search | Advertise on IT Direction | Contact | Home

All logos, trade marks or service marks on this site are the property of their respective owners.

Sponsored by Sure Mail™, Avantex and
by Montreal Server Colocation.

       © IT Direction. All rights reserved.