Human error at cloud operator Joyent causes system crash
Share on Twitter.
Get the most reliable SMTP service for your business. You wished you got it sooner!
May 28, 2014
Cloud systems operator Joyent went through a catastrophic failure late yesterday when an absent-minded administrator
brought down an entire data center's computing assets.
The cloud services provider began reporting "transient availability issues" for its US-East-1 data center at
around 6.30 PM, EST.
"Due to an internal operator error, all computing nodes in our US-East-1 data center were simultaneously rebooted,"
"Some computing nodes are already backed up, but due to the very high loads on the control plane, this is taking some
time to reboot the whole system. We are dedicating all operational and engineering resources to getting this issue
resolved, and will be providing a full report on this failure once every computing node and customer virtual machine
is back online and operational," to company added.
A percentage of the issues were fixed an hour or so later. A datacenter-wide forced reboot on all servers is just about
the worst thing that can happen to a provider aside from the deletion of customer data, or multiple data centers going
"While the immediate cause was operator error, there are broader systemic issues that allowed a 'fat finger' to
take down a datacenter," explained Joyent's chief technology officer Brian Cantrill.
"As soon as we reasonably can, we will analyze how this was architecturally possible, what exactly happened, how
the system recovered, and what improvements we will be making to both the software and to operational procedures to
assure that this doesn't happen again in the future," added Cantrill.
Joyent has service-level agreements in place that will compensate customers for downtime. In going through such a
stomach-churning fault, Joyent has joined an illustrious group of service providers that includes Rackspace, Microsoft,
Google, and Amazon which have all had similarly catastrophic failures.
"Anything that allows you to administer many servers and VMs will allow you to do this," Cantrill added. "There
was a silver lining here in the sense that it was an opportunity to see how the system behaved. There are lots of ways
it could have been much worse."
"The system admin that made the error is mortified, there is nothing we could do or say for that operator that
is going to make it any worse, frankly," Cantrill said.
The goal for the company is to learn from the problem and get better. "You don't teach dolphins with a shock collar," Cantrill
explained. As to what will happen to that system admin is anybody's guest for now.
In other IT news
Facebook's wish that its open compute project (OCP) could bring hyperscale-style innovation
to the internet community is somewhat bearing fruit, with an Australian company revealing a
range of converged infrastructure and virtual SAN products using its server designs.
The company in question, Infrx, is a very small business with just four people. But that
hasn't stopped it from working with Facebook, striking up a relationship with server makers Quanta
and Wiwynn and releasing a range of products.
They include a pre-configured SAN based on VMware's vSAN, plus stack-in-a-box rigs running
either Hyper-V, OpenStack or Hadoop, with Cumulus in the background handling software-defined
The products have been designed in close collaboration with software vendors-- Whithouse
said senior VMware staff assisted with the design of the vSAN equipment while Infrx's Metacloud
offering is based on templates used to deploy the stack at Disney and Australian telco Telstra.
Founder Mark Whithouse says that users in Australia know of OCP, appreciate the low acquisition
and operating costs it offers and feel it represents a chance to improve their operations.
And of course, price is obviously a big factor. Whitehouse is a veteran of a few enterprise
storage vendors and says in his experience, Australian companies pay $1.98 per gigabyte. Infrx
can deliver at 30 cents a gigabyte, he claims.
Whithouse says he expects a couple of sales in the next week, although there's been none so far.
The prospects operate at substantial, but not hyper scale, reflecting Infrx's belief that OCP equipment can
make it into smaller data centres.
Perhaps as interesting as Infrx's offerings is Whitehouse's trip to visit Facebook to research
On that trip he says he saw an assembly facility where Intel personnel told him they were
installing 1,000 CPU sockets each week to feed Facebook's server farms.
Infrx is selling in Australia and New Zealand for now, but Whitehouse says the company's
links in the OCP community means sales beyond the South Pacific may be possible.
The company is currently financed from the founders' pockets and while discussions with
investors are welcome, they're not being actively pursued as Whitehouse and his colleagues
feel that OCP-based infrastructure's main attraction is price.
If investors become involved, he fears they'll force higher prices on the company and destroy
the advantages OCP confers.
Infrx is not alone offering stack-in-a-box products-- NetApp and Cisco's FlexPod, Oracle's engineered
systems and VCE all have similar products.
The likes of Scale Computing do likewise with a 'white (no name) server' offering. We'll keep you posted
on this and other stories.
In other IT news
According to various reports we've seen in the blogosphere this morning, IBM will reportedly
end its contract agreement with NetApp.
Citing ďan internal memo reviewed by Bloomberg, the newswire says IBM has simply decided to
offer enterprise customers its own solutions rather than continuing to resell the N-series
network attached storage devices it gets from NetApp.
IBM's data storage sales aren't exactly that high, so it makes sense for the company to
concentrate on shifting its own infrastructure and taking as much profit margin as it can rather
than outsource with NetApp.
That its own v3500, Storwize v5000 and Storwize V7000 Unified are reasonable replacements for
the N3000 Express, N6000 and N7000 it gets from NetApp means that the decision can't have been
all that difficult in the first place.
Understandably, that doesn't make the decision any easier for NetApp, which draws about two
percent of its revenue and more than a little credibility from its IBM alliance.
With its balance sheet challenged in several ways, losing some easy revenue is perhaps the
last thing it needs, however.
It's also worth reflecting about what IBM's move says about the NAS market in general. The
likes of Dropbox for business offer NAS-like functions (as end-users perceive them) without all
the hassle of maintaining a device.
Naturally, such services aren't going to become less sophisticated any time soon, so it represents
a real threat to those who need to access files alone.
To be sure, SaaS (Software-as-a-Service) also poses a parallel threat by removing the need
for data storage solutions capable of serving the transactional needs of on-premises applications.
In fact, those new and emerging technologies threaten both IBM and NetApp at the same time.
IBM at least has a cloud offering that it can use to compensate.
But NetApp looks instead like less of a good catch after years of acquisition speculation. Only time
will tell how this will pan out in the next year or two.
In other IT news
Workers at Microsoft Research (MSR) have implemented a new method to automatically check
code for compliance with privacy laws, and Microsoft claims that its simple to use.
Legalease is to specify restrictions on how data is handled. One of the main drivers behind
its development was that software developers and those setting companiesí privacy policies
donít share a common language.
As an example, MSR says that more than 20 percent of the code in its Bing search engine changes
on a daily basis, with changes made by thousands of programmers.
Even some small changes in code might affect how data is used or who views it, potentially
violating company, government or regulatory privacy policies.
Keeping tabs on changes in very large systems, like the Bing search engine, using manual audits
is difficult and very time consuming.
According to MSR, automated testing is the best way to verify compliance with privacy rules
and laws on the massive scale demanded in environments like Bing.
like the U.S. Health Insurance Portability and Accountability Act (HIPPA).
Grok, meanwhile, annotates existing code using a system that cross-references information
from different sources, based on varying levels of confidence.
According to Microsoft, pattern-matching to column names across a database results in a
low-confidence score, while annotations made manually by developers are deemed to be more
trustworthy and thus get a high-confidence score.
MSR says it developed Grok for use on Bing but found writing suitable polices very difficult,
and this was what led to Legalese. Both were tested on Bing and are now running on the data analytics
MSR presented Legalese and Grok at the 35th IEEE Symposium on Security and Privacy in San Jose, California
Source: Joyent Inc.
Get the most dependable SMTP server for your company. You will congratulate yourself!
Share on Twitter.
Need to know more about the cloud? Sign up for your free Cloud Hosting White Paper.