Information Technology News.

Learning from IT failures and preventing them from reocurring

Share on Twitter.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

March 8, 2016

At the QCon Developer Conference this week in London, R&D engineering manager Gavin Stevenson told conference participants that they should learn from their previous IT failures and take steps to prevent them from reocurring.

QCon is a vendor-neutral annual conference focused on large-scale software development and architecture, and is relatively hype-free when compared to similar events.

Stevenson's main message is that examining how an application fails when under stress is more illuminating than simply observing it while working normally.

Failure of your software identifies the limitations of the overall system. That said, his team works in R&D, rather than on production systems, and the real goal here is to avoid failure at all costs. Hardware is one thing, but so is software, asserts Stevenson.

He works for William Hill, a Java-centric organization. However, its next-generation system under development is written in Erlang, which is designed for concurrency.

"The syntax is simple," said Stevenson, "and the supervisor hierarchy makes it really nice to work with."

Supervisors, part of Erlang's OTP (open telecom platform) library, manage child processes and restart them when necessary, adding resilience, he says.

Stevenson's team decided to use an in-memory database for added performance. They tested the system by using a logfile of all the bets placed for last year's Grand National, over 6.2 million, and replaying them as fast as possible.

"Our application failed miserably, which was brilliant," he said. There was "massive contention" in the database and excessive memory consumption, over 50 GB!

A redesign using sharding (a technique for partitioning the data), load-balanced supervisors, distributed logging using Apache Kafka and multiple betting engines, a new design which avoids having a new Erlang process for every bet.

All those elements resulted in a resilient, smooth and scalable system that could process 6 million bets in about twenty minutes.

Additionally, Stevenson's team also relies on Docker containers for deployment. "Everything we do in R&D, it's with Docker," he said, though they have struggled with container load-balancing and orchestration in the recent past.

"There isn't a brilliant solution," he added, though they are looking at Docker Swarm, a product for simply clustering more Docker engines.

Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run-- code, runtime, system tools, system libraries, basically anything you can install on a server.

This guarantees that it will always run in the exact same manner, regardless of the environment it is operating in.

"It's a reactive microservice-based architecture," said Stevenson. "Making your application fail is a handy tool for application development, but only one small piece in the wider task of designing resilient and well-performing systems," he added.

At an open space discussion which followed Stevenson's presentation, William Hill's story seem remote from the reality facing many businesses, however.

One attendee, in the financial services industry, lamented the many dependencies in the system he managed, any one of which could stop things from working altogether.

The core issue was a legacy back-end system including IBM's WebSphere MQ, SOAP web services and numerous JDBC (Java) database calls.

"Incredibly, it's over 30 long years of legacy," he said. "When will we get the budget to fix it? Not in my lifetime," he added.

Nor is today's rush towards microservices architecture a complete solution either. Each microservice is a dependency in and by itself, and what happens when one breaks? "You have to dig into why it doesn't work anymore, how do you react quickly?" asked an attendee.

Even if you think you know how it should be done, implementing today's best practice in the real IT world is a huge challenge. Regretably, Stevenson didn't answer that query.

Source: The 2016 QCon Developer Conference, London U.K.

Sponsered ad: Get a Linux Enterprise server with 92 Gigs of RAM, 16 CPUs and 8 TB of storage at our liquidation sale. Only one left in stock.

Sponsered ad: Order the best SMTP service for your business. Guaranteed or your money back.

Share on Twitter.

IT News Archives | Site Search | Advertise on IT Direction | Contact | Home

All logos, trade marks or service marks on this site are the property of their respective owners.

Sponsored by Sure Mail™, Avantex and
by Montreal Server Colocation.

       © IT Direction. All rights reserved.