Hadoop community working together to make Docker faster
Share on Twitter.
Get the most reliable SMTP service for your business. You wished you got it sooner!
May 3, 2014
The Hadoop community said late yesterday that it's currently working as a team on various new patches that will bring Docker into
the data management system, and independent benchmarks are already showing that the technology is now a lot faster than traditional server
virtualization methods. The technology actually is a new breakthrough.
Docker is an open source Linux containerization technology that uses underlying kernel elements like namespaces, lxc, and cgroups to
let a system admin run multiple apps with all their dependencies in secure sandboxes on the same underlying Linux operating system,
making it an attractive option to server virtualization, which bundles a copy of the OS with each app.
In a set of specific benchmarks that an IBM worker released on Thursday, Big Blue demonstrated that Docker containerization has some
huge advantages over the KVM hypervisor, from an overall performance perspective.
Alongside this, we also discovered some pretty impressive work by the Hadoop community to bring the technology into the eponymous
data analysis and management engine.
This will add more punch to the idea that Docker could become an eventual replacement for traditional server virtualization approaches,
granting businesses huge benefits from an open source technology.
To start with, benchmarks conducted by IBM show that Docker has a number of performance advantages over the KVM hypervisor when
running on the open source cloud infrastructure tool OpenStack.
An informative post published by IBM's Boden Russell goes into further details about the results. "From an OpenStack Cloudy operational
time perspective (boot, reboot, delete, snapshot, etc.) docker LXC outperformed KVM ranging from 1.09x (delete) to 49x (reboot)," Russell
"Based on the compute node resource usage metrics during the serial VM packing test, Docker LXC CPU growth is approximately 26 times lower
than KVM. On this surface, this indicates a 26x density potential increase from a CPU point of view using docker LXC vs a traditional hypervisor.
Docker LXC memory growth is approximately 3 times lower than KVM. On the surface, this indicates a 3x density potential increase from a memory
point of view using docker LXC vs a traditional hypervisor," he added.
Not only does Docker have desirable resource-usage characteristics, but the way it allows developers to package applications has attracted
attention from the open source Hadoop community.
Recently, we learned that some people are diligently working to add Docker support into a crucial component of Apache Hadoop 2.0 named
YARN, with the goal of increasing the usefuleness of both technologies.
YARN was introduced in version two of Apache Hadoop, and it lets the software run multiple applications within Hadoop rather than purely
Thanks to this, YARN is helping to transform Hadoop from a batch processing and storage system into a more general tool for manipulating
and storing data.
By combining YARN with Docker, the community hopes that it can make it trivial for developers to package an application in a
Docker container, then sling it onto the YARN tech as part of a larger Hadoop installation.
Altiscale, the company behind the code contributions that make this possible, was kind enough to answer some of our questions about why
this could be useful.
"As a company building Hadoop as a Service (HaaS) platform, we are particularly interested in YARN as it allows Hadoop to move
beyond map-reduce to a much more diverse variety of applications," explained the company's chief executive Raymie Stata.
"One of the key components of YARN that make this possible are containers. The existing YARN container implementation does not
adequately provide all the types of isolation required to address a scenario we are noticing with our larger customers-– multiple,
independent groups in the same organization with different software requirements."
By adding Docker support, Altiscale hopes that it can flatten some of the barriers that lie between enterprise developers and a
greater utilization of Hadoop.
"For example, a common issue for users is software dependency management," Stata explained. "Docker provides an intriguing
approach to solving that problem by allowing users to upload prepackaged environments or images into repositories which can then
easily be downloaded and run in isolation".
"For instance, there are public repositories in the Docker community called Docker Registries which provide a variety of language
environments such as Java and Ruby. There is also support for private repositories where containers with more specialized environments
can be placed," he added.
Other members of the Hadoop community are keen on the addition of Docker as well. "Where Docker makes perfect sense for YARN is that we
can use Docker Images to fully describe the entire Unix filesystem image for any YARN container," explained Arun Murthy, a founder and
system architect at Hortonworks.
"In this manner, instead of forcing a user to deal with individual files or binaries as things stand today, we can allow the application
to package the entire Unix filesystem image it needs as a Docker image and then get perfect predictability from an environment perspective at
"This is where Docker has the most amount of interest to the YARN/Hadoop community, particularly for users packaging complex applications
which need their own version of Perl, Python, Java, Libc etc, that is hard to manage on YARN currently," he said.
The addition of Docker to YARN looks like a potentially useful tool and is another example of the enthusiasm with which Silicon Valley
has adopted the young open source technology.
This follows Red Hat announcing the broad support for Docker in its eponymous Linux distribution, and launching a project named "Atomic"
built around the technology. Amazon also recently added Docker support to its "Elastic Beanstalk" platform-as-a-service cloud.
In other IT and open source news
System admins in Microsoft's cloud data centre for Western Europe have spent the morning battling severe issues with the server equipment
that supports the company's main cloud service.
Problems with the core Compute and Storage components were first reported at 9:39 am this morning, according to the Windows
Azure Status Dashboard, when Microsoft said it had received an alert for SQL Databases, Compute and Storage in West Europe.
Microsoft later admitted that the issues meant that customers could "experience issues accessing services". It described the
Compute problems as a "Partial Service Interruption Limited Impact", which is a Microsoft euphemism for a fraction of its technology
not working correctly.
Storage was termed a "Partial Service Interruption" and not given a qualifier, so it's likely that more customers were hit by the
As Compute, Storage, and SQL Databases are fundamental building blocks for any cloud infrastructure, this is a severe problem.
As of 2:54 pm Microsoft said it had "partially restored the services and continue to see improvements to Storage availability".
It indicated that the Compute services were mostly fixed, noting that-- "We have confirmed recovery for Compute availability. A very
small subset of IaaS Virtual Machines may be affected. We are validating the restoration steps."
In other IT news
The U of T has criticised Canada's Internet Service Providers for unnecessarily routing user traffic via
the United States, even when both the origin and destination of the traffic is within Canada.
In a study that mirrors European concerns about why traffic should traverse the U.S. when it doesn't need to, the Canadian transparency
study blames an unwillingness to peer for sending traffic into the reach of the NSA.
The University of Toronto's Andrew Clement and Jonathan Obar have put together the report along with an interactive map, in which they rate
Canadian ISPs on various transparency characteristics.
The ratings, the report says, are based on how easily users can find information including an ISP's compliance with data privacy
legislation, how they report data access requests, how well they define personal information, information about where user data is stored,
Against the 10 criteria used in the assessment, nobody scored highly-- the best was Teksavvy, scoring just 3.5 stars out of a potential ten,
followed by Primus on just three stars.
None of the ISPs tested provide transparency reporting, and the researchers say none of the 20 carriers they examined are in full
compliance with Canada's PIPEDA (Personal Information Protection and Electronic Documents Act) privacy law.
About routing, the report states-- “Fewer than half of the ISP privacy policies refer to the location and jurisdiction for the information they
store. Only one (Hurricane) gives an indication of where it routes customer data and none make explicit that they may route data via the
US where it is subject to NSA surveillance”.
Source: The Hadoop Development Team.
Get the most dependable SMTP server for your company. You will congratulate yourself!
Share on Twitter.
Need to know more about the cloud? Sign up for your free Cloud Hosting White Paper.