Get the lowest-cost and the best server colocation service in the business. Learn more.
Information Technology News.

Cloudera wants to help the Hadoop community

Share on Twitter.

Install your server in Sun Hosting's modern colocation center in Montreal. Get all the details by clicking here.

Do it right this time. Click here and we will take good care of you!

Click here to order our special clearance dedicated servers.

Get the most reliable SMTP service for your business. You wished you got it sooner!

June 30, 2014

Cloudera has talked up four major IT companies behind an initiative to link two open source projects together for the good of the Hadoop community.

The newly proposed partnership between IBM, Intel, DataBricks, MapR and Cloudera to port Apache Hive onto Apache Spark is due to be announced sometime this week at the Spark Summit in San Francisco.

We even heard a few rumors of that last week after stumbling across a proposal by Cloudera to lift Hive on Spark.

For those not familiar with the long list of codenames in the Hadoop world, Spark is a general-purpose cluster computing system originally developed at the University of California, Berkeley and based on the Hadoop File System.

It can be used as an alternative data processor to Hadoop MapReduce and is said to be about 100 times faster than MapReduce when running in memory or ten times faster when running on disk.

While all of this is happening, Hive is data warehouse software that uses a SQL-like language to query data stored in Hadoop.

Both projects are important, with Spark seen by many as a potential successor to MapReduce and Hive viewed as a likely candidate for accomplishing SQL-on-Hadoop work.

By lifting Hive up on Spark, Cloudera is hoping to force some consolidation in the Hadoop ecosystem, and in doing so is placing less emphasis on one of Cloudera's own projects: Impala.

Justin Erickson, Cloudera's director of product management, said the company has decided to push Hive because it wants to "go and combine the forces of the Spark community with the Hive community to make batch processing in Hadoop faster."

"Overall, Hive is the standard choice for doing batch processing jobs on Hadoop right now," said Matt Brandwein, the company's head of product marketing.

"We want to cut the fragmentation in the community. People are getting a bit aware of the fact there are so many options for so many different objects. Spark is the successor," he added.

The move has big ramifications for the Hadoop ecosystem and for Cloudera itself. In the recent past, Cloudera has been skeptical of the value of Hive.

In a blog post late last year, Mike Olson, the company's chief strategy officer wrote-- "Decades of experience had taught people to expect real-time responses from their databases. Hive, built on MapReduce, simply couldn't deliver, and that was an issue for us."

To address the perceived shortcomings of Hive, Cloudera built its own software, Impala, but with the new partnership between Cloudera, IBM, MapR, Databricks and Intel, it seems like Cloudera has warmed up somehow to Hive and will use the technology as its main way of dealing with the wider Hadoop community, while still continuing to develop Impala as a way to generate some revenue.

Another little complication in this story is that there already is a Hive-on-Spark project called Shark. But Cloudera feels that Shark has diverged too much from the mainstream Hive.

"To be sure, Shark took an approach of replacing several key components of Hive, including the query planner and other elements of Hive," Cloudera said.

The result of this was that maintaining compatibility with Hive became very difficult as changes to Hive can not be transparently back-ported to Shark.

With the Hive-on-Spark approach, we are making a much more limited change to only the physical query planner, which means that the Hive community can make changes and add new functionality to Hive and have this transparently work with either Spark or MapReduce or Tez.

As such, the maintenance burden will be much lower for Hive on Spark and will be more deeply integrated with the core Hive community.

Speaking of Tez, Cloudera's move also puts pressure on Hortonworks, which helped develop the competing data-processing framework.

But Cloudera says Spark, like Tez, is merely an option. As the company explains in a FAQ document, "It is not a goal for the Spark execution backend to replace Tez or MapReduce. It is healthy for the Hive project for multiple backends to coexist. Users have a choice whether to use Tez, Spark or MapReduce. Each has different strengths depending on the use case. And the success of Hive does not completely depend on the success of either Tez or Spark."

When contacted for comment, Hortonworks said the decision to pour development resources into Hive on Spark is broadly a good thing.

"It's an admission that the open source community driven model is the right one," said Shaun Connolly, the company's vice president of strategy.

With this new initiative, Cloudera can develop a better understanding of the future direction of the software and more carefully hone its business to reap the benefits of its growing user base.

In other IT news

Research scientists at MIT say they have cleverly built a new 36-core processor that uses an internal networking system to get maximum data throughput from all the processing cores.

Unveiled at the International Symposium on Computer Architecture, the original chip design gets around some of the issues with multicore processors, namely bus sharing between cores, and maintaining cache coherence, among other things.

To be sure, most conventional designs use a single bus to connect various cores, meaning that when two cores communicate together, they typically use the entire bus and leave other cores waiting.

The MIT design borrows from the internet's concept and allows all chips to share data with their neighbors using their own router.

"Typically, you can reach your neighbors really quickly," said Bhavya Daya, an MIT graduate student in electrical engineering and computer science, and the first author on the new paper.

"You can also have multiple paths to your destination. So if you're going way across, rather than having one congested path, you could have multiple ones," he added.

The network is also used to distribute various data between each core's cache without having to shift it too far, potentially speeding up the system even further.

"Their contribution is an interesting one-- They're saying, 'Let's get rid of a lot of the complexity that's already in existing networks. That will create more avenues for communication, and our clever communication protocol will sort out all the details'," said Todd Austin, a professor of electrical engineering and computer science at the University of Michigan.

"It's a simpler approach but a much faster approach at the same time. It's a really clever idea of doing things."

The blueprints for the new chip design aren't being released as of yet, since the team first wants to develop an operating system capable of using it to its full advantage.

The team is now adapting a version of Linux to use the new chip before releasing the designs.

In other IT news

The president of the Open Data Centre Alliance (ODCA) has given what some would call smart advice to CIOs contemplating how they migrate their legacy platforms into the cloud-- forget it and just dump your old code!

Click here to order the best dedicated server and at a great price.

However, Correy Voo, whose job is the infrastructure CTO at UBS, added this was likely a temporary dilemma as the coming wave of technology bosses, who’ve grown up with theoretically unlimited resources at their fingertips, will not even contemplate such a move.

“Don't try to lift a legacy platform and make it a cloud process. There are some things that just will not fit a cloud environment,” Voo told an audience at the Cloud World Forum event in London last week.

An application developed to deal with the constraints of the 80s and 90s, or even earlier, would never take full advantage of the new platform, he argued.

Banks and other financial institutions, many of which are members of the ODCA, will have stacks of such apps languishing on mainframes or other aging platforms.

“I’m not saying it’s impossible. It’s certainly possible if you want to throw a lot of time and effort at it. But if I want to buy something brand new like a $3,000 TV I don’t want to play low definition material on it."

“In my view, the users in some ways are being cheated because they don’t get the full capability of that investment.” He cited mobility as one example of this mismatch.

The discussion about whether to migrate applications designed to cope with the hardware and infrastructure constraints of the 80s and 90s has a parallel in the differing attitudes towards infrastructure between younger and old tech pros.

“That discussion is coming to a head in some respects,” Voo added. “Volume wise we’re now at the stage where there’s sufficient volume of new people in the industry who think differently to the old people.”

“It’ll be very fractious early on and there’ll be a lightbulb moment where everybody will realise that’s probably the right way of doing it.”

“When it becomes easy to consume cloud technology without any of the complication, any of the cost implications, when that happens it’s a no-brainer-- you simply move on to the next issue.”

As for what that problem that is? Voo said it will be about how we manage and access data.

“Just because we have the ability to use a piece of data, doesn’t necessarily mean we should ethically, so I think the IT industry and business in general have a number of questions to resolve in relation to data growth the type of data that’s out there.”

Source: Cloudera Inc.

Get the most dependable SMTP server for your company. You will congratulate yourself!

Share on Twitter.

IT News Archives | Site Search | Advertise on IT Direction | Contact | Home

All logos, trade marks or service marks on this site are the property of their respective owners.

Sponsored by Sure Mail™, Avantex and
by Montreal Server Colocation.

       © IT Direction. All rights reserved.