Archive for the ‘agpl’ Category

HPCC vs Hadoop at a glance

Июнь 18th, 2011
Yesterday I noticed this tweet by Andrei Savu: . This prompted me to read the related GigaOM article and then check out the HPCC Systems website.

If you're too lazy to read the article or visit that website:
HPCC (High Performance Computing Cluster) is a massive parallel-processing computing platform that solves Big Data problems. The platform is now Open Source!


HPCC Systems compares itself to Hadoop, which I think is completely justified in terms of functionality. Its product originated as a homegrown solution of LexisNexis Risk Solutions allowing its customers (banks, insurance companies, law enforcment and federal government) to quickly analyze billions of records, and as such it has been in use for a decade or so. It is now open sourced, and I already heard an announcement that Pentaho is its major Business Intelligence Partner.

Based on the limited information a made a quick analysis, which I emailed to the HPCC Systems CTO, Armando Escalante. My friend Jos van Dongen said it was a good analysis and told me I should post it. Now, I don't really have time to make a nice blog post out of it, but I figured it can't hurt to just repeat what I said in my emails. So here goes:

Just going by the documentation, I see a two real unique selling points in HPCC Systems as compared to Hadoop:

  • Real-time query performance (as opposed to only analytic jobs). HPCC offers two difference setups, labelled Thor and Roxie. Functionalitywise, Thor should be compared to a Map/Reduce cluster like Hadoop: it's good for doing fairly long running analyses on large volumes of data. Roxie is a different beast, and designed to offer fast data access, supporting ad-hoc real-time queries
  • Integrated toolset (as opposed to hodgepodge of third party tools). We're talking about an IDE, job monitoring, code repository, scheduler, configuration manager, and whatnot. This really looks like like big productivity boosters, which may make Big Data processing a lot more accessible to companies that don't have the kind of development teams required to work with Hadoop.

(there may be many more benefits, but these are just the ones I could clearly distill from the press release and the website)

Especially for Business Intelligence, Roxie maybe a big thing. If real-time Big Data queries could be integrated with Business Intelligence OLAP and reporting tools, then this is certainly a big thing. I can't disclose the details but I have trustworthy information that integration with Pentaho's Analysis Engine, the Mondrian ROLAP engine is underway and will be available as an Enterprise feature.

A few things that look different but which may not matter too much when looking at HPCC and Hadoop from a distance:
  • ECL, the "Enterprise Control Language", which is a declarative query language (as opposed to just Map/Reduce). This initially seems like a big difference but Hadoop has tools like pig and sqoop and hive. Now, it could be that ECL is vastly superior to these hadoop tools, but my hunch is you'd have to be careful in how you position that. If you choose a head-on strategy in promoting ECL as opposed to pig, then the chances are that people will just spend their energy in discovering the things that pig can do and ECL cannot (not sure if those features actually exist, but that is what hadoop fanboys will look for), and in addition, the pig developers might simply clone the unique ECL features and the leveling of that playing field will just be a matter of time. This does not mean you shouldn't promote ECL - on the contrary, if you feel it is a more productive language than pig or any other hadoop tool, then by all means let your customers and prospects know. Just be careful and avoid downplaying the hadoop equivalents because that strategy could backfire.

  • Windows support. It's really nice that HPCC Systems is available for Microsoft Windows, it makes that a lot easier for Microsoft shops (and there are a lot of them). That said, customers that really have a big-data problem will solve it no matter what their internal software policies are. So they'd happily start running hadoop on linux if that solves their problems.
  • Maturity. On paper HPCC looks more mature than hadoop. It's hard to tell how much that matters though because hadoop has all the momentum. People might choose for hadoop because they anticipate that the maturity will come thanks to the sheer number of developers committing to that platform.


The only thing I can think of where HPCC looks like it has a disadvantage as compared to Hadoop is adoption rate and licensing. I hope these will prove not to be significant hurdles for HPCC, but I think that these might be bigger problems then they seem. Especially the AGPL licensing seems problematic to me.

The AGPL is not well regarded by anyone I know - not in the open source world. The general idea seems to be that even more than plain GPL3 it restricts how the software may be used. If the goal of open sourcing HPCC is to gain mindshare and a developer community (something that hadoop has done and is doing extremely well) then a more permissive license is really the way to go.

If you look at products like MySQL but also Pentaho - they are both very strongly corporately led products. The have a good number of users, but few contributions from outside the company, and this is probably due to a combination of GPL licensing and the additional requirement for handing over the copyright of any contributions to the company. Hence these products don't really benefit from an open source development model (or at least not as much as they could). For these companies, Open source may help initially to gain a lot of users, but those are in majority the users that just want a free ride: conversion rates to enterprise edition customers are quite low. It might be enough to make a decent buck, but eventually you'll hit a cap on how far you can grow. I'm not saying this is bad - you only need to grow as much as you have to, but it is something to be aware of.

Contrast this to Hadoop. The have a Apache 2.0 permissive license, and this results in many individuals but also companies contributing to the project. And there are still companies like Cloudera that manage to make a good living off of the services around their distribution of Hadoop. You don't lose the ability to develop add-ons either with this model - apache 2.0 allows all that. The difference with GPL (and AGPL) of course is that it allows this also to other users and companies. So the trick to stay on top in this model is to simply offer the best product (as opposed to being the sole holder of the copyright to he code).

Anyway - that is it for now - I hope this is helpful.

PlanetMySQL Voting: Vote UP / Vote DOWN

Cloud openness contemplated

Апрель 15th, 2010

I caught some of the keynotes and discussion at the Linux Foundation Collaboration Summit today, and was particularly interested in the panel discussion on open source and cloud computing. While we are used to hearing and talking about how important open source software is to cloud computing (open source giving to cloud computing), moderator John Mark Walker posed the question of whether cloud computing gives back? The discussion also rightfully focused on openness in cloud computing, how open source might or might not translate to cloud openness and the importance of data to be open as well.

The discussion also centered on some issues regarding open standards and how open is open enough for cloud computing? It may depend on who you ask, but I tend to think that the flexibility, interoperability and portability advantages of open source software will dictate its continued use and true openness in the cloud.

However, this is not always the case. When we consider openness in the mobile market, we see that while open source software is going into more and more smartphones and mobile devices, by the time it gets into the product and into the hands of consumers, it ends up closed. This is not necessarily a violation of open source license, either in rule or in spirit, but rather the use, incorporation and reliance on open source alongside proprietary products, strategies and companies, typically under a permissive license. Much of it also has to do with the need, both perceived and real, for control of code in these devices among hardware, software, wireless carrier and other players with a stake.

Another interesting perspective of what open source means, or doesn’t mean, in terms of cloud computing, standards and interoperability comes from the Xen community’s Simon Crosby of Citrix.

One of the most interesting things to watch when considering whether cloud computing gives back to open source is the AGPLv3 license, which is viewed in different ways as both a burden and a boon to network-based, distributed development by various parties. We continue to see vendors, such as mobile software player Funambol, as strong supporters of AGPL while others, such as Google, continue their resistence to it.

The AGPL also came up in the Linux Foundation Collaboration summit panel again, and while I don’t think the license currently serves as the answer to whether cloud computing gives back to open source, we do see some benefits to open source from cloud computing, both in terms of code, projects and communities and the commercial vendors leveraging open source software. In terms of code, large users of open source software projects, such Linux, MySQL, Hadoop, Cassandra, help to raise the profile and credibility of open source. Whether corporations or university campuses, these large users can also be among the most active community participants — driving features and shaking out bugs, and most prolific code contributors — creating features and extensions and enlarging the ecosystem. In terms of commercial open source vendors, cloud computing can also mitigate the challenges of balancing and differentiating free, community versions and separate, paid versions. If the vendor is able to offer support, services or even extensions with the cloud version of its software, it is easily separated from a free, community version that may be available for free, but not from the cloud.

Of course, there is more that cloud computing can do for open source and there is much more that has to be done to ensure true openness in cloud computing, particularly when some existing and emerging defacto standards are anything but open, but for all that open source is to cloud computing, cloud computing seems to be returning the favor to some degree already.


PlanetMySQL Voting: Vote UP / Vote DOWN