Archive for the ‘Hadoop’ Category

451 CAOS Links 2011.10.07

Октябрь 7th, 2011

OpenStack Foundation. New Pentaho CEO. And more.

# Rackspace announced its intention to form an independent OpenStack Foundation.

# HP has chosen Ubuntu as the lead host and guest operating system for its Public Cloud.

# Pentaho appointed Quentin Gallivan as its new CEO.

# Hortonworks continued the discussion about contributions to Apache Hadoop.

# Bob Bickel explained why CloudBees is not, itself, open source.

# Google announced the limited preview release of Google Cloud SQL.

# Eucalyptus Systems, Nebula and Virtual Bridges joined the Linux Foundation.

# Dave Neary discussed the different types of community in relation to the Tizen project.

# Akamai joined the OpenStack community.

# Daniel Abadi provided his perspective on Oracle’s NoSQL Database.

# One more thing…
Apple’s relationship with open source may be somewhat tenuous – Paul Rooney provides some background – but given the impact Steve Jobs has made on the industry as a whole it seems wrong not to mark his passing in some way. We’ll leave the words to the company he created.


PlanetMySQL Voting: Vote UP / Vote DOWN

Webinar: NoSQL, NewSQL, Hadoop and the future of Big Data management

Октябрь 6th, 2011

Join me for a webinar where I discuss how the recent changes and trends in big data management effect the enterprise.  This event is sponsored by Red Rock and RockSolid.

Overview:

It is an exciting and interesting time to be involved in data. More change of influence has occurred in the database management in the last 18 months than has occurred in the last 18 years. New technologies such as NoSQL & Hadoop and radical redesigns of existing technologies, like NewSQL , will change dramatically how we manage data moving forward. 

These technologies bring with them possibilities both in terms of the scale of data retained but also in how this data can be utilized as an information asset. The ability to leverage Big Data to drive deep insights will become a key competitive advantage for many organisations in the future.

Join Tony Bain as he takes us through both the high level drivers for the changes in technology, how these are relevant to the enterprise and an overview of the possibilities a Big Data strategy can start to unlock.

https://redrockevents.webex.com/redrockevents/onstage/g.php?t=a&d=869100422

 


PlanetMySQL Voting: Vote UP / Vote DOWN

451 CAOS Links 2011.09.23

Сентябрь 23rd, 2011

Red Hat revenue up 28% in Q2. Funding for NoSQL vendors. And more.

# Red Hat reported net income of $40m in the second quarter on revenue up 28% to $281.3m.

# 10gen raised $20m in funding, while DataStax closed an $11m series B round, while also releasing its DataStax Enterprise and Community products. Additionally Neo Technology raised $10.6m series A funding.

# Oracle announced the addition of new extended capabilities in MySQL Enterprise Edition. The move confirmed the adoption of the open core licensing strategy, and was both welcomed and derided.

# BonitaSoft announced an $11m series B funding round.\

# Platfora raised $5.7m in series A funding to accelerate development of its BI and analytics platform for data stored in Hadoop.

# EMC launched its EMC Greenplum Modular Data Computing Appliance, which includes both the Greenplum Database and Greenplum HD (Hadoop), and introduced the Greenplum Analytics Workbench, a test bed cluster for integration testing Apache Hadoop.

# Oracle acquired GoAhead Software, which offers a commercial distribution of OpenSAF.

# Ingres changed its name to Actian and launched its Action Apps and Cloud Action Platform.

# Richard Stallman asked ‘Is Android really free software?’. Predictably enough the answer is ‘no’. Carlo Daffara called FUD.

# LexisNexis Risk Solutions’ HPCC Systems released the source code for its HPCC Systems platform, and introduced a covenant to keep contributed code open source for three years.

# OpenStack released Diablo, the fourth version of its open source cloud software.

# The PostgreSQL Global Development Group announced the release of PostgreSQL 9.1.

# VoltDB announced the general availability of VoltDB version 2.0.

# Samsung is reportedly planning to release its Bada mobile operating system under an open source license.

# Karmasphere updated its Karmasphere Analyst Big Data analytics product with new workflow capabilities for Apache Hadoop.

# The Open Virtualization Alliance now has more than 200 members.

# The Outercurve Foundation announced the acceptance of the GADS open source project into its Data, Language and System Interoperability Gallery.

# Openbravo announced that customer deployments of its ERP product on Amazon have increased over 187% in the last 12 months.

# The Apache Software Foundation confirmed Apache Whirr as a top-level project.

# Qt gained more independence from Nokia.

# SUSE Linux Enterprise Server has been selected for Use with SAP HANA.

# Red Hat Enterprise Linux 6 was certified by SAP to run SAP business applications, as well as support for SAP running on Red Hat Enterprise Linux on Amazon EC2.

# 10gen’s MongoDB was chosen by SAP as a core component of SAP’s platform-as-a-service (PaaS) offering.

# Puppet Labs announced Puppet Enterprise 2.0.

# Microsoft added Casio to its list of Linux-related patent agreement signees.

# Dries Buytaert explained why Acquia acquired Cyrve and GVS and addressed concern that Acquia is sucking up all the Drupal talent.

# Medsphere Systems announced the generally availability of the enhanced OpenVista electronic health record (EHR) platform.

# Stormy Peters asked whether open source is excluding high context cultures.

# OpenIndiana’s fork of OpenSolaris added support for the Illumos kernel.

# Cenatic released the results of its research into public administration involvement in open source communities.

# Spring Roo is shifting to be 100% Apache licensed.

# VLC developers are looking for anyone who has contributed to libVLC so that they can approve the change in licence from GPLv2 to LGPLv2.

# Virtual Bridges joined OpenStack.

# Github now has over one million users.

# Splunk open sourced the code for docs.splunk.com.


PlanetMySQL Voting: Vote UP / Vote DOWN

What is the biggest challenge for Big Data?

Сентябрь 9th, 2011

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but when organizations like Google are doing 10PB sorts on 8000 machines in just over 6 hours – we know the technical scope for Big Data exists and eventually will flow down to the masses, and such scale will likely be achievable by most organizations in the next decade.

Instead I think the core problem that needs to be addressed relates to people and skills.  There are lots of technical engineers who can build distributed systems, orders of magnitude more who can operate them and fill them to the brim with captured data.  But where I think we are lacking skills is with people who know what to do with the data.  People who know how to make it actually useful.  Sure, a BI industry exists today but I think this is currently more focused on the engineering challenges of providing an organization with faster/easier access to their existing knowledge rather than reaching out into the distance and discovering new knowledge.  The people with pure data analysis and knowledge discovery skills are much harder to find, and these are the people who are going to be front and center driving the big data revolution.  People who you can give a few PB of data too and they can provide you back information, discoveries, trends, factoids, patterns, beautiful visualizations and needles you didn’t even know were in the haystack.

These are people who can make a real and significant impact on an organizations bottom line, or help solve some of the world’s problems when applied to R&D.  Data Geeks are the people to be revered in the future and hopefully we see a steady increase in people wanting to grow up to be Data Scientists. 


PlanetMySQL Voting: Vote UP / Vote DOWN

NSA, Accumulo & Hadoop

Сентябрь 8th, 2011

Reading yesterday that the NSA has submitted a proposal to Apache to incubate their Accumulo platform.  This, according to the description, is a key/value store built over Hadoop which appears to provide similar function to HBase except it provides “cell level access labels” to allow fine grained access control.  This is something you would expect as a requirement for many applications built at government agencies like the NSA.  But this also is very important for organizations in health care and law enforcement etc where strict control is required to large volumes of privacy sensitive data.

An interesting part of this is how it highlights the acceptance of Hadoop.  Hadoop is no longer just a new technology scratching at the edges of the traditional database market.  Hadoop is no longer just used by startups and web companies.  This is highlighted by outputs like this from organizations such as the NSA.  This is also further highlighted by the amount of research and focus on Hadoop by the data community at large (such as last week at VLDB).  No, Hadoop has become a proven and trusted platform and is now being used by traditional and conservative segments of the market.  

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Hadoops Everywhere

Сентябрь 2nd, 2011

We don’t pay enough attention to Hadoop.

By “we” I mean DBAs, the rest of the world is paying plenty of attention to Hadoop. Recently, I started asking my customers and fellow DBAs about Hadoop adoption in their company. Turns out that many of them have Hadoop. Hadoop shows up in large companies and small ones, in established industries and in startups. Its everywhere.

The way Hadoop shows up in all companies, and the way DBAs don’t pay Hadoop much attention, reminds me a lot of how MySQL started showing up in the enterprise. It didn’t start by DBAs showing up one morning and telling their managers:
“There’s this new open source database. Its not as stable as Oracle and it doesn’t have all the features we need, but man – its going to save us tons of money, and its pretty simple to manage.”

Nope, this never happened. What happened instead is that developers learned about MySQL, and it seemed to them like an excellent way to go around this whole DBA thing. They could install it themselves, learn how to use it in a week and become happy and productive. Without ever having to discuss their schema, data model, requirements, capacity planning, availability, backups and all the other things that DBAs want to talk about.

By the time the application came out of developement and had to be deployed in production, MySQL was a done deal. No one is going to re-write the app just because the DBAs don’t know MySQL. Sometimes the Oracle DBAs were forced to learn and admin MySQL, but more often it was considered “not a database” and left for the sysadmins to manage, while the DBAs continued to pretend that the entire world is written by Oracle.

So thats what Hadoop adoption looks like now – Its usually introduced by the developers and administered by sysadmins, while DBAs continue to pretend it doesn’t exist or doesn’t matter. When pressed, some DBAs will even insist that all this “big data” thing can and should be done in a database, but the developers are too ignorant or lazy to work with a proper RDBMS.

I think the day arrived when, just like DBAs can no longer ignore MySQL, we can no longer ignore Hadoop either. So lets talk about it.

What is Hadoop?

First, Hadoop is not a database. Its infrastrusture, almost an operating system.

Hadoop was developed to ease the management of “big data”. In this context big data is too much data to fit on the hard-drive of a single machine, so the data and the analysis of the data has to be distributed over a large cluster.

The idea is that you install Hadoop on a large number of normal servers, use their harddrives to store the data, so the data is distributed across many separate machines and disks and then use their CPUs to process the data. Its a shared nothing architecture built with commodity hardware.

Hadoop consists of two parts – a distributed file system (HDFS) and programing model with a job scheduling system (Map Reduce). Hadoop’s file system is different from your mother’s file system in two important aspects:

  1. It was built to support large files, so the default block size is 64M. This makes the disk seek time a small percentage of the time it takes to retrieve the data. You can store smaller files in HDFS, but you can’t store too many small files – one of HDFS servers has to keep the entire file list in memory, and too many files will result in this server running out of memory. You can configure a larger block size, but you need to be careful – data is processed in blocks. If you have fewer blocks than processing machines, you use less CPUs than you could to process the data. Missing on some of the performance benefits.
  2. Each block is replicated on several servers, so if any single server fails, the data is not lost and processing can continue. You can configure the number of servers each block is replicated on.

Map-Reduce is a parallel job-processing framework.
Each map-reduce job splits the data into independent chunks (usually block-sized), each chunk is processed by a map task in parallel to all other tasks. Map tasks usually do independent transformations and filtering of the data. The output of the map tasks is the input of reduce tasks – reduce tasks aggregate the data and generate the final output. The results of the tasks are stored in the HDFS filesystem, and the map-reduce framework keeps track of all the jobs.
This includes placing the tasks on the server that contains the data each task processes, to reduce network utilization, ad tracking the tasks so if a task hangs or stalls on one server, it can be started on an additional server to speed up processing.

Using Hadoop consists of loading data files into HDFS filesystem, and then writing map-reduce jobs to analyse the data. One of the major drawbacks of using Hadoop is that Map-Reduce, while it makes developing distributed software easier, is still much more difficult to use than SQL. There are tools that make developing ad-hoc queries for Hadoop easier. I describe few of them below.

Where would I want to use Hadoop?

Hadoop was developed for analysis that load data once, rarely modify it, and that run batch operations that will scan most of the data set.  Note that this is very different from normal use of a database where traditionally we want to access a small fraction of the data (using indexes to locate precisely the data we want) and to constantly modify the data.

By far the most common Hadoop use-case is ETL. Transforming various logs collected throughout the organization into a information that can be added to the corporate data-warehouse and analysed by tranditional BI tools.
Whether its mining web server logs to discover usage patterns of websites and web applications, or an ISP analysing mail server logs to find the location of users and decide which locations require additional mail servers. The other part is analysing load and failure data to improve internal IT operations.

There are other exciting use cases for Hadoop:

  1. The New York Times famously used 100-server Hadoop cluster hosted by Amazon to transform 4TB of old images into 11 million PDF files. They did it in 24 hours and for total cost of 240$.
  2. Yahoo uses Hadoop to create web indexes and power its search engine.
  3. Autodesk uses Hadoop to track the most popular products in product catalogs and sell this information back to their customers.
  4. eBay uses Hadoop to optimize its product search.
  5. AOL-Advertising uses Hadoop to optimize its ad-placement. Facebook are doing the same. Facebook are also using Hadoop to mine user behavior data and use this information to make product marketing decisions.

The list goes on and on. Almost every company has a lot of data, a lot of it outside the relational database. Almost every company can optimize its business operations or even drive completely new products by analysing and mining this data. Hadoop is a tool to mine large amounts of non-relational data.

Our business analysts won’t write map-reduce jobs

There are two solutions to this problem and most companies use both:

  1. Load the results of Hadoop processing into the data-warehouse that is already in use by the business analysts (usually through their BI tools). This can be done only when there are definite requirements on how the data will be used.
  2. Use tools such as HBase, Hive or HUE as a front-end to Hadoop. These tools provide a language similar to SQL that will be more familiar to business analysts and will allow them to learn how to use Hadoop for ad-hoc queries. In addition, Pentaho has a BI product that can integrate directly with Hadoop.

Who is offering support and products for Hadoop?

This is definitely an area that showed large and unexpected growth in the last year. As more large companies adopt Hadoop, more vendors rush to support it, and as enterprise support for Hadoop grows, more companies are ready to adopt it. This growth spiral was very exciting to watch.

Hadoop is an open-source product. If you need support, training and all kinds of enterprise services, you’ll need to find a company to support you.

The most well known company in this space is Cloudera, who deserve tons of credit for making Hadoop what it is today. The founders of Cloudera are the early Hadoop developers from Yahoo, so they definitely have the technical chops to support it. They also hired top-notch training team from MySQL after the Oracle aquisition.

While Cloudera sells Hadoop professional services, it appears that they do not sell 24/7 production support services or integration services.  I’ve heard from my customers that even small 24-node cluster requires a full time employee to support it. Figuring out the fastest way to load terabytes of data into HDFS also remains the problem of the developing teams.

In addition to support and services, Cloudera also sell their own Hadoop distribution with some enterprise-ready extensions such as a management suite.

EMC, the storage giant, created their own Hadoop distribution, which they support. It is called “Greenplum HD Enterprise Edition”. Their distribution includes snapshots, WAN replication and cluster management capabilities.
EMC also have an Hadoop data appliance that is claimed to run Greenplum database and Hadoop in the same device. All with hardware optimized for Hadoop processing and a unified interface of some kind. It sounds nice. I’m still waiting to run into one of those “in the wild”.
The device was announced on May, and I kind of expect Oracle to announce their own Hadoop Exadata ever since. The story of unified structured and unstructured data in same device sounds like something that Oracle won’t be able to ignore. Maybe this year in OpenWorld?

Netapp announced their own Hadoopler around the same time that EMC announced their device. The hadoopler is not a complete Hadoop stack – its just a high performance storage running HDFS. Its not a NAS/SAN system – the computation nodes (which Netapp does not provide) are expected to connect directly to the disks on Netapp shelves.This entire thing is based on the Netapp E-series (AKA Engenio). It is supposed to improve disk-failure recovery and high availability of HDFS.
Netapp has partnership with Cloudera competitor, Hortonworks to provide Hadoop support.

May was a busy month indeed because at the same time IBM announced its own Hadoop distribution “InfoSphere BigInsights”, which is once again Hadoop with enterprise features. It seems to be software-only. Support will be provided by IBM.

So, IBM, EMC and Netapp are joining the Hadoop fun. You’d better believe that this is not just a toy for web startups, but a tool that is expected to have significant use in the enterprise.


PlanetMySQL Voting: Vote UP / Vote DOWN