Archive for the ‘Web/Tech’ Category

IA Ventures — Jobs shout out

Август 4th, 2011

My friends over at IA Ventures are looking both for an Analyst and for an Associate to their team.  If Big Data, New York and start-ups is in your blood then I can’t think of a better VC to be involved in. 

From the IA blog:

"IA Ventures funds early-stage Big Data companies creating competitive advantage through data and we’re looking for two start-up junkies to join our team – one full-time associate / community manager and one full time analyst. Because there are only four of us (we’re a start-up ourselves, in fact), we’ll need you to help us investigate companies, learn about industries, develop investment theses, perform internal operations, organize community events, and work with portfolio companies—basically, you can take on as much responsibility as you can handle."

Roger, Brad and the team continue to impress with their focus on Big Data, their strategic investments in monetizing data and knowledge of the industry in general.


PlanetMySQL Voting: Vote UP / Vote DOWN

Who/What to acquire next

Март 18th, 2011

Well as predicted, with Aster Data recently being picked up by Teradata most of the key new generation MPP distributed analytics vendors have been acquired (Aster Data, Vertica, Netezza & Greenplum).  This had to happen and was expected to happen.  The MPP Analytics startup “revolution” is over and these technologies will now be integrated into the mainstream.

So what’s next?  As we now, if you are a massive multi-national software company it is a lot less risky to incrementally innovate and leave the development of “game changing” technologies to startups that can be acquired after they prove both the tech and the market.  So what follows MPP?

NoSQL technologies seem the only likely candidate at the moment, although I think it is a few years too early for any major acquisitions to occur.  A key issue that would need to be worked through is what exactly is being acquired as most NoSQL platforms are open source / free (most MPP platforms were proprietary).  But nonetheless, as the market grows and starts to eat away at some noticeable level from the existing RDBMS market the major vendors will want a piece of that action and the frenzy will start again.  But this is still quite a while away yet.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

The problem with a full box of big data tools

Октябрь 7th, 2010

NoSQL”, for lack of better name, is a generic term that describes any data management system that does not use SQL as a query interface.  Generally this means any data management system that is non-relational, but the term also has also been stretched as far to include the boundaries of what constitutes a data management system at all (such as Hadoop).

Early on (a couple of years back in NoSQL time) when the term was coined I think the positioning was much more aggressive, but more recently this has been softened so now NoSQL is commonly quoted as meaning of “Not only SQL” or “next generation databases” (whatever that means).  The common message you get now is something along the lines of NoSQL systems are more “specialized”, each being designed to solve a smaller number of problems than the generic RDBMS sets out to.  NoSQL is another tool in your toolbox.  A better option in certain cases where the RDBMS doesn’t fit well.  A different hammer for a different type of nail.  All makes sense in theory, but in reality this brings its own set of troubles.

There are now dozens of NoSQL systems available for a developer to choose.  From MongoDB, Cassandra, Voldemort, Hbase, CouchDB, Riak, Neo4J, HamsterDB and so on.  And there are several different orientations of NoSQL system including document, key/value and graph.  It seems the same energy we saw open-source hackers 10 years ago putting into MySQL has now been transferred into a myriad of NoSQL systems.  Again the argument, more choice, better for everyone.

The problem, and I am putting it out there as a problem so we can think of ways to fix it, is that while that is fine in reality, in practice many choices also creates difficulties.  Real world development projects have certain skills bases they draw on, with experience and ability to “make things work” based on years of hard slog cobbling things together.  And there are very few surprises left when deploying an application on a mainstream RDBMS (of course they will, like any software, will still have issue from time to time).

One of the key reasons the RDBMS has been so dominate is the fact that you could use it pretty much for any requirement.  And using it for any requirement meant that your developers had lots of experience building applications and your DBAs had lots of experience running it.  But also you knew that you could almost always make any requirement work “good enough” by buying extra hardware and/or indexing the heck out of it etc.  Regardless of whether it was technically the best fit or not, when all things were considered the RDBMS was a stable constant given short project timeframes and limited development budgets.  It was exactly its generic nature, its ability to do most things good enough, that has led to the RDBMS to become the default option for any new development project (with the various flavors of MySQL, Oracle, DB2 ,SQL Server being less relevant).

As humans, we all have limited brain capacities and most of us can only be experts in a small number of things.  And our expertise typically come from our history, making mistakes learning what works and what doesn’t through the hard yards of experience.   So given a buffet choice of specialized NoSQL systems how on earth do we choose the most appropriate tool for the job, while at the same time dealing with the lack of expertise we will invariably have?  Also what will be the impact to development projects in choosing the wrong tool for the job?  The RDBMS is very very forgiving to poor design, poor implementation and the subsequent addition of unforeseen application requirements (you want to run OLAP now we have built you a busy OLTP database – sure but do it overnight).  Will a specialist NoSQL system have the same tolerance for our incompetence?

So now I return back to the point that is really the keystone of the NoSQL motivation, “there are requirements which a RDBMS doesn’t work at all well for”.  I agree with this, but I have yet to see any quantification of what this actually means.  Is it 5% or 10% of current development projects?  And should the question really be “what percentage of development projects is the RDBMS unusable for”?  Technical purity, and even reducing license costs, needs to be balanced against one of the largest costs, re-skilling development and production teams to understand this new data platform. 

There are some clear cases, the Googles, Twitters, Facebooks etc where scale alone is clearly outside the boundaries of what is possible on today’s RDBMS platforms.  But in terms of today’s development projects, what percentage would these scalability requirements quantify?  1%?  Less?  Sure, we are going through somewhat of a data explosion and by all counts the volume of data we collect and manage in our databases is growing at an alarming rate.  So the demand for scale will continue, but let’s also not forget that the big RDBMS vendors are very market driven, and as the market changes their products will also continue to change with it.  It is very unlikely they will be asleep at the wheel and lose their dominate share of the ~$30b market without a fight.

Contrary to how it may appear, I am actually supportive of a number of NoSQL initiatives and I am even hands on with a few.  But I do have concerns about how we quantify the market, how we ensure that people are making the right decisions in choosing a NoSQL platform.  And also how do bridge the gap with skill sets and experience for developers who will have years upon years of RDBMS experience but, by nature, only have exposure to NoSQL systems periodically based on certain application requirements. 

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Was Stonebraker right?

Сентябрь 15th, 2010

Back in 2008 Stonebraker & DeWitt published a paper and associated blog post titled “MapReduce: A major step backwards”.  Their key points being Map Reduce is:


  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on


This turned out to be one of the most contentious postings in the DBMS community at the time drawing widespread criticism.  The “old men of DBMS” didn’t get that a database was not the solution for every problem and some problems just required a different type of mallet.  Even Vertica (who Stonebraker founded) seemed to distance themselves from the comments a little issuing a post affirming their commitment to Map/Reduce.  

If you read through the comments of the original Stonebraker/DeWitt post and the follow on post you will see how vigorously people were defending it.


The key example quoted when hailing the benefits of the Map/Reduce was that of the company which popularized it in the first place, Google.  Google used Map/Reduce to build its search indexes processing the immense volumes of data in batch fashion using MR jobs run across thousands of nodes.  No matter how the arguments for MR broke down the final word could always be – “Google does it” for which there wasn’t a great comeback.


Now however things have changed.  It has been reported that Google has moved away from Map/Reduce for search indexing due to time constraints in processing updates to the index and instead has opted/reverted to a, wait for it, DBMS centric approach to the problem (Google Caffeine).  Let me quickly point out that this DBMS is not a RDBMS but instead is their own BigTable distributed database (over GFS).


So, some questions are begging to be asked.  


Firstly, was Stonebraker and Dewitt right?  It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?


And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case?  Is the proposition for Map/Reduce today still just as good now the Google don’t do it?  (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek.  But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).


Finally, this no doubt will provide a shot in the arm for BigTable like open source implementations such as HBase and Cassandra.



PlanetMySQL Voting: Vote UP / Vote DOWN

VLDB 2010

Сентябрь 6th, 2010

VLDB 2010

I will be at VLDB 2010 next week.  If anyone on this blog is attending and wants to catch up to discuss start ups and innovation in DB, NoSQL, Big Data etc drop me a line and I will try to meet up.


PlanetMySQL Voting: Vote UP / Vote DOWN

Riptano for Cassandra

Май 3rd, 2010

Riptano

Cassandra is one of the most interesting NoSQL platforms at the moment.  And by most interesting what I really mean is the most clearly justifiable.  Some NoSQL platforms offer new data models, improved query interfaces and/or good single node performance through relaxed consistency models.  As a database guy however, the justification for throwing out the RDBMS baby and bathwater is still difficult at this point as NoSQL platforms tend to be highly focused in one aspect of data management, and very immature in all other areas.  Cassandra is somewhat different as it is more mature in a number of key areas (albeit still immature in others).  Areas that can make Cassandra more justifiable for the right project, when compared with a more traditional RDBMS based solution.  This is because Cassandra’s primary capabilities can’t easily be replicated on those traditional mainstream platforms.

Cassandra’s primary focus is on scalability.  More specifically that is scalability combined with reasonable functionality and performance & availability when at scale.  While some other platforms are trying to bolt on scalability/availability to their functionality rich data engines, Cassandra already has proven real life examples running 150 node clusters.  Notable uses of Cassandra include Digg, Facebook, Twitter, Reddit & Rackspace.  And the feedback from these sites is very good; commonly Cassandra has been expressed as the hands down winner for transaction processing performance at scale.

One of the key contributors to Cassandra has been Jonathan Ellis and until recently he has been working on Cassandra while employed by RackSpace.  But, I was pleased to hear that Jonathan, and business partner Matt Pfeil, have taken the step of setting up their own Cassandra focused company, Riptano.

Riptano are providing the commercialized support services around the open source Cassandra that are necessary for the platform to survive and grow.  While such services may be less important for adoption from the techie rich Web 2.0 crowd, for any platform to become mainstream there needs to be an escalation path for companies uninterested or unable to tinker with the code themselves.  Riptano provides those services which can allow Cassandra use to start to grow further.

Just as importantly, this move gives representation to Cassandra and provides an entity whose best interests will be served through advocacy of the platform.  While Jonathan and others had been doing a fine job of this to date personally, another corporation investing commercial dollars into advocacy will be important to ensure Cassandra’s message isn’t drowned out by more highly funded alternatives.

Riptano has received some early funding from RackSpace and I believe already has a few customers signed for their support services.  Best luck Jonathan & Matt.

Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

Ingres Vectorwise smokes it!

Май 1st, 2010

I work in all markets of the database industry, from web & startup through the largest and most established enterprises.  And to be completely honest, the name Ingres has not come up in conversation very much at all.  10 years ago maybe more often, but recently not all that much.  But Ingres has been quietly ticking away.  Despite being largely off the radar, they still have a sizable and loyal customer base, global offices and a focused & dedicated management team.  And importantly they have an open source business model which actually appears to be working.

I wrote last year that their "behind the scenes" status had the potential to change.  Ingres had been very clever and worked out a partnership relationship with Peter Bonzc’s Vectorwise.  And that relationship was promising big things for data analytics from a price/performance perspective.  But at the time it was all promise and little in the way of substance had been produced.

But that has been changing.  A month or two back Ingres somewhat quietly launched their Beta program for the Ingres Vectorwise technology.  This technology, if you have not read about it before, combines an analytical column store and “vectorized processing” to give much greater throughput rates than previously possible on your existing hardware (Vectorwise is a single node solution i.e. not MPP) .

And I have started hearing feedback, and it is good.  Very good.  While Ingres Vectorwise isn’t fully baked yet, I have heard it is producing astounding performance results in early testing.  In one case I heard of <10TB real life production comparison test and Ingres Vectorwise smoked everything else they had tested.  And they have tested a lot of different market leading analytical platforms.

So I think this is the start of an Ingres’s comeback.  Certainly anyone looking at <10TB analytical platforms will be getting the recommendation that they at least look at Ingres Vectorwise from me.  I am looking forward to seeing what 2010/2011 brings for them.

Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

NoSQL Buzz

Апрель 14th, 2010

I have noticed a definite increase in NoSQL buzz over the last few months.  This is partly confirmed by Google Trends, this service shows data relating to how search topics rank:

Googletrends_nosql

The last couple of months has seen a dramatic rise in both the number of searches and also the number of news items relating to NoSQL. 

But the traditionalists need not yet fret, interest in NoSQL is yet but a blip on the data management radar, as demonstrated by this compairson between NoSQL and MySQL search rankings:

Googletrends_mysql

I will be interesting to see how the dynamics of this change throughout 2010 though.

Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

What is Big Data?

Январь 31st, 2010

Exhibit: AggregationsImage by Aranda\Lasch via Flickr

One of my favorite terms at the moment is “Big Data”.  While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges.  Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case.  Many existing technologies have little problem physically handling large volumes (TB or PB) of data.  Instead the Big Data challenges result out of the combination of volume and our usage demands from that data.  And those usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes.  The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal.  This is of course the ability to handle an unlimited volume of data, processing all requests in real time.

So what are Big Data technologies?

More than at any point in the past, data related technologies are the focus of research & innovation.  But Big Data challenges won’t be solved anytime soon by a single approach.  Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale. 

A few common areas of innovation that I describe as Big Data technologies include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases and Distributed Transaction Processing.

So what is the point of Big Data?

Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”.  While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.

When describing the point of Big Data I like to think about how the Internet has changed my life in general.  By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before.  However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form.  To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives.  This is what those working in Big Data are setting out to achieve.


Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

DBMS Links of the Week

Сентябрь 26th, 2009

Larry EllisonImage by plαdys via Flickr

The following is a list of interesting DBMS related links for the week:



Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN