Archive for the ‘Cloud Databases’ Category

Big Data innovation marches on

Сентябрь 21st, 2010

Netezza

With IBM intending to acquire Netezza the predicted consolidation in the distributed analytics market is well underway.  Recent deals include EMC/Greenplum Teradata/Kickfire and now IBM/Netezza.  A good breakdown of this deal is on Curt’s blog.  There is still more to go of course with one of the crown jewels, Vertica, still ripe for the picking. 

What this indicates is that MPP analytics has moved from the innovative edge into the mainstream market and now the more risk adverse large caps and now willing to invest substantially in growing this market.  Interestingly Microsoft made this move early with the acquisition of Datallegro in 2008, I doubt this has paid dividends yet but 5 years out this might be a different story as the explosive growth of machine generated data continues. 

While it is probably a bad time to start building another MPP query processor of course innovation in big data core technology continues to be strong.  Key areas of innovation relate to Flash/SSD optimization & caching, Graph databases, stream processing & CEP, Hadoop optimization, massive shared nothing (cloud) scalability & SQL/NOSQL convergence.  These technologies will come to market in a variety of different product forms some of which will later be picked up by the large caps.  Rinse, repeat.


PlanetMySQL Voting: Vote UP / Vote DOWN

Was Stonebraker right?

Сентябрь 15th, 2010

Back in 2008 Stonebraker & DeWitt published a paper and associated blog post titled “MapReduce: A major step backwards”.  Their key points being Map Reduce is:


  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on


This turned out to be one of the most contentious postings in the DBMS community at the time drawing widespread criticism.  The “old men of DBMS” didn’t get that a database was not the solution for every problem and some problems just required a different type of mallet.  Even Vertica (who Stonebraker founded) seemed to distance themselves from the comments a little issuing a post affirming their commitment to Map/Reduce.  

If you read through the comments of the original Stonebraker/DeWitt post and the follow on post you will see how vigorously people were defending it.


The key example quoted when hailing the benefits of the Map/Reduce was that of the company which popularized it in the first place, Google.  Google used Map/Reduce to build its search indexes processing the immense volumes of data in batch fashion using MR jobs run across thousands of nodes.  No matter how the arguments for MR broke down the final word could always be – “Google does it” for which there wasn’t a great comeback.


Now however things have changed.  It has been reported that Google has moved away from Map/Reduce for search indexing due to time constraints in processing updates to the index and instead has opted/reverted to a, wait for it, DBMS centric approach to the problem (Google Caffeine).  Let me quickly point out that this DBMS is not a RDBMS but instead is their own BigTable distributed database (over GFS).


So, some questions are begging to be asked.  


Firstly, was Stonebraker and Dewitt right?  It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?


And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case?  Is the proposition for Map/Reduce today still just as good now the Google don’t do it?  (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek.  But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).


Finally, this no doubt will provide a shot in the arm for BigTable like open source implementations such as HBase and Cassandra.



PlanetMySQL Voting: Vote UP / Vote DOWN

VLDB 2010

Сентябрь 6th, 2010

VLDB 2010

I will be at VLDB 2010 next week.  If anyone on this blog is attending and wants to catch up to discuss start ups and innovation in DB, NoSQL, Big Data etc drop me a line and I will try to meet up.


PlanetMySQL Voting: Vote UP / Vote DOWN

Riptano for Cassandra

Май 3rd, 2010

Riptano

Cassandra is one of the most interesting NoSQL platforms at the moment.  And by most interesting what I really mean is the most clearly justifiable.  Some NoSQL platforms offer new data models, improved query interfaces and/or good single node performance through relaxed consistency models.  As a database guy however, the justification for throwing out the RDBMS baby and bathwater is still difficult at this point as NoSQL platforms tend to be highly focused in one aspect of data management, and very immature in all other areas.  Cassandra is somewhat different as it is more mature in a number of key areas (albeit still immature in others).  Areas that can make Cassandra more justifiable for the right project, when compared with a more traditional RDBMS based solution.  This is because Cassandra’s primary capabilities can’t easily be replicated on those traditional mainstream platforms.

Cassandra’s primary focus is on scalability.  More specifically that is scalability combined with reasonable functionality and performance & availability when at scale.  While some other platforms are trying to bolt on scalability/availability to their functionality rich data engines, Cassandra already has proven real life examples running 150 node clusters.  Notable uses of Cassandra include Digg, Facebook, Twitter, Reddit & Rackspace.  And the feedback from these sites is very good; commonly Cassandra has been expressed as the hands down winner for transaction processing performance at scale.

One of the key contributors to Cassandra has been Jonathan Ellis and until recently he has been working on Cassandra while employed by RackSpace.  But, I was pleased to hear that Jonathan, and business partner Matt Pfeil, have taken the step of setting up their own Cassandra focused company, Riptano.

Riptano are providing the commercialized support services around the open source Cassandra that are necessary for the platform to survive and grow.  While such services may be less important for adoption from the techie rich Web 2.0 crowd, for any platform to become mainstream there needs to be an escalation path for companies uninterested or unable to tinker with the code themselves.  Riptano provides those services which can allow Cassandra use to start to grow further.

Just as importantly, this move gives representation to Cassandra and provides an entity whose best interests will be served through advocacy of the platform.  While Jonathan and others had been doing a fine job of this to date personally, another corporation investing commercial dollars into advocacy will be important to ensure Cassandra’s message isn’t drowned out by more highly funded alternatives.

Riptano has received some early funding from RackSpace and I believe already has a few customers signed for their support services.  Best luck Jonathan & Matt.

Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

Ingres Vectorwise smokes it!

Май 1st, 2010

I work in all markets of the database industry, from web & startup through the largest and most established enterprises.  And to be completely honest, the name Ingres has not come up in conversation very much at all.  10 years ago maybe more often, but recently not all that much.  But Ingres has been quietly ticking away.  Despite being largely off the radar, they still have a sizable and loyal customer base, global offices and a focused & dedicated management team.  And importantly they have an open source business model which actually appears to be working.

I wrote last year that their "behind the scenes" status had the potential to change.  Ingres had been very clever and worked out a partnership relationship with Peter Bonzc’s Vectorwise.  And that relationship was promising big things for data analytics from a price/performance perspective.  But at the time it was all promise and little in the way of substance had been produced.

But that has been changing.  A month or two back Ingres somewhat quietly launched their Beta program for the Ingres Vectorwise technology.  This technology, if you have not read about it before, combines an analytical column store and “vectorized processing” to give much greater throughput rates than previously possible on your existing hardware (Vectorwise is a single node solution i.e. not MPP) .

And I have started hearing feedback, and it is good.  Very good.  While Ingres Vectorwise isn’t fully baked yet, I have heard it is producing astounding performance results in early testing.  In one case I heard of <10TB real life production comparison test and Ingres Vectorwise smoked everything else they had tested.  And they have tested a lot of different market leading analytical platforms.

So I think this is the start of an Ingres’s comeback.  Certainly anyone looking at <10TB analytical platforms will be getting the recommendation that they at least look at Ingres Vectorwise from me.  I am looking forward to seeing what 2010/2011 brings for them.

Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

NoSQL Buzz

Апрель 14th, 2010

I have noticed a definite increase in NoSQL buzz over the last few months.  This is partly confirmed by Google Trends, this service shows data relating to how search topics rank:

Googletrends_nosql

The last couple of months has seen a dramatic rise in both the number of searches and also the number of news items relating to NoSQL. 

But the traditionalists need not yet fret, interest in NoSQL is yet but a blip on the data management radar, as demonstrated by this compairson between NoSQL and MySQL search rankings:

Googletrends_mysql

I will be interesting to see how the dynamics of this change throughout 2010 though.

Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

What is Big Data?

Январь 31st, 2010

Exhibit: AggregationsImage by Aranda\Lasch via Flickr

One of my favorite terms at the moment is “Big Data”.  While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges.  Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case.  Many existing technologies have little problem physically handling large volumes (TB or PB) of data.  Instead the Big Data challenges result out of the combination of volume and our usage demands from that data.  And those usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes.  The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal.  This is of course the ability to handle an unlimited volume of data, processing all requests in real time.

So what are Big Data technologies?

More than at any point in the past, data related technologies are the focus of research & innovation.  But Big Data challenges won’t be solved anytime soon by a single approach.  Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale. 

A few common areas of innovation that I describe as Big Data technologies include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases and Distributed Transaction Processing.

So what is the point of Big Data?

Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”.  While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.

When describing the point of Big Data I like to think about how the Internet has changed my life in general.  By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before.  However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form.  To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives.  This is what those working in Big Data are setting out to achieve.


Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

DBMS Links of the Week

Сентябрь 26th, 2009

Larry EllisonImage by plαdys via Flickr

The following is a list of interesting DBMS related links for the week:



Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

Is the RDBMS doomed (yada yada yada) ?

Сентябрь 22nd, 2009

Ladybower PlugholeImage by Snooch2TheNooch via Flickr

I was speaking with Michael Stonebraker this morning.  I mentioned that lately many have been referencing comments he has made over the last couple of years.  And I also mentioned that many had interpreted them as he was implying the RDBMS is “doomed”.  Mike has been saying the same thing for years, but the current NoSQL movement seems to have picked up on this and highlighting one of the RDBMS's own pioneers is predicting its demise.

I asked Mike to clarify this.  My interpretation of his response is as follows.  I understand that he doesn’t believe the relational database itself is doomed.  Instead the current general purpose implementations, or “elephants” using his words, were out of date.  By moving away from a historical GP function into something more specific in focus, either in transaction processing or analytics, you can easily get 50x performance improvement over GP RDBMS.  This doesn’t necessarily mean moving away from the “relational” nature, but instead changing some core design principles for how a RDBMS is implemented.  It is this improvement factor that will see “new” specialist platforms overtake “old” general purpose platforms.  That is gradually, over time.  However Mike also mentioned the relational data model doesn’t make sense in a number of disciplines, particularly in sciences, and alternative modeling paradigms will offer benefits to this market (hence his focus on SciDB).  So while relational is a valid data model, other data models are also needed.

I have a similar position to Mike, but perhaps with a few differences. 

- Firstly I agree with the mantra that current GP RDBMS platforms provide only a “middle of the road” capability, and we gone too far in using a GP RDBMS for everything.  However I do believe there is a long term future for the GP RDBMS.  A general purpose application requirement will continued to be well suited for a general purpose platform.  With a specialist only approach, a general purpose requirement may need both a specialist OLTP platform and a specialist Analytics platform to provide the same capability.

- I agree that with an extreme requirement, either analytics or transaction processing, a specialist platform is well suited.  But I don’t see the choices of just MPP or memory resident RBDMS as being a broad enough set.  Apps that use a db just as a persistence cache will benefit from a high performing, scalable database platform with much tighter integration with the object model.  I am not sure any of the current NoSQL platforms have it quite right yet, but when these guys eventually get together with the database guys and work on these things together they may get there.

- I don’t think a 50x performance speed up on its own is enough to drive change in OLTP.  I have written before how difficult it is to get into this market and how tight Oracle, Microsoft & IBM have this sewn up.  But I don’t believe it is impossible, I think you just need to bring slam dunks on multiple fronts (performance just being one of them).

Anyway I feel like I am a bit of a broken record at the moment.  I have been addressing the “is the RDBMS doomed” question a couple of times a day for some time. Time to focus on something else for a bit.


Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

Some Initial Thoughts on Oracle Exadata V2

Сентябрь 16th, 2009

Oracle CorporationImage via Wikipedia

There will be plenty of detailed coverage on Exadata V2 so I won’t attempt to replicate that.  However I do have a couple of initial thoughts which I would like to share.  For those who missed it, Oracle has just announced Exadata V2 (which is their pre-built “machine”).  Exadata V1 was built using HP equipment, Exadata V2 is using Sun.  The main addition to Exadata V2 seems to be an extra tier in the memory hierarchy, a flash cache.  Oracle is very quick to point out this is not flash disk, but it is flash memory, Sun’s FlashFire technology (flash disk or SSD’s was always going to be a transition technology, flash memory doesn’t have the physical constraints of moving parts disk so the whole “disk” concept for flash doesn’t make too much sense other than it fits easily with current architectures).

The new memory layer (Processor Cache’s -> DRAM -> Flash Cache -> Disk) coupled with Oracle’s algorithms to effectively use the Flash Cache layer brings performance benefit to the solution (+ all the other improvements 12 months of hardware innovation brings, faster CPU’s, more memory etc).

My initial thoughts are:

  • Kudos to Oracle.  They are the first vendor to really bring a bunch of this leading edge technology together in a semi-mainstream way.  Flash Cache, Inifiband interconnects, DBMS optimizations using flash hasn't really surfaced anywhere outside of startups yet.
  • So what happens to Exadata V1 customers using the HP solution?  This is only about a year old.  Some analysts are suggesting there has only been minor sales of Exadata V1 (I am not an analyst so don’t really know).  So why would HP continue to support a platform where no new sales will be created, when potentially only a limited number of customers have it today?  Possibly Oracle will offer attractive terms to move existing HP Exadata V1 customers to Sun Exadata V2.
  • It is a preconfigured solution that you by in certain size configurations.  Small, half rack, full rack, multiple racks.  I think Larry said that 3 racks will give you a PetaByte of storage capacity.  This is fine, except they are targeting it for use with OLTP and data warehousing workloads.  It seems odd that to get very high computational resources for transaction processing, you would also get massive volumes of potentially unnecessary storage capacity.  It will be interesting to see if they allow the balance between processing & storage to be modified as part of configuration.


I have had some questions along the lines of “isn’t this back to the one size fits all approach?”  Well yes it is, but Oracle never really moved away from this in terms of the core DBMS.  It is my understanding that Oracle Exadata was still the general purpose Oracle DBMS & RAC but on a hardware platform optimized for accessing large data sets (making it a data warehousing solution).  Using FlashFire, the hardware can now do high levels of random I/O (I think 1m random I/O’s was quoted) which makes the hardware platform general purpose as well.

One interesting question will be if, under Oracle, other vendors can buy the exact same hardware configuration from Sun and optimize their DBMS for Flash also?   If so, it may be difficult for them to do this in a way that is price competitive.  And will competitive DBMS vendors really want to help fill Oracle’s pockets further?

If we expect to see more of this hardware alignment between DBMS vendors where does that leave Microsoft?  Maybe HP is already peeling the Exadata V1 logos off their racks and sticking Microsoft Madison logo’s in their place?

UPDATE:

Oracle has put out a FAQ which partly answers some of the questions.


Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN