Archive for the ‘Relational DB’ Category

What is the biggest challenge for Big Data?

Сентябрь 9th, 2011

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but when organizations like Google are doing 10PB sorts on 8000 machines in just over 6 hours – we know the technical scope for Big Data exists and eventually will flow down to the masses, and such scale will likely be achievable by most organizations in the next decade.

Instead I think the core problem that needs to be addressed relates to people and skills.  There are lots of technical engineers who can build distributed systems, orders of magnitude more who can operate them and fill them to the brim with captured data.  But where I think we are lacking skills is with people who know what to do with the data.  People who know how to make it actually useful.  Sure, a BI industry exists today but I think this is currently more focused on the engineering challenges of providing an organization with faster/easier access to their existing knowledge rather than reaching out into the distance and discovering new knowledge.  The people with pure data analysis and knowledge discovery skills are much harder to find, and these are the people who are going to be front and center driving the big data revolution.  People who you can give a few PB of data too and they can provide you back information, discoveries, trends, factoids, patterns, beautiful visualizations and needles you didn’t even know were in the haystack.

These are people who can make a real and significant impact on an organizations bottom line, or help solve some of the world’s problems when applied to R&D.  Data Geeks are the people to be revered in the future and hopefully we see a steady increase in people wanting to grow up to be Data Scientists. 


PlanetMySQL Voting: Vote UP / Vote DOWN

Reply to The Future of the NoSQL, SQL, and RDBMS Markets

Август 12th, 2011

Conor O'Mahony over at IBM wrote a good post on a favorite topic of mine “The Future of the NoSQL, SQL, and RDBMS Markets”.  If this is of interest to you then I suggest you read his original post.  I replied in the comments but thought I would also repost my reply here.

-----------------------------------------------------------------------------------------------

Hi Connor, I wish it was as simple as SQL & RDBMS is good for this and NoSQL is good for that.  For me at least, the waters are much muddier than that.

The benefit of SQL & RDBMS is that its general purpose nature has meant it can be applied to a lot of problems, and because of its applicability it is become mainstream to the point every developer on the planet can probably write basic SQL.  And it is justified, there aren’t many data problems you can’t through a RDBMS at and solve.

The problem with SQL & RDBMS, well essentially I see two.  Firstly, distributed scale is a problem in a small number of cases.  This can be solved by losing some of the generic nature of RDBMS and keeping SQL such as with MPP or attempts like Stonebraker’s NewSQL.  The other way is to lose RDBMS and SQL altogether to achieve scale with alternative key/value methods such as Cassandra, HBase etc.  But these NoSQL databases don’t seem to be the ones gaining the most traction.  From my perspective, the most “popular” and fastest growing NoSQL databases tend to be those which aren’t entirely focused on pure scale but instead focus first on the development model, such as Couch and MongoDB.  Which brings me to my second issue with SQL & RDBMS.

Without a doubt the way in which we build applications has changed dramatically over the last 20 years.  We now see much greater application volumes, much smaller developer teams, shorter development timeframes and faster changing requirements.  Much of what the RDBMS has offered developers – such as strong normalization, enforced integrity, strong data definition, documented schemas – have become less relevant to applications and developers.  Today I would suspect most applications use a SQL database purely as a application specific dumb datastore.  Usually there aren’t multiple applications accessing that database, there aren’t lots of direct data import/exports into other aplications, no third party application reporting, no ad-hoc user queries and the data store is just a repository for a single application to retain data purely for the purpose of making that application function.  Even several major ERP applications have fairly generic databases with soft schemas without any form of constraints of referential integrity.  This is just handled better, from a development perspective, in the code that populates it.

Now of course the RDBMS can meet this requirement – but the issue is the cost of doing this is higher than what it needs to be.  People write code with classes, RDBMS uses SQL.  The translation between these two structures, the plumbing code, can be in cases 50% of more of an applications code base (be that it hand-written code or automatic code generated by a modeling tool).  Why write queries if you are just retrieving and entire row based on key.  Why have a strict data model if you are the only application using it and you maintain integrity in the code?  Why should a change in requirements require you to now to go through the process of building a schema change script/process that has to have deployed sync’d with application version.  Why have cost based optimization when all the data access paths are 100% known at the time of code compilation?

Now I am still largely undecided on all of this.  I get why NoSQL can be appealing.  I get how it fits with today’s requirements, what I am unsure about if it is all very short sighted.  Applications being built today with NoSQL will themselves grow over time.  What may start off today as simple gets/puts within a soft schema’d datastore may overtime gain certain reporting or analytics requirements unexpected when initial development began.  What might have taken a simple SQL query to meet such a requirement in RDBMS now might require data being extracted into something else, maybe Hadoop or MPP or maybe just a simple SQL RDBMS – where it can be processed and re-extracted back into the NoSQL store in a processed form.  It might make sense if you have huge volumes of data but for the small scale web app, this could be a lot of cost and overhead to summarize data for simple reporting needs.

Of course this is all still evolving.  And RDBMS vendors and NoSQL are both on some form of convergence path.  We have already started hearing noises about RBDMS looking to offer more NoSQL like interfaces to the underlying data stores as well as the NoSQL looking to offer more SQL like interfaces to their repositories.  They will meet up eventually, but by then we will all be talking about something new like stream processing :)

Thanks Connor for the thought provoking post.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

IA Ventures — Jobs shout out

Август 4th, 2011

My friends over at IA Ventures are looking both for an Analyst and for an Associate to their team.  If Big Data, New York and start-ups is in your blood then I can’t think of a better VC to be involved in. 

From the IA blog:

"IA Ventures funds early-stage Big Data companies creating competitive advantage through data and we’re looking for two start-up junkies to join our team – one full-time associate / community manager and one full time analyst. Because there are only four of us (we’re a start-up ourselves, in fact), we’ll need you to help us investigate companies, learn about industries, develop investment theses, perform internal operations, organize community events, and work with portfolio companies—basically, you can take on as much responsibility as you can handle."

Roger, Brad and the team continue to impress with their focus on Big Data, their strategic investments in monetizing data and knowledge of the industry in general.


PlanetMySQL Voting: Vote UP / Vote DOWN

What Scales Best?

Июль 29th, 2011

It is a constant, yet interesting debate in the world of big data.  What scales best?  OldSQL, NoSQL, NewSQL?

I have a longer post coming on this soon.  But for now, let me make the following comments.  Generally, most data technologies can be made to scale - somehow.  Scaling up tends not to be too much of an issue, scaling out is where the difficulties begin.  Yet, most data technologies can be scaled in one form or another to meet a data challenge even if the result isn’t pretty. 

What is best?  Well that comes down to the resulting complexity, cost, performance and other trade-offs.  Trade-offs are key as there are almost always significant concessions to be made as you scale up.

A recent example of mine, I was looking at scalability aspects of MySQL.  In particular, MySQL Cluster.  It is actually pretty easy to make it scale.  A 5 node cluster on AWS was able to scale to process a sustained transaction rate of 371,000 insert transactions – per second.   Good scalability yes, but there were many trade-offs made around availability, recoverability and non-insert query performance to achieve it.  But for the particular requirement I was looking at, it fitted very well.

So what is this all about?  Well, if a Social Network is  running MySQL in a sharded cluster to achieve the scale necessary to support their multi-millions users the fact that database technology x or database technology y can also scale with different “costs” or trade-offs doesn’t necessarily make it any better – for them.  If you, for example, have some of the smartest and talented MySQL developers on your team and can alter the code at a moment’s notice to meet a new requirement – that alone might make your choice of MySQL “better’ than using NoSQL database xyz from a proprietary vender where there may be a loss of flexibility and control from soup to nuts.

So what is my point?  Well I guess what I am saying is physical scalability is of course an important consideration in determining what is best.  But it is only one side of the coin.  What it “costs” you in terms of complexity, actual dollars, performance, flexibility, availability, consistency etc, etc are all important too.  And these are often relative, what is complex for you may not be complex for someone else.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Who/What to acquire next

Март 18th, 2011

Well as predicted, with Aster Data recently being picked up by Teradata most of the key new generation MPP distributed analytics vendors have been acquired (Aster Data, Vertica, Netezza & Greenplum).  This had to happen and was expected to happen.  The MPP Analytics startup “revolution” is over and these technologies will now be integrated into the mainstream.

So what’s next?  As we now, if you are a massive multi-national software company it is a lot less risky to incrementally innovate and leave the development of “game changing” technologies to startups that can be acquired after they prove both the tech and the market.  So what follows MPP?

NoSQL technologies seem the only likely candidate at the moment, although I think it is a few years too early for any major acquisitions to occur.  A key issue that would need to be worked through is what exactly is being acquired as most NoSQL platforms are open source / free (most MPP platforms were proprietary).  But nonetheless, as the market grows and starts to eat away at some noticeable level from the existing RDBMS market the major vendors will want a piece of that action and the frenzy will start again.  But this is still quite a while away yet.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Some NoSQL Myths

Октябрь 19th, 2010

I have been busy travelling recently but thought I would jot down a couple of NoSQL myths that are fresh in my head from my recent discussions.

  • Twitter use Cassandra internally but have not migrated their tweet store, despite their earlier plans to.  For now tweets are still stored in MySQL.
  • Despite the widely accepted view that the use of Cassandra led to Diggs issues a couple of Digg engineers have apparently discounted this.
  • Despite the widely accepted view that NoSQL databases all use eventual consistency this is not so.  HBase, for example, offers full consistency.
  • Despite the widely accepted view that NoSQL is only about unlimited distributed scalability this is also not so.  Some of the most popular NoSQL platforms have fairly rudimentary (traditional RDBMS like) scalability options.  Such as CouchDB and MongoDB which use sharding + replication to achieve scale.
  • Despite being commonly reported as “easy to install” or “easy to use” the benefits of a document object model are much more significant.  Why did we spend so much time during the 90’s trying to build ORDBMS?  Because the object-relational impedance mismatch is major and this translates into significant development overhead.  It is not uncommon to see 30-60% of all code in some applications purely “plumbing” to deal with mapping data to and from the RDBMS.  This is not something necessarily well understood by DBA’s or even some long time database designers, something I will write a follow up post on.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

The problem with a full box of big data tools

Октябрь 7th, 2010

NoSQL”, for lack of better name, is a generic term that describes any data management system that does not use SQL as a query interface.  Generally this means any data management system that is non-relational, but the term also has also been stretched as far to include the boundaries of what constitutes a data management system at all (such as Hadoop).

Early on (a couple of years back in NoSQL time) when the term was coined I think the positioning was much more aggressive, but more recently this has been softened so now NoSQL is commonly quoted as meaning of “Not only SQL” or “next generation databases” (whatever that means).  The common message you get now is something along the lines of NoSQL systems are more “specialized”, each being designed to solve a smaller number of problems than the generic RDBMS sets out to.  NoSQL is another tool in your toolbox.  A better option in certain cases where the RDBMS doesn’t fit well.  A different hammer for a different type of nail.  All makes sense in theory, but in reality this brings its own set of troubles.

There are now dozens of NoSQL systems available for a developer to choose.  From MongoDB, Cassandra, Voldemort, Hbase, CouchDB, Riak, Neo4J, HamsterDB and so on.  And there are several different orientations of NoSQL system including document, key/value and graph.  It seems the same energy we saw open-source hackers 10 years ago putting into MySQL has now been transferred into a myriad of NoSQL systems.  Again the argument, more choice, better for everyone.

The problem, and I am putting it out there as a problem so we can think of ways to fix it, is that while that is fine in reality, in practice many choices also creates difficulties.  Real world development projects have certain skills bases they draw on, with experience and ability to “make things work” based on years of hard slog cobbling things together.  And there are very few surprises left when deploying an application on a mainstream RDBMS (of course they will, like any software, will still have issue from time to time).

One of the key reasons the RDBMS has been so dominate is the fact that you could use it pretty much for any requirement.  And using it for any requirement meant that your developers had lots of experience building applications and your DBAs had lots of experience running it.  But also you knew that you could almost always make any requirement work “good enough” by buying extra hardware and/or indexing the heck out of it etc.  Regardless of whether it was technically the best fit or not, when all things were considered the RDBMS was a stable constant given short project timeframes and limited development budgets.  It was exactly its generic nature, its ability to do most things good enough, that has led to the RDBMS to become the default option for any new development project (with the various flavors of MySQL, Oracle, DB2 ,SQL Server being less relevant).

As humans, we all have limited brain capacities and most of us can only be experts in a small number of things.  And our expertise typically come from our history, making mistakes learning what works and what doesn’t through the hard yards of experience.   So given a buffet choice of specialized NoSQL systems how on earth do we choose the most appropriate tool for the job, while at the same time dealing with the lack of expertise we will invariably have?  Also what will be the impact to development projects in choosing the wrong tool for the job?  The RDBMS is very very forgiving to poor design, poor implementation and the subsequent addition of unforeseen application requirements (you want to run OLAP now we have built you a busy OLTP database – sure but do it overnight).  Will a specialist NoSQL system have the same tolerance for our incompetence?

So now I return back to the point that is really the keystone of the NoSQL motivation, “there are requirements which a RDBMS doesn’t work at all well for”.  I agree with this, but I have yet to see any quantification of what this actually means.  Is it 5% or 10% of current development projects?  And should the question really be “what percentage of development projects is the RDBMS unusable for”?  Technical purity, and even reducing license costs, needs to be balanced against one of the largest costs, re-skilling development and production teams to understand this new data platform. 

There are some clear cases, the Googles, Twitters, Facebooks etc where scale alone is clearly outside the boundaries of what is possible on today’s RDBMS platforms.  But in terms of today’s development projects, what percentage would these scalability requirements quantify?  1%?  Less?  Sure, we are going through somewhat of a data explosion and by all counts the volume of data we collect and manage in our databases is growing at an alarming rate.  So the demand for scale will continue, but let’s also not forget that the big RDBMS vendors are very market driven, and as the market changes their products will also continue to change with it.  It is very unlikely they will be asleep at the wheel and lose their dominate share of the ~$30b market without a fight.

Contrary to how it may appear, I am actually supportive of a number of NoSQL initiatives and I am even hands on with a few.  But I do have concerns about how we quantify the market, how we ensure that people are making the right decisions in choosing a NoSQL platform.  And also how do bridge the gap with skill sets and experience for developers who will have years upon years of RDBMS experience but, by nature, only have exposure to NoSQL systems periodically based on certain application requirements. 

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Data innovation marches on

Сентябрь 21st, 2010

Netezza

With IBM intending to acquire Netezza the predicted consolidation in the distributed analytics market is well underway.  Recent deals include EMC/Greenplum Teradata/Kickfire and now IBM/Netezza.  A good breakdown of this deal is on Curt’s blog.  There is still more to go of course with one of the crown jewels, Vertica, still ripe for the picking. 

What this indicates is that MPP analytics has moved from the innovative edge into the mainstream market and now the more risk adverse large caps and now willing to invest substantially in growing this market.  Interestingly Microsoft made this move early with the acquisition of Datallegro in 2008, I doubt this has paid dividends yet but 5 years out this might be a different story as the explosive growth of machine generated data continues. 

While it is probably a bad time to start building another MPP query processor of course innovation in big data core technology continues to be strong.  Key areas of innovation relate to Flash/SSD optimization & caching, Graph databases, stream processing & CEP, Hadoop optimization, massive shared nothing (cloud) scalability & SQL/NOSQL convergence.  These technologies will come to market in a variety of different product forms some of which will later be picked up by the large caps.  Rinse, repeat.


PlanetMySQL Voting: Vote UP / Vote DOWN

Was Stonebraker right?

Сентябрь 15th, 2010

Back in 2008 Stonebraker & DeWitt published a paper and associated blog post titled “MapReduce: A major step backwards”.  Their key points being Map Reduce is:


  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on


This turned out to be one of the most contentious postings in the DBMS community at the time drawing widespread criticism.  The “old men of DBMS” didn’t get that a database was not the solution for every problem and some problems just required a different type of mallet.  Even Vertica (who Stonebraker founded) seemed to distance themselves from the comments a little issuing a post affirming their commitment to Map/Reduce.  

If you read through the comments of the original Stonebraker/DeWitt post and the follow on post you will see how vigorously people were defending it.


The key example quoted when hailing the benefits of the Map/Reduce was that of the company which popularized it in the first place, Google.  Google used Map/Reduce to build its search indexes processing the immense volumes of data in batch fashion using MR jobs run across thousands of nodes.  No matter how the arguments for MR broke down the final word could always be – “Google does it” for which there wasn’t a great comeback.


Now however things have changed.  It has been reported that Google has moved away from Map/Reduce for search indexing due to time constraints in processing updates to the index and instead has opted/reverted to a, wait for it, DBMS centric approach to the problem (Google Caffeine).  Let me quickly point out that this DBMS is not a RDBMS but instead is their own BigTable distributed database (over GFS).


So, some questions are begging to be asked.  


Firstly, was Stonebraker and Dewitt right?  It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?


And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case?  Is the proposition for Map/Reduce today still just as good now the Google don’t do it?  (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek.  But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).


Finally, this no doubt will provide a shot in the arm for BigTable like open source implementations such as HBase and Cassandra.



PlanetMySQL Voting: Vote UP / Vote DOWN

VLDB 2010

Сентябрь 6th, 2010

VLDB 2010

I will be at VLDB 2010 next week.  If anyone on this blog is attending and wants to catch up to discuss start ups and innovation in DB, NoSQL, Big Data etc drop me a line and I will try to meet up.


PlanetMySQL Voting: Vote UP / Vote DOWN