Archive for the ‘Relational DB’ Category

Riptano for Cassandra

Май 3rd, 2010

Riptano

Cassandra is one of the most interesting NoSQL platforms at the moment.  And by most interesting what I really mean is the most clearly justifiable.  Some NoSQL platforms offer new data models, improved query interfaces and/or good single node performance through relaxed consistency models.  As a database guy however, the justification for throwing out the RDBMS baby and bathwater is still difficult at this point as NoSQL platforms tend to be highly focused in one aspect of data management, and very immature in all other areas.  Cassandra is somewhat different as it is more mature in a number of key areas (albeit still immature in others).  Areas that can make Cassandra more justifiable for the right project, when compared with a more traditional RDBMS based solution.  This is because Cassandra’s primary capabilities can’t easily be replicated on those traditional mainstream platforms.

Cassandra’s primary focus is on scalability.  More specifically that is scalability combined with reasonable functionality and performance & availability when at scale.  While some other platforms are trying to bolt on scalability/availability to their functionality rich data engines, Cassandra already has proven real life examples running 150 node clusters.  Notable uses of Cassandra include Digg, Facebook, Twitter, Reddit & Rackspace.  And the feedback from these sites is very good; commonly Cassandra has been expressed as the hands down winner for transaction processing performance at scale.

One of the key contributors to Cassandra has been Jonathan Ellis and until recently he has been working on Cassandra while employed by RackSpace.  But, I was pleased to hear that Jonathan, and business partner Matt Pfeil, have taken the step of setting up their own Cassandra focused company, Riptano.

Riptano are providing the commercialized support services around the open source Cassandra that are necessary for the platform to survive and grow.  While such services may be less important for adoption from the techie rich Web 2.0 crowd, for any platform to become mainstream there needs to be an escalation path for companies uninterested or unable to tinker with the code themselves.  Riptano provides those services which can allow Cassandra use to start to grow further.

Just as importantly, this move gives representation to Cassandra and provides an entity whose best interests will be served through advocacy of the platform.  While Jonathan and others had been doing a fine job of this to date personally, another corporation investing commercial dollars into advocacy will be important to ensure Cassandra’s message isn’t drowned out by more highly funded alternatives.

Riptano has received some early funding from RackSpace and I believe already has a few customers signed for their support services.  Best luck Jonathan & Matt.

Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

Ingres Vectorwise smokes it!

Май 1st, 2010

I work in all markets of the database industry, from web & startup through the largest and most established enterprises.  And to be completely honest, the name Ingres has not come up in conversation very much at all.  10 years ago maybe more often, but recently not all that much.  But Ingres has been quietly ticking away.  Despite being largely off the radar, they still have a sizable and loyal customer base, global offices and a focused & dedicated management team.  And importantly they have an open source business model which actually appears to be working.

I wrote last year that their "behind the scenes" status had the potential to change.  Ingres had been very clever and worked out a partnership relationship with Peter Bonzc’s Vectorwise.  And that relationship was promising big things for data analytics from a price/performance perspective.  But at the time it was all promise and little in the way of substance had been produced.

But that has been changing.  A month or two back Ingres somewhat quietly launched their Beta program for the Ingres Vectorwise technology.  This technology, if you have not read about it before, combines an analytical column store and “vectorized processing” to give much greater throughput rates than previously possible on your existing hardware (Vectorwise is a single node solution i.e. not MPP) .

And I have started hearing feedback, and it is good.  Very good.  While Ingres Vectorwise isn’t fully baked yet, I have heard it is producing astounding performance results in early testing.  In one case I heard of <10TB real life production comparison test and Ingres Vectorwise smoked everything else they had tested.  And they have tested a lot of different market leading analytical platforms.

So I think this is the start of an Ingres’s comeback.  Certainly anyone looking at <10TB analytical platforms will be getting the recommendation that they at least look at Ingres Vectorwise from me.  I am looking forward to seeing what 2010/2011 brings for them.

Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

What is Big Data?

Январь 31st, 2010

Exhibit: AggregationsImage by Aranda\Lasch via Flickr

One of my favorite terms at the moment is “Big Data”.  While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges.  Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case.  Many existing technologies have little problem physically handling large volumes (TB or PB) of data.  Instead the Big Data challenges result out of the combination of volume and our usage demands from that data.  And those usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within “modern” timeframes.  The exact definitions are of course are relative & constantly changing, however right now this is somewhere along the path towards the end goal.  This is of course the ability to handle an unlimited volume of data, processing all requests in real time.

So what are Big Data technologies?

More than at any point in the past, data related technologies are the focus of research & innovation.  But Big Data challenges won’t be solved anytime soon by a single approach.  Keeping in mind all the different platforms that Big Data is having an impact on (web, cloud, enterprise, mobile) combined with all the Big Data domain challenges (transaction processing, analytics, data mining, visualization) as well as many of the Big Data characteristic requirements (volume, timeliness, availability, consistency), it is easy to see how no single technology will provide a cover-all solution for the eclectic mix of needs. Instead a broad set of technologies that are each focused on meeting specific set of needs are improving our ability to manage data at scale. 

A few common areas of innovation that I describe as Big Data technologies include: MPP Analytics, Cloud Data Services, Hadoop & Map/Reduce (and associate technologies such as HBase, Pig & Hive), In-Memory Databases and Distributed Transaction Processing.

So what is the point of Big Data?

Someone asked me if Big Data was just tools to “try and sell them more relevant crap they don’t want”.  While up-sell & targeted advertising are too major uses of Big Data technologies I hope that mine and others work in this field does result achievements more significant than just these.

When describing the point of Big Data I like to think about how the Internet has changed my life in general.  By having unlimited & timely access to information we are now better informed in all areas of our existence than ever before.  However, we are now facing the problem that there is fast becoming too much data for us to digest in its raw form.  To move forward in our understanding we will need to rely on technology to provide timely, summarized & relevant data across all aspects of our lives.  This is what those working in Big Data are setting out to achieve.


Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

DBMS Links of the Week

Сентябрь 26th, 2009

Larry EllisonImage by plαdys via Flickr

The following is a list of interesting DBMS related links for the week:



Related articles by Zemanta
Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

Is the RDBMS doomed (yada yada yada) ?

Сентябрь 22nd, 2009

Ladybower PlugholeImage by Snooch2TheNooch via Flickr

I was speaking with Michael Stonebraker this morning.  I mentioned that lately many have been referencing comments he has made over the last couple of years.  And I also mentioned that many had interpreted them as he was implying the RDBMS is “doomed”.  Mike has been saying the same thing for years, but the current NoSQL movement seems to have picked up on this and highlighting one of the RDBMS's own pioneers is predicting its demise.

I asked Mike to clarify this.  My interpretation of his response is as follows.  I understand that he doesn’t believe the relational database itself is doomed.  Instead the current general purpose implementations, or “elephants” using his words, were out of date.  By moving away from a historical GP function into something more specific in focus, either in transaction processing or analytics, you can easily get 50x performance improvement over GP RDBMS.  This doesn’t necessarily mean moving away from the “relational” nature, but instead changing some core design principles for how a RDBMS is implemented.  It is this improvement factor that will see “new” specialist platforms overtake “old” general purpose platforms.  That is gradually, over time.  However Mike also mentioned the relational data model doesn’t make sense in a number of disciplines, particularly in sciences, and alternative modeling paradigms will offer benefits to this market (hence his focus on SciDB).  So while relational is a valid data model, other data models are also needed.

I have a similar position to Mike, but perhaps with a few differences. 

- Firstly I agree with the mantra that current GP RDBMS platforms provide only a “middle of the road” capability, and we gone too far in using a GP RDBMS for everything.  However I do believe there is a long term future for the GP RDBMS.  A general purpose application requirement will continued to be well suited for a general purpose platform.  With a specialist only approach, a general purpose requirement may need both a specialist OLTP platform and a specialist Analytics platform to provide the same capability.

- I agree that with an extreme requirement, either analytics or transaction processing, a specialist platform is well suited.  But I don’t see the choices of just MPP or memory resident RBDMS as being a broad enough set.  Apps that use a db just as a persistence cache will benefit from a high performing, scalable database platform with much tighter integration with the object model.  I am not sure any of the current NoSQL platforms have it quite right yet, but when these guys eventually get together with the database guys and work on these things together they may get there.

- I don’t think a 50x performance speed up on its own is enough to drive change in OLTP.  I have written before how difficult it is to get into this market and how tight Oracle, Microsoft & IBM have this sewn up.  But I don’t believe it is impossible, I think you just need to bring slam dunks on multiple fronts (performance just being one of them).

Anyway I feel like I am a bit of a broken record at the moment.  I have been addressing the “is the RDBMS doomed” question a couple of times a day for some time. Time to focus on something else for a bit.


Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

Some Initial Thoughts on Oracle Exadata V2

Сентябрь 16th, 2009

Oracle CorporationImage via Wikipedia

There will be plenty of detailed coverage on Exadata V2 so I won’t attempt to replicate that.  However I do have a couple of initial thoughts which I would like to share.  For those who missed it, Oracle has just announced Exadata V2 (which is their pre-built “machine”).  Exadata V1 was built using HP equipment, Exadata V2 is using Sun.  The main addition to Exadata V2 seems to be an extra tier in the memory hierarchy, a flash cache.  Oracle is very quick to point out this is not flash disk, but it is flash memory, Sun’s FlashFire technology (flash disk or SSD’s was always going to be a transition technology, flash memory doesn’t have the physical constraints of moving parts disk so the whole “disk” concept for flash doesn’t make too much sense other than it fits easily with current architectures).

The new memory layer (Processor Cache’s -> DRAM -> Flash Cache -> Disk) coupled with Oracle’s algorithms to effectively use the Flash Cache layer brings performance benefit to the solution (+ all the other improvements 12 months of hardware innovation brings, faster CPU’s, more memory etc).

My initial thoughts are:

  • Kudos to Oracle.  They are the first vendor to really bring a bunch of this leading edge technology together in a semi-mainstream way.  Flash Cache, Inifiband interconnects, DBMS optimizations using flash hasn't really surfaced anywhere outside of startups yet.
  • So what happens to Exadata V1 customers using the HP solution?  This is only about a year old.  Some analysts are suggesting there has only been minor sales of Exadata V1 (I am not an analyst so don’t really know).  So why would HP continue to support a platform where no new sales will be created, when potentially only a limited number of customers have it today?  Possibly Oracle will offer attractive terms to move existing HP Exadata V1 customers to Sun Exadata V2.
  • It is a preconfigured solution that you by in certain size configurations.  Small, half rack, full rack, multiple racks.  I think Larry said that 3 racks will give you a PetaByte of storage capacity.  This is fine, except they are targeting it for use with OLTP and data warehousing workloads.  It seems odd that to get very high computational resources for transaction processing, you would also get massive volumes of potentially unnecessary storage capacity.  It will be interesting to see if they allow the balance between processing & storage to be modified as part of configuration.


I have had some questions along the lines of “isn’t this back to the one size fits all approach?”  Well yes it is, but Oracle never really moved away from this in terms of the core DBMS.  It is my understanding that Oracle Exadata was still the general purpose Oracle DBMS & RAC but on a hardware platform optimized for accessing large data sets (making it a data warehousing solution).  Using FlashFire, the hardware can now do high levels of random I/O (I think 1m random I/O’s was quoted) which makes the hardware platform general purpose as well.

One interesting question will be if, under Oracle, other vendors can buy the exact same hardware configuration from Sun and optimize their DBMS for Flash also?   If so, it may be difficult for them to do this in a way that is price competitive.  And will competitive DBMS vendors really want to help fill Oracle’s pockets further?

If we expect to see more of this hardware alignment between DBMS vendors where does that leave Microsoft?  Maybe HP is already peeling the Exadata V1 logos off their racks and sticking Microsoft Madison logo’s in their place?

UPDATE:

Oracle has put out a FAQ which partly answers some of the questions.


Reblog this post [with Zemanta]

PlanetMySQL Voting: Vote UP / Vote DOWN

OLTP back into focus

Сентябрь 14th, 2009

I haven’t blogged in over a month now.  This is for a number of reasons.  Firstly I have been flat out with various activities.  This included a trip to VLDB in Lyon mid month.  Secondly, a lot of the companies I have spoken with this month aren’t ready to speak publically so hence no blog posts resulting from these sorts of discussions.

However there has been a wiff of a change in the air in terms of focus that is interesting and worth highlighting.  After years of lots of innovation around data analytics, OLTP is starting to make a comeback in terms of reclaiming some of the limelight.  Much more on this between now and the end of the year, but a couple things to watch:


PlanetMySQL Voting: Vote UP / Vote DOWN

VectorWise

Август 1st, 2009


I was fortunate enough to speak with Marcin Zukowski earlier about VectorWise.  If you missed it, VectorWise came out of stealth mode a day or two ago.  The have announced a joint partnership with Ingres and essentially are claiming impressive analytic RDBMS performance gains on conventional hardware.

To start with, a key message that I think needs to be communicated here is that this is not a product announcement.  Ingres and VectorWise have announced a partnership in which they of course plan to build products together, today those products are still in the works.

VectorWise is a spin out of CWI based on research that was undertaken by Marcin and others, research that centered on MonetDB.  Explaining the essence of VectorWise is difficult because it is largely internal DBMS data storage & processing logic, but I will have a go.

The modern RDBMS is based around design principles that stem from general purpose OLTP roots and historical hardware architectures (this is partially true even for some of the newest analytic platforms).  These design principles in a nutshell focus on the fact that disk is slow & CPU is fast.  Data is seeked or partially scanned off disk and cached.  Row-by-row (tuple-by-tuple) operators process that data, passing the outcome of each operator to the next as part of a queries execution plan until ultimately producing the result. 

Traditionally I/O is the main bottleneck, so to make the database faster you add more I/O bandwidth.   Today, disk requirements may be up to 100x the actual capacity needs, so many disks are necessary to achieve the I/O bandwidth to provide performance for an analytical RDBMS implementation.  Even though the RBDMS’s may parallelize query operators across cores, this typically works by partitioning data between cores, yet each is still processing on a tuple-by-tuple basis.

Conventional wisdom?  Well maybe.  You see disk is only really “slow” when it is doing random seeks.  Give a disk something sequential to do on the other hand and things are very different.  Modern disks are able to sequentially scan in the range of 150MB per second.  An array of 10 disks should therefore be able to return sequentially read data in the range of 1GB per second. 

When it comes to databases, column based storage has been found to effectively structure data for a) high levels of compression and b) sequential access.  VectorWise makes use of both of these technologies to help it achieve high levels of sequential I/O.  The problem now however is that disk may no longer the bottleneck.  While we can get 1GB a second sequentially off disk relatively easily & cheaply, processing tuple-by-tuple at this rate is very difficult.  As it turns out, a RDBMS’s may only achieve a data processing rate of 50MB a second per CPU core.  This makes the CPU processing limitations a big bottleneck for analytics data sets, assuming the above figures we would need over 20 cores to keep up with 10 disks (and of course CPU cores don’t scalability linearly).

If we step out of the database world for the moment into the world of high end computer games, or high end scientific processing, we find their use of current CPU technology is much more advanced than what we are used to.  They are using new CPU extensions (MMX, SSE, SS2, Prescott etc) to parallize & pipeline computation within a CPU’s core meaning they are processing orders of magnitude more instructions per core that what a traditional RDBMS typically has been able to. The exact details are too low level to discuss here (many of the research papers are available online) but it is fair to say, modern CPU architectures contain advanced features that to date haven’t effectively been exploited by database vendors.

Enter VectorWise.  Their aim is to marry storage technologies which allow high levels of sequential I/O to occur with query processing logic which is designed for modern CPU architectures.  Rather than process tuple-by-tuple they are processing “vectors”, groups of tuples, leveraging modern CPU extensions and high levels of on-chip cache to allow the CPU to carry out higher data processing throughput.  The result is instead of the 50MB a second in a tuple-by-tuple approach, VectorWise are able to achieve processing rates in the range of 500Mb-1GB a second per core in some situations.  This means processing rates of 8GB a second or more could be possible with relatively low end hardware.

“In some situations” is the key point to stress here, this obviously isn’t a blanket gain that applies to all analytic data sets, workloads and query requirements.  Just what those situations are will be the key to their technologies success, how well it actually applies to real world data sets and queries.  I wouldn’t expect to see too many specific examples on this until a product beta appears.  But the theory is VectorWise can offer high levels of processing capabilities with existing mainstream hardware.  At this point VectorWise isn’t even focusing on MPP instead they are single node focused.  If their scalability claims pan out you can imagine how this could allow a single node solution to be competitive with existing low to mid scale MPP solutions that are based on a more conventional query processing architecture.

This isn’t VectorWise’s only trick up their sleeve.  They are also are leveraging research around column based storage, compression, piggy-backed (shared) scans and so on.  Much of the research that has been adopted by VectorWise is referenced from their web site.

So VectorWise have impressive technology, so why then partner with Ingres rather than a larger vendor (or going at it alone)?  Marcin offers a few reasons.  Firstly, as academics they feel strongly that open source is cool so this path was greatly preferred over a relationship with a non-open vendor.  Secondly Ingres will allow them to deliver their technology in an uncompromised fashion.  Marcin mentioned that if they had partnered with one of the big three vendors, that vendors existing product strategies and investments would have likely meant their ideas could have only been implemented in partial form.  Ingres on the other hand is going to allow them more of a green field.  And of course, a partnership with Ingres makes sense from a go to market perspective as Ingres already has a worldwide reputation, a global customer base, sales & marketing capabilities etc.

Marcin confirmed that Ingres have an exclusive license to their technology, and first option to acquire them for a certain period of time.  This allows Ingres to really invest in the relationship without the fear of the carpet being pulled out from under them. 

VectorWise clearly are applying innovative research to analytical RBDMS requirements.  But as interesting as the technology sounds, the proof in the pudding will be how well these design principals translate to real-world analytical processing requirements in mainstream product form.  This remains to be seen, but Ingres and their community clearly has high hopes.

VectorWise is clearly differentiated when comparison with a traditional mainstream RDBMS running on mainstream hardware.  However in this current market we have lots of different approaches to the problems described.  Kickfire for example use their own SQL Chip processor to increase data processing rates and other appliance vendors are using FPGAs etc for similar purposes.  The comparison of these different approaches and the relative effectiveness of each approach still need to be examined, however a mainstream hardware approach has obvious benefits.


Related articles by Zemanta
Reblog this post [with Zemanta]

Maria Update

Июль 30th, 2009

MySQL PanelImage by Sebastian Bergmann via Flickr

I had a quick chat with Michael Widenius today.  He is on vacation so tried to keep the call short.  Essentially spoke about two topics, Oracle & an update on Maria.

The Monty Program has 15 staff now.  Their focus is getting the MariaDB branch of MySQL ready for release, I understand they have a target of next month (August) for this release.  The Maria storage engine has been delayed for the time being with the focus being on the branch release instead.  PBXT and XtraDB will two of the storage engines included in this release.

The Open Database Alliance is a key initiative on the Monty Program, started in conjunction with Percona.  Essentially the ODA is a network of third part MySQL services organizations with an operating agreement between them.  The idea is to build a credible global support capability for MySQL outside of Oracle/Sun.  But also to provide the same structure to other open source database initiatives, such as Postgres.  If you are in the business of providing MySQL servies, or services to other open source database platforms, I suggest you check it out.

I think we shouldn’t have to wait too much longer before Oracle is in a position where it can start talking about its MySQL plans.  At this point however I would think that MySQL's path forward could be quite different to its recent past.  I originally planned to expand my thoughts on this here, but as this is and can only be speculation (as Oracle hasn’t made an official comments yet) maybe it is just better to wait and see.


Related articles by Zemanta
Reblog this post [with Zemanta]

The NoSQL community needs to engage the DBA’s

Июль 30th, 2009

Baloney

The NoSQL movement has been gaining some steam lately, with discussion forums and mailing lists popping up all around the web.  Despite having a career that has been centered on the RDBMS, I have made no secret that I think we have gone too far down with our RDBMS for everything mindset.  I think we need to add a few more tools back into our data toolbox. 

Today, 99.5% of new data centric developments started will use a RDBMS by default.  Maybe .5 of a % will consider using something as obtuse as a NoSQL platform.  By experience I know the majority of people discussing NoSQL platforms today are web developers.  In fact there is almost a sense of trying to trying to keep this under the radar of DBAs.  If we don’t talk to the DBAs about this stuff then they won’t bother us with all that jabber about consistency, data integrity, robustness and recovery. 

Actually, many of the NoSQL projects are touting one of the key benefits of a NoSQL platform is you can do big data without the need of a costly DBA.

Baloney.

This shows me that the people making those comments have no idea what DBAs do and what happens with critical data applications post deployment.

A NoSQL data platform may have a different approach to operational management than a RDBMS, but a large part of the requirement will be the same.  It doesn’t matter if you have 10, 100 or 1000GB of data deployed on a NoSQL platform or an RDBMS.  Someone still needs to be thinking about backups & recovery, availability, capacity planning, performance monitoring, import/export, data integration, tuning & optimization, replication latency and so on.  Also, I have never come across any technology that works perfectly 100% of the time, so when things don’t work as expected and nodes are out of sync or partial data corruption occurs at 2am, someone will still need to fix it.  Guess who that is going to be.

DBAs are critical to any wide scale success with NoSQL platforms.  They need to be engaged and educated.  Sure they are going to be really annoying for quite a while, ripping into common NoSQL limitations such as lack of transaction support, eventual consistency, data duplication & application controlled data integrity.  However over time they will start to see the positive aspects as well and learn sometimes a mallet isn’t the only tool required.


Reblog this post [with Zemanta]