Archive for the ‘Database Management’ Category

Got open source cloud storage? Red Hat buys Gluster

Октябрь 6th, 2011

Red Hat’s $136m acquisition of open source storage vendor Gluster marks Red Hat’s biggest buy since JBoss and starts the fourth quarter with a very intersting deal. The acquisition is definitely good for Red Hat since it bolsters its Cloud Forms IaaS and OpenShift PaaS technology and strategy with storage, which is often the starting point for enterprise and service provider cloud computing deployments. The acquisition also gives Red Hat another weapon in its fight against VMware, Microsoft and others, including OpenStack, of which Gluster is a member (more on that further down). The deal is also good for Gluster given the sizeable price Red Hat is paying for the provider of open source, software-based, scale-out storage for unstructured data and also as validation of both open source and software in today’s IT and cloud computing storage.

This is exactly the kind of disruption we’ve been seeing and expecting as Linux vendors compete with new rivals in virtualization, cloud computing and different layers of the stack, including storage (VMware, Microsoft, OpenStack, Oracle, Amazon and others), as covered in our recent special report, The Changing Linux Landscape.

While the deal makes perfect sense for both Red Hat and for Gluster, it also has implications for the white hot open source cloud computing project OpenStack. There was no mention of OpenStack in Red Hat’s FAQ on the deal, but there was a reference to ongoing support for Gluster partners, of which there are many fellow OpenStack members. OpenStack was also highlighted among Gluster’s key open standards participation along with the Linux Foundation and Red Hat-led Open Virtualization Alliance oriented around KVM. Sources at both Gluster and Red Hat, which point to OpenStack support being bundled into Red Hat’s coming Fedora 16, also reiterated to me Red Hat is indeed planning to continue involvement with OpenStack around the Gluster technologies. I suspect Red Hat is looking to leverage Gluster more for its own purposes than for OpenStack’s, but I must also acknowledge Red Hat’s understanding of the value of openness, community and compatibility. Taking that idea a step further, Gluster may represent a way that Red Hat can integrate with and tap into the OpenStack community by blending it with its own community around Fedora, RHEL, JBoss, RHEV and Cloud Forms and OpenShift.

The deal also leads many to wonder whether or what may be next for Red Hat in terms of acquisition. We’ve long thought database and data management technologies were areas where we might see Red Hat building out. This was also the subject of renewed rumors recently, and we believe it might still be an attractive piece for Red Hat given the open source opportunities and targets around NoSQL technologies such as Apache Hadoop distributed data management framework and Cassandra distributed database management software. We’ve also believed systems management to be a potential place for Red Hat to further expand. Given its need to largely stay within open source, we would expect targets in this area to include GroundWork Open Source, which joins Linux and Windows systmes in its monitorig and management, and Zenoss, which works with Cisco and Red Hat rival VMware in monitoring and managing systems with its open source software. Another potential target that would increase Red Hat’s depth in open source virtualization and cloud computing is Convirture, which might also be an avenue for Red Hat to reach out to midmarket and SMB customers and channel players. Red Hat was among the non-OpenStack members we listed as potential acquirers when considering the M&A possibilities (451 subscribers) out of OpenStack.

Given its recent quarterly earnings report and topping the $1 billion annual revenue mark, Red Hat seems again to be bucking the bad economy. We’ve written before in 2008 and more recently how bad economic conditions can be good for open source software. Red Hat is atop the list of open source vendors that suffer as traditional, enterprise IT customers such as banks freeze spending or worse, fail. However, the company’s deal for Gluster is yet another sign it is thriving and expanding despite economic difficulty and uncertainty.

You don’t have to just look at Red Hat’s earnings or take our word for it. On Jim Cramer’s ‘Mad Money’ this week, we heard Red Hat CEO Jim Whitehurst praised for Red Hat performance and traction where most companies and many economists are throwing the blame: financial services, government and Europe. Cramer credited Red Hat for a ’spectacular quarter’ and allowed Whitehurst to tout the benefits of the Gluster technology and acquisition, particularly Gluster’s software-based storage technology that matches cloud computing. It was quite a contrast to the news out of Oracle Open World, where hardware was a focal point.


PlanetMySQL Voting: Vote UP / Vote DOWN

What is the biggest challenge for Big Data?

Сентябрь 9th, 2011

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but when organizations like Google are doing 10PB sorts on 8000 machines in just over 6 hours – we know the technical scope for Big Data exists and eventually will flow down to the masses, and such scale will likely be achievable by most organizations in the next decade.

Instead I think the core problem that needs to be addressed relates to people and skills.  There are lots of technical engineers who can build distributed systems, orders of magnitude more who can operate them and fill them to the brim with captured data.  But where I think we are lacking skills is with people who know what to do with the data.  People who know how to make it actually useful.  Sure, a BI industry exists today but I think this is currently more focused on the engineering challenges of providing an organization with faster/easier access to their existing knowledge rather than reaching out into the distance and discovering new knowledge.  The people with pure data analysis and knowledge discovery skills are much harder to find, and these are the people who are going to be front and center driving the big data revolution.  People who you can give a few PB of data too and they can provide you back information, discoveries, trends, factoids, patterns, beautiful visualizations and needles you didn’t even know were in the haystack.

These are people who can make a real and significant impact on an organizations bottom line, or help solve some of the world’s problems when applied to R&D.  Data Geeks are the people to be revered in the future and hopefully we see a steady increase in people wanting to grow up to be Data Scientists. 


PlanetMySQL Voting: Vote UP / Vote DOWN

NSA, Accumulo & Hadoop

Сентябрь 8th, 2011

Reading yesterday that the NSA has submitted a proposal to Apache to incubate their Accumulo platform.  This, according to the description, is a key/value store built over Hadoop which appears to provide similar function to HBase except it provides “cell level access labels” to allow fine grained access control.  This is something you would expect as a requirement for many applications built at government agencies like the NSA.  But this also is very important for organizations in health care and law enforcement etc where strict control is required to large volumes of privacy sensitive data.

An interesting part of this is how it highlights the acceptance of Hadoop.  Hadoop is no longer just a new technology scratching at the edges of the traditional database market.  Hadoop is no longer just used by startups and web companies.  This is highlighted by outputs like this from organizations such as the NSA.  This is also further highlighted by the amount of research and focus on Hadoop by the data community at large (such as last week at VLDB).  No, Hadoop has become a proven and trusted platform and is now being used by traditional and conservative segments of the market.  

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Apache and MySQL Logging with Syslog-ng

Сентябрь 5th, 2011

Apache and syslog-ng

While logging to a database back-end has its benefits, the setup as it stands leaves us wanting. Some applications, such as Apache, do not log via syslog-ng by default. The good news is that this can be easily remedied, and there are a couple of different ways of doing this. First, the less good way:

Method #1: Changing the Apache configuration file.

First, we need to setup syslog-ng appropriately by creating a new source for apache, such as the following:

source s_apache {
 unix-stream("/var/log/apache2/apache_log.socket"
 max-connections(512)
 keep-alive(yes));
 };

log { source(s_apache); destination(d_pgsql); };

This recycles the original destination for PostgreSQL and upon restarting syslog-ng, will create the /var/log/apache2/apache_log.socket which will now need to be referenced in httpd.conf:

CustomLog "|/usr/bin/logger -s -t 'Apache' -p info -u /var/log/apache2/apache_log.socket" Combined
 ErrorLog "|/usr/bin/logger -s -t 'Apache' -p err -u /var/log/apache2/apache_log.socket"

The “-t ‘Apache’” portion of the above lines will act as the $PROGRAM value defined in the d_pgsql destination above, and may be tailed to suit your preferences. After restarting Apache, your logs should now be sent to the PostgreSQL database along with other system logs.

The problem with this method is that the services must be started in a specific order, syslog-ng first, then Apache. If syslog-ng is restarted for any reason and Apache is not started again afterwards, no Apache logs will be sent to syslog-ng. This is why I prefer the next method:

Method #2: file();

This time, let’s leave Apache alone (no changes to your httpd.conf), and just adjust the syslog-ng configuration. First, a source needs to be made, like before, but employing a different method for gathering the logs. Let’s call it “s_apache” again:

source s_apache {
 file("/var/log/apache2/access_log");
 file("/var/log/apache2/error_log");
 };

Also, a new destination, “d_pgsql_apache”….

destination d_pgsql_apache {
 sql(type(pgsql)
 host("ip.of.you.host") username("logwriter") password(“logwriterpassword") port("5432")
 database("syslog")
 table("logs_${HOST}_${R_YEAR}${R_MONTH}${R_DAY}")
 columns("datetime varchar(16)", "host varchar(32)", "program varchar(20)", "pid
 varchar(10)", "message varchar(800)")
 values("$R_DATE", "$HOST", "Apache", "$PID", "$MSG")
 indexes("datetime", "host", "program", "pid", "message"));
 };

The old destination could be used, however the program name will not be entered correctly. Here, the variable $PROGRAM is replaced with “Apache”, so that we always know what program is producing the log. Finally, we need a new log line:

log { source(s_apache); destination(d_pgsql_apache); };

MySQL and syslog-ng

The same format can be applied for MySQL logs:

source s_mysql {
 file("/var/log/mysql/mysqld.sql");
 };
destination d_pgsql_mysql {
 sql(type(pgsql)
 host("ip.of.you.host") username("logwriter") password(“logwriterpassword") port("5432")
 database("syslog")
 table("logs_${HOST}_${R_YEAR}${R_MONTH}${R_DAY}")
 columns("datetime varchar(16)", "host varchar(32)", "program varchar(20)", "pid
 varchar(10)", "message varchar(800)")
 values("$R_DATE", "$HOST", "MySQL", "$PID", "$MSG")
 indexes("datetime", "host", "program", "pid", "message"));
 };

log { source(s_mysql); destination(d_pgsql_mysql); };

Please notice that again, the $PROGRAM variable has been changed, this time to “MySQL” to make our lives easier.

Viewing Logs

So, obviously we could run a simple select statement for “Apache” or “MySQL”, and it would probably give us way more information than we really want to see. Let’s say though, that we are interested in seeing all database connections from all users, in the event that we suspect an account has been compromised. I’ll use my trusty server “Louis” as an example. A simple query like this would do the trick:

select * from logs_louis_20110824 where program = ‘MySQL’ and message like ‘%Connect%’

That’s quite a bit of activity in such a short amount of time, but a useful query all the same. Now that we can monitor Apache and MySQL logs for all of our servers from a central location using PostgreSQL, let’s take a look at syslog-ng’s other capabilities for remote logging. Take a look at Logging to a Remote Host with Syslog-ng for more information on logging over TCP and UDP.

Relevant posts:

Basic Apache and MySQL Performance Tuning: Part1: Apache

Basic Apache and MySQL Performance Tuning: Part 2: MySQL

101 Tips to MySQL Tuning and Optimization

LAMP Security: 21 Tips for Apache

25 Apache Performance Tuning Tips

Integrate Apache Monitoring into Monitis.com

Share Now:
  • del.icio.us
  • Digg
  • Facebook
  • LinkedIn
  • BlinkList
  • DZone
  • Google Bookmarks
  • Reddit
  • StumbleUpon
  • Twitter
  • RSS

PlanetMySQL Voting: Vote UP / Vote DOWN

Reply to The Future of the NoSQL, SQL, and RDBMS Markets

Август 12th, 2011

Conor O'Mahony over at IBM wrote a good post on a favorite topic of mine “The Future of the NoSQL, SQL, and RDBMS Markets”.  If this is of interest to you then I suggest you read his original post.  I replied in the comments but thought I would also repost my reply here.

-----------------------------------------------------------------------------------------------

Hi Connor, I wish it was as simple as SQL & RDBMS is good for this and NoSQL is good for that.  For me at least, the waters are much muddier than that.

The benefit of SQL & RDBMS is that its general purpose nature has meant it can be applied to a lot of problems, and because of its applicability it is become mainstream to the point every developer on the planet can probably write basic SQL.  And it is justified, there aren’t many data problems you can’t through a RDBMS at and solve.

The problem with SQL & RDBMS, well essentially I see two.  Firstly, distributed scale is a problem in a small number of cases.  This can be solved by losing some of the generic nature of RDBMS and keeping SQL such as with MPP or attempts like Stonebraker’s NewSQL.  The other way is to lose RDBMS and SQL altogether to achieve scale with alternative key/value methods such as Cassandra, HBase etc.  But these NoSQL databases don’t seem to be the ones gaining the most traction.  From my perspective, the most “popular” and fastest growing NoSQL databases tend to be those which aren’t entirely focused on pure scale but instead focus first on the development model, such as Couch and MongoDB.  Which brings me to my second issue with SQL & RDBMS.

Without a doubt the way in which we build applications has changed dramatically over the last 20 years.  We now see much greater application volumes, much smaller developer teams, shorter development timeframes and faster changing requirements.  Much of what the RDBMS has offered developers – such as strong normalization, enforced integrity, strong data definition, documented schemas – have become less relevant to applications and developers.  Today I would suspect most applications use a SQL database purely as a application specific dumb datastore.  Usually there aren’t multiple applications accessing that database, there aren’t lots of direct data import/exports into other aplications, no third party application reporting, no ad-hoc user queries and the data store is just a repository for a single application to retain data purely for the purpose of making that application function.  Even several major ERP applications have fairly generic databases with soft schemas without any form of constraints of referential integrity.  This is just handled better, from a development perspective, in the code that populates it.

Now of course the RDBMS can meet this requirement – but the issue is the cost of doing this is higher than what it needs to be.  People write code with classes, RDBMS uses SQL.  The translation between these two structures, the plumbing code, can be in cases 50% of more of an applications code base (be that it hand-written code or automatic code generated by a modeling tool).  Why write queries if you are just retrieving and entire row based on key.  Why have a strict data model if you are the only application using it and you maintain integrity in the code?  Why should a change in requirements require you to now to go through the process of building a schema change script/process that has to have deployed sync’d with application version.  Why have cost based optimization when all the data access paths are 100% known at the time of code compilation?

Now I am still largely undecided on all of this.  I get why NoSQL can be appealing.  I get how it fits with today’s requirements, what I am unsure about if it is all very short sighted.  Applications being built today with NoSQL will themselves grow over time.  What may start off today as simple gets/puts within a soft schema’d datastore may overtime gain certain reporting or analytics requirements unexpected when initial development began.  What might have taken a simple SQL query to meet such a requirement in RDBMS now might require data being extracted into something else, maybe Hadoop or MPP or maybe just a simple SQL RDBMS – where it can be processed and re-extracted back into the NoSQL store in a processed form.  It might make sense if you have huge volumes of data but for the small scale web app, this could be a lot of cost and overhead to summarize data for simple reporting needs.

Of course this is all still evolving.  And RDBMS vendors and NoSQL are both on some form of convergence path.  We have already started hearing noises about RBDMS looking to offer more NoSQL like interfaces to the underlying data stores as well as the NoSQL looking to offer more SQL like interfaces to their repositories.  They will meet up eventually, but by then we will all be talking about something new like stream processing :)

Thanks Connor for the thought provoking post.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

IA Ventures — Jobs shout out

Август 4th, 2011

My friends over at IA Ventures are looking both for an Analyst and for an Associate to their team.  If Big Data, New York and start-ups is in your blood then I can’t think of a better VC to be involved in. 

From the IA blog:

"IA Ventures funds early-stage Big Data companies creating competitive advantage through data and we’re looking for two start-up junkies to join our team – one full-time associate / community manager and one full time analyst. Because there are only four of us (we’re a start-up ourselves, in fact), we’ll need you to help us investigate companies, learn about industries, develop investment theses, perform internal operations, organize community events, and work with portfolio companies—basically, you can take on as much responsibility as you can handle."

Roger, Brad and the team continue to impress with their focus on Big Data, their strategic investments in monetizing data and knowledge of the industry in general.


PlanetMySQL Voting: Vote UP / Vote DOWN

Realtime Data Pipelines

Август 1st, 2011

In life there are really two major types of data analytics.  Firstly, we don’t know what we want to know – so we need analytics to tell us what is interesting.  This is broadly called discovery.  Secondly, we already know what we want to know – we just need analytics to tell us this information, often repeatedly and as quickly as possible.  This is called anything from reporting or dashboarding through more general data transformation and so on.

Typically we are using the same techniques to achieve this.  We shove lots of data into a repository of some from (SQL, MPP SQL, NoSQL, HDFS etc) then run queries/ jobs/ processes across that data to retrieve the information we care about.  

Now this makes sense for data discovery.  If we don’t know what we want to know, having lots of data in a big pile that we can slice and dice in interesting ways is good.   But when we already know what we want to know, continued batch based processing across mounds of data to produce “updated” results of data, that is often changing in constantly, can be highly inefficient.

Enter Realtime Data Pipelines.  Data is fed in one end, results are computed in real time as data flows down the pipeline and come out the other end whenever relevant changes we care about occur.  Data Pipelines / workflow / streams are becoming much more relevant for processing massive amounts of data with real time results.  Moving relevant forms of analytics out of large repositories into the actual data flow from producer to consumer, I believe, will be a fundamental step forward in big data management.

There are some emerging technologies looking to address this, more details to follow.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

What Scales Best?

Июль 29th, 2011

It is a constant, yet interesting debate in the world of big data.  What scales best?  OldSQL, NoSQL, NewSQL?

I have a longer post coming on this soon.  But for now, let me make the following comments.  Generally, most data technologies can be made to scale - somehow.  Scaling up tends not to be too much of an issue, scaling out is where the difficulties begin.  Yet, most data technologies can be scaled in one form or another to meet a data challenge even if the result isn’t pretty. 

What is best?  Well that comes down to the resulting complexity, cost, performance and other trade-offs.  Trade-offs are key as there are almost always significant concessions to be made as you scale up.

A recent example of mine, I was looking at scalability aspects of MySQL.  In particular, MySQL Cluster.  It is actually pretty easy to make it scale.  A 5 node cluster on AWS was able to scale to process a sustained transaction rate of 371,000 insert transactions – per second.   Good scalability yes, but there were many trade-offs made around availability, recoverability and non-insert query performance to achieve it.  But for the particular requirement I was looking at, it fitted very well.

So what is this all about?  Well, if a Social Network is  running MySQL in a sharded cluster to achieve the scale necessary to support their multi-millions users the fact that database technology x or database technology y can also scale with different “costs” or trade-offs doesn’t necessarily make it any better – for them.  If you, for example, have some of the smartest and talented MySQL developers on your team and can alter the code at a moment’s notice to meet a new requirement – that alone might make your choice of MySQL “better’ than using NoSQL database xyz from a proprietary vender where there may be a loss of flexibility and control from soup to nuts.

So what is my point?  Well I guess what I am saying is physical scalability is of course an important consideration in determining what is best.  But it is only one side of the coin.  What it “costs” you in terms of complexity, actual dollars, performance, flexibility, availability, consistency etc, etc are all important too.  And these are often relative, what is complex for you may not be complex for someone else.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Who/What to acquire next

Март 18th, 2011

Well as predicted, with Aster Data recently being picked up by Teradata most of the key new generation MPP distributed analytics vendors have been acquired (Aster Data, Vertica, Netezza & Greenplum).  This had to happen and was expected to happen.  The MPP Analytics startup “revolution” is over and these technologies will now be integrated into the mainstream.

So what’s next?  As we now, if you are a massive multi-national software company it is a lot less risky to incrementally innovate and leave the development of “game changing” technologies to startups that can be acquired after they prove both the tech and the market.  So what follows MPP?

NoSQL technologies seem the only likely candidate at the moment, although I think it is a few years too early for any major acquisitions to occur.  A key issue that would need to be worked through is what exactly is being acquired as most NoSQL platforms are open source / free (most MPP platforms were proprietary).  But nonetheless, as the market grows and starts to eat away at some noticeable level from the existing RDBMS market the major vendors will want a piece of that action and the frenzy will start again.  But this is still quite a while away yet.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Some NoSQL Myths

Октябрь 19th, 2010

I have been busy travelling recently but thought I would jot down a couple of NoSQL myths that are fresh in my head from my recent discussions.

  • Twitter use Cassandra internally but have not migrated their tweet store, despite their earlier plans to.  For now tweets are still stored in MySQL.
  • Despite the widely accepted view that the use of Cassandra led to Diggs issues a couple of Digg engineers have apparently discounted this.
  • Despite the widely accepted view that NoSQL databases all use eventual consistency this is not so.  HBase, for example, offers full consistency.
  • Despite the widely accepted view that NoSQL is only about unlimited distributed scalability this is also not so.  Some of the most popular NoSQL platforms have fairly rudimentary (traditional RDBMS like) scalability options.  Such as CouchDB and MongoDB which use sharding + replication to achieve scale.
  • Despite being commonly reported as “easy to install” or “easy to use” the benefits of a document object model are much more significant.  Why did we spend so much time during the 90’s trying to build ORDBMS?  Because the object-relational impedance mismatch is major and this translates into significant development overhead.  It is not uncommon to see 30-60% of all code in some applications purely “plumbing” to deal with mapping data to and from the RDBMS.  This is not something necessarily well understood by DBA’s or even some long time database designers, something I will write a follow up post on.

 


PlanetMySQL Voting: Vote UP / Vote DOWN