![]() |
|||||||
|
|
||||||
![]() |
|||||||
PlanetMySQL Voting: Vote UP / Vote DOWN
![]() |
|||||||
|
|
||||||
![]() |
|||||||
Selena and I gave a talk on the various issues of running databases “in the cloud” at the recent linux.conf.au in Ballarat. Video is up, embedded below:
It’s almost the end of the year – that means holiday cards, shopping, cooking, parties, and the inevitable year-end top lists (including gems like this one).
In the spirit of end of year list making, we fed our 60+ blogs this year through Google Analytics to find out what our own top ten blogs were (outside of product announcements). So if you missed an episode of the View (TokuView that is) we’ve got a Tokutek Top Ten for you (spoiler alert – they are mostly technical):
10. Cage Match: OldSQL, NoSQL and NewSQL – References to mud wrestling priests and Lady Gaga heat up the debate over MySQL and its variations and alternatives.
9. Indexing, the Director’s Cut – Zardosht Kasheff took his indexing talk on the road this year to the Boston, SF, and NY MySQL meetups. This was from the SF meetup.
8. The Challenges of Big Databases with MySQL: OOW11 Presentation – A popular conference talk from Tokutek co-founder and Chief Architect Bradley Kuszmaul.
7. From Under the Desk to The Cloud – A Review of the O’Reilly Strata Making Data Work Conference.
6. MySQL Partitioning: A Flow Chart – There are almost always better (higher performing, more robust, lower maintenance) alternatives to partitioning.
5. A Case for Write Optimizations in MySQL – Suggested API improvements to increase the performance of writes, and more specifically, updates.
4. Compression Benchmarking: Size Vs. Speed (I Want Both) – TokuDB achieves the highest level of compression while out-performing InnoDB.
3. Write Optimization: Myths, Comparison, Clarifications – Explains how write optimization is the best read optimization.
Tied for First. The top two weren’t even written by the Tokutek team…
Alter Table Engine TokuDB – A blog with test results by Stephane Varoqui, Principal Consultant at SkySQL.
Are You Forcing MySQL to Do Twice as Many JOINs as Necessary? – A guest blog from Baron Schwartz, Chief Performance Architect, Percona.
Of course, Tokutek blogs only make up a small fraction of all the great blogs and news out there in 2011 on MySQL. If folks have other good MySQL year-end “top ten” lists to share, let us know.
In the meantime, have a great end of year and happy holidays!
Here are some of the fake O'Reilly book covers I mentioned in a prior post. These have been optimized for use as black & white Kindle screensaver wallpaper images. If you haven't done so already, you can install a Kindle screensaver hack with a couple of downloads.
Update: I've embedded a slideshow from PicasaWeb, but it requires Flash. If you don't see it you can click on the links below to go directly to PicasaWeb.
Issue addressed: Managing metadata at exabyte scale
The Company: Founded in 2001, Limelight Networks, Inc (NASDAQ: LLNW) is an Internet platform and services company that integrates the most business-critical parts of the online content value chain. Limelight’s cloud-based services enable customers to profit from the shift of content and advertising to the online world, from the explosive growth of mobile and connected devices, and from the migration of IT applications and services to the cloud. More than 1,800 customers worldwide use Limelight’s massively scalable services to better engage audiences, optimize advertising, manage and monetize digital assets and build stronger customer relationships.
The Challenge: Limelight designed a unique high-availability Agile Storage cloud service, which gives users control over how and where their content is stored by offering massive storage capacity, extreme flexibility for setting business rules and replication policies, with localized ingest and content access around the globe. The service provides vast storage volumes for large libraries of any type of digital asset.
The system was designed for a total capacity on the order of exabytes worldwide and is presently capable of supporting over 100 billion assets. To succeed with the platform, Limelight needed a storage engine that could handle insertion and query performance on large tables and scale as the database grew, and it needed to accomplish this in a cost effective manner. “This vast amount of information brings with it a rich and large amount of metadata around policies, file names, storage pointers, asset registries, users, and groups” according to Wylie Swanson, VP Technology, Cloud Services at Limelight. “Ensuring the metadata could be managed in an efficient and flexible way was critical to the design of the offering.”
A number of options Limelight had considered were insufficient. These included:
InnoDB – Despite familiarity with the MySQL storage engine InnoDB, Limelight found that it didn’t meet the project’s requirements. According to Swanson “the minute you run out of RAM for indexing, InnoDB performance starts to fall apart. We were seeing this occur at 50M – 100M rows. You can shard content, of course, but that feeds back into application and management complexity. Moreover, not all of our database schema is amenable to simple sharding methods.”
RAM Expansion – “While high powered servers and more RAM can somewhat extend the size of a database that InnoDB can handle, doing so is ultimately cost prohibitive” according to Swanson. “To support our system using more traditional database technology, we would have had to purchase terabytes of RAM for our servers.”
Schooner – “Schooner offered performance improvements, but was too expensive. In addition, it didn’t look like it could achieve the performance levels of our commodity servers using TokuDB in our application” according to Swanson.
The Solution: Limelight Agile Storage uses TokuDB for metadata management
Limelight needed a system that could access the database remotely with high availability, flexibility, performance and capacity. Limelight chose MariaDB for components of the platform. To satisfy the need for high availability, the Agile Storage Service uses a high availability Linux cluster to manage the metadata.
For the requirements of flexibility, performance and capacity, TokuDB was an unparalleled choice. “TokuDB provides incredible scaling, keeping a high insert rate throughout as the metadata repository continues to grow” noted Swanson. “This is crucial to keeping up with high-ingest points that are spread all around the world. TokuDB also provides the underpinning for a system that supports arbitrary queries – for example which policies are expired on which assets.”
In addition, Limelight benefited from other TokuDB features such as high data compression yielding a savings of 65% of disk capacity for the meta-directory components.
The Benefits:
Scale: The Agile Storage platform was designed to scale to exabytes of data. Cost effectively scaling compute power, storage, and software was critical to the design. “We don’t know how we could have gotten to our required scale and price points for our meta-directory components without TokuDB” according to Swanson.
Ease of Implementation: Swanson noted that “TokuDB worked seamlessly from the start with MariaDB. Installing it was quick and simple, and we were up and running in a few hours and it worked out-of-the-box with default settings, so that we could focus on maximizing the performance of our platform, not our databases.”
Compression: In addition to fast insertion rates, TokuDB provides data compression levels that are much higher than InnoDB’s. TokuDB’s advanced compression technology reduced Limelight’s disk space requirements by roughly 3x, from over 1 TB down to about 350 GB.
Review of Thursday’s Cloud Events in Boston
Everyone is well aware by now of the EC2 outage that Amazon had back in April and it would have surprised no one if that high profile had put a damper on cloud adoption. But judging what we heard yesterday at Boston’s two cloud events (MassTLC’s Cloud Computing Summit and Vilna’s Moving Your Data to the Cloud Panel), cloud solutions can work just fine. For example, there was the customer story told by Douglas Kim, Managing Director, Global Head, PaaS & Cloud Computing at PegaSystems. Pegasystems is a Boston tech company that started offering cloud versions of its BPM services to conservative Fortune 500 customers in regulation-laden fields such as healthcare and finance. After migrating over a major healthcare customer to the cloud, Kim asked the COO how they internally overcame the concerns about complying with HIPAA requirements as they considered the cloud. The COO admitted they were actually already facing $120M in HIPAA violations in the past year – from using their in-house solution! In other words, before throwing too many stones at Amazon (or other cloud providers), ask if you can really do better.
MassTLC: The monkey is not in the cloud
So, with two events focused on the cloud yesterday, we know bloggers, analysts, VCs and press are hot on the topic – how about actual adoption? Bruce Guptill, Senior Vice President and Head of Research at Saugatuck Technology noted that cloud adoption levels were about one in three for new IT applications last year, but that we are heading to a one in two tipping point in 2014. He further noted that buyers are demanding cloud offerings, which in turn is driving 90% of ISV’s to some sort of cloud based presence.
Michael Skok, a General Partner at North Bridge Venture Partners, noted that the biggest drivers for cloud adoption are agility, scalability and cost, based on a study his firm completed in conjunction with several research houses. That will continue to evolve. In five years, Skok thinks the drivers will hinge on innovation, mobility, APIs and competitive pressures. In terms of inhibitors, while security is always a top concern, Bob Shinn, Founder and Senior Managing Partner, Cloud Silver Lining noted that “even the CIA is in the cloud.” Portability was also observed to be a key item to address. Chris Brookins, VP of Engineering, Acquia noted that by resolving the portability issue (i.e., vendor lock-in) they have been able to grow to 60k users.
Business models were also up for debate. Dan Pelton, CIO, Enterasys Networks claimed that companies selling cloud applications should really follow the Google model to make money (subscription price based on number of users) because it is scalable and easy to understand. Kim argued for going even further with enterprise verticals to find ways to connect pricing to actual business value. This includes, as an example from the insurance industry, charging pennies per claim, which can be directly tied to the bottom line.
Vilna: Does anyone have a prayer to beat Google?
Alright, so how do we spur adoption, especially in the wake of public outages? Shin noted that even though FUD still exists, the reality is that “most people’s data centers are not close to the security and quality of Google.” Still have doubts? Kim suggests jumping in with both feet, but not blindly. For how to do this, he pointed to Netflix. They avoided catastrophic failures with Amazon because they implemented across zones and because they were constantly testing for failure, with what they call their “Chaos Monkey.” This is a program that randomly terminates computer processes and services in Netflix’s IT infrastructure architecture to continuously test resiliency. In the end, when it comes to public clouds, perhaps it is best to “trust but verify.”
Review of the O’Reilly Strata Making Data Work Conference
Monica Rogati of LinkedIn told a story of the early days at the firm, when the reporting system consisted of a single server under someone’s desk. One day, someone needed an Ethernet cable and unplugged the machine from the data outlet in the wall. LinkedIn’s data reporting, its life blood, instantly came to a screeching halt.
The Push to the Cloud
LinkedIn, like many other social network sites, eventually would face enormous growth and have to develop new processes and procedures that would allow them to be an effective cloud repository for people’s work contacts and resumes. The quantity of data that social sites have to contend with is staggering. Monica summed it up well in the title of her talk: “1M. 10M. 100M. Data!” And LinkedIn is far from alone – others spoke of other similar increases. Peter Sirota from Amazon Web Services in his talk noted how Yelp generates close to 400 GB of compressed logs per day and that Foursquare has to track over 1M members and 15M venues.
So why is big data becoming so big as of late (and spawning so many conferences?) Richard McDougall of VMware summed up some of the driving forces:
Richard went on to state why the cloud is performing so well here. The cloud
Of course, big name vendors are rushing across the stack to fill in offerings. Peter claimed that Amazon’s EC2 lowers the cost of operating a distributed system for data processing. Chris Schalk of Google noted that customers should “focus on building your apps and let Amazon wear the pagers”, given the release of their Google Apps toolset.
Implications and Benefits
So when companies get it right, what are the implications and benefits of big data in the cloud? The success in the cloud, according to Peter, is leading to better analysis and recommendations just to name a few key areas. And it’s not just the commercial space benefiting. The conference was also great at showcasing how big data availability was shaping areas outside of traditional consumer tech. NYC is making its data publicly available for people to explore and work-on. Nonprofits are also following suit. Data without Borders spoke of an upcoming Datadive weekend for nonprofits who can’t afford data scientists. At the event volunteer data scientists and enthusiasts will be given access to the data for a crowd-sourced approach to finding new insights. Even the biggest names in foundations are seeing the value in big data. Alastair Dant of The Guardian newspaper noted how the Bill and Melinda Gates Foundation are teaming up with The Guardian to make a public data store of information available on world development statistics.
Don’t Let Your Kitten Crash
So, how well is your business prepared for growth? Hilary Mason of bit.ly noted that much of big data either comes from “secret US government scale” or “kittens on the internet scale.” With the former, there is often much advance planning. With the latter, just like the surprise someone gets when Fluffy goes viral, people are often caught off guard when their business volume grows dramatically. That means plan early. Design ahead. Make sure that your infrastructure can take you to the next level of growth. Importantly, consider whether the agility, enabled by the cloud, makes the most sense for you and make sure you are monitoring the right growth parameters in your business. In other words, don’t let your kitten crash.