Archive for the ‘storage engine’ Category

OblakSoft Cloud Storage Newsletter, March 2012

Март 9th, 2012

ClouSE version 1.0b.1.0 released

OblakSoft is pleased to announce the release of ClouSE version 1.0b.1.0. This release addresses reliability and performance issues that were encountered by our early adopters, as well as it provides new functionality to better align with real life usage scenarios.

  • The weblob URLs now support user-defined names.
  • The weblobs now has to be of the LONGBLOB type.
  • The AUTO_INCREMENT fields have now have semantics that is similar to InnoDB’s “interleaved” lock mode.

Thank you all for your feedback!

  • The weblob URLs now support user-defined names.

We’ve got a lot of feedback that auto-generated names are not SEO-friendly, so instead of specifying an extension (like jpg, gif, etc.) the user can now specify a name (like creative-thinking.jpg).

  • The weblobs now has to be of the LONGBLOB type.

The BLOB type only supports up to 64 KB of content. Users who followed the examples in our documentation ran into what looked like data corruption: uploading more than 64 KB with streaming worked well, direct access by URL worked well, but getting the content from the SELECT query resulted in content truncation. To avoid confusion, ClouSE now enforces LONGBLOB type for weblob fields.

  • The AUTO_INCREMENT fields have now have semantics that is similar to InnoDB’s “interleaved” lock mode.

Previously, AUTO_INCREMENT was implemented as a global monotonically increasing counter; the existing values in the table were not taken into account. The old implementation did not work well in cases like restoring tables from a backup taken by mysqldump: the values from the backup could conflict with the new values produced by the ClouSE’s global counter. The new implementation uses existing values to generate new auto-incremented values, such that the new values are higher than any existing value in the table.

 The full list of changes included into ClouSE version 1.0b.1.0 can be found here.

Download ClouSE now for FREE at  http://www.oblaksoft.com/downloads/.


PlanetMySQL Voting: Vote UP / Vote DOWN

Tokutek Selected as a Finalist for O’Reilly Strata Conference

Февраль 9th, 2012

We are excited to announce that we’ve been named as one of ten finalists selected for the startup showcase at the O’Reilly Strata “Making Data Work” Conference at the end of this month in Santa Clara, California. The startup showcase will be held on February 29th, starting at 6:30 pm.

The conference offers a great overview of the big data space, with tracks on Data Science, Business and Industry, Visualization and Interfaces, Hadoop Applied, Hadoop Tech, Policy and Privacy, and Domain Data. With all of the “NoSQL” buzz and sessions at the show (Hadoop gets two tracks!), we are glad to be able to attend as a representative of the “NewSQL” community. We’ll be showing just how much MySQL, with the right storage engine, can scale to take on Big Data while giving up none of the power of ACID, familiar SQL interfaces, rich indexes, high insertion rates, and flexible schema.

If you will be there, please stop by to say hello! And please vote for us too (what can we say, it’s an election year all around).


PlanetMySQL Voting: Vote UP / Vote DOWN

Tokutek Selected as a Finalist for O’Reilly Strata Conference

Февраль 9th, 2012

We are excited to announce that we’ve been named as one of ten finalists selected for the startup showcase at the O’Reilly Strata “Making Data Work” Conference at the end of this month in Santa Clara, California. The startup showcase will be held on February 29th, starting at 6:30 pm.

The conference offers a great overview of the big data space, with tracks on Data Science, Business and Industry, Visualization and Interfaces, Hadoop Applied, Hadoop Tech, Policy and Privacy, and Domain Data. With all of the “NoSQL” buzz and sessions at the show (Hadoop gets two tracks!), we are glad to be able to attend as a representative of the “NewSQL” community. We’ll be showing just how much MySQL, with the right storage engine, can scale to take on Big Data while giving up none of the power of ACID, familiar SQL interfaces, rich indexes, high insertion rates, and flexible schema.

If you will be there, please stop by to say hello! And please vote for us too (what can we say, it’s an election year all around).


PlanetMySQL Voting: Vote UP / Vote DOWN

1 Billion Insertions – The Wait is Over!

Январь 26th, 2012

iiBench measures the rate at which a database can insert new rows while maintaining several secondary indexes. We ran this for 1 billion rows with TokuDB and InnoDB starting last week, right after we launched TokuDB v5.2. While TokuDB completed it in 15 hours, InnoDB took 7 days.

The results are shown below. At the end of the test, TokuDB’s insertion rate remained at 17,028 inserts/second whereas InnoDB had dropped to 1,050 inserts/second. That is a difference of over 16x. Our complete set of benchmarks for TokuDB v5.2 can be found here.

Benchmark Details: Ubuntu 10.10; 2x Xeon X5460; 16GB RAM; 8x 146GB 10k SAS in RAID10. Each data point is the average insertion rate for the last 2 million rows. 

We developed the iiBench benchmark to measure performance for a use case that occurs commonly in production applications, such as online advertising, social media, and network management.

iiBench simulates a pattern of usage for always-on applications that:

  • Require fast query performance and hence require indexes
  • Have high data insert rates
  • Cannot wait for offline batch processing and hence require the indexes be maintained as data comes in

Note that iiBench was created as an open-source benchmark, which allows others to freely use it, extend it, and contribute their changes back. We originally unveiled the benchmark in the context of a challenge issued at the 2008 OpenSQL camp. Since then, iiBench has been downloaded and used many times, and ported by the community (in this case, Mark Callaghan) to a Python Script.

Please let us know any feedback you have on iiBench. For additional information on…

  • iibench overview click here
  • TokuDB version 5.2 Overview click here
  • TokuDB version 5.2 Performance, including iibench, SysBench, Compression, and TPCC-like, click here

PlanetMySQL Voting: Vote UP / Vote DOWN

Fractal Tree Indexes and Mead – MySQL Meetup

Январь 11th, 2012

 
Thanks again to Sheeri Cabral  for having me at the Boston MySQL Meetup on Monday for the talk on “Fractal Tree® Indexes – Theoretical Overview and Customer Use Cases.” The crowd was very interactive, and I appreciated that over 50 people signed up for the event and left some very positive comments and reviews.

In addition, the conversation spilled over late into the night as we made our way over to nearby Mead Hall afterwards for a few drinks, some food, and to continue the discussion.

The presentation is available here.

As a brief overview – most databases employ B-trees to achieve a good tradeoff between the ability to update data quickly and to search it quickly. It turns out that B-trees are far from the optimum in this tradeoff space. This led to the development at MIT, Rutgers and Stony Brook of Fractal Tree indexes. Fractal Tree indexes improve MySQL® scalability and query performance by allowing greater insertion rates, supporting rich indexing and offering efficient compression. They can also eliminate operational headaches such as dump/reloads, inflexible schemas and partitions.

The presentation provides an overview on how Fractal Tree indexes work, and then gets into some specific product features, benchmarks, and customer use cases that show where people have deployed Fractal Tree indexes via the TokuDB® storage engine.
 


PlanetMySQL Voting: Vote UP / Vote DOWN

Setting up XFS on Hardware RAID — the simple edition

Декабрь 16th, 2011

There are about a gazillion FAQs and HOWTOs out there that talk about XFS configuration, RAID IO alignment, and mount point options.  I wanted to try to put some of that information together in a condensed and simplified format that will work for the majority of use cases.  This is not meant to cover every single tuning option, but rather to cover the important bases in a simple and easy to understand way.

Let’s say you have a server with standard hardware RAID setup running conventional HDDs.

RAID setup

For the sake of simplicity you create one single RAID logical volume that covers all your available drives.  This is the easiest setup to configure and maintain and is the best choice for operability in the majority of normal configurations.  Are there ways to squeeze more performance out of a server by dividing the logical volumes: perhaps, but it requires a lot of fiddling and custom tuning to accomplish.

There are plenty of other posts out there that discuss RAID minutia.  Make sure you cover the following:

  • RAID type (usually 5 or 1+0)
  • RAID stripe size
  • BBU enabled with Write-back cache only
  • No read cache or read-ahead
  • No drive write cache enabled

Partitioning

You want to run only MySQL on this box, and you want to ensure your MySQL datadir is separated from the OS in case you ever want to upgrade the OS, but otherwise keep it simple.  My suggestion?  Plan on allocating partitions roughly as follows, based on your available drive space and keeping in mind future growth.

  • 8-16G for Swap –
  • 10-20G for the OS (/)
  • Possibly 10G+ for /tmp  (note you could also point mysql’s tmpdir elsewhere)
  • Everything else for MySQL (/mnt/data or similar):  (sym-link /var/lib/mysql into here when you setup mysql)

Are there alternatives?  Yes.  Can you have separate partitions for Innodb log volumes, etc.?  Sure.  Is it work doing much more than this most of the time?  I’d argue not until you’re sure you are I/O bound and need to squeeze every last ounce of performance from the box.  Fiddling with how to allocate drives and drive space from partition to partition is a lot of operational work which should be spent only when needed.

Aligning the Partitions

Once you have the partitions, it could look something like this:
#fdisk -ul

Disk /dev/sda: 438.5 GB, 438489317376 bytes
255 heads, 63 sectors/track, 53309 cylinders, total 856424448 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00051fe9

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048     7813119     3905536   82  Linux swap / Solaris
Partition 1 does not end on cylinder boundary.
/dev/sda2   *     7813120    27344895     9765888   83  Linux
/dev/sda3        27344896   856422399   414538752   83  Linux
 Several months ago my colleague Aurimas posted two excellent blogs on both the theory of Aligning IO on hardware RAID and some good benchmarks to emphasize the point, go read those if you need the theory here.  Is it common on modern Linux systems for this to be off?  Maybe not, but here’s how you check.
  We want to use mysql on /dev/sda3, but how can we ensure that it is aligned with the RAID stripes?  It takes a small amount of math:
  • Start with your RAID stripe size.  Let’s use 64k which is a common default.  In this case 64K = 2^16 = 65536 bytes.
  • Get your sector size from fdisk.  In this case 512 bytes.
  • Calculate how many sectors fit in a RAID stripe.   65536 / 512 = 128 sectors per stripe.
  • Get start boundary of our mysql partition from fdisk: 27344896.
  • See if the Start boundary for our mysql partition falls on a stripe boundary by dividing the start sector of the partition by the sectors per stripe:  27344896 / 128 = 213632.  This is a whole number, so we are good.  If it had a remainder, then our partition would not start on a RAID stripe boundary.

Create the Filesystem

XFS requires a little massaging (or a lot).  For a standard server, it’s fairly simple.  We need to know two things:

  • RAID stripe size
  • Number of unique, utilized disks in the RAID.  This turns out to be the same as the size formulas I gave above:
    • RAID 1+0:  is a set of mirrored drives, so the number here is num drives / 2.
    • RAID 5: is striped drives plus one full drive of parity, so the number here is num drives – 1.
In our case, it is RAID 1+0 64k stripe with 8 drives.  Since those drives each have a mirror, there are really 4 sets of unique drives that are striped over the top.  Using these numbers, we set the ‘su’ and ‘sw’ options in mkfs.xfs with those two values respectively.
# mkfs.xfs -d su=64k,sw=4 /dev/sda3
meta-data=/dev/sda3              isize=256    agcount=4, agsize=25908656 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=103634624, imaxpct=25
         =                       sunit=16     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=50608, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The XFS FAQ is a good place to check out for more details.

Mount the filesystem

Again, there are many options to use here, but let’s use some simple ones:

/var/lib/mysql           xfs     nobarrier,noatime,nodiratime

Setting the IO scheduler

This is a commonly missed step related to getting the IO setup properly.  The best choices here are between ‘deadline’ and ‘noop’.   Deadline is an active scheduler, and noop simply means IO will be handled without rescheduling.  Which is best is workload dependent, but in the simple case you would be well-served by either.  Two steps here:

echo noop > /sys/block/sda/queue/scheduler   # update the scheduler in realtime

And to make it permanent, add ‘elevator=<your choice>’ in your grub.conf at the end of the kernel line:

kernel /boot/vmlinuz-2.6.18-53.el5 ro root=LABEL=/ noapic acpi=off rhgb quiet notsc elevator=noop

 

This is a complicated topic, and I’ve tried to temper the complexity with what will provide the most benefit.  What has made most improvement for you that could be added without much complexity?


PlanetMySQL Voting: Vote UP / Vote DOWN

Limelight Networks Chooses TokuDB for New Cloud Storage Service

Декабрь 14th, 2011

Limelight Networks

Issue addressed: Managing metadata at exabyte scale

Delivering Agile Storage in the Cloud with Billions of Assets

The Company: Founded in 2001, Limelight Networks, Inc (NASDAQ: LLNW) is an Internet platform and services company that integrates the most business-critical parts of the online content value chain. Limelight’s cloud-based services enable customers to profit from the shift of content and advertising to the online world, from the explosive growth of mobile and connected devices, and from the migration of IT applications and services to the cloud. More than 1,800 customers worldwide use Limelight’s massively scalable services to better engage audiences, optimize advertising, manage and monetize digital assets and build stronger customer relationships.

The Challenge: Limelight designed a unique high-availability Agile Storage cloud service, which gives users control over how and where their content is stored by offering massive storage capacity, extreme flexibility for setting business rules and replication policies, with localized ingest and content access around the globe. The service provides vast storage volumes for large libraries of any type of digital asset.

The system was designed for a total capacity on the order of exabytes worldwide and is presently capable of supporting over 100 billion assets. To succeed with the platform, Limelight needed a storage engine that could handle insertion and query performance on large tables and scale as the database grew, and it needed to accomplish this in a cost effective manner. “This vast amount of information brings with it a rich and large amount of metadata around policies, file names, storage pointers, asset registries, users, and groups” according to Wylie Swanson, VP Technology, Cloud Services at Limelight. “Ensuring the metadata could be managed in an efficient and flexible way was critical to the design of the offering.”

A number of options Limelight had considered were insufficient. These included:

InnoDB – Despite familiarity with the MySQL storage engine InnoDB, Limelight found that it didn’t meet the project’s requirements. According to Swanson “the minute you run out of RAM for indexing, InnoDB performance starts to fall apart. We were seeing this occur at 50M – 100M rows. You can shard content, of course, but that feeds back into application and management complexity. Moreover, not all of our database schema is amenable to simple sharding methods.”

RAM Expansion – “While high powered servers and more RAM can somewhat extend the size of a database that InnoDB can handle, doing so is ultimately cost prohibitive” according to Swanson. “To support our system using more traditional database technology, we would have had to purchase terabytes of RAM for our servers.”

Schooner – “Schooner offered performance improvements, but was too expensive. In addition, it didn’t look like it could achieve the performance levels of our commodity servers using TokuDB in our application” according to Swanson.

The Solution: Limelight Agile Storage uses TokuDB for metadata management

Limelight needed a system that could access the database remotely with high availability, flexibility, performance and capacity. Limelight chose MariaDB for components of the platform. To satisfy the need for high availability, the Agile Storage Service uses a high availability Linux cluster to manage the metadata.

For the requirements of flexibility, performance and capacity, TokuDB was an unparalleled choice. “TokuDB provides incredible scaling, keeping a high insert rate throughout as the metadata repository continues to grow” noted Swanson. “This is crucial to keeping up with high-ingest points that are spread all around the world. TokuDB also provides the underpinning for a system that supports arbitrary queries – for example which policies are expired on which assets.”

In addition, Limelight benefited from other TokuDB features such as high data compression yielding a savings of 65% of disk capacity for the meta-directory components.

The Benefits:

Scale: The Agile Storage platform was designed to scale to exabytes of data. Cost effectively scaling compute power, storage, and software was critical to the design. “We don’t know how we could have gotten to our required scale and price points for our meta-directory components without TokuDB” according to Swanson.

Ease of Implementation: Swanson noted that “TokuDB worked seamlessly from the start with MariaDB. Installing it was quick and simple, and we were up and running in a few hours and it worked out-of-the-box with default settings, so that we could focus on maximizing the performance of our platform, not our databases.”

Compression: In addition to fast insertion rates, TokuDB provides data compression levels that are much higher than InnoDB’s. TokuDB’s advanced compression technology reduced Limelight’s disk space requirements by roughly 3x, from over 1 TB down to about 350 GB.

 

 

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Fractal Tree Indexes – MySQL Meetup

Декабрь 5th, 2011

At next month’s Boston MySQL Meetup, I will give a talk: “Fractal Tree Indexes – Theoretical Overview and Customer Use Cases.” The meetup is 7 pm Monday, January 9th, 2012, and will be held at MIT Building E51 Room 337e (corner of Ames & Amherst St, Cambridge, MA). Thanks to host Sheeri Cabral for the invitation.

Most databases employ B-trees to achieve a good tradeoff between the ability to update data quickly and to search it quickly. It turns out that B-trees are far from the optimum in this tradeoff space. This led to the development at MIT, Rutgers and Stony Brook of Fractal Tree® indexes. Fractal Tree indexes improve MySQL® scalability and query performance by allowing greater insertion rates, supporting rich indexing and offering efficient compression. They can also eliminate operational headaches such as dump/reloads, inflexible schemas and partitions.

I’ll give an overview on how Fractal Tree indexes work, and then get into some specific product features, benchmarks, and customer use cases that show where people have deployed Fractal Tree indexes via the TokuDB® storage engine.

I hope to see you there!


PlanetMySQL Voting: Vote UP / Vote DOWN

A Case for Write Optimizations in MySQL

Ноябрь 21st, 2011

As a storage engine developer, I am excited for MySQL 5.6. Looking at http://dev.mysql.com/tech-resources/articles/whats-new-in-mysql-5.6.html, there has been plenty of work done to improve the performance of reads in MySQL for all storage engines (provided they take advantage of the new APIs).

What would be great to add is API improvements to increase the performance of writes, and more specifically, updates. For many applications that perform updates, such as applications that do click counting or impression counting, there are significant opportunities for improving write performance.

Take the following example of click counting (or impression counting). You have a website and want to save the number of times links on your website have been clicked. Your table may look something like:


create table num_clicks( link_id int, num_clicks int);

To update the number of clicks, you do something like:


insert into num_clicks (LINK_ID, 1) on duplicate key update set num_clicks=num_clicks+1;

With MySQL as it currently works, this is slower than it needs to be, as I explained here. At a high level, the reason is that MySQL forces the storage engine to check in the table if a value exists for LINK_ID. If a row is returned, MySQL performs the increment away from the storage engine, and passes a new row to the storage engine for an update. The check incurs a disk seek, which is very costly in terms of latency. Disks can do only hundreds of seeks per second. Furthermore, NoSQL solutions based on B-trees are similarly limited and can’t be significantly accelerated because updates incur disk I/O.

However, with some changes to MySQL, a storage engine can take advantage of this knowledge to improve its algorithms. All that’s needed is for the storage engine to know that the user wants to perform an insert or to perform this particular update, as opposed to getting individual handler calls of write_row, index_read, and update_row (which is the current design). Hence, what’s needed is a way for the storage engine layer to be able to apply updates on its own.

This change can help all storage engines. Although I am not an expert in MySQL Cluster, I imagine reducing these individual handler calls also helps MySQL Cluster avoid network hops to retrieve information. For in-memory databases, performance may increase due to reducing the number of calls made by the handler. InnoDB can potentially use its insertion buffer to store the “insert … on duplicate key update” operation, thereby giving the operation the same boost insertions into secondary keys get. For TokuDB, we estimate that these types of updates aided by this additional information could run much faster. In future posts, I will expand on how we think TokuDB can do this.


PlanetMySQL Voting: Vote UP / Vote DOWN

Challenges of Big Databases with MySQL – OOW11 Presentation

Октябрь 25th, 2011

Many database management tasks become difficult as you move from millions of rows and gigabytes of data to billions of rows and terabytes of data. Such tasks include ingesting data while maintaining indexes; changing schemas without downtime; and supporting connections, replication, and backup. For some scaling problems (connections and replication), MySQL® is better than most of the competition. For others, such as indexing, schema changes, and backup, MySQL has typically been harder to use. Fortunately, the tasks MySQL does well are in its core, whereas the tasks that are more difficult can be solved with storage engine plug-ins.

I recently gave a talk at Oracle Open World 11, a copy of which can be found here. This presentation discusses how MySQL’s storage engines have recently made dramatic progress in large database manageability.

A list of other MySQL talks can be found in a handy list that Ronald Bradford put together here. For those who want to learn more about TokuDB®, we have an upcoming webinar here.


PlanetMySQL Voting: Vote UP / Vote DOWN