Archive for the ‘Blogging’ Category

Welcome googleCL

Июнь 19th, 2010
I am writing this blog post with Vim, my favorite editor, instead of using the online editor offered by blogger. And I am uploading this post to my Blogger account using Google CL a tool that lets you use Google services from the command line.
I am a command line geek, and as soon as I saw the announcement, I installed it in my laptop. The mere fact that you are reading this blog post shows that it works.

GoogleCL is an apparently simple application. If you install it on Mac using macports you realize how many dependencies it has and how much complexity it gives under the hood.
Using an easy to understand syntax, it allows you to access your blog, pictures, calendar, contacts, videos, and online documents at your fingertips.
For example, let's query my blog for partitioning:

$ google blogger --blog="The Data Charmer" --title=partitioning list "title,url"

Hmm. No results. The manual doesn't help much, but something happened during this query. The first thing ist that I was asked to authorize the script to access my blog, and that was done by activating a key that I got in the command line. So far, so good. The second thing was a message informing me that a default configuration file was created in my home directory. Looking at that file, I saw an option saying "regex = True". Aha! So the title supports regular expressions. Let's try:

$ google blogger --blog="The Data Charmer" --title=".*partitioning" list "title"
Holiday gift - A deep look at MySQL 5.5 partitioning enhancements
The partition helper - Improving usability with MySQL 5.1 partitioning
A quick usability hack with partitioning
MySQL 5.1 Improving ARCHIVE performance with partitioning

OK. This gives me everything with the word "partitioning" in the title. But I know that some titles are missing. Comparing with the results that I get online, I see that the titles where "partitioning" is capitalized are not reported. So the search is case sensitive. What I need to do is to tell the regular expression that I want a case insensitive search. Fortunately, I know how to speak regular expressions. Let's try again.

$ google blogger --blog="The Data Charmer" --title="(?i).*partitioning.*" list "title"
Holiday gift - A deep look at MySQL 5.5 partitioning enhancements
Partitioning with non integer values using triggers
Tutorial on Partitioning at the MySQL Users Conference 2009
The partition helper - Improving usability with MySQL 5.1 partitioning
A quick usability hack with partitioning
MySQL 5.1 Improving ARCHIVE performance with partitioning

Now I feel confident enough to do some changes to my online contents.
To create this blog post, I used some of googlecl capabilities. After I created an image, I uploaded it to my Picasa album using this command:

$google picasa post -n "Blogger Pictures" -t googlecl ~/Desktop/google_cl.png

Then I asked Picasa to give me the URL of the image:

$ google picasa list -n "Blogger Pictures" --query googlecl title,url_direct
google_cl.png,http://lh6.ggpht.com/_gVfZHGgf5LA/TBzjaKiJJvI/AAAAAAAAA74/dthDDhybsmc/google_cl.jpg

And then I inserted that URL in this blog post. Finally, I uploaded the blog post with this command:

google blogger --blog="The Data Charmer" --draft --title "Welcome googleCL" --tags="google,mysql,partitioning,command line,blogging" post ~/blog/welcome_googlecl.html


(Now writing online) And after I checked that the post was looking as I wanted it, I hit the "PUBLICH POST" button.
Welcome, GoogleCL!

PlanetMySQL Voting: Vote UP / Vote DOWN

Using Gearman for Nightly Build and Test

Октябрь 17th, 2009

At Tokutek, Rich Prohaska used
Gearman to automate our nightly build and
test process for TokuDB for MySQL.  Rich is busy working on TokuDB, so I’m
writing up an overview of the build and test architecture on his behalf.


Nightly Build and Test Architecture

Build and Test Process

Rich created a script, nightly.bash, that gets kicked off every night as a cron
job.  Nightly.bash creates a separate Gearman job for each build target.
We have a separate build target (unique binary) for each combination of
operating system (e.g. Linux, Windows, etc.) and HW architecture (e.g.
i686, x86_64) supported by TokuDB.  As we support more operating
systems over time, the number of build targets grows quickly so we needed
a build and test architecture that scales, and Gearman makes it easy.

Gearman then automatically distributes the build jobs to a set of systems
set up as “Build Workers” with each available worker running a build
for the specified build in parallel.  For each build that completes,
successfully, the resulting
binary is stored in an Amazon S3 bucket and a regression test job is
submitted to the Gearman job scheduler.  Storing binaries in S3 costs less
than checking them in to our hosted svn repository.

Test jobs are then distributed by Gearman to a set of “Test Workers” where
the appropriate binary is read from Amazon S3 and regression tests are run.
We currrently run mysql tests, sql bench tests, and TokuDB specific
tests. New tests can easily be added by modifying scripts.
Jobs submitted to Gearman specify a “function,” and workers specify
supported functions when registering with Gearman.  Functions are used
to distribute build and test jobs to servers with the correct
operating system and HW architecture.


Looking Ahead

Using Gearman, we built a highly flexible build and test
infrastructure that easily scales to support more build targets and
additional tests.  Build and test workers can be added seamlessly to increase
capacity and reduce the elapsed time required to finish the build and
test cycle.  Currently, everything runs on physical servers inside Tokutek
but we designed the architecture to run in the cloud where virtual machines
can be created on demand to support a large matrix of build targets and to
complete large regression cycles quickly.  All we need now is SSL support
in Gearman to deploy on Amazon EC3…


PlanetMySQL Voting: Vote UP / Vote DOWN

"Idle"

Сентябрь 24th, 2009

For those who wonder why my blogging is so low these days (apart from today) .. I`m actually writing more Lines of Code than Blog Entries the last couple of weeks:)

And when I`m not writing code I`m reading :) Either proofreading an upcoming book on Zabbix or reading some of the other books Packt sent me.

Next to that I`m busy preparing my T-Dose presentation

Oh and did I mention a 40 something questions questionnaire about some merger ?

Trackback URL for this post:

http://www.krisbuytaert.be/blog/trackback/940

PlanetMySQL Voting: Vote UP / Vote DOWN

Attempting to Quantify Fragmentation Effects

Сентябрь 19th, 2009

We often hear from customers and MySQL experts that fragmentation causes problems such as wasting disk space, increasing backup times, and degrading performance. Typical remedies include periodic "optimize table" or dump and re-load (for example, see Project Golden Gate). Unfortunately, these techniques impact database availability and/or require additional administrative cost and complexity. Tokutek's Fractal Tree algorithms do not not cause fragmentation, and we're looking for ways to measure the effects of fragmentation to quantify TokuDB's benefits.

I ran some tests using the iiBench benchmark as an experiment to try and quantify the impact of fragmentation, and observed some interesting results.

Initial Load – 50M Rows

I created an iiBench table with 50M rows, and recorded how long it took to complete along with the amount of disk space used by the data and indexes; log files are *not* included in the reported disk use.

$ iibench.py --setup --insert_only --engine=[innodb | tokudb] \
   --max_rows=50000000 --table_name=t1
OperationTime
TokuDB
Time
InnoDB
Disk Use
TokuDB
Disk Use
InnoDB
Insert 50M Rows3,349s28,821s 3.3GiB11GiB
select count(*) 56s643s3.4GiB11GiB
select count(*) 19.0s14.4s3.5GiB11GiB
optimize table 153s28,456s4.5GiB20GiB

select count(*) was run twice to see how fast it ran with the cache warmed up (the query cache was off).

When inserting into TokuDB, some data may remain in internal data structures, so optimize table was run to push all of the data to disk for an accurate measure of disk use. Initially, I thought running optimize table would not be necessary for InnoDB, but I ran it to ensure a consistent procedure on both engines. Surprisingly, InnoDB used much more disk space after running optimize table. According to this post, InnoDB's insert buffer is stored in the tablespace, so data in the insert buffer should have been included in the measured disk use before optimizing the table. I'm not sure what caused such a large increase in disk use.

Deleting 10M Rows

    mysql> delete from t1 where transactionid <= 10000000;
OperationTime
TokuDB
Time
InnoDB
Disk Use
TokuDB
Disk Use
InnoDB
Delete 10M rows 1,050s169,238s4.7GiB20GiB
select count(*) 42s772s4.7GiB20GiB
select count(*) 15.6s809s4.7GiB20GiB
select count(*) 14.7s802s4.7GiB20GiB
optimize table 129s28,539s4.6GiB20GiB
select count(*) 16.2s372s4.6GiB 20GiB
select count(*) 13.8s11.6s4.6GiB20GiB

Deleting 10M rows took about 17 minutes on TokuDB, and over 47 hours on InnoDB. After deleting the rows, select count(*) on InnoDB ran slow, even after attempting to warm up the cache. I ran it three times, and it was consistently slow. I ran optimize table, and select count(*) ran much faster, suggesting that the problem may have been caused by fragmentation. TokuDB does not fragment, so select count(*) ran fast on TokuDB after deleting rows, with no need to optimize.

Summary

Deleting 10M rows from a 50M row table caused the time to run select count(*) on InnoDB to increase by a factor of about 55. Running optimize table solved the problem, but it took almost 8 hours to complete. Dumping and reloading is likely to be faster than optimize or alter, but it still takes time and effort.

After deleting 10M rows on TokuDB, the time to run select count(*) decreased from about 19s to about 15s (proportional to the decrease from 50M to 40M rows in the table), without a need for optimizing or dumping and reloading.

Going Further

Posts from Mark Callaghan and Bradley C. Kuszmaul show that iiBench's linear distribution of data does not provide a good model of some real world data sets, and a Zipfian distribution is probably a better model. It would be interesting to re-run the experiment with an updated version of iiBench using a Zipfian distribution. Running similar delete experiments on large real world data sets would be interesting as well.

Additional Details

I ran the tests on a machine with a modest amount of memory and slow cores by today's standards.

  • CentOS 5.1
  • Dell PowerEdge 2950
  • 2 Socket, Quad Core Intel Xeon 1.6GHz
  • 4GB Main Memory (2GB InnoDB Buffer Pool, 2GB TokuDB Cache)
  • 5 disk SW RAID5 1TB SATA
  • ext3 Filesystem

I used default parameters for TokuDB and the InnoDB parameters are shown in the my.cnf file below. By default, TokuDB uses 1/2 of physical memory (2GB) for it's cache size, so the InnoDB buffer pool was set to 2GB for a fair comparison. It may be possible to achieve better InnoDB results through tuning, but the goal of this exercise was to search for ways to quantify the impacts of fragmentation.

[mysqld]

innodb_log_buffer_size=4M
innodb_thread_concurrency=8
innodb_log_files_in_group=3
innodb_log_file_size=1300M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_buffer_pool_size=2G
innodb_max_dirty_pages_pct=90

PlanetMySQL Voting: Vote UP / Vote DOWN

Cache Miss Rate as a function of Cache Size

Сентябрь 13th, 2009

I saw Mark Callaghan’s post, and his graph showing miss rate as a function of cache size for InnoDB running MySQL.  He plots miss rate against cache size and compares it to two simple models:


  • A linear model where the miss rate is (1-C/D)/50, and

  • A inverse-proportional model where the miss rate is D/(1000C).
He seemed happy (and maybe surprised) that that the linear model is a bad match and that inverse-proportional model is a good match.  The linear model is the one that would make sense if every page were equally likely to have a hit.

I’ll argue here that it’s not so surprising.  Suppose that miss rate has a heavy-tailed distribution, such as Zipf’s law. An example of a Zipf’s-law distribution would be if the most frequently accessed cache block accounts for X of the accesses, the second most frequent accounts for X/2 of the accesses, the third most frequent accounts for X/3 of the accesses, and so forth with the ith most frequently accessed block accountig for X/i of the accesses.

What miss rate should we expect?  Essentially in this distribution, if you look at the number of misses in the first N blocks, it’s half the number of misses found in the next N blocks.  Thus, we would expect the miss rate to be proportional to 1/C, where C is the cache size.  That matches Mark’s experiment.

This simple heavy-tailed distribution shows up all over the place, and is are often a good model for this kind of system.  For example, I would expect one were to collect the top page returned for every google query, the frequency of page hits follows this distribution. A more frequently cited example is the frequency distribution of words in a natural language Such a distribution probably controls the query cache too.

One shortcoming of iiBench is that iiBench assumes that all inserted index values are equally likely, leading to a linear miss-rate model.  Since TokuDB’s advantage on insertions is related to the cache miss rate, reducing the miss rate will tend to make InnoDB look better, reducing TokuDB’s advantage.  Thus InnoDB’s miss rate probably isn’t as bad on real data as iiBench suggests.  That probably explains why iiBench shows a 200x advantage for TokuDB, but when talking to customers the advantage is often more like 10x or 20x.  It seems clear to me that iiBench would serve us better if it had the option of generating data according to Zipf’s law.  (Since I designed iiBench, I have no qualms about criticizing it.)

Here’s a little game that helps show why heavy-tailed distributions are fun to think about.  This game comes from the St. Petersberg Paradox.  The game is a lottery in which I’m going to pay you some money, and the amount of money is a function of a random variable. The payoff schedule is that I’ll pay you one dollar half the time, I’ll pay you two dollars a quarter of the time, I’ll pay you $4 with probability 1/8, and in general I’ll pay 2i dollars with probability 2-i-1.  How much would you pay to buy one of these lottery tickets?  (For an analysis, see Wikipedia).



PlanetMySQL Voting: Vote UP / Vote DOWN

Sponsoring OpenSQL Camp 2009

Сентябрь 11th, 2009

We’re supporting the OpenSQL Camp, which will be held in Portland on November 14. 

One of my objectives for the cam[ is to make progress on a universal storage engine API, to make it possible to use the same storage engines in MySQL, PostgreSQL, Ingres, or any other database.  I’m also looking forward to hearing other people’s great ideas.

After OpenSQLcamp, I’ll be attending Supercomputing’09.  Supercomputing and database hardware technology seems to be converging.  Many of the fastest databases today look like a supercomputer with disks attached.  Will there be other kinds of convergence?  For example, what kind of convergence will we see between multicore computing and cluster computing?  Today we program multicore machines very differently from clusters.  I think in the future that difference will vanish.


PlanetMySQL Voting: Vote UP / Vote DOWN

Sorting a Terabyte in 197 seconds

Август 18th, 2009

Sorting a Terabyte in 197 seconds

I just returned from The 21st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), held in Calgary, where I gave a talk about my entry to the sorting contest.  I sorted 1TB in 197s on a 400-node machine at MIT Lincoln Laboratory, a record which still stands today.  (And it will likely remain standing, since terabyte sorting is now deprecated because it’s too fast.  Now the challenge is to sort 100TB.)

For many years Jim Gray ran a sorting contest to see how fast anyone could sort a terabtye worth of 100-byte records, how much data could be sorted in one minute, and how much data could be sorted for a penny.  After Jim’s disappearance at sea in January 2007, a committee formed to continue the contest.

I entered in 2007, and this week I finally got around to talking about it at a conference.  My sorting algorithm is a variant of the SampleSort Algorithm by Blelloch, Plaxton, Leiseron et al. To learn more about TokuSampleSort, take a look at the long version of the paper or the slides from the SPAA talk.

TokuSampleSort is not directly related to the Tokutek’s TokuDB storage engine for MySQL.  One is a sort, and the other maintains indexes.  The difference between sorting and indexing is that an index is a dynamically sorted collection of data.  Sorting, however, operates on a static set of data.  Sorting and indexing both require attention to how to achieve high performance from processors and disks.  It can be a little bit surprising that sorting is easier than indexing.  Sorting is easier because all the data is available at the beginning of the calculation.  In contrast, when indexing, data may arrive a little bit at a time, and at any time a user might query a range of the data.  So indexing requires maintaining data in sorted order as the data arrives (and thus produces many sorted data sets, if you count each intermediate state), whereas a sort produces a single sorted answer.

Many storage engines (including, for example, MyISAM) use a sorting algorithm when creating an index (e.g., with CREATE INDEX or ALTER TABLE, and then use B-trees to maintain the index after it’s been created.  It’s difficult to maintain a B-tree index if records arrive at high speed, however.  TokuDB uses Fractal Tree indexes to maintain indexes orders of magnitude faster than B-trees.

How good would a fractal tree index be for sorting?  TokuSampleSort sorts at a rate of about 127,000 records per processor per second.  TokuDB can index random data at about 30,000 records per processor per second.  Fractal Tree indexes are not as fast as a sort, but they are surprisingly close.



PlanetMySQL Voting: Vote UP / Vote DOWN

Announcing TokuDB 2.1.0

Август 7th, 2009

Tokutek® announces the release the release of the TokuDB storage engine for MySQL®, version 2.1.0.  This release offers the following improvements over our previous release:

  • Faster indexing of sequential keys.
  • Faster bulk loads on tables with auto-increment fields.
  • Faster range queries in some circumstances.
  • Added support for InnoDB.
  • Upgraded from MySQL 5.1.30 to 5.1.36.
  • Fixed all known bugs.

About TokuDB

TokuDB for MySQL is a storage engine built with Tokutek’s Fractal Tree™ technology. TokuDB provides near seamless compatibility for MySQL applications. Tables can be individually defined to use TokuDB, MyISAM, InnoDB® or other MySQL-compliant storage engines. Data is loaded, inserted, and queried using standard MySQL commands, with no restrictions or special requirements. Fractal Trees can index data up to 50 times faster than traditional database technologies, enabling near real time analysis on large volumes of rapidly arriving data.

Notice: Tokutek and Fractal Tree are trademarks or registered trademarks of Tokutek, Inc.  MySQL is a registered trademark of Sun Microsystems, Inc.  InnoDB is a registered trademark of Oracle Corporation.



PlanetMySQL Voting: Vote UP / Vote DOWN

Autoincrement Semantics

Июль 30th, 2009

In this post I’m going to talk about how TokuDB’s implementation of auto increment works, and contrast it to the behavior of MyISAM and InnoDB.  We feel that the TokuDB behavior is easier to understand, more standard-compliant and offers higher performance (especially when implemented with Fractal Tree indexes).

In TokuDB, each table can have an auto-increment column.  That column can be used as any part of a key, but it doesn’t have to be part of any key.  The value produced by auto incrementing is always greater than the previous maximum value for that column. There are some cases where auto-incremented values are skipped, such as when a transaction aborts, which “uses up” auto-incremented values.

This behavior is close to that required for SQL:2003 (see SQL:2003 at wikipedia), which specifies that each table provides one unnamed sequence which behaves essentially in the way we implemented auto increment.  The SQL standard permits but doesn’t require auto-incremented values to be used up when a transaction aborts.

The semantics provided by TokuDB is different from MyISAM and InnoDB.  The value you get in an auto-increment field depends on the exact type and order of the secondary indexes (which is described in the MySQL manual).  Those relatively complex semantics seem to be viewed as a feature by MyISAM users.  Also, for MyISAM, there are no transactions, and so aborting a transaction doesn’t use up auto-increment values.

InnoDB provides yet another semantics: An auto-increment field must be the first field in some key.

For many users, the difference in semantics does not seem important.

Part of the reason we implemented a different behavior than the other storage engines is that the other engines’ behaviors don’t play well with fractal tree indexing.  If rows can be inserted into a table or index without fetching an existing row, then fractal trees are two orders of magnitude faster than B-trees. If we had to fetch an existing row (to find out what the previous maximum auto-increment field was), the performance advantage would be less in many cases.  By using the moral equivalent of a SQL:2003 sequence, we can generate auto-increment values without fetching rows, and so indexes can be maintained at high data-arrival rates.

Thus TokuDB’s auto-increment fields have several advantages over those of MyISAM and InnoDB.  TokuDB’s are simpler to use, they act more like SQL:2003 standard sequences, and (perhaps most importantly to us) they admit higher performance.