Archive for the ‘benchmarks’ Category

Benchmarks of new innodb_flush_neighbor_pages

Январь 18th, 2012

In our recent release of Percona Server 5.5.19 we introduced new value for innodb_flush_neighbor_pages=cont.
This way we are trying to deal with the problem of InnoDB flushing.

Actually there is also the second fix to what we think is bug in InnoDB, where it blocks queries while it is not needed (I will refer to it as “sync fix”). In this post I however will focus on innodb_flush_neighbor_pages.

By default InnoDB flushes so named neighbor pages, which really are not neighbors.
Say we want to flush page P. InnoDB is looking in an area of 128 pages around page P, and flushes all the pages in that area that are dirty. To illustrate, say we have an area of memory like this: ...D...D...D....P....D....D...D....D where each dot is a page that does not need flushing, each “D” is a dirty page that InnoDB will flush, and P is our page.
So, as the result of how it works, instead of performing 1 random write, InnoDB will perform 8 random writes.
This is quite far from original intention to flush as many pages as possible in singe sequential write.

So we added new innodb_flush_neighbor_pages=cont method, with it, only really sequential write will be performed
That is case ...D...D...D..DDDPD....D....D...D....D only following pages will be flushed:
...D...D...D..FFFFF....D....D...D....D (marked as “F”)

Beside “cont”, in Percona Server 5.5.19 innodb_flush_neighbor_pages also accepts values “area” (default) and “none” (recommended for SSD).

What kind of effect does it have ? Let’s run some benchmarks.

We repeated the same benchmark I ran in Disaster MySQL 5.5 flushing, but now we used two servers: Cisco UCS C250 and HP ProLiant DL380 G6

First results from HP ProLiant.

Throughput graph:

Response time graph (axe y has logarithmic scale):

As you see with “cont” we are able to get stable line. And even with default innodb_flush_neighbor_pages, Percona Server has smaller dips than MySQL.

So this is to show effect of “sync fix”, let’s compare Percona Server 5.5.18 (without fix) and 5.5.19 (with fix).

You see that the fix helps to have queries running in cases when before it was “hard” stop, and no
transaction processed.

The previous result may give you impression that “cont” guarantees stable line, but unfortunately this is not always the case.

There are results ( throughput and response time) from Cisco UCS 250 server:

You see, on this server we have longer and deeper periods when MySQL stuck in flushing, and in such cases, the
innodb_flush_neighbor_pages=cont only helps to relief the problem, not completely solving it.
Which, I believe, is still better than complete stop for significant amount of time.

The raw results, scripts and different CPU/IO metrics are available from our Benchmarks Launchpad


PlanetMySQL Voting: Vote UP / Vote DOWN

Percona testing: Quick test clusters with kewpie!

Январь 13th, 2012

The announcement of Percona XtraDB Cluster seems to have generated a fair bit of interest : )

Although the documentation contains more formal instructions for setting up a test cluster, I wanted to share a quick way to set up an ad-hoc cluster on a single machine to help people play with this (imho) rather amazing bit of software.

To do this, you will need kewpie (PXC will have kewpie in-tree soon)
cd basedir;
bzr branch lp:kewpie

edit the file kewpie.py like so:

=== modified file 'kewpie.py'
--- kewpie.py    2012-01-09 21:17:09 +0000
+++ kewpie.py    2012-01-11 18:32:17 +0000
@@ -49,9 +49,9 @@ from lib.test_mgmt.execution_management
# We base / look for a lot of things based on the location of
# the kewpie.py file
qp_rootdir = os.path.dirname(os.path.abspath(sys.argv[0]))
-#project_name = 'percona-xtradb-cluster'
+project_name = 'percona-xtradb-cluster'
#project_name = 'xtrabackup'
-project_name = None
+#project_name = None
defaults = get_defaults(qp_rootdir,project_name)
variables = test_run_options.parse_qp_options(defaults)
variables['qp_root'] = qp_rootdir

Or you may branch kewpie anywhere and simply pass appropriate –basedir and –wsrep-provider-path instructions and use –default-server-type=galera

Run the tests:
./kewpie.py  –start-and-exit
This will start up 3 nodes and join them into a cluster:

percona-xtradb-cluster/kewpie$ ./kewpie.py --start-and-exit
Setting --no-secure-file-priv=True for randgen usage...
20120113-125552 INFO Using --no-shm, will not link workdir to shm
20120113-125552 INFO Using mysql source tree:
20120113-125552 INFO basedir: /percona-xtradb-cluster
20120113-125552 INFO clientbindir: /percona-xtradb-cluster/client
20120113-125552 INFO testdir: /percona-xtradb-cluster/kewpie
20120113-125552 INFO server_version: 5.5.17
20120113-125552 INFO server_compile_os: Linux
20120113-125552 INFO server_platform: x86_64
20120113-125552 INFO server_comment: (Source distribution wsrep_22.3.r3683)
20120113-125552 INFO Using default-storage-engine: innodb
20120113-125552 INFO Using testing mode: native
20120113-125552 INFO Processing test suites...
20120113-125552 INFO Found 35 test(s) for execution
20120113-125552 INFO Creating 1 bot(s)
20120113-125604 INFO Taking clean db snapshot...
20120113-125610 INFO Taking clean db snapshot...
20120113-125616 INFO Taking clean db snapshot...
20120113-125621 INFO bot0 server:
20120113-125621 INFO NAME: s0
20120113-125621 INFO MASTER_PORT: 9317
20120113-125621 INFO GALERA_LISTEN_PORT: 9318
20120113-125621 INFO GALERA_RECV_PORT: 9319
20120113-125621 INFO SOCKET_FILE: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s0/my.sock
20120113-125621 INFO VARDIR: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s0
20120113-125621 INFO STATUS: 1
20120113-125621 INFO bot0 server:
20120113-125621 INFO NAME: s1
20120113-125621 INFO MASTER_PORT: 9320
20120113-125621 INFO GALERA_LISTEN_PORT: 9321
20120113-125621 INFO GALERA_RECV_PORT: 9322
20120113-125621 INFO SOCKET_FILE: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s1/my.sock
20120113-125621 INFO VARDIR: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s1
20120113-125621 INFO STATUS: 1
20120113-125621 INFO bot0 server:
20120113-125621 INFO NAME: s2
20120113-125621 INFO MASTER_PORT: 9323
20120113-125621 INFO GALERA_LISTEN_PORT: 9324
20120113-125621 INFO GALERA_RECV_PORT: 9325
20120113-125621 INFO SOCKET_FILE: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s2/my.sock
20120113-125621 INFO VARDIR: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s2
20120113-125621 INFO STATUS: 1
20120113-125621 INFO User specified --start-and-exit.  kewpie.py exiting and leaving servers running...

Now for some play:

$ mysql -uroot --protocol=tcp --port=9317 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.5.17-log Source distribution wsrep_22.3.r3683

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create table t1 (a int not null auto_increment, primary key(a));
Query OK, 0 rows affected (0.11 sec)

mysql> insert into t1 values (),(),(),(),();
Query OK, 5 rows affected (0.06 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select * from t1;
+----+
| a  |
+----+
|  1 |
|  4 |
|  7 |
| 10 |
| 13 |
+----+
5 rows in set (0.00 sec)

mysql> exit;
Bye
$ mysql -uroot --protocol=tcp --port=9320 test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 5.5.17-log Source distribution wsrep_22.3.r3683

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from t1;
+----+
| a  |
+----+
|  1 |
|  4 |
|  7 |
| 10 |
| 13 |
+----+
5 rows in set (0.00 sec)

mysql> exit
Bye
$ mysql -uroot --protocol=tcp --port=9323 test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 5.5.17-log Source distribution wsrep_22.3.r3683

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from t1;
+----+
| a  |
+----+
|  1 |
|  4 |
|  7 |
| 10 |
| 13 |
+----+
5 rows in set (0.00 sec)

mysql> exit
Bye

Should you wish to alter the number of nodes or their configuration, you can edit the percona_tests/cluster_basic/suite_config.py file:

server_requirements = [[],[],[]]
server_requests = {'join_cluster':[(0,1), (0,2)]}
servers = []

Each ‘[]‘ in the server_requirements list is a server.  You can add new servers by adding a new list.  If you want specific options, put them into the list representing the server:
[['--innodb-file-per-table']]

You will need to add an entry into the server_requests dictionary as well.  If you added a new node and want it in the cluster you would simply change it as follows:
server_requests = {‘join_cluster’:[(0,1), (0,2), (0,3)]}

When you are done, you may use mode=cleanup to kill off any servers:

./kewpie.py --mode=cleanup
Setting --no-secure-file-priv=True for randgen usage...
Setting --start-dirty=True for cleanup mode...
20120113-132229 INFO Using --start-dirty, not attempting to touch directories
20120113-132229 INFO Using mysql source tree:
20120113-132229 INFO basedir: /percona-xtradb-cluster
20120113-132229 INFO clientbindir: /percona-xtradb-cluster/client
20120113-132229 INFO testdir: /percona-xtradb-cluster/kewpie
20120113-132229 INFO server_version: 5.5.17
20120113-132229 INFO server_compile_os: Linux
20120113-132229 INFO server_platform: x86_64
20120113-132229 INFO server_comment: (Source distribution wsrep_22.3.r3683)
20120113-132229 INFO Using default-storage-engine: innodb
20120113-132229 INFO Using testing mode: cleanup
20120113-132229 INFO Killing pid 17040 from /percona-xtradb-cluster/kewpie/workdir/bot0/var_s0/run/my.pid
20120113-132229 INFO Killing pid 17096 from /percona-xtradb-cluster/kewpie/workdir/bot0/var_s2/run/my.pid
20120113-132229 INFO Killing pid 17070 from /percona-xtradb-cluster/kewpie/workdir/bot0/var_s1/run/my.pid
20120113-132229 INFO Stopping all running servers...

Alternately, you can just let the tests run to ensure some basic functionality.  I’ll be writing more about these tests and other testing efforts soon, but I wanted to help people get started with their own explorations.

Happy testing and I hope you dig Percona XtraDB Cluster as much as we do : )


PlanetMySQL Voting: Vote UP / Vote DOWN

SAN vs Local-disk :: innodb_flush_method performance benchmarks

Январь 6th, 2012

If you’ve been tuning your MySQL database and have wondered what effect the innodb_flush_method settings have on write performance, then this information might help. I’ve recently been doing a lot of baseline load tests to show performance differences between localdisk and the new SAN we’re deploying. Since we run InnoDB for everything in production, and writes are very heavy, I decided to run comparison tests between two identical servers to find the best setting for innodb_flush_method. We have the following specs for the hardware:

  • Dell R610
  • 24 core Intel Xeon X5670 @ 2.93ghz
  • 72GB ECC RAM
  • Brocade 825 HBA
  • Local disk: RAID-10 15K SAS Ext3 (ugh)
  • SAN: Oracle 7420 with four Intel Xeon X7550 @ 2.00GHz, 512GB RAM, 2TB read-cache(SLC-SSD), 36GB write cache (MLC-SSD), 3 disk shelves populated with 60x2TB 7200RM SATA drives setup in mirrored format with striped logs, dual 8Gb FC links to redundant fabric, connected to Brocade DCX 8510-4.
  • The my.cnf file being used for the tests: click-click

I’m using the following sysbench command to run the tests. On each server the same commands are used. I ran a 1B row prepare prior to the 1B row test.

sysbench –db-driver=mysql –num-threads=64 –max-requests=1000000000 –max-time=3600 –test=oltp –verbosity=3 –validate=off –oltp-test-mode=complex –oltp-read-only=off –oltp-table-name=sbtest –oltp-table-size=1000000000 –oltp-dist-type=special –mysql-host=localhost –mysql-port=3306  –mysql-table-engine=innodb run

On the server that is utilizing SAN paths there are two LUNS presented for MySQL use. /db/data01 for InnoDB data files, /db/logs01 for InnoDB logs. These filesystems are both formatted as XFS. The server running local-disk tests is running Ext3. I might run some more tests later with the local-disk setup as XFS if time allows.

Here are the results. Clearly a well designed SAN infrastructure is superior to even RAID-10 15K SAS drives. And of course you can see the different performance values from using O_DIRECT for the innodb_flush_method for the different data storage mediums.

1B Row Complex Transactional Test, 64 threads

  • SAN O_DIRECT: read/write requests: 31560140 (8766.61 per sec.)
  • SAN O_DSYNC: read/write requests: 5179457 (1438.52 per sec.)
  • SAN fdatasync: read/write requests: 9445774 (2623.66 per sec.)
  • Local-disk O_DIRECT: read/write requests: 3258595 (905.06 per sec.)
  • Local-disk O_DSYNC: read/write requests: 3494632 (970.65 per sec.)
  • Local-disk fdatasync: read/write requests: 4223757 (1173.04 per sec.)

PlanetMySQL Voting: Vote UP / Vote DOWN

SAN vs Local-disk :: innodb_flush_method performance benchmarks

Январь 6th, 2012

If you’ve been tuning your MySQL database and have wondered what effect the innodb_flush_method settings have on write performance, then this information might help. I’ve recently been doing a lot of baseline load tests to show performance differences between localdisk and the new SAN we’re deploying. Since we run InnoDB for everything in production, and writes are very heavy, I decided to run comparison tests between two identical servers to find the best setting for innodb_flush_method. We have the following specs for the hardware:

  • Dell R610
  • 24 core Intel Xeon X5670 @ 2.93ghz
  • 72GB ECC RAM
  • Brocade 825 HBA
  • Local disk: RAID-10 15K SAS Ext3 (ugh)
  • SAN: Oracle 7420 with four Intel Xeon X7550 @ 2.00GHz, 512GB RAM, 2TB read-cache(SLC-SSD), 36GB write cache (MLC-SSD), 3 disk shelves populated with 60x2TB 7200RM SATA drives setup in mirrored format with striped logs, dual 8Gb FC links to redundant fabric, connected to Brocade DCX 8510-4.
  • The my.cnf file being used for the tests: click-click

I’m using the following sysbench command to run the tests. On each server the same commands are used. I ran a 1B row prepare prior to the 1B row test.

sysbench –db-driver=mysql –num-threads=64 –max-requests=1000000000 –max-time=3600 –test=oltp –verbosity=3 –validate=off –oltp-test-mode=complex –oltp-read-only=off –oltp-table-name=sbtest –oltp-table-size=1000000000 –oltp-dist-type=special –mysql-host=localhost –mysql-port=3306  –mysql-table-engine=innodb run

On the server that is utilizing SAN paths there are two LUNS presented for MySQL use. /db/data01 for InnoDB data files, /db/logs01 for InnoDB logs. These filesystems are both formatted as XFS. The server running local-disk tests is running Ext3. I might run some more tests later with the local-disk setup as XFS if time allows.

Here are the results. Clearly a well designed SAN infrastructure is superior to even RAID-10 15K SAS drives. And of course you can see the different performance values from using O_DIRECT for the innodb_flush_method for the different data storage mediums.

1B Row Complex Transactional Test, 64 threads

  • SAN O_DIRECT: read/write requests: 31560140 (8766.61 per sec.)
  • SAN O_DSYNC: read/write requests: 5179457 (1438.52 per sec.)
  • SAN fdatasync: read/write requests: 9445774 (2623.66 per sec.)
  • Local-disk O_DIRECT: read/write requests: 3258595 (905.06 per sec.)
  • Local-disk O_DSYNC: read/write requests: 3494632 (970.65 per sec.)
  • Local-disk fdatasync: read/write requests: 4223757 (1173.04 per sec.)

PlanetMySQL Voting: Vote UP / Vote DOWN

kernel_mutex problem cont. Or triple your throughput

Декабрь 3rd, 2011

This is to follow up my previous post with kernel_mutex problem.

First, I may have an explanation why the performance degrades to significantly and why innodb_sync_spin_loops may fix it.
Second, if that is correct ( or not, but we can try anyway), than playing with innodb_thread_concurrency also may help. So I ran some benchmarks with innodb_thread_concurrency.

My explanation on the performance degradation is following:
InnoDB still uses some strange mutex implementation, based on sync_arrays (hello 1990ies), I do not have a good reason why it is not yet replaced.
Sync_array internally uses pthread_cond_wait / pthread_cond_broadcast construction, and on pthread_cond_broadcast call, all threads, competing on mutex, wake up and start racing.
This effect has name thundering herd.

Davi Arnaut does not agree with me, where I do not agree with him either. This is the healthy discussion, and it is possible only because InnoDB is still Open Source and we all can check source code. If the problem were in the closed extension Thread Pool I could not participate in it.

We will probably argue more on that topic, but that does not stop us from trying different
innodb_thread_concurrency ( 0 by default, that is no restrictions).

This variable has a complex fate. Once it was one solution for poor InnoDB scalability, then it changed default value, then it even was named useless.

There is results for workload as in previous post, 256 threads and
with innodb_thread_concurrency=0,4,8,16,32,64

innodb_thread_concurrency Throughput
0 68369.02
4 137999.96
8 194537.48
16 161985.59
32 158296.21
64 153889.72

Wow, this is something. I expected improvement, but not almost 3x times ( 194537÷68369 = 2.8).
The best throughput is with innodb_thread_concurrency=8.

So now let’s compare results for innodb_thread_concurrency= 0 vs 8 for all range of threads:

Threads innodb concurrency=0 innodb concurrency=8
1 11178.34
2 27741.06
4 53364.52
8 92546.73 88046.72
16 144619.58 141781.00
32 164884.03 168360.95
64 154235.73 186167.15
128 147456.33 199260.97
256 68369.02 194357.78
512 40509.67 194639.51
1024 22166.94 183524.16

So innodb_thread_concurrency is even more helpful innodb_sync_spin_loops, and allows to get stable result even with 1024 threads. It is yet early to say it useless, and you may play with it.



PlanetMySQL Voting: Vote UP / Vote DOWN

kernel_mutex problem. Or double throughput with single variable

Декабрь 2nd, 2011

Problem with kernel_mutex in MySQL 5.1 and MySQL 5.5 is known: Bug report. In fact in MySQL 5.6 there are some fixes that suppose to provide a solution, but MySQL 5.6 yet has long way ahead before production, and it is also not clear if the problem is really fixed.

Meantime the problem with kernel_mutex is raising, I had three customer problems related to performance drops during the last month.

So what can be done there ? Let’s run some benchmarks.

But some theory before benchmarks. InnoDB uses kernel_mutex when it starts/stop transactions, and when InnoDB starts the transaction, usually there is loop through ALL active transactions, and this loop is inside kernel_mutex. That is to see kernel_mutex in action, we need many concurrent but short transactions.

For this we will take sysbench running only simple select PK queries against 48 tables, 5,000,000 rows each.

Hardware is Cisco UCS C250 server. The workload is read-only and fully in memory.

There is the result for different threads (against Percona Server 5.5.17):

Threads Throughput, q/s
1 11178.34
2 27741.06
4 53364.52
8 92546.73
16 144619.58
32 164884.03
64 154235.73
128 147456.33
256 68369.02
512 40509.67
1024 22166.94

The peak throughput is 164884 q/s for 32 threads, and it declines to 68369 q/s for 256 threads, that is 2.4x times drop.

The reason, as you may guess, is kernel_mutex. How you can see it ? It is easy. In SHOW ENGINE INNODB STATUS\G you will see a lot of lines like:

--Thread 140370743510784 has waited at trx0trx.c line 1184 for 0.0000 seconds the semaphore:
Mutex at 0x2b0ccc8 '&kernel_mutex', lock var 1
waiters flag 0
--Thread 140370752542464 has waited at trx0trx.c line 1772 for 0.0000 seconds the semaphore:
Mutex at 0x2b0ccc8 '&kernel_mutex', lock var 1
waiters flag 0
--Thread 140088222295808 has waited at trx0trx.c line 1184 for 0.0000 seconds the semaphore:
Mutex at 0x2b0ccc8 '&kernel_mutex', lock var 1
waiters flag 0
--Thread 140370746922752 has waited at trx0trx.c line 1184 for 0.0000 seconds the semaphore:
Mutex at 0x2b0ccc8 '&kernel_mutex', lock var 1
waiters flag 0
--Thread 140088223500032 has waited at trx0trx.c line 1184 for 0.0000 seconds the semaphore:
Mutex at 0x2b0ccc8 '&kernel_mutex', lock var 1
waiters flag 0
--Thread 140088231528192 has waited at trx0trx.c line 795 for 0.0000 seconds the semaphore:
Mutex at 0x2b0ccc8 '&kernel_mutex', lock var 1
waiters flag 0
...

This problem is actually quite serious. In the real workloads I saw this happening with less than 256 threads, and not all production systems can tolerate 2x times drop of throughput in the peak times.

So what can be done there ?

In the first try, let’s recall that kernel_mutex (and all InnoDB mutexes) has complex handling with spin loops, and there are two variables that affects mutex loops: innodb_sync_spin_loops and innodb_spin_wait_delay. I actually think that tuning system with these variable is something closer to dance with drum than to scientific method, but nothing else helps, why not to try.

There we vary innodb_sync_spin_loops from 0 to 100 (default is 30):

Threads Throughput NA
1 11178.34
2 27741.06
4 53364.52
8 92546.73
16 144619.58
32 164884.03
64 154235.73
128 147456.33
256 68369.02
512 40509.67
1024 22166.94

I was surprised to see that with innodb_sync_spin_loops=100 we can improve to 145324 q/s , almost to peak throughput from first experiment.

With innodb_sync_spin_loops=100 the kernel_mutex is still the main point of contention, but InnoDB tries to prevent the current thread from pausing, and that seems helping.

Further experiments showed that 100 is not enough for 512 threads, and it should be increased to 200.

So there is final results with innodb_sync_spin_loops=200 for 1-1024 threads.

Threads Throughput Throughput spin 200
1 11178.34 11288.42
2 27741.06 28387.62
4 53364.52 53575.52
8 92546.73 92184.65
16 144619.58 143688.91
32 164884.03 164392.94
64 154235.73 154022.57
128 147456.33 152280.84
256 68369.02 150089.31
512 40509.67 127680.65
1024 22166.94 61507.08

So playing with this variable we can double throughput to the level with 32-64 threads.
I am not really can explain how it does work internally, but I wanted to show one of possible ways
to deal with problem when you hit by kernel_mutex problem.

Further direction I want to try to limit innodb_thread_concurrency and also bind mysqld to less CPUs, and also it is interesting to see if MySQL 5.6.3 really fixes this problem.



PlanetMySQL Voting: Vote UP / Vote DOWN

Virident FlashMAX MLC in tpcc-mysql workload

Ноябрь 29th, 2011

As I mentioned in previous post on Virident FlashMAX MLC, beside sysbench benchmark, I also run tpcc-mysql (to compare performance Virident FlashMAX vs Fusion-io ioDrive Duo)

The report with results is there: http://www.percona.com/files/white-papers/virident-mlc-tpcc.pdf

The graphical result for tpcc-mysql 5000W:

My conclusions from this benchmark:

  • Virident FlashMAX provides stability of performance and reveals a denser throughput.
  • In addition to stability, in many cases there is also a better throughput in MySQL (up to 40\%) using the Virident FlashMAX card.

DISCLOSURE: This benchmark was done as part of our consulting practice for which we compensated by Virident. However, this benchmark was run independently of Virident, and reflects our opinion of this product.



PlanetMySQL Voting: Vote UP / Vote DOWN

Fishing with dynamite, brought to you by the randgen and dbqp

Ноябрь 16th, 2011

I tend to speak highly of the random query generator as a testing tool and thought I would share a story that shows how it can really shine. At our recent dev team meeting, we spent approximately 30 minutes of hack time to produce test cases for 3 rather hard to duplicate bugs. Of course, I would also like to think that the way we have packaged our randgen tests into unittest format for dbqp played some small part, but I might be mildly biased.

The best description of the randgen’s power comes courtesy of Andrew Hutchings – “fishing with dynamite“. This is a very apt metaphor for how the tool works – it can be quite effective for stressing a server and finding bugs, but it can also be quite messy, possibly even fatal if one is careless. ; ) However, I am not writing this to share any horror stories, but glorious tales of bug hunting!

The randgen uses yacc-style grammar files that define a realm of possible queries (provided you did it right…the zen of grammar writing is a topic for another day). Doing this allows us to produce high volumes of queries that are hopefully interesting (see previous comment about grammar-writing-zen).

It takes a certain amount of care to produce a grammar that is useful and interesting, but the gamble is that this effort will produce more interesting effects on the database than the hand-written queries that could be produced in similar time. This is especially useful when you aren’t quite sure where a problem is and are just trying to see what shakes out under a certain type of stress. Another win is that a well-crafted grammar can be used for a variety of scenarios. The transactional grammars that were originally written for testing Drizzle’s replication system have been reused many times (including for two of these bugs!)

This brings us to our first bug:
mysql process crashes after setting innodb_dict_size

The basics of this were that the server was crashing under load when innodb_dict_size_limit was set to a smaller value. In order to simulate the situation, Stewart suggested we use a transactional load against a large number of tables. We were able to make this happen in 4 easy steps:
1) Create a test case module that we can execute. All of the randgen test cases are structured similarly, so all we had to do was copy an existing test case and tweak our server options and randgen command line as needed.

2) Make an altered copy of the general, percona.zz gendata file. This file is used by the randgen to determine the number, composition, and population of any test tables we want to use and generate them for us. As the original reporter indicated they had a fair number of tables:

$tables = {
rows => [1..50],
partitions => [ undef ]
};

The value in the ‘rows’ section tells the data generator to produce 50 tables, with sizes from 1 row to 50 rows.

3) Specify the server options. We wanted the server to hit similar limits as the original bug reporter, but we were working on a smaller scale.
To make this happen, we set the following options in the test case:

server_requirements = [["--innodb-dict-size-limit=200k --table-open-cache=10"]]

Granted, these are insanely small values, but this is a test and we’re trying to do horrible things to the server ; )

4) Set up our test_* method in our testcase class. This is all we need to specify in our test case:

def test_bug758788(self):
test_cmd = ("./gentest.pl "
            "--gendata=conf/percona/innodb_dict_size_limit.zz "
            "--grammar=conf/percona/translog_concurrent1.yy "
            "--queries=1000 "
            "--threads=1")
retcode, output = execute_randgen(test_cmd, test_executor, servers)
self.assertTrue(retcode==0, output)

The test is simply to ensure that the server remains up and running under a basic transactional load

From there, we only need to use the following command to execute the test:
./dbqp.py –default-server-type=mysql –basedir=/path/to/Percona-Server –suite=randgen_basic innodbDictSizeLimit_test
This enabled us to reproduce the crash within 5 seconds.

The reason I think this is interesting is that we were unable to duplicate this bug otherwise. The combination of the randgen’s power and dbqp’s organization helped us knock this out with about 15 minutes of tinkering.

Once we had a bead on this bug, we went on to try a couple of other bugs:

Crash when query_cache_strip_comments enabled

For this one, we only modified the grammar file to include this as a possible WHERE clause for SELECT queries:

WHERE X . char_field_name != 'If you need to translate Views labels into other languages, consider installing the <a href=\" !path\">Internationalization</a> package\'s Views translation module.'

The test value was taken from the original bug report.
Similar creation of a test case file + modifications resulted in another easily reproduced crash.
I will admit that there may be other ways to go about hitting that particular bug, but we *were* practicing with new tools and playing with dynamite can be quite exhilarating ; )
parallel option breaks backups and restores

For this bug, we needed to ensure that the server used –innodb_file_per_table and that we used Xtrabackup‘s –parallel option. I also wanted to create multiple schemas and we did via a little randgen / python magic:

# populate our server with a test bed
test_cmd = "./gentest.pl --gendata=conf/percona/bug826632.zz "
retcode, output = execute_randgen(test_cmd, test_executor, servers)
# create additional schemas for backup
schema_basename='test'
for i in range(6):
    schema = schema_basename+str(i)
    query = "CREATE SCHEMA %s" %(schema)
    retcode, result_set = execute_query(query, master_server)
    self.assertEquals(retcode,0, msg=result_set)
    retcode, output = execute_randgen(test_cmd, test_executor, servers, schema)

This gave us 7 schemas, all with 100 tables per schema (with rows 1-100). From here we take a backup with –parallel=50 and then try to restore it. These are basically the same steps we use in our basic_test from the xtrabackup suite. We just copied and modified the test case to suit our needs for this bug. With this setup, we need a crash / failure during the prepare phase of the backup. Interestingly this only happens with this number of tables, schemas, and –parallel threads.

Not too shabby for about 30 minutes of hacking + explaining things, if I do say so myself. One of the biggest difficulties in fixing bugs comes from being able to recreate them reliably and easily. Between the randgen’s brutal ability to produce test data and queries and dbqp’s efficient test organization, we are now able to quickly produce complicated test scenarios and reproduce more bugs so our amazing dev team can fix them into oblivion : )


PlanetMySQL Voting: Vote UP / Vote DOWN

Side load may massively impact your MySQL Performance

Ноябрь 14th, 2011

When we’re looking at benchmarks we typically run some stable workload and we run it in isolation – nothing else is happening on the system. This is not however how things happen in real world when we have significant variance in the load and many things can be happening concurrently.

It is very typical to hear complains about MySQL interactive performance – serving simple standard web traffic is drastically impacted when some heavy queries are ran in background or backup is done with mysqldump – a lot more than you would expect from simple resource competition. I finally found some time to look further in this problem and see what can be done to remedy it.

We designed the benchmark the following way – there is a small table (200MB) which completely fits in the Innodb Buffer Pool (512MB). We also have larger table 4GB which does not fit in the buffer pool. We’re running uniform sysbench OLTP on the small table and mysqldump on the second table. First we run tests individually and when concurrently.

In the perfect world what we would like to see is performance is staying about the same when we run tests concurrently because Sysbench should run completely in memory and use a lot of CPU resources but none of disk IO and mysqldump should have relatively little CPU needs and be bound by disk. Also these are just 2 “threads” running on 4 core system so there should be plenty CPU to spare.

We’re using Percona Server 5.5.15 for this test with buffer pool size of 512MB and innodb_flush_method=O_DIRECT

Test Setup:

[root@localhost msb_ps_5_5_15]# sysbench –test=oltp –db-driver=mysql –mysql-host=localhost –mysql-table-engine=innodb –mysql-db=test –oltp-table-name=md_cache_test_small –oltp-table-size=1100000 –mysql-user=msandbox –mysql-password=msandbox –mysql-socket=/tmp/mysql_sandbox5516.sock prepare

[root@localhost msb_ps_5_5_15]# sysbench –test=oltp –db-driver=mysql –mysql-host=localhost –mysql-table-engine=innodb –mysql-db=test –oltp-table-name=md_cache_test_big –oltp-table-size=17600000 –mysql-user=msandbox –mysql-password=msandbox –mysql-socket=/tmp/mysql_sandbox5516.sock prepare

Running Sysbench and MySQLDump. Note we run them in the loop to see how result stabilizes.

[root@localhost msb_ps_5_5_15]# sysbench –test=oltp –db-driver=mysql –num-threads=1 –max-requests=0 –oltp-dist-type=uniform –max-time=180 –oltp-read-only –mysql-host=localhost –mysql-table-engine=innodb –mysql-db=test –oltp-table-name=md_cache_test_small –oltp-table-size=1100000 –mysql-user=msandbox –mysql-password=msandbox –mysql-socket=/tmp/mysql_sandbox5516.sock run

[root@localhost msb_ps_5_5_15]# time mysqldump –defaults-file=my.sandbox.cnf test md_cache_test_big > /dev/null

Baseline Run:
When we run the tests individually Sysbench gives about 330 req/sec and mysqldump for large table completes in about 95 seconds.
If we run them concurrently after system reaches steady state we get about 2 req/sec and mysqldump takes about 180 seconds.

Yes you get it right. Performance of sysbench OLTP on small table drops more than 150 times when heavy mysqldump is running concurrently. mysqldump itself also slows down
about 2x.

What is going on here ? To understand it we should take a look at the buffer pool contents.

mysql [localhost] {msandbox} (information_schema) > select t.schema, t.name, t.table_id, i.index_id, i.name, sum(data_size)/1024/1024 as data_size_mb from innodb_sys_tables as t inner join innodb_sys_indexes as i using(table_id) inner join innodb_buffer_pool_pages_index as p using(index_id) where t.schema=’test’ group by i.index_id \G
INDEX_NAME DATA_SIZE_MB
test.md_cache_test_small.PRIMARY 216.31397057
test.md_cache_test_small.k 2.66948509
test.md_cache_test_big.PRIMARY 250.76164627

…..

INDEX_NAME DATA_SIZE_MB
test.md_cache_test_small.PRIMARY 12.10487175
test.md_cache_test_big.PRIMARY 457.70432472

When we’re running sysbench OLTP on its own we have the primary key of the table fit completely in the buffer pool. However when mysqldump is ran concurrently it reads so many pages from the disk it pushes out most of the smaller table from the buffer pool with only 12MB remaining. This makes workload extremely IO bound hence such drop in performance.

The performance of mysqldump is impacted too because we now have 2 threads competing for what is single hard drive on this test system.

It is worth to note MySQL actually uses midpoint insertion for its buffer pool replacement policy . Unfortunately by default it is configured in a way it is quite useless. The blocks are indeed first placed in the head of “old” sublist which mean they should not push any hot data which is in “young” sublist. However when you’re doing mysqldump (or running some complex batch job query) you are likely going to have multiple accesses to the data on the same page before being done with it for good. Because there are several accesses page really gets immediately moved to the young sublist and as such placing high pressure on buffer pool.

There is ingenious feature though to deal with this problem, it is just you have to enable it separately. There is a variable innodb_old_blocks_time which specifies amount of milliseconds which needs to pass before table can be moved to the young sublist. In typical cases like mysqldump all accesses to the majority of pages will be concentrated within very small period of time so setting innodb_old_blocks_time variable to some value will prevent important data to be pushed out of buffer pool.

Lets repeat the benchmark with innodb_old_blocks_time=1000 which will correspond to 1 sec.

Separate Sysbench gives about 330 req/sec and mysqldump about 95 seconds which is the same. Note we ran test on virtualized system in this case so we would not be able to measure small variances in performance reliably.

Running Sysbench and MySQLDump convurrently gives about 325 req/sec for sysbench and some 100 seconds for mysqldump which is a dramatic improvement of over
150x for sysbench and results now going inline with what you would expect.

Lets see what is going on with buffer pool contents:

INDEX_NAME DATA_SIZE_MB
test.md_cache_test_small.PRIMARY 216.35031509
test.md_cache_test_small.k 0.13414192
test.md_cache_test_big.PRIMARY 253.21095276
test.md_cache_test_big.k 0.01491451

…..
INDEX_NAME DATA_SIZE_MB
test.md_cache_test_small.PRIMARY 216.35031509
test.md_cache_test_big.PRIMARY 253.19661140
test.md_cache_test_big.k 0.01491451

As you can see now the small table PRIMARY KEY (which is what used by benchmark) is not pushed from buffer pool at all.

For advanced tuning you might also look into changing how buffer pool is split into young and old sublists via innodb_old_blocks_pct variable though we did not need to do it in this case.

I’m not sure if there are any bad side effects from setting innodb_old_blocks_time to non zero value, if not I would strongly suggest changing default from zero in MySQL 5.6 as it would offer much better “out of box” user experience.

Summary
As we can see in default configuration MySQL has buffer pool which can be easily washed away by large table scans or heavy batch jobs. If this happen the workload which is normally in memory becomes disk IO bound which can slow it down more than 100 times. The solution is rather easy though. Setting innodb_old_blocks_time to 1000 or other meaningful number is an easy remedy for this problem.

I want to thank Ovais Tariq for doing a lot of heavy lifting running benchmarks for this post.


PlanetMySQL Voting: Vote UP / Vote DOWN

Review of Virident FlashMAX MLC cards

Ноябрь 10th, 2011

I have been following Virident for a long time (e.g. http://www.mysqlperformanceblog.com/2010/06/15/virident-tachion-new-player-on-flash-pci-e-cards-market/). They have great PCIe Flash cards based on SLC NAND.
I always thought that Virident needed to come up with an MLC card, and I am happy to see they have finally done so.

At Virident’s request, I performed an evaluation of their MLC card to assess how it handles MySQL workload. As I am very satisfied with the results, I wish to share my findings in this post.

But first, I wish to offer an overview of the card.

Virident FlashMax Cards are available in 1TB and 1.4TB usable capacities (the models names are M1000 and M1400)
These specified sizes are already available for end users.
I evaluated M1400 (1.4TB size) model, which I will discuss:

Because Virident has competition in the SSD market, they have stated their goals to distinguish themselves from their competitors:

  • Stability of performance: That is to minimize variations in throughput
  • Better response times: This is very important for database performance and I appreciate that Virident has made this a priority.
  • Performance at full capacity: As we know, SSD-based cards have special characteristics; the throughput declines when space utilization increases. Virident’s design/programming minimizes this decline.
  • RAID5 on the card: The card comes with RAID5 support on the card to give better protection.

To deal with a throughput decline, all Flash cards have reserved space. The 1.4TB card, that I have, internally holds 2TB worth of space.

This additional space is used for two purposes:

      1. To amortize write-intensive workloads, by using additional space.
      2. To have replacements for failed MLC modules. When one MLC module fails, it is marked as unused, and gets replaced by one from the pool of reserved modules.

Internally, Virident uses 25nm Intel NAND Flash MLC modules, this is the same technology that Intel uses for the Intel SSD 320 cards. 25nm modules allow the user a greater capacity, Physically you can place
more GBs into a given area. However, the drawback is that 25nm has worse reading and writing latencies, compared to previous generations. However, I have yet to determine how this affects MySQL workloads.

Virident has provided the following price list:

  • M1000 (1000GB Usable) – $13,000
  • M1400 (1400GB Usable) – $18,200
  • This amounts to $13/GB

Second, it is important to compare the performance of Virident FlashMAX MLC with available competing solutions.
It is fair to say Fusion-io ioDrive Duo 1.28TB MLC is the most well-known and most advanced competitor in the market.
I had a chance to administer a head-to-head comparison of sysbench and tpcc-mysql workloads between FlashMAX 1.4TB and ioDrive Duo 1.28TB.

It is important to highlight that Fusion-io ioDrive Duo is based on 34nm NAND technology, which is a full generation behind the 25nm NAND. However at this point, I have no access to Fusion-io ioDrive2, which is based on 25nm NAND.
Another important factor is that ioDrive Duo is actually two cards visible in the OS, and the user needs to use a software RAID. For Virident all 1400GB shows up as one single drive so no software RAID is necessary.

To compare performances I ran sysbench oltp and tpcc-mysql benchmarks. I will present the results
for sysbench oltp (with full report available later) below, and the results for tpcc-mysql in a followup post.

For sysbench, I used our multi-tables sysbench implementation with 256 tables and 10,000,000 rows each. This is a total of around 630GB of data, which allows one to adequately fill both cards in comparison.

Some hardware used in benchmarks include:

  • Server: Cisco UCS C250, running Oracle Linux 6.1 and Percona Server 5.5.15
  • Client: HP ProLiant DL380 G6, sysbench v5

Of course, our Percona Server was optimized for Flash cards, with variations for two settings.
I tested combinations of innodb_buffer_pool_size=120GB, 174GB and innodb_flush_log_at_trx_commit=1, 2.

The results in this post are for case innodb_buffer_pool_size=174GB and innodb_flush_log_at_trx_commit=1

As in all my recent benchmarks, I use long runs of 1 hour each with measurements every 10 seconds. This methodology allows me to observe trends and the stability of the performance on graphs.

The first graph represents throughput in transactions per second for different amounts of user threads (more is better). More concentrated dots represent less variance and better stability of throughput.

A tabular format, for throughput I use a median of measurements for last 1800 seconds in each run:

Card / Threads / tps 1 2 4 8 16 32 64 128 256 512 1024
1 Fusion-io ioDrive Duo 83.00 177.00 322.00 523.00 644.00 740.00 801.00 798.00 761.00 784.00 162.00
2 Virident FlashMAX 96.00 179.00 357.00 607.00 821.00 975.00 1083.00 1156.00 1064.00 1091.00 465.00

In order to examine the details of how throughput varies we have taken 32 threads and examined the timeline graph for each one:

While you can see that with Virident FlashMAX we have a pretty stable line of around 975 tps, the Fusion-io ioDrive Duo has a variance of 700-800 tps.

My conclusions are as follows:

  • It is great to see another player on MLC Flash cards market.
  • It is also great that Virident focuses on stability of performance for competitive advantage.
  • Beside stability, we also see better throughput in MySQL using the Virident FlashMAX card for every thread count. On 32-64 threads we have about a 35-40% advantage of using Virident FlashMAX.

DISCLOSURE: This review was done as part of our consulting practice for which we compensated by Virident. However, this review was written independently of Virident, and reflects our opinion of this product.


PlanetMySQL Voting: Vote UP / Vote DOWN