Archive for the ‘Tools/ dbSTRESS’ Category

MySQL Performance: InnoDB IO Capacity & Flushing

Июль 12th, 2010

This article was initially inspired by Vadim's post on mysqlperformanceblog.com comparing MySQL 5.5.4 and XtraDB under TPCC-like workload. Vadim's post has opened an interesting discussion about InnoDB I/O capacity feature in general, and particularly - tunning / settings regarding 5.5.4 and XtraDB. However, going more in depth I've discovered more interesting things than I've expected initially :-) and here is a kind of summary of my observations..

The following stuff is mainly covering heavy Read+Write workloads as I/O capacity was added to InnoDB to improve page writes rather reading ;-)

First of all - why I/O capacity in InnoDB?..

InnoDB I/O Capacity

Initially InnoDB was designed to flush in background only 100 pages per second. The value of 100 pages was fixed within a code and was not configurable. So, if you have got a storage subsystem able to execute more than 100 writes/sec you were not limited by the storage anymore, but by InnoDB itself..

Google team introduced the "I/O capacity" feature within their performance patches giving a way to adapt InnoDB I/O activity according to a given storage array capabilities, etc. This value of IO capacity entered also in many other maintenance tasks when database needs to decide how many pages should be flushed in background. This feature changed many things and bring some new problems as well :-))

Supposing you've procured a storage array able easily to execute 1000 random writes/sec, so how you can improve your performance now?..

  • Well, before with a fixed 100 writes value the difference was seen only during a "furious flushing" period (will talk about later)

  • Now, with a new storage you're able to say you can keep 1000 writes/sec, and set innodb_io_capacity=1000. What it'll change? - on the time of the Google patch this value was mainly used to size dirty pages flushing. So every time InnoDB reached the max dirty page percentage setting it started a flush batch with a number of pages equal to IO capacity setting (as well in other background flushes too but with a much smaller impact)..

  • So far, with IO capacity any user was able finally to force InnoDB to process a more or less aggressive flushing of modified data and better use the available storage subsystem.. But now - what about performance?.. ;-)

  • From the performance perspective - less you're doing I/O operations faster you're going :-)

  • On another hand - you cannot delay I/O operations infinitively because earlier or later you'll need to write anyway and write a huge amount of pages will be always harder then flush them little bit but constantly ;-) - of course, your workload may vary and if you know that after 5 min of hot activity you'll get 20 min of quiet period - you may probably delay your writes and flush pages during the quiet period.. - but usually such a programmable situation is not happening often :-)

  • So, in production environment you know that to avoid write bursts you need to flush constantly.. - but how much and since which event?.. ;-)

  • If you're flushing too much you'll make your storage busier and your overall performance lower..

  • But which amount of dirty pages you're really need to flush?.. And is there any danger if you're flushing less?.. ;-)

First - does it really matter how many dirty pages you have in your buffer pool?.. ;-)

All changes are reflected in the redo log, so in case of crash you should recover them (but well, if you're crashing too often you probably need to change your HW or OS vendor ;-)) But supposing you've got a solid HW, stable OS, huge RAM, fast storage and don't afraid to drive fast :-) So where is a real danger?..

Usually when you're optimizing a database performance you're not really looking to reach the highest pick result, but rather a stable performance level.. So the main danger will be always to see performance drops during your workload. And when it happens due really growing load - it's one story, but when it happens due wrong configuration or design problems - it's another story ;-)

So far, there are at least 2 most critical situations you may reach "by design" :-)

  • #1) Your buffer pool is small and out of space - most of pages are dirty and you absolutely need to flush to make a room for other data.. - Well, such case normally should not arrive because InnoDB is self protecting to never out pass 70% of dirty pages percentage, but if your IO capacity is set too low you're not protected ;-) And once it happens InnoDB has to flush at any price to avoid a global freeze.. and it'll flush without regarding IO capacity setting - it'll simply involve a full flush, so it'll be written on the highest write speed as your storage able to sustain.. - So anyway, it'll not be a total freeze, but not far from that, as under so aggressive writes your database server will work very slowly. I'm calling it "furious flushing" :-))

  • #2) But if the buffer pool is big enough then the furious flushing may happen again when you're out of free space in your redo logs.. - Having bigger buffer pool will give you a better performance, and for the same dirty pages percentage you'll be able to keep more dirty pages and dispose a bigger gap of buffer pool space.. BUT the max redo log space is still currently limited to 4GB! - and every redo log record referencing a dirty page cannot be recycled until this dirty page remains unflushed.. So once the amount of your changes reached 7/8 of 4GB, InnoDB has no choice - it'll involve a full flush too to avoid a freeze.. - And again, 4GB means you've created your redo log files with a max allowed size (4GB in total) - but with a smaller size furious flushing will arrive much more earlier ;-)

If you still follow me, you'll see that the main problem is coming from redo logs ;-)

There may also arrive another issue due flushing from LRU, but I think its impact will be much less important (or it may be the next issue to fix? - so I'll ignore it for the moment :-))

But well, let's summarize now and go by example:

  • Let's start with a server configured with a buffer pool=500M and redo log=128M - this configuration will work.. And you'll have all problems at once, but not really feel them because every full flash will be still short ;-) So you'll have a low average performance level but quite stable.. And still feeling you may better use your HW :-)

  • As you have, say, 16GB of RAM on machine, it'll be pity to use only 500M for your buffer pool :-) And once you'll configure buffer pool=12000M you'll see a huge performance improvement, but mainly on read operations.. - On the Read+Write workloads you'll only go faster due faster cached reads, but writes will not go faster.. - Why? - you'll be still limited by the size of your redo log :-) 128M is too short and you'll be constantly doing a full flush..

  • Then when you'll try a 3000M redo log you'll see a real improvement (50% or more)! - but nothing is free.. - with a bigger redo logs once you're out of redo space your full flushing will be really furious :-))

  • On the same time it'll be cool to keep the reached performance level all the time and without performance drops, no? ;-)

Sorry if I'm repeating myself, but a small picture says more - here is a performance level on dbSTRESS with 128MB redo logs:

And here is the same workload but with 1024MB redo logs:

As you see, 8.000 TPS is way better than 6.000 TPS ;-)

BUT - there are huge periodic drops.. And the origin of these drops is out of space in redo logs.

Can it be fixed and how easily ?..

  • You may lower your max dirty page percentage setting and force InnoDB to flush the modified pages before the redo logs will be out of space, and IO capacity setting will help you here to adapt your I/O activity according your workload. BUT - the problem is that there is no direct dependency between a number of dirty pages and a number of records in redo logs!.. - So your setting may be still be not low enough to cover any kind of activity, then by setting max dirty pages percentage too low you're starting to write too often and slow down your general performance..

  • The first real solution for this problem was proposed by Percona team by introducing Adaptive Checkpoint feature. The initial implementation of Adaptive Checkpoint was released as progressive flushing: depending on the percentage of redo log occupancy, the dirty pages flushing was adapted with a relative IO capacity percentage - less free space is left in redo logs, higher percentage of IO capacity is used for dirty pages flushing. This solution is still integrated within Percona's XtraDB and called "reflex".

  • InnoDB then introduced a similar feature which is called Adaptive Flushing. The solution is based on calculation of estimated speed of necessary flushing according the dirty pages level and redo log space occupancy. The estimation obtains finally a number of pages to flush and this number is compared to the IO capacity setting: the lower of two values is used then.

  • Then, if your workload become really heavy on writes, you'll need a Purge Thread (otherwise either Adaptive Checkpoint or Flushing will not be able to help you because their code will be never reached, so never executed.. - I'll skip details here, but if you want to know why you may read this post ). Just keep in mind that you may need a Purge Thread feature to be sure all things are working right.

So by tuning the IO capacity setting according your workload you may adapt I/O activity on your systems: more or less aggressive writes will be involved on your storage subsystem. As well it's also very useful when your goal is to run several MySQL servers in parallel on the same host (for virtualization or other reasons) and you expect to guaranty a reasonable storage sharing between your MySQL instances. By IO capacity setting you may attribute a sort of quotas for each MySQL server on write activity, etc.

Now, once the preface is finished, let's go back to the Vadim's article :-)

XtraDB & IO capacity

Vadim in his post pointed on a quite curious situation observed on Percona's TPCC-like workload:

  • On the same workload and similar my.conf settings MySQL 5.5.4 performed less stable comparing to XtraDB..

  • Further investigations showed that 5.5.4 become more stable with a higher innodb IO capacity setting (and a bigger buffer pool as well)..

So it means that Adaptive Flushing estimation obtains a higher number of pages to flush comparing to the initially used IO capacity setting... So if the IO capacity is configured lower than it should be for a given workload then InnoDB will not flush fast enough to keep a room of free space in the redo logs and will meet a "furious flushing" every time..

To understand if your IO capacity settings is good enough you may just monitor your checkpoint age over a time: if it remains low, stable and not increasing to reach the critical level (ex: stays under 80% of your redo space MB) - it's ok. Otherwise you should increase your IO capacity value.

The problem also that you cannot change it live for the moment.. So if you discover you have to adjust it within your production workload - it may be a real problem if you have to restart your MySQL server.. - except if your familiar with GDB hacking and feel ok to change a global variable of the currently running MySQL process :-) But seems everybody is agree that IO capacity should be a dynamic variable, so I hope we'll see it dynamic very soon :-))

Another helpful features will be also nice to have:

  • A counter of "furious flushing" events (e.g. every time when the full flush is involved due missing free space in redo logs) - it'll simplify InnoDB monitoring and give you a needed information about full flushes without needing to monitor checkpoint age..

  • Print a current estimated IO capacity value within InnoDB status output to see if there are gaps between estimated and configured IO capacity values..

Seems it'll be easy to add.

However, my attention here was still mainly focused on the one and the same question I've asked myself after Vadim's article: Why having exactly the same (low) IO capacity settings XtraDB is not meeting the same problems as MySQL 5.5.4 ?...

The real answer as usual is coming from the source code reading ;-) - after scanning and tracing XtraDB code under my workloads I've finally discovered that the default Adaptive Checkpoint setting used currently by XtraDB simply don't use IO capacity setting in its formula - it'll simply write as many pages as it estimated! :-)) So it's normal it does not meet any penalties due lower IO capacity setting :-)) The default mode in XtraDB Adaptive Checkpoint is "estimate", and it's implemented currently to write pages without regarding any IO capacity limits/settings (while previously default "reflex" mode is using IO capacity).. - Well, of course the point about how many writes should be involved by InnoDB and should they be limited or remain free of limits is a subject of long discussions (preferable around a beer :-)) But I afraid there will be as many opinions as people :-)) and the most optimal will be probably leave it as an option on configuration setting:

  • #1) - writes are limited by IO capacity
  • #2) - writes are free of limits and involved depending on activity
  • #3) - writes are limited by IO capacity, but IO capacity is auto-incremented by InnoDB in case of high write requests

And the last option (#3) I'll try to use during the following tests :-)

Well, if it became less or more clear with IO capacity impact, there were still several questions regarding MySQL 5.5.4 which did not stop to run in my head and needed to find their answers..

MySQL 5.5.4 and InnoDB Flushing

When we firstly implemented the Purge Thread feature yet in MySQL 5.4 it did not really bring any performance degradation, but only stability improvements:

MySQL 5.4 default:

MySQL 5.4 with a Purge Thread:

(Ref: MySQL Performance: Final fix for Ahead Flushing & Purge Lag )

Then there are so many changes were introduced within InnoDB code that when the Purge Thread was officially available in MySQL 5.5.4 the things were changed: we started to see decreased performance on the Purge Thread is activated:

(Ref.: Why Purge Thread in InnoDB?.. )

While I can understand there is a performance drop due bigger processing done by InnoDB once it's doing things right ;-)) but looking on the Checkpoint Age graph I still have impression we're probably flushing too much and observed performance gap may still be reduced...

Then, analyzing my tests with XtraDB I was also surprised that this performance drop in 5.5.4 is significantly higher then in XtraDB:

Having 17.000 TPS and 18.000 TPS makes an important difference, specially that having only one purge thread is not yet removing completely a purge lag, and you may need to run two purge threads or have a purge lag fix applied to reach a real stability, but this will also reduce your performance little bit because purging has a big cost! (Ref.: MySQL Performance: Improving Stability for more details)..

And again, looking on Checkpoint Age, there are 2 things coming in mind:

  • Seems we're flushing more than really needed comparing to XtraDB

  • Slowly growing Checkpoint Age over a time I don't like at all: it still means there is a danger to not flush enough and meet "furious flushing" (and on more long tests I've observed it by myself, so I'm not presenting other graphs, but just believe me :-))

All these observations make me think there is something wrong in the way we're flushing dirty pages.. But on the same time the code logic in InnoDB don't make me think something was missed.. Which bring me on idea that we simply don't focus on the right direction...

Initially I've started to trace estimation values reported by InnoDB Adaptive Flushing - and it's true, sometimes they were looking too high or too low.. So, I've replaced estimation function by my own formula similar that I've used in 5.4 tests - and it was possible to make InnoDB performing better with a such simple formula, but it make me worry on the same time that such a tunning will be very workload oriented rather universal..

And then getting all things presented above together I've started to look in the root of the problem:

  • Currently what we're trying in Adaptive Flushing is to find a right number of pages to flush - but is it a right way?...

  • On the same time for some reasons even by flushing an estimated number of pages we're still not protected to get out of free space in redo logs and meet the famous "furious flushing"...

  • And then if I monitor my redo log write activity I have only 10-12MB/sec on write activity, which is giving me on x3 of 1GB logs at least 200 sec before there will be no free space in my redo logs!

  • So what I'm doing wrong during these 200 sec to arrive into a bad situation when I don't have a free space anymore?..

Looking on all of these points make me thing that it doesn't really matter how many dirty pages we've estimated to flush.. What is really important - we have just to follow a redo log activity to be sure there is always a free space in redo logs! And all we need is just flush enough to free the tail of the redo logs on the same speed as the redo log is filled :-))

To implement such a solution we have to get a look on how the batch flush function is called within InnoDB:

  • Initially it was buf_flush_batch() function, and since 5.5.4 it's buf_flush_list() function which has only 2 arguments: number of pages to flush and the max age of pages to flush..

  • Curiously that over all history of InnoDB the second argument was always equal to the MAX possible value (IB_ULONGLONG_MAX) - which means that when we involve the flush batch there is no guaranty the most oldest dirty pages will be flushed first! And by flushing an estimated amount of pages we may still have a tail of redo logs not freed (well, correct me if I'm wrong, but it's an impression I've got..)

  • So the idea is simple: instead of IB_ULONGLONG_MAX as max page age argument give an age based on the redo log tail age ! :-))

  • NOTE: going ahead, XtraDB is already using a different value instead of IB_ULONGLONG_MAX within "estimate" option for Adaptive Checkpoint, but on the time I've scanned the XtraDB code I did not really catch this part... - and once my own code worked I was finally able to appreciate the XtraDB solution too! Well done, Percona! ;-))

So far, the Implementation logic looks like this:

  • On every 1sec loop of Master thread the local old/previous LSN is kept

  • On every 10sec the global old LSN is kept and the Redo Log write speed is recalculated as:
    LSN_speed = (LSN_speed + ( LSN_current - LSN_old ) / time_spent ) / 2

  • Then when Adaptive Flushing is involved: if redo log is filled in less then 10% - do nothing; otherwise the flush batch is involved

  • The flush list function is called as buf_flush_list( IO_capacity, flush_age ) where:
    - IO capacity: is simply all 100% of IO capacity setting
    - flush_age: oldest page modification + (LSN_speed + (lsn_curr - lsn_old)) / 2

  • So the Flush Age is always adjusted according the avg redo log activity observed over a time; then the IO capacity settings is limiting the number of pages to flush in case there are too much corresponding to the given age limit..

  • Optionally, the IO capacity setting is increased by +100 every time when there was met 10 times the redo log occupancy at 75%..

So what about the TPS results now? ;-)) - I've got even slightly better result comparing to XtraDB, but again - single purge thread is not enough to reach a complete stability, and as mentioned in the previous post , you'll need to have a second purge thread or a purge lag fixed. Solution with a Purge Lag still gives a better result, and by setting innodb_max_purge_lag=400000 we obtaining the following result with XtraDB:

As you see, limiting Purge Lag to 400K is reducing performance (as expected): from ~18.000 TPS we're going to 17.000-17.500 TPS.. However there is a huge changes on the History List length - if before after 40min of activity it grew up to 10M(!!), then now it remains stable at 400K, which is way better for production environment and general workload stability as well..

As before 5.5.4 was worse than XtraDB when the Purge Thread is activated, it was also worse with a Purge Lag limit too.. But now, when the gap is removed, what is the sable result with improved 5.5.4 ?.. ;-))

As you can see, over a time 5.5.4 is now the same or even slightly higher then XtraDB! So the proposed solution seems to work better than current flushing estimation in InnoDB.. - but more testing and also on another workloads will be need to find an optimal way ;-))

I've supposed to finish here, but then entered into another mess...

Single User Performance

To get the final results with a growing number of users I've started a classic workload from 1 to 256 users, and observed absolutely strange things having just a one active user:

As you can see, the proposed (patched) solution brings a total disorder on workload stability when there is not 32 but only one user is running in the database...

Looking on the Checkpoint Age graph - it's clear we're flushing too much here.. But WHY?...

Once again, the answers are coming from the source code:

  • For some reasons, Master thread supposing there was no activity during the last period and going to the background loop(!) while there is still one user non-stop bombarding the database with its queries...
  • In the background loop Master thread flushing pages fast supposing there is no user activity and killing my single user performance...

Solution:

  • Don't know if I'm right, but it looks like there are several bugs/improvements to fix/make here..

  • First of all the IO capacity is not used within a background flush loop, so it may be a source of performance impact when the storage is shared between several MySQL instances...

  • Then - the user activity within a Master thread is verified via srv_activity_count variable, and seems in some cases it remains not increased (or probably my changes involved such a situation), so by increasing its value inside of the 1sec loop if there were changes in redo log fixed the problem!

  • Finally - in many places operations with a server activity counter are protected by the kernel_mutex, while all information we need about is to know if there was any changes or no, so doesn't matter how much it was increased, it's just compared to its previous value, and only if it's the same - Master thread is going to the background loop... - So WHY protect it by the kernel_mutex while it's already so hot?... :-))

Applying all these changes, I've got the following results:

As you see, it's way better!! :-))

And curiously it's even better now comparing to XtraDB:

Auto Adaptive IO Capacity

There is still possible that even well tuned IO capacity configuration may meet cases when the IO capacity setting may be lower than needed.. And as I said, it should be configurable and up to user decision if writes should be free of limit or not.. In my case I've tested an auto adaptive IO capacity which is incremented progressively every time the redo log free space become close to critical levels..

Here is a test I've made to see if over the same period of non-stop Read+Write activity 5.5.4+ will still be able to keep the same performance level if I'll disable max purge lag limit (but purge thread active) - if you remember, the result initially for 5.5.4 was 17.000 TPS, and 18.000 TPS for XtraDB within the same conditions - and now for modified 5.5.4+ :

As you can see:

  • Performance level remains stable and reaches now higher TPS numbers even than XtraDB :-) and we're not far from the results obtained without using Purge Thread! (similar to what we saw before with MySQL 5.4)
  • Auto adaptive IO capacity works well to keep a balance an avoid entering a critical redo log occupancy - BTW, at the end of the test it reached 3600(!) in max - while usually 1500 was enough :-)
  • The History Length is reaching 14M(!) at the end of the test, and all these background problems are coming from here - there is more and more unpurged data and every operation become slower and slower.. - Curiously there is not too much more data written to the disks over a time, but InnoDB has to write more over one flush batch to keep enough free space in redo logs...

So far, the work continues, and it's time for vacations now :-))

Any comments are welcome!


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Performance: Using Performance Schema

Май 20th, 2010

Two weeks ago I was very lucky to meet in Paris Marc Allf (one of developers of MySQL Performance Schema) and get all details about Performance Schema directly from the source :-)

I would say I'm very excited about this feature giving so enormous potential to observe any activity inside of MySQL, as well by a great flexibility of the proposed solution! :-)

However, discussing with Marc and going over a day more and more in depth, I've got some remarks that I needed to group, summarize and then express (and vacation time helping here a lot ;-)) My points will not be about adding some new extensions, but rather fixing some core issues which I consider important (and, please, correct me if you're not agree, express your ideas, etc - it's one of the cases where a quantity of opinions will be easily transformed into the quality solution for sure :-))

The following stuff is not a HOWTO, rather just some ideas based on observations during a test workload. For more information about Performance Schema don't miss an excellent documentation published on MySQL site, as well a really nice example with InnoDB was posted by Jimmy Yang. So, sorry if I'll miss/skip some details - it's already hard to make it short :-)


Performance Impact

First of all I was curious to see the performance impact of using Performance Schema on a heavy database workload. So for my experiments I've used dbSTRESS scenarios with a constant Read-Only or Read+Write workload with 32 concurrent users on the server with 16 cores. InnoDB storage engine was used for all tables during the tests.

Following cases were tested as a start point:

  • MySQL server was even not compiled with Performance Schema (None)
  • Performance Schema was set OFF (performance_schema=0 (default))
  • Performance Schema was set ON (performance_schema=1)

The following graphs are representing observed TPS levels on dbSTRESS corresponding to each case (tests were executed sequentially, one after other).

Read-Only:

As you can see:

  • there is no impact on Read-Only until the performance_schema is not set to 1
  • once activated, the impact on the Read-Only workload is still minimal - from 30.000 TPS we're going to 28.000 TPS, which is representing 7% (well, I'd prefer less, but 7% is still not too bad :-))

Read+Write:

Thing are changing on Read+Write:

  • still no impact until Performance Schema is not enabled
  • once activated, the impact here is more important - from 17.000 TPS we're going to 13.500-14.000 TPS, which is near 20% (!) degradation and not good at all..
  • on the same time it's easy to understand - the internal MySQL/innoDB contention during Read+Write workload is already very high, so once you add an additional instructions within a critical parts of code it may only become more higher..

Let's see now if a such performance impact can be reduced..

Performance Schema has several "SETUP" tables giving an easy way to change default configuration and adapt accounting setup conditions/rules according your need. One of such tables is "SETUP_CONSUMERS". The default setting (contents) is:

mysql> select * from SETUP_CONSUMERS;
+----------------------------------------------+---------+
| NAME                                         | ENABLED |
+----------------------------------------------+---------+
| events_waits_current                         | YES     |
| events_waits_history                         | YES     |
| events_waits_history_long                    | YES     |
| events_waits_summary_by_thread_by_event_name | YES     |
| events_waits_summary_by_event_name           | YES     |
| events_waits_summary_by_instance             | YES     |
| file_summary_by_event_name                   | YES     |
| file_summary_by_instance                     | YES     |
+----------------------------------------------+---------+
8 rows in set (0.00 sec)

Currently there are 3 account categories:

  • high level "summary" tables containing aggregated information by instance, event name, etc.
  • middle level "history" tables keeping last series of wait events
  • and the lowest level "current" table keeping wait events happening right now

By changing the value of "ENABLED" column to YES or NO you may enable or disable corresponding accounting within Performance Schema. By disabling "events_waits_current" you'll disable a whole accounting (equivalent to disabling Performance Schema). However having "history" accounting disabled makes still possible to account in "summary" aggregates. So the "events_waits_current" is the lowest possible level of accounting and playing the main role for all others.

Now, how the performance impact will be changed if I'll enable only the "events_waits_current" accounting?

mysql> update SETUP_CONSUMERS set ENABLED= 'NO' where NAME != 'events_waits_current';
Query OK, 7 rows affected (0.00 sec)
Rows matched: 7  Changed: 7  Warnings: 0

mysql> select * from SETUP_CONSUMERS;
+----------------------------------------------+---------+
| NAME                                         | ENABLED |
+----------------------------------------------+---------+
| events_waits_current                         | YES     |
| events_waits_history                         | NO      |
| events_waits_history_long                    | NO      |
| events_waits_summary_by_thread_by_event_name | NO      |
| events_waits_summary_by_event_name           | NO      |
| events_waits_summary_by_instance             | NO      |
| file_summary_by_event_name                   | NO      |
| file_summary_by_instance                     | NO      |
+----------------------------------------------+---------+
8 rows in set (0.00 sec)

mysql>


Read+Write (results including a "current-only" accounting):

Observations:

  • TPS level is now around of 15.000(!) - it's better, but still 12% degradation comparing to 17.000 TPS
  • I've added a mutex waits graph here based on information reported by InnoDB (show mutex)
  • As you may see in all cases the workload is starting by jump of waits on index mutex, then drop down, then in few minutes waits become stable...

Let's get a more close look during a "stable" period:

Observations:

  • Comparing to the "normal" state, we may see many mutex waits increased once Performance Schema is enabled
  • kernel mutex waits are increased by x3 from ~1000 to ~3000 (srv/srv0srv.c)
  • buffer mutexes waits are also increased, but the kernel mutex waits should be the main factor here (and seems the index mutex waits were decreased due contention moved to the kernel mutex)..
  • using only "current" accounting reducing mutex waits, but still has %12 overhead..
  • again - on the lowest possible instrumentation level there is already %12 performance degradation!

So, let's see more in details what's going here with those mutexes ;-)

Having only "current" accounting is probably useful for live debugging, but for a general observation I'll need to activate at least "events_waits_summary_by_event_name":

mysql> update SETUP_CONSUMERS set ENABLED= 'YES' where NAME = 'events_waits_summary_by_event_name';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from SETUP_CONSUMERS;
+----------------------------------------------+---------+
| NAME                                         | ENABLED |
+----------------------------------------------+---------+
| events_waits_current                         | YES     |
| events_waits_history                         | NO      |
| events_waits_history_long                    | NO      |
| events_waits_summary_by_thread_by_event_name | NO      |
| events_waits_summary_by_event_name           | YES     |
| events_waits_summary_by_instance             | NO      |
| file_summary_by_event_name                   | NO      |
| file_summary_by_instance                     | NO      |
+----------------------------------------------+---------+
8 rows in set (0.00 sec)

mysql>     

Then reset the summary per name accounting (the table corresponding to this data is called "EVENTS_WAITS_SUMMARY_BY_EVENT_NAME", so to reset you just need to truncate this table):

mysql> truncate table EVENTS_WAITS_SUMMARY_BY_EVENT_NAME;
Query OK, 0 rows affected (0.00 sec)

Then restart my test again ;-)

Observations:

  • Still 15.000 TPS!
  • means by wait name aggregation is not decreasing performance!

But what can we discover now via Performance Schema?..

Let's see the top 7 wait events during Read+Write workload:

mysql> select EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT 
from EVENTS_WAITS_SUMMARY_BY_EVENT_NAME 
where count_star > 0 
order by SUM_TIMER_WAIT desc limit 7; 
+--------------------------------------------+------------+------------------+
| EVENT_NAME                                 | COUNT_STAR | SUM_TIMER_WAIT   |
+--------------------------------------------+------------+------------------+
| wait/synch/mutex/innodb/kernel_mutex       |  158217608 | 1232782089422208 |
| wait/io/file/innodb/innodb_log_file        |   18585756 |  321430981380938 |
| wait/synch/mutex/innodb/buf_pool_mutex     |   81590186 |  110569541099368 |
| wait/synch/mutex/sql/MDL_map::mutex        |   36280653 |  101617775779338 |
| wait/synch/mutex/sql/LOCK_open             |   71053737 |   39460113468230 |
| wait/synch/mutex/innodb/buffer_block_mutex |  149481944 |   21384268657882 |
| wait/synch/mutex/sql/LOCK_global_read_lock |   52214671 |   14988152509694 |
+--------------------------------------------+------------+------------------+
7 rows in set (0.00 sec)

That becomes very interesting:

  • As I already supposed, having the highest number of waits doesn't yet mean the lock is having the hottest contention.. - but the waited time does! :-)
  • "kernel_mutex" and "buffer_block_mutex" are having a similar number of waits, however a time spent on the "kernel_mutex" waits is over x100 times higher!
  • BTW, we even don't see index mutex int the top 7 wait times ;-)
  • So, the "kernel_mutex" is keeping here the first place
  • While the second one is on the redo log file writes! - however still 10 times lower than kernel_mutex :-)
  • Then buffer pool mutex is sharing the third place with MDL_map mutex waits..
  • Whet is interesting here also: the contention on LOCK_open was resolved by introducing of MDL in 5.5.4, however according to what I can see here, the contention is moved now in MDL :-) means still need to be fixed..

Now, can I believe to what I see here?.. - the kernel mutex seems to be the hottest, but how can we be sure it's not due instrumentation code?..

Let's see what will be the result if we'll disable any mutex accounting. To do this we need to operate first with the SETUP_INSTRUMENTS table (for every NAME it keeps the setting state for ENABLED and TIMED):

mysql> update SETUP_INSTRUMENTS set ENABLED = 'NO', TIMED = 'NO' where NAME like '%mutex%';
Query OK, 121 rows affected (0.00 sec)
Rows matched: 121  Changed: 121  Warnings: 0

The new result is here:

Observations:

  • We're now near 16.000 TPS! - means only 6% in performance degradation (which is way better, but we're missing all information about mutexes)..
  • It also demonstrates the impact of mutex instrumentation within Performance Schema - there is something to optimize for sure..

Let's see what are the top 7 waits now:

mysql> select EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT 
from EVENTS_WAITS_SUMMARY_BY_EVENT_NAME 
where count_star > 0 order by SUM_TIMER_WAIT desc limit 7; 
+------------------------------------------------------------+------------+-----------------+
| EVENT_NAME                                                 | COUNT_STAR | SUM_TIMER_WAIT  |
+------------------------------------------------------------+------------+-----------------+
| wait/io/file/innodb/innodb_log_file                        |   19467181 | 337391150792878 |
| wait/synch/rwlock/sql/LOCK_grant                           |   26426964 |   5980185397914 |
| wait/synch/cond/sql/Query_cache::COND_cache_status_changed |     182386 |   5504284262672 |
| wait/io/file/innodb/innodb_data_file                       |      24829 |   3915749095700 |
| wait/synch/rwlock/innodb/buf_block_lock                    |     286027 |     23545903550 |
| wait/synch/rwlock/sql/LOCK_system_variables_hash           |        128 |        23642138 |
| wait/synch/rwlock/sql/LOCK_dboptions                       |         32 |         8600878 |
+------------------------------------------------------------+------------+-----------------+
7 rows in set (0.00 sec)

Observations:

  • Waits on redo log writes now on the first position
  • LOCK_grant on the second (is it normal it's called so often?..)
  • And I was really surprised to see waits related to query cache while it's disabled! - is it normal too?..
  • And only then we're hitting waits on the data files writing... - hmm.. very interesting..

So, I was curious to see if TPS wil be better if I'll disable now the accounting on "rwlock" names too?..

mysql> update SETUP_INSTRUMENTS set ENABLED = 'NO', TIMED = 'NO' where NAME like '%rwlock%';
Query OK, 20 rows affected (0.00 sec)
Rows matched: 20 Changed: 20 Warnings: 0

The new result:

Still the same 16.000 TPS. And the top 7 waits are now:

mysql> select EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT 
from EVENTS_WAITS_SUMMARY_BY_EVENT_NAME 
where count_star > 0 order by SUM_TIMER_WAIT desc limit 7; 
+------------------------------------------------------------+------------+-----------------+
| EVENT_NAME                                                 | COUNT_STAR | SUM_TIMER_WAIT  |
+------------------------------------------------------------+------------+-----------------+
| wait/io/file/innodb/innodb_log_file                        |   19324916 | 334853972005878 |
| wait/synch/cond/sql/Query_cache::COND_cache_status_changed |     181392 |   5652267868910 |
| wait/io/file/innodb/innodb_data_file                       |      24876 |   3727374241032 |
+------------------------------------------------------------+------------+-----------------+
3 rows in set (0.00 sec)
  

Observations:

  • Waits on "rwlock" names are gone as expected
  • It did not make any big changes on other waits, so we can be sure they were not provoked by "rwlock" instrumentation background effects :-)
  • Once again, strange to see a query cache related waits.. :-)

To be sure now the problem is not coming from TIMED accounting, I'll try a test with all NAME accounting enabled but with disabled time accounting:

mysql> update SETUP_INSTRUMENTS set ENABLED = 'YES', TIMED = 'NO';
Query OK, 217 rows affected (0.00 sec)
Rows matched: 217 Changed: 217 Warnings: 0

The new result:

Hmm.. - We're back to 15.000 TPS again?...

And what about top 7 wait numbers now? -

mysql> select EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT 
from EVENTS_WAITS_SUMMARY_BY_EVENT_NAME 
where count_star > 0 order by 2 desc limit 7; 
+---------------------------------------------------------+------------+------------------+
| EVENT_NAME                                              | COUNT_STAR | SUM_TIMER_WAIT   |
+---------------------------------------------------------+------------+------------------+
| wait/synch/mutex/innodb/kernel_mutex                    |  155087534 |  126763742418976 |
| wait/synch/mutex/innodb/buffer_block_mutex              |  146812631 |  165945812893432 |
| wait/synch/mutex/innodb/buf_pool_mutex                  |   85353157 | 2081293853476058 |
| wait/synch/mutex/sql/LOCK_open                          |   69641679 |   53903993714220 |
| wait/synch/mutex/sql/LOCK_global_read_lock              |   51163779 |   39601211521560 |
| wait/synch/mutex/sql/MDL_map::mutex                     |   35567564 |   27529970557804 |
| wait/synch/mutex/sql/Query_cache::structure_guard_mutex |   25581890 |   19800605760780 |
+---------------------------------------------------------+------------+------------------+
7 rows in set (0.00 sec)

Observations:

  • First of all I'm surprised to not see a zero value in the SUM_TIMER_WAIT column! - is it a trash and should be ignored? or is it a bug and there is still some time accounting is happening even when it's disabled?..
  • There is definitively something going not optimal - the accounting of number of waits should be the most light weight and should not bring a such important performance degradation!.. - it's seen with a similar "show mutex" within InnoDB and there is no reason why it cannot be done within Performance Schema too..
  • However, if I can believe what I see - the wait numbers are very interesting here :-) Because LOCK_open & MDL are still here ;-) and curiously disabled query cache too ;-)

Now.. - What if the main problem is coming from the hottest "kernel mutex"?.. Let's disable accounting just on the "kernel mutex" and see how it'll change things:

    
mysql> update SETUP_INSTRUMENTS set ENABLED = 'YES', TIMED = 'YES';
Query OK, 217 rows affected (0.00 sec)
Rows matched: 217  Changed: 217  Warnings: 0

mysql> update SETUP_INSTRUMENTS set ENABLED = 'NO', TIMED = 'NO' where NAME like '%kernel_mutex%';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

The new result is here:

Observations:

  • The result is not really better (or just slightly better) - still around of 15.000 TPS
  • That means the performance impact is somewhere inside of the mutex instrumentation globally within Performance Schema..

And what are the top 7 waits now?..

    
mysql> select EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT 
from EVENTS_WAITS_SUMMARY_BY_EVENT_NAME 
where count_star > 0 order by 3 desc limit 7; 
+---------------------------------------------------------+------------+-----------------+
| EVENT_NAME                                              | COUNT_STAR | SUM_TIMER_WAIT  |
+---------------------------------------------------------+------------+-----------------+
| wait/io/file/innodb/innodb_log_file                     |   18687653 | 328447818062000 |
| wait/synch/mutex/innodb/buf_pool_mutex                  |   82809405 | 122379714665884 |
| wait/synch/mutex/sql/MDL_map::mutex                     |   35427655 | 100051628180218 |
| wait/synch/mutex/sql/LOCK_open                          |   69335282 |  38925098359264 |
| wait/synch/mutex/innodb/buffer_block_mutex              |  148084516 |  21364777588960 |
| wait/synch/mutex/sql/LOCK_global_read_lock              |   50918129 |  14015063815792 |
| wait/synch/mutex/sql/Query_cache::structure_guard_mutex |   25459066 |  10364875712542 |
+---------------------------------------------------------+------------+-----------------+
7 rows in set (0.00 sec)

Observations:

  • The top wait time now is on redo log writes (and I'm curious to understand if it's really true.. - because I've used innodb_flush_log_at_trx_commit=2 setting and my redo log files are placed on SSD volume.. - will be interesting to observe if it'll be still similar if I'll use a RAM disk instead ;-))
  • Curiously MDL+LOCK_open wait time is not far from buffer pool mutexes - probably it's still need to be improved..
  • Anyway, it's quite positive to bring to the light other waits than InnoDB mutexes - it's the part of puzzle which was always missed before, and now we may compare servers as apples to apples without platform dependency! ;-)

So far, what can be improved here?..

Performance and Instrumentation

Usually performance analyzing and application tracing/instrumentation are very often going together. Because if you really need to understand what's going wrong inside of your application you need to trace it. From the other hand, if your tracing solution has an important impact on performance - very quickly you'll no more able to confirm you're tracing a real problem, or a problem may just may be gone due tracing as some critical parts will run slower and not making contention anymore, etc...

Let's start with a dumb example :-)

Here is a dumb C code I'm using usually to explain a potential impact of DTrace while tracing an unknown binary:

#include <stdio.h>

#define MAX  20000
#define LOOP 100000

main( int argc, char *argv[])
{
 int i;
 long t1, t2;

 printf( "My PID: %d\n", getpid() );
 sleep( 15 );
 puts( "Start.." );

 time( &t1 );
 for( i= 0; i < MAX; i++ )  fun();
 time( &t2 );

 printf( "Time: %d sec.\n", t2 - t1 );
}

fun()
{
 f1(); f2(); f3(); f4(); f5();
}

f1() { int i,n; for( n= 0; n < LOOP; n++ ) i*= n; }
f2() { int i,n; for( n= 0; n < LOOP; n++ ) i*= n; }
f3() { int i,n; for( n= 0; n < LOOP; n++ ) i*= n; }
f4() { int i,n; for( n= 0; n < LOOP; n++ ) i*= n; }
f5() { int i,n; for( n= 0; n < LOOP; n++ ) i*= n; }

This program has no real goal :-) It just execute function "fun()" MAX times. Then function "fun()" executes functions f1,f2..f5 - each of them is just doing LOOP times a loop on multiplication. So int total there will be executed a number of loops equal to: MAX * 5 * LOOP. And from the application point of view nothing will be changed if I'll increase the MAX by 10 and decrease the LOOP value by 10 too - there will be still the same number of loops executed, and the execution time will not vary to much due such changes..

Now let's see what will happens if I'll trace this program with DTrace supposing I have no idea what this program binary is doing and just want to discover what kind of functions are executed inside, how often, and how much time we spent inside..

I'll use the following (classic) DTrace script here:

#!/usr/sbin/dtrace -s

pid$1:a.out::entry
{
 self->t[probefunc]= timestamp;
}

pid$1:a.out::return
/self->t[probefunc]/
{
 elapsed= timestamp - self->t[probefunc];
 @s[probefunc]= sum(elapsed);
 @c[probefunc]= count();
 self->t[probefunc]= 0;
}

Script is just taking a PID as argument, and then by Control-C is printing the list of functions called (how many times, and a time spent within each function). The program is printing its PID on the start and waits 15 sec to leave me a time to start DTrace script before it will start looping :-)

So far, without tracing, the program is taking 31 sec to execute on my old AMD box. And 32 sec under DTrace script tracing. Well, 32 vs 31 sec is not too much, right? Let's see now how the time will be changed if I'll change MAX and LOOP values:

MAX LOOP Normal Time Under tracing
20000 100000 31 sec 32 sec
200000 10000 31 sec 37 sec
2000000 1000 32 sec 73 sec
20000000 100 32 sec 372 (!) sec

As you cant see, in worse cases the binary took over x2 times, then over x10 (!) times more to do the same work!... - Why?.. - just because the code added on the fly by DTrace instrumentation has a cost! - and by reducing the time we stay inside of the each function we amplified it by so much!..

Of course, we may also simplify the trace script and trace only call numbers:

#!/usr/sbin/dtrace -s

pid$1:a.out::entry
{
 @c[probefunc]= count();
}

But the thing will still not be too much better - in the worst case instead of 372 sec we'll get 122 sec, so still near x4 times slower than a normal execution time of the same program!

While DTrace is an absolutely great tool, in the current case you have to understand what you're doing - you're introducing additional instructions into your code, and more short the portion of your code will be - more important performance impact you'll see! Dtrace has no idea about your binary. But if you have - you may do it in right way :-)

For example here a simple adding of counter++ inside of each function should not bring any important slow down on execution :-))

Now what about MySQL and Performance Schema?..

MySQL is not an unknown binary for Performance Schema instrumentation :-) so it can be done in much more lightweight way comparing to the previously presented observations!

For example if we take "kernel_mutex" - in many places it's called just to protect one or few instructions! (well, probably the usage of this mutex should be also optimized :-)) but we're speaking about tracing and instrumentation for the moment :-))

So, how the things may be changed?..

Counters - The first step should be done on accounting of the number of events - there is nothing more lightweight than a simple "count++" :-)) and there are many currently already introduced here and there inside of the MySQL code as well every storage engine. These all counters should be remapped to the same place to avoid to count the same information twice and should be always accounted, without regarding if Performance Schema is enabled or not! Keep in mind that operation:

if( tracing ) count++;

is more costly rather simple:

count++;

That's why counters should be always available without any check condition - it gives a lower performance impact :-))

Another point - it'll be also fine to have not only a number of waits, but also a number of access/calls (it'll be probably just another count++ in another place, but will give a great overview of many other issues: which file/table is accessed more than others, which mutexes are the most involved, etc.)...

Time accounting - as you saw previously, having a high number of some wait events is not yet meaning we found the source of performance problem.. - some other events may have lower wait numbers, but way more high summary wait time! - and in many cases timing is the only way to find the real source of the problem. From the other hand, time measurement is much more costly than a simple counter. So this operation should be checked first by the "IF" condition (as it's done currently) and executed on demand for each event if its TIMED setting is set to YES (as it's done currently)..

Aggregation - is very useful, but may be very costly as well :-) so should be re-viewed and re-optimized again.. - in many cases aggregation may be done on demand and during the query execution (for ex. for permanently presented objects like mutexes/ files/ etc)..

Performance Schema access - currently we may access performance data and execute an SQL query only via a normal MySQL session.. - means we're in the same wait queue with other sessions to execute our query. I'll be much better to have say a dedicated port for Perf queries to access data in priority, or via SHM (as it made in Oracle for ex.) and access perf data live without introducing any activity into MySQL server processing (it'll be the most clean solution).

Call to action :-)

I think currently we're missing feedbacks from real tests / workloads whenever it's possible:

  • What kind of performance impact do you observe on your workload when Performance Schema is enabled?..
  • Are the numbers you're observing via Performance Schema reflecting your activity right?..
  • What kind of information you're missing?..
  • Etc. etc. etc. :-)

Any comments are welcome! :-)


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Performance: Improving Stability

Май 3rd, 2010

Huge performance improvement was came with MySQL 5.5.4, and looking on the following picture it's very easy to see why:

It's a one hour Read+Write dbSTRESS workload with 32 sessions running non-stop on 16 cores server. The left part of the picture corresponds to MySQL 5.1 activity, and the right one to MySQL 5.5.4. The first graph represents TPS levels, and the second one - the mutex waits observed during each test.

Even without going in details you may see how dramatically were reduced mutex waits! And I would say one of the greatest InnoDB features was a compact presentation of the "show innodb mutex" output which gave the way to monitor InnoDB more in depth and understand its contentions on each workload! - it's still not perfect, but anyway a big step ahead :-) and helped a lot to improve 5.5.4.

From the mutex waits graph you may see that the most hot "visible" contention now is on the index mutex (well, the output "file:line" is probably better for debugging rather a normal use - it's one of the things to improve here, just to be more human friendly and show a mutex name instead, as well the code is already there and require a small format changes)..

From the other side, purge lagging is another problem - you may see the same (apples to apples) workload on MySQL 5.1, MySQL 5.5.4 and XtraDB 9.1:

The first graph is still representing the TPS level, and it's interesting to see the potential performance gap between 5.5.4 and XtraDB during the first 10-15 min of test (and if we stop the test here we'll simply miss the next problem and will not see decreasing performance with a time on both engines due purge lagging..). The second graph represents the InnoDB history length, and as you may see it's growing now even faster than before! :-) And the problem is supposed to be fixed with a purge thread.. But let's get a look on one problem at a time :-)

Index Mutex Contention

It'll be great just to get it fixed ;-) But as you know, an simple way to reduce contention on a single mutex is to split this mutex into several ones (so you'll split your contention too, and instead of having say a one single session working at a time - you'll have several sessions working, etc.). And currently MySQL has a nice feature - PARTITIONS! :-) By using partitions on a table having index mutex contentions will simply split this contention by number of partitions! (for example using 3 partitions will physically split your table into 3 tables (even you'll still see it as a single one), and instead of one index mutex you'll have 3 mutexes as well dividing your contention by 3 probably too :-))

However on the next picture you may see the performance impact on 5.5.4 by using 3 partitions on the hottest table during the same Read+Write workload as before:

As you may see, curiously TPS level did not change... And we may suppose that partitions would not help here if we did not have improved mutex status within InnoDB and did not see there is another mutex waits jumped to the top and become more hot once we used 3 partitions! (while index mutex was really readuced! - that's why monitoring is so important :-))

What about this new mutex waits? - this is a dictionary lock on stats update. The fix was already proposed by Percona, and is not yet present in the current 5.5.4 tree - but as it's just a one line changes I've tried to see what if it'll be done on 5.5.4:

As you can see, the TPS level is improved! (over 20% gain!), and the main "visible" mutex waits are now on the redo logs - which may probably be considered as expected contention for a database in general :-)

But where we're with a history length now?..

Purge Lagging

Now let's get a look on the same workload but from the stability perspective.. Having constantly growing history length is not normal and quite dangerous! As well having periodic performance drops is not better too :-) And the Purge Thread feature is the answer here. But let's look in details how it'll help - the following picture represents the same workload tested on:

  • MySQL 5.5.4
  • MySQL 5.5.4 +purge thread
  • MySQL 5.5.4 +3 partitions
  • MySQL 5.5.4 +3 partitions +purge thread

The first graph is TPS, the second one is History length:

As you may see, Purge Thread brings an important stability to the workload. However a single purge thread is no more enough to follow a purge demand here!..

Currently 5.5.4 implementation allows only one purge thread for the moment. But to analyze a performance impact we may try XtraDB which already allows several purge threads. The following picture represents the same Read+Write workload running on XtraDB 9.1 using 3 partitions and tested:

  • without purge thread
  • with 1 purge thread
  • with 2 purge threads

The first graph represents TPS level, and the second one History length:

As you may see having 2 purge threads are completely removing purge lagging here! However it's also significantly reducing performance..

Can do we better here?.. - let's try ;-)

For a long time InnoDB has a purge lag configuration option - you can limit purge lagging with innodb_max_purge_lag by setting it to some acceptable level of History length, then once this limit will be reached InnoDB will slightly slow down all arriving DML queries by doing a short sleep (few ms) before processing a query statement. The feature itself is not bad, and many file systems are having a similar solution to throttle a high demand write requests. The only problem with InnoDB purge lag feature that it's broken seems to me, and I've explained why and how to fix it . Now if I apply the fix to XtraDB and see the result ;-)

What I'm interesting here is to compare:

  • single purge thread + fixed purge lag
  • two purge threads

Of course fixed purge lag without purge thread cannot give us a stable result due all missed maintenance work missed by Master thread (as explained in the previous post ). However, having one purge thread and throttling writes via max purge lag setting may still give a better result than having 2 purge threads..

On the following picture you may see the same Read+Write workload running on XtraDB 9.1 and now comparing:

  • XtraDB without purge thread
  • XtraDB with a single purge thread
  • XtraDB with 2 purge threads
  • XtraDB with a single purge thread + fixed max purge lag

The innodb_max_purge_lag was set to 400K here:

As you see, the tandem of single purge thread + fixed purge lag gives a better result here than 2 purge threads - TPS level is higher while History length is constantly kept under 400K! :-)

Why on my point of view such a solution will be more optimal?

  • in many cases one purge thread will be probably just enough
  • however when it'll be not enough - your server will still be protected by the max purge lag setting!

As always, many things depends on workload, and your experience may be different of mine.. - but tests and observations will be done, better MySQL performance will be with a time :-)

Any comments are welcome! ;-)


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Performance: 5.5.4 @dbSTRESS

Апрель 13th, 2010

Today we have something to celebrate - the MySQL 5.5.4 was announced! And a huge number of performance improvement made in this version make me very happy :-))

I've got an immense pleasure to participate in this story and seek for the most optimal solutions together with all these talented engineers from MySQL/Sun/InnoDB/Oracle teams (and in few months all Oracle :-)) - and I think it was a huge step ahead to see how well we may all work together :-))

And now I'm happy to share with you my results obtained with MySQL 5.5.4 on dbSTRESS benchmark. All tests were executed on the 32cores Intel server running Fedora 10 and having 128GB RAM and absolutely fast internal SSD drives (and if from the memory I've used only 16GB for the buffer pools, the SSD drives helped a lot to reduce any dependency on the I/O activity).

Test scenario :

  • Workload(s): Read-Only, Read-Write
  • Sessions: 1, 2, 4, 8, 16, 32, 64, 128, 256
  • Think time: 0 sec. (non-stop)

Tested InnoDB engine implementations :

  • MySQL 5.5.4
  • MySQL 5.1.45 with innodb plugin 1.0.6 (I did not test the integrated innodb because we've already demonstrated it last year  that 5.1 with integrated innodb is way slower comparing to any other InnoDB engines implementation (including innodb plugin :-))
  • XtraDB-9.1

Abbreviations used in graphs :

  • concurrency: innodb_thread_concurrency settings
  • suffix -prg1: means engine was started with a one purge thread (to see why separated purge thread is needed and what are the benefits you may read my benchmark report when we were the first to propose and published the purge thread patch.
  • X axis: number of sessions
  • Y axis: number of transactions per second (TPS)

Read-Only Workload

Read+Write Workload

Relaxed Read+Write Workload (10 Reads per Write)

I'll not go too much in details for the moment, but you may already see that:

  • MySQL 5.5.4 is reaching higher TPS levels than others
  • MySQL 5.5.4 is better prepared now to scale up to 32 cores

Of course it's not perfect yet, but there is already a huge progress was made! Even get a look on the last year's graphs and you'll see what I mean ;-)

More data will come, so stay tuned ;-)


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Performance: Current performance levels of 5.4, XtraDB-8 and Innodb-1.0.4 @dbSTRESS

Октябрь 17th, 2009

This is a short post just to make a point on the current performance level of available InnoDB engine implementations. I've made the current testing not in the goal to isolate performance issues, but to see how well the code works now when it meets the business constrain to flush on every transaction!..

The problem is that I've tested recently Oracle 10g with dbSTRESS during several customer's demands and was curious how well MySQL competes now with Oracle :-)) Due Oracle license restrictions I cannot present any Oracle results here (sorry folks, I still need my job :-)) But I've put MySQL into the similar conditions and was curious to see what is the current performance level we may expect on the same platform..

Until now I've always used "innodb_flush_log_at_trx_commit = 2" setting as I expected to isolate internal bottlenecks rather my storage array - but now it'll be "innodb_flush_log_at_trx_commit = 1". Bin log is not used (let's suppose we have a very secure storage array :-))

Setting innodb_flush_log_at_trx_commit = 1 changing a lot of thing - the workload become fully I/O-bound! - all will depends now how fast redo log is able to flush to disk!

dbSTRESS Scenario :

  • 32 user sessions
  • Read+Write workload (RW=1), no think time
  • Read transaction contains 2 selects
  • Write transaction contains 3 independent auto-commint orders: DELETE, INSERT, UPDATE
  • Database contains 10M objects, 200M history records
  • Test duration: 1 hour

H/W configuration :

  • M5000 splitted in 2 zones: each zone attached to 16 own CPU cores, one zone runs MySQL, another dbSTRESS
  • Storage array: ST6140 - splitted into 2 RAID1 LUNs, one keeps /DATA, another /LOG, cache on controller helps a lot!

MySQL configuration :

[mysqld]
 table_open_cache = 8000

# files innodb_file_per_table innodb_log_group_home_dir=/LOG innodb_log_file_size=1024M innodb_log_files_in_group = 3
# buffers innodb_buffer_pool_size=12000M innodb_additional_mem_pool_size=20M innodb_log_buffer_size=8M
# tune innodb_checksums=0 innodb_doublewrite=0 innodb_support_xa=0 innodb_thread_concurrency = 0 innodb_flush_log_at_trx_commit=1 innodb_flush_method= O_DIRECT innodb_max_dirty_pages_pct=40
# perf special innodb_adaptive_checkpoint / innodb_adaptive_flushing = 1 innodb_read_io_threads = 16 innodb_write_io_threads = 16 innodb_io_capacity = 2000

Notes:

  • On XtraDB you have to disable innodb_adaptive_flushing to be able to use innodb_adaptive_checkpoint
  • The innodb_thread_concurrency is set to zero as the current thread concurrency model is not taking care if thread is doing an I/O operation or not (see InnoDB concurrency models for more details)..

Results

the same test was executed for:

  • InnoDB-plugin-1.0.4 (Innodb-4 on the graph
  • XtraDB-8
  • MySQL 5.4 default (5.4-def)
  • MySQL 5.4 patched (5.4-patched)

Observations:

  • Of course we are far here from previously observed 12.000 TPS :-) now we're flushing on each commit!
  • INSERT and UPDATE response times are making the difference
  • Surprisingly recently announced XtraDB-8 was performed slower rather InnoDB-plugin-1.0.4 - I think there should be something probably missed within the current code (specially when you discovering all new features prepared by Percona! :-)), but well, the gap is small, but present..
  • MySQL 5.4 default version performs slower too, and even if it jumps to the higher TPS number it's lacking of stability..
  • MySQL 5.4-patched - is the default version with separated purge thread patch applied (see the full report for more details) - it performs better than InnoDB-plugin-1.0.4 as well demonstrating a very stable performance due separated and non-blocking purge activity.

I think there are the last results with these versions, but I'll keep them as the reference for next new coming changes :-)

Any comments are welcome! :-)
(but, please, don't ask me about Oracle results :-))


PlanetMySQL Voting: Vote UP / Vote DOWN

Performance Trainings & Conferences in Kiev

Октябрь 2nd, 2009

Last week I've participated in several Performance Conferences and Trainings in Kiev. Events mainly were dedicated to the latest improvements within MySQL performance, as well Solaris and Oracle tuning / performance analyzing and best practices. The week was very hot :-) but I've enjoyed every day discussing with creative and smart people! People hungry on knowledge and ready to stay 2 hours more overnight outpassing all allowed timing - it was a real pleasure to share all I know! :-))

So, I'd like to thank all participants! As well Sun@Kiev and i-Klass teams who organized and prepared all these events! I may only wish to see such events more and more often, and don't forget to invite me again! :-))

NOTE : similar conference about MySQL performance improvement will be organized soon in Paris - so if you're interesting in (and speaking French :-)) - please, contact me ahead, we'll need to know the size of the auditorium to be able to have seats for everybody :-))

After all that said, here are few images from the "road notes" and my slides or other materials used by me during presentations (as promised)...

Kiev Polytechnic Institute

25 years ago I came here expecting to be accepted as a new student.. And spent 6 years of my life to study maths, computers, programming, as well many other topics (even if some of them were here only to develop your brain and did not have any real usage :-)) I was invited to discuss about of databases scalability and the latest improvement made in MySQL performance. The conference was prepared to celebrate the Software Freedom Day in KPI ...


After so many years, entering the Main campus building was so impressive.. The Main Auditorium in this building should still remember so many famous names ... Did I expect one day to find myself on the podium and presenting my work to others?.. :-)) Initial stress was very high.. but discussing and presenting to students was so fun - that in few minutes I've got a feeling I'm speaking with old friends, and everything goes just fine :-))

Presentations :

MySQL Performance conference @i-Klass

i-Klass is very well placed in Kiev near and face to the one of the most famous and beautiful Church in Kiev - Kiev Pechersk Lavra - simply must to visit if you're traveling in Kiev :-)) BTW, there is also a famous museum of micro miniatures (the world's smallest chess board you may see on following photo). Also, did I say you the food is absolutely excellent in Kiev?.. ;-)

i-Klass is a very good partner of Sun, and for such event they prepared their bigger conference hall accepting near 350 persons. The event itself was very successful as nobody left till the end :-)) as well a big part of participants preferred to stay one or two hours more again but do not miss the last uncovered topics :-) As well I was surprised to discover Percona presence even in Kiev! :-)) And I'm still waiting for photos from this event :-))

Presentations :

Solaris Performance Monitoring & Analyze @i-Klass

Two next days covered Solaris performance and Oracle-oriented system tuning. We expected 20 persons for each event as practice lab supposed to have a Solaris desktop on the desk and there was not too much place in the room. But finally we got 45 persons every day and most of people came with their own laptop running Solaris! :-))

I've enjoyed all the time during these trainings and was happy to see so many strong engineers loving Sun technology and proud of their choices! :-) Some partners and customers I've already met during their benchmarks in Paris :-) All other are always welcome to test and improve performance of their applications in our Sun Solution Centers !!!

Also, one of the fun events was a direct connection with our Paris office and presenting live (with my colleague Matthieu) the demo of Automated Dynamic Reconfiguration with LDOMs! It was absolutely fantastic seating in two different locations and work all together as a one single team! Seems we made a new step in our activity :-))

Note : Last day due Internet connection problems we did not finish our stress testing with Oracle in Solaris zones - all who still want to participate in the remote testing please contact me before next Tuesday, because on Tuesday we'll replay live all programmed tests! :-))

Presentations :

Bye, bye...

Thanks again to all participants and see you next time! :-))



PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Performance: I/O Optimization & InnoDB

Сентябрь 9th, 2009

After my previous post about InnoDB Doublewrite Buffer impact I've received several feedbacks where people are claiming to see an important performance degradation when they enabling doublewrite... Discussing with them about their wokloads I've tried to reproduce a similar activity with dbSTRESS tool. And finally I was happy (or unhappy - all depends on the side :-))

  • Happy because I was able to reproduce the problem :-)
  • And unhappy because I've got 30% performance degradation by activating doublewrite buffer!

And now I'll tell you a story how to fix it :-))

Sorry, it'll be little bit long (but it was hard to make it shorter :-))
First of all let's get a look on the I/O activity and possible solutions to make it running more optimally...


I/O Requests


Generally we may classify I/O requests by:
  • I/O operation: Read or Write
  • I/O nature: Sequential or Random
  • I/O mode: Buffered or Synced (opened with O_SYNC / D_SYNC or followed by fsync()/fdatasync())
  • etc..

And each case has its own solution:

Sequential Write (SW) :

  • if your goal is throughput - main rule is to use big I/O blocks (or group several small writes into a big one - may be done by application /filesystem /storage) - latency will be less important here, so even with a simple disk array you still may obtain a very decent performance
  • if your goal is operations/sec - use the smallest possible I/O blocks and having a write cache on your storage will help even more (latency will have a big role here and usually it's the only way to have a write response time less than 0.1ms)
  • in any cases having write cache on your storage box will help a lot (specially if you also have a smart controller chip which may group and execute several writes in parallel, etc.)

Sequential Read (SR) :

  • if your goal is throughput - use bigger I/O blocks
  • in any case to speed-up your SR activity you'll need a kind of "prefetch" available on your I/O subsystem - it means that next blocks will be read ahead and placed in buffer even before application requests them (it may be done by application itself, by filesystem, as well by storage box)

Random Write (RW) :

  • if your goal is throughput - something wrong with your application design :-))
  • using smaller I/O blocks giving better performance
  • grouping when it possible several writes into a single one helps too
  • having write cache on the storage box helps a lot - it may delay a real write to disk operation, group and transform several incoming I/O requests into fewer operations, as well process several writes in parallel (depending to controller), etc.

Random Read (RR) :

  • this is the worse case because in most of cases you really need to read a data block , means 5ms penalty!
  • prefetch is not possible here due a random nature of I/O operations..
  • so the only way you have to speed-up RR requests is to keep them caches as much as possible !
  • in many cases there is no way to cache all needed data, so 5ms penalty in this case is inevitable...
  • RR latency may be greatly reducing by using SSD arrays, but it's still remain very costly (but cheaper then RAM :-))

As you may see, the Random Read case is the most problematic and should be avoided whenever possible.

Time to time I'm giving Performance Trainings to Sun engineers as well to Sun customers and partners, and I give them an example of RR impact observed during one of customers benchmarks:

  • imagine you have an I/O subsystem giving you 30.000 random writes/sec
  • how much you may estimate a performance drop on I/O operations/sec if %5 of your writes will now become random reads?...

What do you think?..

When I'm saying what I've observed was x3 times (!) worse (only 10.000 I/O op/sec) - it's hard to believe.. However it's easy to count:

  • due excellent work of controller + write cache pair a random write operation was executed within 0.1ms
  • so 100 I/O operations (writes) were executed within 100 x 0.1 = 10ms
  • now 5 of them become random reads, and each RR costing ~5ms...
  • 100 I/O operations will take now: 5 x 5 + 95 x 0.1 = 34.5ms

Now, let's come back to the issue with doublewrite buffer...


Doublewrite Buffer


So, what may go wrong when doublewrite buffer was activated?..

Workload described in my previous post was prepared mainly to test MySQL code scalability - it avoiding to be much depending on I/O slowness and sized enough to keep a whole data workset cached (and again there are some other issues, but I'll investigate them another day). What's important here - after some time (once data are well sitting in cache) there is no more I/O read requests involved by database!

So what's the I/O activity during this workload:

  • there are redo log writes - SW nature, which are very well cached by my storage box which give me 0.1ms service time per write operation
  • there are buffer flush list writes - RW nature, but even having configured 16 write threads for InnoDB I never getting more than 2 write operations on the same time (seems to me lock contentions are still quite hot inside of InnoDB) - so my storage box performs pretty well here keepping reasonable service time and writes/sec level
  • database performance keeps stable ~11.000 TPS

What's changing once I've enabled a doublewrite?

  • same amount of pages flushing from the buffer pool will be also written in parallel into doublewrite buffer
  • doublewrite buffer is written to disk sequentially(!) - which means near 0.1ms in my case and should go much more faster comparing to buffer flush writes which are random
  • so fast sequential writes going in background of random writes should not decrease performance too much
  • and it's exactly what I've observed! :-))

Testing more aggressive workload

And now - I'm changing the workload conditions. I'm increasing the working set limits in way that clients accessing more wide object references and finally only 10% of data (in better case) may be kept in the buffer pool (even it's quite big). The workload become more close to realistic and more I/O-bound - it's probably more common case for production environment :-) It'll not help to understand scalability issues anymore, but will help to understand InnoDB I/O activity! :-))

So well, what is changed?..

  • Database is starting with an empty buffer pool and should really read hard to get some data cached
  • The starting performance level is ~1500 TPS now, but it's growing with time as anyway the probability to find data in cache is growing from 0% to 10% anyway :-))
  • Random Read remain the main I/O activity during whole test (and the main bottleneck as well)
  • Once there is no more free pages in the buffer pool performance is not increasing anymore, but keeping stable...
  • Final performance level is around of 3500 TPS (whole test duration is one hour)
  • So due RR impact workload is running x3 times slower - which is not surprising...

And if we replay the same test with doublewrite enabled?..

  • Due doublewrite I/O activity is more high in volume - but that is as expected :-)
  • But performance level reaching only 2500 TPS - this was not expected.. At least I did not expect :-)) I expected to see the same TPS numbers! :-))

So, what's going wrong?..

Looking more in details..

  • We may see that workload is already I/O-bound on random reads
  • There are random reads! - means you may estimate 5ms cost per I/O operation
  • And now between these 5ms reads we're trying to insert our double writes... ;-)
  • And even writes are fast, but they are entering in concurrency with RR on storage box controller
  • And as I said before - RR data should be really read from disk which is increasing overall latency, and our double writes become not as fast as expected which is creating additional penalty on performance

Any way to fix it?..

My workaround to fix it:

  • Initially during all my tests I've always separated InnoDB datafiles (/DATA) from redo logs (/LOG) to the different storage volumes (if not different controllers, but at least different disks) - I've made it for all reasons I've explained before (to avoid RR/RW of data pages add any latency to redo log sequential writes)
  • As well I'm using a file per table for InnoDB to split file access (POSIX standard imposing for filesystems to accept only one write operation at any given time for any given file - so it's not really a good idea to keep all InnoDB tables within a single datafile ;-))
  • Now - when we activated a doublewrite: where are double writes are going?.. - they are all going into the system datafile!!
  • And when is system datafile is placed?.. - right, it's created within the same /DATA filesystem ;-)
  • So, what if we move it now to the /LOG filesystem and instead to mix it with RR pages will just add its sequential write activity to the redo log writes? - as storage write cache performed really well for SR operations until now, it should not be a problem ;-)
    $ mv /DATA/ibdata1 /LOG
    $ ln -s /LOG/ibdata1 /DATA

Observations :

  • I/O write volume is really doubled now
  • But double writes are not mixed anymore to page I/O activity and not lowered redo log writes either!
  • And you know what? - I've got back my 3500 TPS ! :-))

Going more far..

  • The same workaround may help to improve other SW activity involved within InnoDB
  • Datafiles are almost accessed in Read and/or Write, random or sequential - we may help here only by using a powerful enough storage
  • However doublewrite and binlog are very good candidates for improvement! (and probably some other?)
  • Why I call it workaround for the moment? - I prefer to be able to do all this stuff directly from InnoDB - and after that I'll call it solution ;-)


Binlog


And what about binlog?..
  • Binlog is using sequential writes
  • Each write operation is synced to disk (or not - depends if you're expecting to recover all your data or not :-))
  • Binlog files are created and growing dynamically until they reach max_binlog_size

Potential problems:

  • By default binlog files will be created within the same directory where your datafiles are ;-)
  • Why it's bad?.. - first of all it's bad because if your /DATA storage is crashed you'll be unable to replay your binlogs on your latest backup :-) - and the second reason is the same I/O impact as we saw previously with doublewrite :-)
  • Then binlog file is never created ahead...
  • And the problem here is that on many filesystems each single write() increasing file size will be transformed to 2(!) physical writes! (one for data and one for ~infrastructure~) - so very easily you may do 2 times more writes on your storage than you should! (unless your filesystem is managing blocks by extents like ZFS, VxFS, QFS, etc..)
  • But in any case - why not to create binlogs ahead??? - space may be allocated in background by sequential writing of big blocks - it should be very fast! - then binlog writes will run faster for sure whatever filesystem you're using!

Suggestions


To summarize, here is the current list of my suggestions:
  • Have a separated file for doublewrite! - and then place it according a user setting (for ex. use some kind of innodb_doublewrite_file option) - and it'll be a real solution! :-)
  • Create binlog files ahead! - and probably add an option how many files should be created ahead (it's much more easier to create a 1GB binlog before it'll be used as there will be no any kind of sync impact!)
  • Probably there are also other InnoDB sequential write activities which are also meriting to be placed into the separated file(s) - what do you think?..

Any comments are welcome! :-)


Few Graphs...


Just few graphs at the end comparing InnoDB activity when Doublewrite Buffer is set OFF and then ON.

Few comments :

  • I've increased a buffer pool that time to be not out of free pages within one hour :-)

  • After one hour of activity performance level reached ~5000 TPS

  • Similar performance level is also observed when doublewrite buffer is ON

  • I/O write amount is 2 times higher when doublewrite buffer is ON, however the I/O write operations are remaining the same! (so Sarah was right :-))

  • With a growing performance level, purge activity become more and more important, and finally there is no more sleeps within a purge loop - so having a separated purge thread is THE MUST! :-) (see here for more details why)

  • Current I/O-bound workload performing better if dirty pages are flushed when reaching a dirty pages percentage limit and NOT by adaptive/ahead flushing/checkpoint - Why? - because in this case it involves a single flush batch with a full I/O capacity - and as I/O subsystem is already very hot this batch will be unable to create a burst activity - it'll slowly write pages in background, then max dirty percentage will be reached again and so on.. - what's important it'll call flush batch much more less times then any other of available (ahead) methods - but again, it's only possible if your purge processing is splitted into another thread! :-))

TPS level and Response times


Dirty pages flushing


Purge activity


I/O Activity and Checkpoint Age


InnoDB Buffer Pool, History list length and Log writes/sec

Any comments are welcome! :-)

PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Performance: Final fix for Ahead Flushing & Purge Lag

Август 17th, 2009

The goal of this post is to tell you a final story about performance study looking to fix the Ahead FLushing and Purge Lag issues.. (I've tried to make it short, but if you feel you'll need more details or just have some more time for reading - go directly to the full report: http://dimitrik.free.fr/db_STRESS_MySQL_540_Purge_Lag_and_Ahead_Flushing_Fixed_Aug2009.html  :-))

Before to tell you the story about the final solution, let me show you again the TPS level graph obtained during my last tests:

 

The read+write workload is executed during one hour non-stop on 4 engines (one by one):

  • MySQL 5.4
  • InnoDB plugin-1.0.4
  • MySQL 5.Perf build #45
  • XtraDB-6

As you may see from graph, all engines are getting a periodic performance drops!!

NOTE :

  • InnoDB plugin-1.0.4 is using a recently added "Adaptive Flushing" feature !
  • XtraDB-6 is using the famous Percona's "Adaptive Checkpoint" !

So, WHY do they also getting performance drops???...

And let me repeat again - just because during a heavy read+write workload and under the current InnoDB design the Master thread is never leaving the purge loop! and whatever "Adaptive" flushing or checkpoint code, as well dirty page limit check is never reached within a Master thread! - (for more details see my initial report and next story )..

(To understand better all these critical points I've instrumented InnoDB with light counters and was able to monitor it live during workload activity (see Extention of InnoDB Status command for more details...))

Initially I've tried to extend a purge loop with ahead flushing to avoid a "furious flushing". Modification is light, but cannot be a true fix because all other Master code should be still executed too...

And then an absolute radical idea changed everything ! :-)

  • Tim Cook came with idea: if it's so, WHY NOT isolate a purge processing withing a separated thread ?..
  • and Vince Carbone made an initial code split with separated Master and Purge threads!!!

Why having separated purge thread is absolutely great ?! ;-)

  • because after that at least all Master thread code will work as expected in any situation!! :-))
  • we may leave purging alone looping forever and not take care anymore! :-))
  • Master thread activity become way simpler! :-))

It's for what I always loved Sun - it's a place of innovation !! :-))

This idea inspired me to adapt all my changes to the splitted threads and its current logic is working as the following:

Purge :

  • it's a separated thread now, just looping alone on trx_purge() call and doing nothing else ! :-)
  • initially by design purge should be involved once per 10 sec, but I've made this interval auto-adaptive: if no work was done during the last loop the sleep timeout is increasing (still respecting max 10sec), and if there was some purge work done - sleep timeout is decreasing (still respecting min 10ms) - works just well and adapt itself to the active workload :-)

Master :

  • no more purge! :-)
  • no more checks for redo log flush! :-)
  • flushing redo log within a 1sec loop (as designed)
  • do all other stuff within a 1sec or 10sec loop (as designed)
  • for the first time on my workload I saw Master thread checking dirty page percentage limit!!! :-)))

Wow! that's is really great!!! :-))


Ahead Flushing

Now, when a dirty percentage limit is really checked, do we still need Ahead Flushing? ;-))

And I will say you YES! :-)

Why?..

  • currently when a dirty page limit is reached there will be a burst buffer flush with 100% of I/O capacity setting, which may be very heavy; and setting lower I/O capacity than real will be not good as it may be too low to flush fast enough (because the goal is to avoid a critical checkpoint age level)

  • it's quite painful to calculate everytime if dirty page limit is set correctly according redolog and buffer pool sizes - it'll be much more better to leave a database engine to take care about, no? ;-)

  • there is no a half-force solution: it does not flushing at all, or it's flushing on 100% I/O capacity :-)

So, yes, we need it!

And with a current model with separated threads both Percona's Adaptive Checkpoint and InnoDB's freshly available Adaptive Flushing are entering perfectly in the game now! :-)) However I wanted absolutely to test a final solution but both "Adaptive" codes were not directly adaptable for MySQL 5.4... So I've made mine :-) with a small personal touch :-))

You may find all details about within a final report , but I'll just show here that from the previous situation:

we moved to the following one:

Few comments:

  • no more periodic performance drops!
  • checkpoint age is staying under 1.2 GB - probably a shorter time in case of recovery? ;-)
  • critical checkpoint age is never reached! :-)
  • mainly the buffer flushing is involved from Ahead Flushing code
  • and still no more than one flush per second is really executed..

Everything goes pretty well now.. except a constantly growing "History len" value...


Purge Lag

Observing my workload graphs I was surprised by seeing a constantly growing "History len" value - this number is represiting the current number of un-purged pages..

Why it happens and where is a danger you may find from MySQL manual:

In the InnoDB multi-versioning scheme, a row is not physically removed from the database immediately when you delete it with an SQL statement. Only when InnoDB can discard the update undo log record written for the deletion can it also physically remove the corresponding row and its index records from the database. This removal operation is called a purge, and it is quite fast, usually taking the same order of time as the SQL statement that did the deletion.

In a scenario where the user inserts and deletes rows in smallish batches at about the same rate in the table, it is possible that the purge thread starts to lag behind, and the table grows bigger and bigger, making everything disk-bound and very slow. Even if the table carries just 10MB of useful data, it may grow to occupy 10GB with all the "dead" rows . In such a case, it would be good to throttle new row operations and allocate more resources to the purge thread. The innodb_max_purge_lag system variable exists for exactly this purpose.

(see http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.htmlfor more details)

In my case every time when the test is finished it still took 15-20 minutes for InnoDB to flush all dirty pages!

Observations:

  • History len is outpassing 6 millions
  • once the workload is finished, InnoDB is starting to decrease a purge gap
  • dirty pages level cannot reach zero value until history len is not become zero too
  • it took 17 minutes to free all dirty pages!

This workload was kept during one hour. But what will be after a whole day activity?..

The problem seems to come due the fact that purge thread cannot follow the workload activity - there are too much pages to purge...

What kind of solution may be used here?

  • speed-up the purge processing (even it may be quite costly - there are many user threads modifying data, and only one purge thread cleaning removed rows - so there will be need to have several purge threads, and it leave less CPU power for a useful work (and finally you'll probably not win anything for your workload throughput) - however even single thread currently is not going on its full speed as it's crossing some common locks with other threads during its work..) - but with time I think this direction will get the main priority!..

  • add a kind of throttling for writing operations to keep their activity on the same level as your maximum possible purge throughput - it'll slow down little bit your transactions, but avoid a big potential disaster!..

The second solution is already implemented within InnoDB - you may set an innodb_max_purge_lag parameter to say InnoDB to keep purge gap under this limit. But! The only problem - it doesn't work here..

WHY?..

As I explained in my previous report , the innodb_max_purge_lag condition is ignored until InnoDB considering there is a consistent read view which may need see rows to be purged. The problem is I don't have a feeling it works properly, because until I have any SELECT within my workload InnoDB considering them as consistent reads, even they are started way after when the last DML statement was committed. Probably keeping a track of an oldest LSN for SELECT may help?.. But well, let's back to the innodb_max_purge_lag setting:

This variable controls how to delay INSERT, UPDATE, and DELETE operations when purge operations are lagging (see Section 13.2.9, ?InnoDB Multi-Versioning?). The default value 0 (no delays). The InnoDB transaction system maintains a list of transactions that have delete-marked index records by UPDATE or DELETE operations. Let the length of this list be purge_lag. When purge_lag exceeds innodb_max_purge_lag, each INSERT, UPDATE, and DELETE operation is delayed by ((purge_lag/innodb_max_purge_lag)×10)?5 milliseconds. The delay is computed in the beginning of a purge batch, every ten seconds. The operations are not delayed if purge cannot run because of an old consistent read view that could see the rows to be purged.

(see http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html#sysvar_innodb_max_purge_lag)

But honestly - what is the problem with DML query delay and consistent view?.. If the innodb_max_purge_lag is set it means we want to avoid any purge lag higher than this value! And if it become higher and there are new insert/delete/updates arriving - what else can we do here if we will not delay them just little bit? The same thing may be made within a customer application (if things going slowly - add few ms sleep between transaction orders) - will it be broken after that? ;-)) The same logic is implemented within many file systems (and recently in ZFS too)..

So, my fix is simple here - by removing of consistency read check within trx_purge function! :-)

In short the:


if (srv_max_purge_lag > 0 && !UT_LIST_GET_LAST(trx_sys->view_list)) {

become:


if (srv_max_purge_lag > 0 ) {

As well I limit the max potential delay value to 50ms.

The result

And now let me present you the result - following test was executed with Purge lag fix applied and innodb_max_purge_lag = 200000 limit:

Observations :

  • TPS level is slightly lower - from 8,200 TPS it moved to 7,800 (400TPS less, 5% performance loss)
  • however, everything else looks just fine! :-)
  • history len is stable on 200000
  • max DML delay did not outpass 5ms during all test duration
  • checkpoint age and dirty pages level are rock stable
  • and the most fun: once the test is finished, instead of 17 minutes all dirty pages were freed within 30 seconds ! :-)

More details you may find from the full report: http://dimitrik.free.fr/db_STRESS_MySQL_540_Purge_Lag_and_Ahead_Flushing_Fixed_Aug2009.html 

Any comments are welcome! :-)


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Performance: InnoDB plugin-1.0.4 & others @dbSTRESS

Август 13th, 2009

This post is an update of my previous one about XtraDB-6 performance - as InnoDB plugin-1.0.4 announce came the same day I did not have yet any benchmark results on that time :-)

To be short, the new InnoDB plugin looks very positive and have several very valuable improvement (and of course we expected to see them much more earlier, no? ;-) on the same time analyzing all latest updates - probably it's the first sign that things will go much more faster in the near future? ;-)

Anyway, what I liked with this release:

  • group commit is back! (and we should thank a lot Percona team for their efforts to get it fixed! ;-)
  • configurable number of I/O threads and I/O capacity (aligned now with XtraDB, Google patched and MySQL 5.4)
  • adaptive flushing (idea is similar to Percona's Adaptive Checkpoint, but more elegant)
  • read ahead algorithm changes
  • etc..

Full list of changes you may find from InnoDB changelog , (as well Mark Callaghan wrote a very good summary about, and hope you did not miss Yoshinori's post about his group commit tests ).

InnoDB plugin is missing a timer concurrency model for the moment. Probably it will come with the next release? ;-)

But what about results on dbSTRESS?..

Read-Only Workload

Observations :

  • with the latest Solaris 10 update7, MySQL is scaling better now on the read-only workload even with setting concurrency to zero!
  • InnoDB plugin looks very stable!

Observations :

  • while concurrency setting is not zero, only engines using a timer based model continue to keep workload
  • we may also ask as well: why use concurrency setting if the result is alredy "good enough"?.. - well, if you have only reads - it's ok.. But what if you also have writes? ;-)


Read+Write Workload

Observations :

  • as you may see, limiting thread concurrency on the read+write workload helps a lot to keep a growing load!
  • depending on how many active & concurrent sessions you have on your workload you may choose your optimal setting..

All other results as well all other details about this testing you may find from my full report: http://dimitrik.free.fr/db_STRESS_XtraDB_6_and_InnoDB_plugin_4_on_M5000_Aug2009.html


Long duration Read+Write Workload

I've also mentioned before the test is not log enough to reach the highest TPS level of each engine. For my other performance analyzing I've set up a long (1 hour) non-stop Read+Write test:

  • 5min Read-Only warm-up
  • 60min non-stop Read+Write
  • 32 concurrent sessions
  • 16 CPU cores
  • 50GB database

So, let me show a small graph:

The blue curve is representing a TPS level reached during this workload. There were all 4 engines tested one after other: MySQL 5.4, then InnoDB plugin-4, then MySQL.Perf build #45, and finally XtraDB-6. As you may see the throughput still continues to grow within 10-15 minutes, and only then reaches its higher level and become stable. So, the max (and stable!) Read+Write throughput will be:

Engine TPS
MySQL 5.4 8,200
InnoDB plugin-1.0.4 8,300
XtraDB-6 9,500
MySQL Perf build #45 11,500


Curiously I've already observed 12,500 TPS before with Perf build #5... (need to check if we did not break something with a time here..). But performance improvements made within the latest releases of XtraDB and InnoDB plugin are very impressive!

I may only say: Please, DON'T STOP! :-)

Any comments are welcome! :-)


PlanetMySQL Voting: Vote UP / Vote DOWN