Archive for the ‘tuning’ Category

Momentum MTA Performance Tuning Tips

Январь 7th, 2012

About 2 months ago I’ve joined LivingSocial technical operations team and one of my first tasks there was to figure out a way to make our MTAs perform better and deliver faster. We use a really great product called Momentum MTA (former Ecelerity) and it is really fast, but it is always good to be able to squeeze as much performance as possible so I’ve started looking for a ways to make our system faster.

While working on it I’ve created a set of scripts to integrate Momentum with Graphite for all kinds of crazy stats graphing, those scripts will be opensourced soon, but for now I’ve decided to share a few tips about performance-related changes we’ve made to improve our performance at least 2x:

  1. Use EXT2 Filesystem for the spool storage – After a lot of benchmarking we’ve noticed that amounts of I/O we’ve been doing was way too high compared to our throughput. Some investigation showed that EXT3 filesystem we were using for the spool partition had way too high metadata update overhead because of the fact that the spool storage uses a lot of really small files. Switching to EXT2 helped us gain at least 50-75% additional performance. Additional performance gain was caused by turning on noatime option for our spool.

    There are some sources that claim using XFS for spool directories is a better option, but we’ve decided to stick with EXT2 for now.

  2. Do not use %h{X} macro in your custom logs – Custom logs is an awesome feature of momentum and we use it to log our bounces along with some information from mail headers. Unfortunately the most straighforward thing to do (using %h{X} macro) was not the best option for I/O loaded servers because every time Momentum needs to log a bounce it needs to swap message body in from the disk and parse it to get you the header value.

    To solve this issue we’ve created a Sieve+ policy script that would extract the headers we need from a message during initial spooling phase (when the message is still in memory) and put those values to the message metadata. This way when we need to log those values we wouldn’t need to swap message body in from the disk. Here is the Sieve script to extract header value:

    1
    2
    3
    4
    5
    6
    require [ "ec_header_get", "vctx_mess_set", "ec_log" ];

    # Extract x-ls-send-id header to LsSendId context variable
    # (later used in deliver log)
    ($send_id) = ec_header_get "x-ls-send-id";
    vctx_mess_set "LsSendId" $send_id;

    After this we could use it in a custom logger like this:

    1
    2
    3
    4
    5
    6
    custom_logger "custom_logger1"
    {
      delivery_logfile = "cluster:///var/log/ecelerity/ls-delivery_log.cluster=>master"
      delivery_format = "%t@%BI@%i@%CI@D@%r@%R@%m@%M@%H@%p@%g@%b@%vctx_mess{LsSendId}"
      delivery_log_mode = 0664
    }
  3. Give more RAM to Momentum – When Momentum receives a message, it stores it to the disk (as required by SMTP standard) and then tries to deliver the copy it has in memory, if delivery succeeds, on-disk copy is unliked. The problem with a really have outbound traffic load is that momentum needs to keep tons of emails in memory, but by default it could only hold 250 messages. With a load of 250-500 messages a second this is just too small.

    To change this limit we’ve increased Max_Resident_Active_Queue parameter and changed it to 1000000 (of course we made sure have enough RAM to hold that many messages if needed) and Max_Resident_Messages to 0 (which means unlimited). This allows Momentum keep as many messages resident as possible and reduce the load caused by swap-in operations required for re-delivery attempts, etc.

  4. Choose a proper size for your I/O-related thread pools – in default Momentum config they set SwapIn and SwapOut thread pool sizes to 20. Under really high load even on our 4xSAS15k RAID10 this tends to be too high value. We’ve switched those pools to 8 threads each and it helped to reduce I/O contention and overall I/O throughput.

As a summary, I’d like to note, that as with any optimizations, before tuning your system it really helps to set up as much monitoring for your MTA servers as possible: cacti graphs, graphite, ganglia or something else – does not matter. Just make sure you see all the aspects of your system performance and understand what is going on with your system before changing any performance-related settings.



PlanetMySQL Voting: Vote UP / Vote DOWN

Setting up XFS on Hardware RAID — the simple edition

Декабрь 16th, 2011

There are about a gazillion FAQs and HOWTOs out there that talk about XFS configuration, RAID IO alignment, and mount point options.  I wanted to try to put some of that information together in a condensed and simplified format that will work for the majority of use cases.  This is not meant to cover every single tuning option, but rather to cover the important bases in a simple and easy to understand way.

Let’s say you have a server with standard hardware RAID setup running conventional HDDs.

RAID setup

For the sake of simplicity you create one single RAID logical volume that covers all your available drives.  This is the easiest setup to configure and maintain and is the best choice for operability in the majority of normal configurations.  Are there ways to squeeze more performance out of a server by dividing the logical volumes: perhaps, but it requires a lot of fiddling and custom tuning to accomplish.

There are plenty of other posts out there that discuss RAID minutia.  Make sure you cover the following:

  • RAID type (usually 5 or 1+0)
  • RAID stripe size
  • BBU enabled with Write-back cache only
  • No read cache or read-ahead
  • No drive write cache enabled

Partitioning

You want to run only MySQL on this box, and you want to ensure your MySQL datadir is separated from the OS in case you ever want to upgrade the OS, but otherwise keep it simple.  My suggestion?  Plan on allocating partitions roughly as follows, based on your available drive space and keeping in mind future growth.

  • 8-16G for Swap –
  • 10-20G for the OS (/)
  • Possibly 10G+ for /tmp  (note you could also point mysql’s tmpdir elsewhere)
  • Everything else for MySQL (/mnt/data or similar):  (sym-link /var/lib/mysql into here when you setup mysql)

Are there alternatives?  Yes.  Can you have separate partitions for Innodb log volumes, etc.?  Sure.  Is it work doing much more than this most of the time?  I’d argue not until you’re sure you are I/O bound and need to squeeze every last ounce of performance from the box.  Fiddling with how to allocate drives and drive space from partition to partition is a lot of operational work which should be spent only when needed.

Aligning the Partitions

Once you have the partitions, it could look something like this:
#fdisk -ul

Disk /dev/sda: 438.5 GB, 438489317376 bytes
255 heads, 63 sectors/track, 53309 cylinders, total 856424448 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00051fe9

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048     7813119     3905536   82  Linux swap / Solaris
Partition 1 does not end on cylinder boundary.
/dev/sda2   *     7813120    27344895     9765888   83  Linux
/dev/sda3        27344896   856422399   414538752   83  Linux
 Several months ago my colleague Aurimas posted two excellent blogs on both the theory of Aligning IO on hardware RAID and some good benchmarks to emphasize the point, go read those if you need the theory here.  Is it common on modern Linux systems for this to be off?  Maybe not, but here’s how you check.
  We want to use mysql on /dev/sda3, but how can we ensure that it is aligned with the RAID stripes?  It takes a small amount of math:
  • Start with your RAID stripe size.  Let’s use 64k which is a common default.  In this case 64K = 2^16 = 65536 bytes.
  • Get your sector size from fdisk.  In this case 512 bytes.
  • Calculate how many sectors fit in a RAID stripe.   65536 / 512 = 128 sectors per stripe.
  • Get start boundary of our mysql partition from fdisk: 27344896.
  • See if the Start boundary for our mysql partition falls on a stripe boundary by dividing the start sector of the partition by the sectors per stripe:  27344896 / 128 = 213632.  This is a whole number, so we are good.  If it had a remainder, then our partition would not start on a RAID stripe boundary.

Create the Filesystem

XFS requires a little massaging (or a lot).  For a standard server, it’s fairly simple.  We need to know two things:

  • RAID stripe size
  • Number of unique, utilized disks in the RAID.  This turns out to be the same as the size formulas I gave above:
    • RAID 1+0:  is a set of mirrored drives, so the number here is num drives / 2.
    • RAID 5: is striped drives plus one full drive of parity, so the number here is num drives – 1.
In our case, it is RAID 1+0 64k stripe with 8 drives.  Since those drives each have a mirror, there are really 4 sets of unique drives that are striped over the top.  Using these numbers, we set the ‘su’ and ‘sw’ options in mkfs.xfs with those two values respectively.
# mkfs.xfs -d su=64k,sw=4 /dev/sda3
meta-data=/dev/sda3              isize=256    agcount=4, agsize=25908656 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=103634624, imaxpct=25
         =                       sunit=16     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=50608, version=2
         =                       sectsz=512   sunit=16 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The XFS FAQ is a good place to check out for more details.

Mount the filesystem

Again, there are many options to use here, but let’s use some simple ones:

/var/lib/mysql           xfs     nobarrier,noatime,nodiratime

Setting the IO scheduler

This is a commonly missed step related to getting the IO setup properly.  The best choices here are between ‘deadline’ and ‘noop’.   Deadline is an active scheduler, and noop simply means IO will be handled without rescheduling.  Which is best is workload dependent, but in the simple case you would be well-served by either.  Two steps here:

echo noop > /sys/block/sda/queue/scheduler   # update the scheduler in realtime

And to make it permanent, add ‘elevator=<your choice>’ in your grub.conf at the end of the kernel line:

kernel /boot/vmlinuz-2.6.18-53.el5 ro root=LABEL=/ noapic acpi=off rhgb quiet notsc elevator=noop

 

This is a complicated topic, and I’ve tried to temper the complexity with what will provide the most benefit.  What has made most improvement for you that could be added without much complexity?


PlanetMySQL Voting: Vote UP / Vote DOWN

Performance schema, overhead tuning

Июнь 27th, 2011

Performance schema performance tuning

One of the most important question users ask before deciding to use the performance schema is: what is the overhead ? The underlying concern of course is to make sure deploying the performance schema does not negatively impact production ("First, do no harm").
The question is simple, and yet the answer is not so simple, as it depends on so many things.
As seen in some benchmarks done already, the overhead when using the performance schema can vary a lot, and in some cases really causes degradations.
A lot of time has been spent investigating the root causes (note the plural form) for the performance schema overhead in general, so I think it is a good time to share current findings.

Server workload

Analysing performances, and testing the effects of different code or configuration changes, can hardly be done in a vacuum. That's where benchmarks come in.
The problem with benchmarks in general is that different benchmarks will produce different loads, and that a given code may perform very well under some conditions but poorly under others ...
One particular property of the performance schema is that, since it instruments the server itself, the load and behavior of the server:
  • is there capacity left or is it saturated ?
  • is it CPU bound or disk io bound ?,
  • paging ?
  • swaping ?
  • contended in a specific place (query cache, table cache, innodb buffer pool, [insert your favorite here]) ?
directly influences the load put on the performance schema code:
  • how many mutexes / rwlocks / cond events are detected ?
  • how many file io events are detected ?
  • how many table io / table lock events are detected ?
  • how many stages / statements are detected ?
Given this load as input, the user configuration of the performance schema:
  • the sizing of internal performance schema buffers,
  • which instrument is enabled ?
  • which instrument is timed ?
  • which timer is used ?
  • which consumer is enabled ?
  • is there some filtering per user ? per host ? (table setup_actors)
  • is there some filtering per object? (table setup_objects)
will decide what to do with the events seen:
  • discard them,
  • count them,
  • time them,
  • aggregate them, possibly with fancy aggregations,
all of which affects overhead.
To summarize, the overall system can be modeled as giant pipe, which consists of, in order:
  • the client application and the generated application workload,
  • the mysql server and the generated instrumentation workload,
  • the performance schema and the generated statistics / overhead.
It is critical to understand how that pipe works to interpret results.
For example, if a client application is sending very few queries to the server, generating a very "small" load, a benchmark with or without the performance schema is not going to show any differences. To be meaningful, a benchmark (of the database) has to put a lot of pressure (on the database). This is hardly a surprise, but here comes the interesting part.
Next, if a server is poorly configured for the application workload received, and if the server itself does not perform well for that load, the instrumentation workload on the performance schema will be "small". A properly configured server will execute more queries, transactions, etc, which will generate more events to instrument. To be meaningful, a benchmark (of the performance schema) has to put a lot of pressure (on the performance schema).
An immediate consequence is that benchmarks of the performance schema should disable the query cache: there are currently 474 instrumentation points in the server. If most executions paths bounce off of the query cache, stress will be on "wait/synch/mutex/sql/Query_cache::structure_guard_mutex" only, cutting off the load generated from the lower layers on the server code, such as the table io instrumentation, or the storage engines instrumentation.
A more subtle consequence is that every time a performance bottleneck in the server code is removed, either by a bug fix or by a better server tuning, server efficiency goes up, which increases the number of statements / transactions / whatever the benchmark metric is counting, that are executed. While this is good news for the server, it actually increases the instrumentation workload for the performance schema.
In conclusion, and in my opinion, to interpret a performance schema benchmark, results should describe not only the overhead observed, but also the load seen that caused the overhead. By load seen, I mean here the instrumentation load seen at the server / performance schema interface, since this is the real stress put on the performance schema code. The more details are available for each stage of the pipe, the better.
Time to dive into numbers then ...

The trunk benchmark

The load used for this analysis is sysbench, and more specifically the sysbench oltp read-only benchmark.
The reason this particular case was picked is not because I like the results on this one better.
The reasons are:
  • sysbench oltp was the test used in previous benchmarks, and to reproduce the results (and analyse the reasons for overhead), it helps to be in similar conditions,
  • a read only, in memory only load eliminates a lot of noise from measurements, making analysis easier: the goal here is not to get a number (by "just" running a benchmark), it is to get an explanation for the number, so that the code can be improved,
  • the performance schema code itself is "CPU only, user land only" code, as it does not make any io or any system call. A benchmark that shares the same characteristics seems more likely to provide interesting insight.
Time permitting, I hope to get to do the same analysis for other cases (sysbench read-write, dbt2, etc).
The test machine used is a 8 CPU / 32 GB Linux box, so the test configuration is sized accordingly:
  • for sysbench: --oltp-table-size=1000000
  • for mysql my.cnf: innodb_buffer_pool_size = 12G
Historical note: testing done for this post started on a 8 CPU 16 GB machine, then moved to another lab. The original configuration was not changed, hence the 12G buffer pool only.

The trunk results

All results are for mysql-trunk (revno 3195, 2011-06-13), compiled in Release configuration. This corresponds to the current code for mysql 5.6, which is still in development.

Tests without the performance schema

With this build, the performance instrumentation is not compiled. All the computation of overhead expressed as percentages are relative to this reference, which is used as the ultimate baseline.
Connections TPS CPU User (%) CPU System (%) CPU Idle (%)
1 807 10 2 88
2 1439 20 5 75
4 2610 40 10 50
8 5086 84 14 2
12 5037 84 15 1
16 5038 85 15 0
32 4854 85 15 0
48 4762 85 15 0
64 4118 (*) 60 32 8
128 3538 (*) 57 35 8
256 1746 (*) 57 35 8
The overall CPU statistics are collected by vmstat, and helps to qualify in which mode the server is operating.
Looking at the overall shape of the performance curve, three different operating modes can be observed:
  • Ramping up (1 to 8 connections)
  • Peek efficiency (8 to 48 connections)
  • Server progressive degradation (48 to infinity connections)
For a 8 core machine, this looks reasonable.
(*) A special note about results for 64 / 128 / 256 connections: the performance measured in this cases is not stable and can vary a lot (50 percent) between runs. This is due to the machine saturation: at 256 connections, 256 threads are running in sysbench, plus 256+16 in mysqld, which makes 512+ active threads on a 8 core. Numbers are posted for this case as well, to include this mode of operation in the analysis.

Tests with the performance schema

Now that we have established what the baseline looks like, executing the same workload on the same hardware, under the same conditions gives the following results. The CPU user / system / idle statistics from vmstat are not repeated (the overall profile is the same) for brevity.
With performance schema disabled off global (timed) thread (timed)
Connections TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%)
1 796 1.36 775 3.97 684 15.24 681 15.61
2 1409 2.08 1349 6.25 1198 16.75 1191 17.23
4 2544 2.53 4634 5.86 2115 18.97 2114 19.00
8 4844 4.76 4634 8.89 3974 21.86 3913 23.06
12 4850 3.71 4662 7.44 4020 20.19 3979 21.00
16 4847 3.79 4676 7.19 4051 19.59 4020 20.21
32 4676 3.61 4485 7.60 3851 20.66 3789 21.94
48 4582 3.78 4384 7.94 3759 21.06 3692 22.47
64 (*) 2548 38.13 3048 25.98 3392 17.63 3286 20.20
128 (*) 2035 42.48 2007 43.27 1938 45.22 1917 45.82
256 (*) 1745 0.06 1697 2.81 1662 4.81 1624 6.99
(*) The same comment about results stability applies here. Because the results are unstable with great variations, the overhead computed after normalization on a unstable baseline gives meaningless results.

Analysis

Overall, the overhead in the most basic configuration, with all instruments enabled and only the global instrumentation consumer enabled, is at 20 percent already.
This is a lot. Looking in more details at these results, it appears that:
  • Adding the thread instrumentation consumer (to get statistics per thread) does only marginally increase overhead: at 8 connections, the overhead goes from 21.86% to 23.06%.
  • The overhead depends on the number of connections. It is progressive in the 1-8 connections range, peaks at 8 connections, then is both stable and significant at 8-48 connections.
  • Peak TPS without the performance schema was at 8 connections. With the performance schema, it seems to move to 12 or even 16 connections.
To understand the root cause for this 20% toll, the following measurements are also instructive:
  • With the performance schema enabled but all instruments disabled (configuration "off"), the overhead is at ~8%. For code doing nothing, this is actually quite high.
  • When the performance schema is disabled at startup time (configuration "disabled"), the overhead is at ~4%. This configuration is not very useful from a functional point of view, but from a technical point of view it executes a code path which is a subset of the configuration "off". That makes the numbers collected useful.
Based on this, overhead seems to originate from, roughly:
  • compiling with the performance schema (configuration "disabled"): ~4% cost,
  • starting with the performance schema (configuration "off"): additional ~4% cost,
  • enabling the global consumer only (configuration "global (timed)"): additional ~12% cost,
  • adding the thread consumer only (configuration "global (timed)"): additional ~1% cost
To understand better which code is executed, and how often, let's use the performance schema itself to get some insight on the server execution. After all, that is precisely what the performance schema is for.
Using the "thread (timed)" configuration, with the assumption that the instrumentation does not distort the workload, and looking at details for 8 connections, which is chosen because it corresponds to the peak performance without the performance schema, statistics collected are as follows below.

Total number of wait events.

mysql> select sum(count_star), sum(sum_timer_wait)
    ->   from performance_schema.events_waits_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
|      1098658477 |     797822128662705 |
+-----------------+---------------------+
1 row in set (1.79 sec)

Top waits, by count

mysql> select * from performance_schema.events_waits_summary_global_by_event_name
    ->   where count_star>0 order by count_star desc limit 25;
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                                    | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| wait/io/table/sql/handler                     |  482492695 | 583215424951365 |          42210 |        1208680 |     7913678535 |
| wait/synch/mutex/innodb/srv_conc_mutex        |  265080324 | 142377797253015 |          30150 |         537005 |     1862766495 |
| wait/synch/mutex/sql/THD::LOCK_thd_data       |  122077261 |   6519294560295 |          30150 |          53265 |     1072076715 |
| wait/synch/rwlock/innodb/hash table locks     |   38409624 |   5420427797475 |          27135 |         141035 |      360901530 |
| wait/synch/mutex/innodb/read_view_mutex       |   32554195 |   6146001298890 |          24120 |         188605 |     1011083265 |
| wait/synch/mutex/sql/LOCK_open                |   32553870 |   9359193544965 |          30150 |         287430 |     1080446355 |
| wait/lock/table/sql/handler                   |   32553724 |  20576707870950 |          99495 |         631810 |     1083789990 |
| wait/synch/mutex/mysys/THR_LOCK::mutex        |   32553724 |   6964485402105 |          30150 |         213730 |     3022857090 |
| wait/synch/rwlock/innodb/trx_sys_lock         |   18592602 |   5270410352025 |          42210 |         283410 |     1622682045 |
| wait/synch/rwlock/sql/LOCK_grant              |   16276408 |   3502306526580 |          51255 |         215070 |      892225935 |
| wait/synch/rwlock/innodb/btr_search_latch     |   16198467 |   4144473779535 |          27135 |         255605 |     1048297410 |
| wait/synch/rwlock/sql/MDL_lock::rwlock        |    2325412 |    524010714480 |          54270 |         225120 |       36433260 |
| wait/synch/mutex/mysys/BITMAP::mutex          |    2325268 |    409667867595 |          42210 |         175875 |      908214480 |
| wait/synch/mutex/innodb/trx_mutex             |    1163261 |     32488536510 |          24120 |          27805 |        7419915 |
| wait/synch/mutex/sql/MDL_map::mutex           |    1162780 |    274705866855 |          36180 |         236175 |      133772535 |
| wait/synch/mutex/sql/LOCK_plugin              |    1162673 |    212769433395 |          33165 |         182910 |      791328960 |
| wait/synch/mutex/innodb/lock_mutex            |    1162633 |    145945057815 |          39195 |         125290 |       14173515 |
| wait/synch/mutex/innodb/log_sys_mutex         |       7757 |      2441752020 |          27135 |         314565 |     1239502680 |
| wait/synch/mutex/innodb/os_mutex              |       2490 |       271684665 |          30150 |         108875 |       17423685 |
| wait/synch/mutex/innodb/fil_system_mutex      |       1402 |       146559150 |          24120 |         104520 |        5538555 |
| wait/synch/mutex/innodb/flush_list_mutex      |        607 |        96362415 |          33165 |         158455 |         346725 |
| wait/io/file/innodb/innodb_log_file           |        600 |   2693980326195 |       23094900 |     4489967145 |   819050374485 |
| wait/synch/mutex/innodb/buf_pool_mutex        |        516 |        40862295 |          27135 |          79060 |         521595 |
| wait/synch/mutex/innodb/log_flush_order_mutex |        473 |        49503285 |          33165 |         104520 |         397980 |
| wait/synch/mutex/innodb/lock_wait_mutex       |        428 |        71949960 |          63315 |         167835 |        1706490 |
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
25 rows in set (1.79 sec)

Top waits, by time

mysql> select * from performance_schema.events_waits_summary_global_by_event_name
    ->   where count_star>0 order by sum_timer_wait desc limit 25;
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                                | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| wait/io/table/sql/handler                 |  482492695 | 583215424951365 |          42210 |        1208680 |     7913678535 |
| wait/synch/mutex/innodb/srv_conc_mutex    |  265080324 | 142377797253015 |          30150 |         537005 |     1862766495 |
| wait/lock/table/sql/handler               |   32553724 |  20576707870950 |          99495 |         631810 |     1083789990 |
| wait/synch/mutex/sql/LOCK_open            |   32553870 |   9359193544965 |          30150 |         287430 |     1080446355 |
| wait/synch/mutex/mysys/THR_LOCK::mutex    |   32553724 |   6964485402105 |          30150 |         213730 |     3022857090 |
| wait/synch/mutex/sql/THD::LOCK_thd_data   |  122077261 |   6519294560295 |          30150 |          53265 |     1072076715 |
| wait/synch/mutex/innodb/read_view_mutex   |   32554342 |   6146010024300 |          24120 |         188605 |     1011083265 |
| wait/synch/rwlock/innodb/hash table locks |   38409624 |   5420427797475 |          27135 |         141035 |      360901530 |
| wait/synch/rwlock/innodb/trx_sys_lock     |   18592651 |   5270414135850 |          42210 |         283410 |     1622682045 |
| wait/synch/rwlock/innodb/btr_search_latch |   16198467 |   4144473779535 |          27135 |         255605 |     1048297410 |
| wait/synch/rwlock/sql/LOCK_grant          |   16276408 |   3502306526580 |          51255 |         215070 |      892225935 |
| wait/io/file/innodb/innodb_log_file       |        600 |   2693980326195 |       23094900 |     4489967145 |   819050374485 |
| wait/synch/rwlock/sql/MDL_lock::rwlock    |    2325412 |    524010714480 |          54270 |         225120 |       36433260 |
| wait/synch/mutex/mysys/BITMAP::mutex      |    2325268 |    409667867595 |          42210 |         175875 |      908214480 |
| wait/synch/mutex/sql/MDL_map::mutex       |    1162780 |    274705866855 |          36180 |         236175 |      133772535 |
| wait/synch/mutex/sql/LOCK_plugin          |    1162673 |    212769433395 |          33165 |         182910 |      791328960 |
| wait/synch/mutex/innodb/lock_mutex        |    1162633 |    145945057815 |          39195 |         125290 |       14173515 |
| wait/synch/mutex/innodb/trx_mutex         |    1163457 |     32497638795 |          24120 |          27805 |        7419915 |
| wait/io/file/innodb/innodb_data_file      |          8 |     23031533745 |       54936315 |     2878941425 |     7430889600 |
| wait/synch/rwlock/innodb/checkpoint_lock  |          4 |      5048894880 |         250245 |     1262223720 |     2745088155 |
| wait/synch/mutex/innodb/log_sys_mutex     |       8095 |      2471021640 |          27135 |         305185 |     1239502680 |
| wait/synch/mutex/sql/LOCK_thread_count    |         31 |       830150100 |          30150 |       26778895 |      235748880 |
| wait/synch/mutex/innodb/os_mutex          |       2539 |       276339825 |          30150 |         108540 |       17423685 |
| wait/synch/mutex/innodb/fil_system_mutex  |       1451 |       149064615 |          24120 |         102510 |        5538555 |
| wait/synch/mutex/innodb/flush_list_mutex  |        704 |       111835395 |          33165 |         158790 |         346725 |
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
25 rows in set (1.79 sec)

Total number of stage events

mysql> select sum(count_star), sum(sum_timer_wait)
    ->   from performance_schema.events_stages_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
|       244153447 |    2263079088326000 |
+-----------------+---------------------+
1 row in set (0.00 sec)

Top stages, by count

mysql> select * from performance_schema.events_stages_summary_global_by_event_name
    ->   where count_star>0 order by count_star desc limit 25;
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                     | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| stage/sql/init                 |   34879169 | 157347558824000 |        2374000 |        4511000 |     1685373000 |
| stage/sql/cleaning up          |   18602306 |  25263461317000 |         628000 |        1358000 |     2278265000 |
| stage/sql/closing tables       |   18602129 | 384870074281000 |         628000 |       20689000 |     4374172000 |
| stage/sql/query end            |   18602129 |  23815226118000 |         698000 |        1280000 |     1674268000 |
| stage/sql/checking permissions |   16276937 |  37278853417000 |        1327000 |        2290000 |     1734543000 |
| stage/sql/Opening tables       |   16276936 |  91344126412000 |        3352000 |        5611000 |     1641511000 |
| stage/sql/executing            |   16276863 |  15264844128000 |         628000 |         937000 |     1097577000 |
| stage/sql/statistics           |   16276863 | 325924010836000 |        6495000 |       20023000 |     3296152000 |
| stage/sql/Sending data         |   16276863 | 548279995436000 |        2794000 |       33684000 |     6805393000 |
| stage/sql/System lock          |   16276863 |  68298443425000 |        2444000 |        4196000 |     1102675000 |
| stage/sql/preparing            |   16276863 |  40997524468000 |        1676000 |        2518000 |     1090453000 |
| stage/sql/optimizing           |   16276863 |  31599629407000 |        1606000 |        1941000 |     1093176000 |
| stage/sql/end                  |   16276863 |  17257774496000 |         768000 |        1060000 |     1070408000 |
| stage/sql/Sorting result       |    2325266 | 270070871391000 |        7264000 |      116146000 |     3214855000 |
| stage/sql/removing tmp table   |    1162634 |  15392269553000 |        9359000 |       13239000 |      904741000 |
| stage/sql/freeing items        |    1162634 |  15041751826000 |        3143000 |       12937000 |     1106237000 |
| stage/sql/Creating tmp table   |    1162633 |  14681955665000 |        9568000 |       12628000 |     1092898000 |
| stage/sql/Copying to tmp table |    1162633 | 180350717326000 |      116008000 |      155122000 |     8074224000 |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
18 rows in set (0.00 sec)

Top stages, by time

mysql> select * from performance_schema.events_stages_summary_global_by_event_name
    ->   where count_star>0 order by sum_timer_wait desc limit 25;
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                     | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| stage/sql/Sending data         |   16276863 | 548279995436000 |        2794000 |       33684000 |     6805393000 |
| stage/sql/closing tables       |   18602129 | 384870074281000 |         628000 |       20689000 |     4374172000 |
| stage/sql/statistics           |   16276863 | 325924010836000 |        6495000 |       20023000 |     3296152000 |
| stage/sql/Sorting result       |    2325266 | 270070871391000 |        7264000 |      116146000 |     3214855000 |
| stage/sql/Copying to tmp table |    1162633 | 180350717326000 |      116008000 |      155122000 |     8074224000 |
| stage/sql/init                 |   34879169 | 157347558824000 |        2374000 |        4511000 |     1685373000 |
| stage/sql/Opening tables       |   16276936 |  91344126412000 |        3352000 |        5611000 |     1641511000 |
| stage/sql/System lock          |   16276863 |  68298443425000 |        2444000 |        4196000 |     1102675000 |
| stage/sql/preparing            |   16276863 |  40997524468000 |        1676000 |        2518000 |     1090453000 |
| stage/sql/checking permissions |   16276937 |  37278853417000 |        1327000 |        2290000 |     1734543000 |
| stage/sql/optimizing           |   16276863 |  31599629407000 |        1606000 |        1941000 |     1093176000 |
| stage/sql/cleaning up          |   18602306 |  25263461317000 |         628000 |        1358000 |     2278265000 |
| stage/sql/query end            |   18602129 |  23815226118000 |         698000 |        1280000 |     1674268000 |
| stage/sql/end                  |   16276863 |  17257774496000 |         768000 |        1060000 |     1070408000 |
| stage/sql/removing tmp table   |    1162634 |  15392269553000 |        9359000 |       13239000 |      904741000 |
| stage/sql/executing            |   16276863 |  15264844128000 |         628000 |         937000 |     1097577000 |
| stage/sql/freeing items        |    1162634 |  15041751826000 |        3143000 |       12937000 |     1106237000 |
| stage/sql/Creating tmp table   |    1162633 |  14681955665000 |        9568000 |       12628000 |     1092898000 |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
18 rows in set (0.00 sec)

Total number of statement events

mysql> select sum(count_star), sum(sum_timer_wait)
    ->   from performance_schema.events_statements_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
|        18602306 |    2293686457711000 |
+-----------------+---------------------+
1 row in set (0.00 sec)

Top statements, by count

mysql> select event_name, count_star, sum_timer_wait
         from performance_schema.events_statements_summary_global_by_event_name
         where count_star>0 order by count_star desc limit 25;
+---------------------------------+------------+------------------+
| event_name                      | count_star | sum_timer_wait   |
+---------------------------------+------------+------------------+
| statement/com/Execute           |   17439495 | 2267764442472000 |
| statement/sql/begin             |    1162633 |   25914627564000 |
| statement/com/Prepare           |         88 |       3041856000 |
| statement/com/Close stmt        |         80 |        729649000 |
| statement/com/Quit              |          9 |         47910000 |
| statement/sql/show_table_status |          1 |       3568260000 |
+---------------------------------+------------+------------------+
6 rows in set (0.00 sec)

Top statements, by time

mysql> select event_name, count_star, sum_timer_wait
         from performance_schema.events_statements_summary_global_by_event_name
         where count_star>0 order by sum_timer_wait desc limit 25;
+---------------------------------+------------+------------------+
| event_name                      | count_star | sum_timer_wait   |
+---------------------------------+------------+------------------+
| statement/com/Execute           |   17439495 | 2267764442472000 |
| statement/sql/begin             |    1162633 |   25914627564000 |
| statement/sql/show_table_status |          1 |       3568260000 |
| statement/com/Prepare           |         88 |       3041856000 |
| statement/com/Close stmt        |         80 |        729649000 |
| statement/com/Quit              |          9 |         47910000 |
+---------------------------------+------------+------------------+
6 rows in set (0.00 sec)

Timer statistics

mysql> select * from performance_schema.setup_timers;
+-----------+------------+
| NAME      | TIMER_NAME |
+-----------+------------+
| wait      | CYCLE      |
| stage     | NANOSECOND |
| statement | NANOSECOND |
+-----------+------------+
3 rows in set (0.00 sec)

mysql> select * from performance_schema.performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME  | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE       |      2982179104 |                1 |             36 |
| NANOSECOND  |      1000000000 |                1 |           1404 |
| MICROSECOND |         1000000 |                1 |           1422 |
| MILLISECOND |            1034 |                1 |           1431 |
| TICK        |             104 |                1 |            774 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)

Interpreting results

Let's do some basic math to understand what this means ...
The total number of wait events collected is "1098658477". That's a big number. In a more readable form, it reads "1,098,658,477". That still is a big number. In plain English, it is 1 Billion, with a B. This was for a 5 minutes (300 sec) run, and all the statistics were truncated just before the sysbench load, so only the load itself is counted here.
Events collection rates (in number or events per seconds) are then:
  • 3662194 waits/sec, or 3.6 M waits/sec
  • 813844 stages/sec, or 813 K stages/sec
  • 62007 statements/sec, or 62 K statements/sec
These numbers are actually mind blowing. Keep in mind that there is no sampling involved. When each and every one of these events is collected, the performance schema time the event, and add statistics to the proper counter for the object and the thread involved, so that statistics per object, per table, per thread, etc, are available.
For what it is worth, we also have 13 stages per statements, and 59 waits per statements, on average.
Based on the top wait events, we can verify that this workload is indeed CPU bound and read only, since no file io are involved
An important point to note is that "wait/io/table" here represents table IO, as seen at the SQL layer / storage engine interface. A table IO event will time the code executed overall inside the storage engine, so it basically is the sum of the CPU time spent inside the engine and the waits blocking the storage engine. Having a big number reported in table io is expected.
Given the rate of collection for this load, 3.6 M waits/sec, the overhead in the performance schema is not caused by a big, rarely invoked, slow function, but rather by very frequent calls to very short functions.
With such a number of events, timing events cause a huge number of timer invocations.
For waits, the timer used is the cycle timer (see table performance_schema.setup_timers). For the CYCLE timer, the timer overhead alone, expressed in cycles, is 36 for this machine (see table performance_schema.performance_timers).
With 1098658477 events, 2 timer invocation per event, that makes 79103410344 cycles spent timing waits.
Given a CPU frequency of 2982179104 or 2.9 GHz, this represents 26.5224 seconds of CPU in a 300 seconds benchmark.
During this 300 sec benchmark, top reports that mysqld (the server process) consumes 723 percents of CPU (this is a 8 core, but sysbench alone eats 70 percent), so the overall CPU used by the server is 2169 sec. An overhead of 26 sec / 2169 sec represents 1.2 percent of CPU power, just timing waits.
Let's do the same math for stages and statements. They both use the same timer, nanoseconds.
With 244153447 stages, 2 timers per stage, 1404 cycles per timer, we have 685582879176 cycles or 229.8933 sec of CPU.
With 18602306 statements, 2 timers per stage, 1404 cycles per timer, we have 52235275248 cycles or 17.5158 sec of CPU.
Ouch ! That makes 229.8933+17.5158 or 247.4091 seconds of CPU calling the nanosecond timer. In terms of overhead, 247 seconds taken from 2169 represents 11.4 percent of mysqld spent for the nanosecond timer alone.
The bottom line is, this usage of the nanosecond timer is killing the performance. Now, why ? Intuitively, one would think that collecting and timing waits at a rate of 3.6 M / sec is more consuming than doing the same thing for stages and statements, at a rate of 813 K / sec and 62 K / sec respectively.
The root cause for this bad performances is hidden in plain sight, exposed by the performance_timers table.
mysql> select * from performance_schema.performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME  | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE       |      2982179104 |                1 |             36 |
| NANOSECOND  |      1000000000 |                1 |           1404 | <-- Really ? This is HUGE !
| MICROSECOND |         1000000 |                1 |           1422 |
| MILLISECOND |            1034 |                1 |           1431 |
| TICK        |             104 |                1 |            774 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)
This is actually the first time I find a machine where the overhead for the NANOSECOND, MICROSECOND and MILLISECOND timer is so high
Some details about the machine used ...
malff@tyr03:~>more /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 2992.227
cache size      : 6144 KB
...
There are 8 cores total, all identical.
Timer details
malff@tyr03:withpfs>./unittest/mysys/my_rdtsc-t
1..11
# ----- Routine ---------------
# myt.cycles.routine          :             2 <-- RDTSC
# myt.nanoseconds.routine     :            11 <-- CLOCK_GETTIME
# myt.microseconds.routine    :            13
# myt.milliseconds.routine    :            18
# myt.ticks.routine           :            17
# ----- Frequency -------------
# myt.cycles.frequency        :    2986029850
# myt.nanoseconds.frequency   :    1000000000
# myt.microseconds.frequency  :       1000000
# myt.milliseconds.frequency  :          1035
# myt.ticks.frequency         :           104
# ----- Resolution ------------
# myt.cycles.resolution       :             1
# myt.nanoseconds.resolution  :             1
# myt.microseconds.resolution :             1
# myt.milliseconds.resolution :             1
# myt.ticks.resolution        :             1
# ----- Overhead --------------
# myt.cycles.overhead         :            36
# myt.nanoseconds.overhead    :          1404 <-- Ouch
# myt.microseconds.overhead   :          1422
# myt.milliseconds.overhead   :          1431
# myt.ticks.overhead          :           765
...
malff@tyr03:~>uname -a
Linux tyr03 2.6.32-100.28.17.el6.x86_64 #1 SMP Mon May 23 04:37:16 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
To have a point of comparison, the following are taken from a 4 core 8 Gb machine (my now several years old desktop)
mysql> select * from performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME  | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE       |      2831810945 |                1 |             34 | <-- Marc desktop
| NANOSECOND  |      1000000000 |                1 |            102 | <-- Marc desktop
| MICROSECOND |         1000000 |                1 |            111 |
| MILLISECOND |            1037 |                1 |            127 |
| TICK        |             105 |                1 |            493 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)
malff@linux-8edv:~> more /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
stepping        : 10
cpu MHz         : 2833.000
cache size      : 6144 KB
...
malff@linux-8edv:~> uname -a
Linux linux-8edv 2.6.37.6-0.5-desktop #1 SMP PREEMPT 2011-04-25 21:48:33 +0200 x86_64 x86_64 x86_64 GNU/Linux
Both the lab machine (tyr03) and my desktop have an Intel processor, and they happen to even be of the same family (6) and model (23), only the model name differ (Xeon E5450 vs Core 2 Quad Q9550).
The CYCLE timer is implemented using RDTSC in both cases, and the overheads measured are very similar, with 36 and 34 cycles.
Implementation of the CYCLE timer is in my_timer_cycles() in mysys.
Details about the RDTSC instruction can be found in the processor spec, page C-32 (722). Family 6 model 23 is CPUID 06_17H.
For the nanoseconds timer, it is implemented in my_timer_nanoseconds() in both cases with a call to clock_gettime(CLOCK_REALTIME), in both cases on Linux 2.6.
The lab machine has a Linux 2.6.32 kernel, while the desktop has Linux 2.6.37.
The exact root cause of this huge difference in overhead (1404 vs 102) has not been explained yet.
In the mean time, given the cost of the nanosecond timer, maybe the performance schema overhead could be reduced by not timing events or timing them with the cycle timer.
This is worth a new set of measurements, still for mysql-trunk ...

The trunk benchmark, revisited

With performance schema global (not timed) thread (not timed) thread (all cycles)
Connections TPS Overhead (%) TPS Overhead (%) TPS Overhead (%)
1 757 6.20 752 6.82 739 8.43
2 1362 5.35 1327 7.78 1293 10.15
4 2441 6.48 2385 8.62 2328 10.80
8 4490 11.72 4384 13.80 4165 18.11
12 4489 10.88 4411 12.43 4220 16.22
16 4489 10.90 4436 11.95 4284 14.97
32 4336 10.67 4232 12.81 3992 17.76
48 4244 10.88 4111 13.67 3876 18.61
64 (*) 3977 3.42 3040 26.18 2527 38.64
128 (*) 2830 20.01 1972 44.26 1897 46.38
256 (*) 1683 3.61 1671 4.30 1629 6.70
(*) Noise at 64/128/256 still makes results unstable.
These new results, using a different configuration but still the same code in mysql-trunk, are interesting for two reasons.
First, it confirms the effect of timing alone. Timing a lot or events takes time, and timing a lot of events with a bad timer makes things really worse. Using the CYCLE timer for waits, stages and statements here, in the "thread (all cycle)" configuration, helped to regain 5 percent.
Second, it shows that once all timers are removed from the code path, the overhead is still in the ~11% range for the global only consumer. The code probably has room for optimizations, see the next section.

The tuning benchmark

The current code in mysql-trunk (at time of writing) can be improved and optimized, in an effort to reduce the overhead even more. A long list of different tweaks, changes, fixes ... has been the topic of a lot of research and tuning lately. Instead of going through each change, the cumulated changes are available in a different branch, in mysql-trunk-pfs-tuning.
This branch is available on launchpad for the technically curious. A bit of warning: it is used only for prototyping and benchmarking to see what code does (or does not) work better. It may not even build for your platform. Do not use this in production.
mysql-trunk-pfs-tuning without pfs (baseline) disabled off global (not timed) global (timed) thread (not timed) thread (timed) thread (all cycles)
Connections TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%)
1 819 0 788 3.79 778 5.01 771 5.86 684 16.48 755 7.81 678 17.22 744 9.16
2 1431 0 1405 1.82 1383 3.35 1354 5.38 1212 15.30 1329 7.13 1199 16.21 1307 8.67
4 2598 0 2567 1.19 2522 2.93 2452 5.62 2163 16.74 2394 7.85 2141 17.59 2371 8.74
8 4949 0 4980 -0.63 4799 3.03 4635 6.34 4105 17.05 4445 10.18 4061 17.94 4367 11.76
12 4948 0 4936 0.24 4759 3.82 4616 6.71 4112 16.90 4478 9.50 4076 17.62 4390 11.28
16 4944 0 4931 0.26 4754 3.84 4601 6.94 4131 16.44 4476 9.53 4077 17.54 4384 11.33
32 4764 0 4753 0.23 4596 3.53 4455 6.49 3984 16.37 4286 10.03 3945 17.19 4200 11.84
48 4682 0 4681 0.02 4519 3.48 4377 6.51 3908 16.53 4202 10.25 3860 17.56 4111 12.20
64 (*) 4148 0 3881 6.44 2583 37.73 3653 11.93 2916 29.70 3082 25.70 2863 30.98 3496 15.72
128 (*) 3659 0 2034 44.41 2024 44.68 1983 45.80 1924 47.42 2011 45.04 1886 48.46 1939 47.01
256 (*) 2514 0 1746 30.55 1732 31.11 1690 32.78 1632 35.08 1724 31.42 1640 34.77 1646 34.53
We can see some progress:
  • When all consumers are off, the overhead is in the 3% to 4% range, which is much better compared to ~8%.
  • The "global (timed)" and "thread (timed)" configurations still have an important overhead because they use the nanosecond timer, but this overhead decreased, which shows the benefits of the code optimizations.
  • The "thread (all cycles)" overhead is in the ~11% range.
Now that's better. With every instruments enabled and timed, for all waits stages and statements, and with statistics collected globally and per thread, 11% for this wealth of information is not so bad. Especially knowing that the overhead can be changed back to 3-4% at anytime, since all this configuration can be changed dynamically.

Take aways

The most important facts found here:
  • Timer performances do vary between platforms, and have a great impact. Check the timer performance itself with table performance_schema.performance_timers before deciding to time events.
  • The performance schema code itselt can be improved. Current optimizations are available in mysql-trunk-pfs-tuning, as a preview.
  • Overall, the performance schema captures instrumentation events at a very high rate. Limiting the choice of consumers and instruments to use is a good way to cut down the overhead.
  • There is no magic formula that says "overhead is X". Overhead depends so much on the work load, the hardware, the configuration, etc, that the best answer is "it depends".
The real question about overhead, given that it depends so much, really boils down to "What is the overhead for my application ?". The best way to find that is to try it ...
-- Marc Alff

PlanetMySQL Voting: Vote UP / Vote DOWN

Slides from DrupalDownUnder2011 on Tuning for Drupal

Июнь 16th, 2011

By popular request, here’s the PDF of the slides of this talk as presented in January 2011 in brisbane; it’s fairly self-explanatory. Note that it’s not really extensive “tuning”, it just fixes up a few things that are usually “wrong” in default installs, creating a more sane baseline. If you want to get to optimal correctness and more performance, other things do need to be done as well.


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Query Optimization – Tip # 1 – Avoid using wildcard character at the start of a LIKE pattern.

Ноябрь 23rd, 2010
The more I go through others SQL, there are some common mistakes that I see developers making over and over again, so I thought why not start a series of tips that can help developers optimize their queries and avoid common pitfalls. So this post is a part of that series of tips, and this is the first tip "Avoid using a wild card character at the start of a LIKE pattern".
PlanetMySQL Voting: Vote UP / Vote DOWN

DiskPageBufferMemory tuning and disk data statistics in MySQL Cluster 7.1.9

Ноябрь 3rd, 2010
From MySQL Cluster 7.1.9 it is possible to get better stats on disk data tables. In fact, the statistics makes it possible to tune the DiskPageBufferMemory parameter (similar to innodb_bufferpool), in order to avoid disk seeks. It is much (understatement) faster to fetch data from the DiskPageBufferMemory than disk.

Here is an example/tutorial how to use this information and how to check the hit ratio of the DiskPageBufferMemory. Next time, I will explain about other counters you can get from ndbinfo.diskpagebuffer.

Finally, so no more guess work and educated guesses are needed.

Let's take an example.

I have a table t1 with 650000 record
CREATE TABLE `t1` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`data1` varchar(512) DEFAULT NULL,
`data2` varchar(512) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=ndbcluster TABLESPACE ts_1 STORAGE DISK ENGINE=ndbcluster
Then doing Random reads on this table with bencher:
 src/bencher -r 100000 -t 1 -q \
"SET @x=FLOOR(1 + (RAND() * 650000)); SELECT * FROM t1 WHERE id=@x"
DiskPageBufferMemory=64M
The initial performance after 340 seconds into the run:
Thread 0 - 274 tps (average tps measured after 335 secs)
Thread 0 - 273 tps (average tps measured after 340 secs)
and the hit ratio:
mysql> SELECT
node_id,
100*(sum(page_requests_direct_return)/
(sum(page_requests_direct_return)+sum(page_requests_wait_io))) as hit_ratio_pct
FROM diskpagebuffer GROUP BY node_id;
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 3 | 93.3577 |
| 4 | 93.6565 |
+---------+---------------+

2 rows in set (0.01 sec
A hit ratio of ~93% - not so great.

So let's increase the DiskPageBufferMemory ( i am using the configurator scripts for this)
# change DiskPageBufferMemory to 512M in config.ini
vi ../config/config.ini
# do a rolling restart of the data nodes
./rolling-restart.sh --layer=storage
## the sit back and wait for the data nodes to be rolled..
hostx: (324s elapsed) Node 3: starting (Last completed phase 100) (mysql-5.1.51 ndb-7.1.9)
hostx: Data node 3 is STARTED
## then wait for node 4..
hosty: (98s elapsed) Node 4: starting (Last completed phase 4) (mysql-5.1.51 ndb-7.1.9)

Increasing DiskPageBufferMemory=512M
Of course, while doing the rolling restart, the same traffic is still happening on the cluster. Let's check what the hit_ratio is now (while restarting node 3).

695 seconds into the run we have:
mysql>  "see above"
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 4 | 91.2639 |
+---------+---------------+
1 row in set (0.00 sec)
and perf is...
Thread 0 - 200 tps (average tps measured after 690 secs)
Thread 0 - 200 tps (average tps measured after 695 secs)
not good at all.

After increase to DiskPageBufferMemory=512M

When the rolling restart has completed the ./rolling-restart prints out:
Cluster: Cluster has performed a rolling restart  
Let's check what hit ratio we got:
mysql>  "see above"
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 3 | 98.2487 |
| 4 | 98.6386 |
+---------+---------------+
2 rows in set (0.01 sec)
Hit ratio is picking up and performance as well:
Thread 0 - 1141 tps (average tps measured after 1310 secs)
Thread 0 - 1149 tps (average tps measured after 1315 secs)
Now, we just have to wait for a while and let the system enter a steady state.
Then we can check back on the hit ratio and perhaps increase the DiskPageBufferMemory even more.

If you have 99.5 -- 99.9% hit ratio i would say you are good.

After 1780 seconds into the run we have:
+---------+---------------+
| node_id | hit_ratio_pct |
+---------+---------------+
| 3 | 99.2035 |
| 4 | 99.5251 |
+---------+---------------+
2 rows in set (0.00 sec)

and
Thread 0 - 1669 tps (average tps measured after 1775 secs)
Thread 0 - 1674 tps (average tps measured after 1780 secs)
... wow .. tuning the DiskPageBufferMemory really helped. And 2000 sec into the run, perf and hit ratio is still climbing.

PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Paginated displays – How to kill performance vs How to improve performance!

Октябрь 30th, 2010
Pagination is used very frequently in many websites, be it search results or most popular posts they are seen everywhere. But the way how it is typically implemented is naive and prone to performance degradation. In this article I attempt on explaining the performance implications of poorly designed pagination implementation. I have also analyzed how Google, Yahoo and Facebook handle pagination implementation. Then finally i present my suggestion which will greatly improve the performance related to pagination.
PlanetMySQL Voting: Vote UP / Vote DOWN

Impact of the sort buffer size in MySQL

Октябрь 25th, 2010

The parameter sort_buffer_size is one the MySQL parameters that is far from obvious to adjust. It is a per session buffer that is allocated every time it is needed. The problem with the sort buffer comes from the way Linux allocates memory. Monty Taylor (here) have described the underlying issue in detail, but basically above 256kB the behavior changes and becomes slower. After reading a post from Ronald Bradford (here), I decide to verify and benchmark performance while varying the size of the sort_buffer. It is my understanding that the sort_buffer is used when no index are available to help the sorting so I created a MyISAM table with one char column without an index:

CODE:
  1. CREATE TABLE `sorttest` (
  2.   `data` char(30) DEFAULT NULL
  3. ) ENGINE=MyISAM DEFAULT CHARSET=latin1

and I inserted 100k rows with this simple script:

CODE:
  1. #!/bin/bash
  2.  
  3. NUMROW=100000
  4. COUNT=0
  5. while [ "$NUMROW" -gt "$COUNT" ]
  6. do
  7.     UUID=`uuidgen`
  8.     mysql test -e "insert into sorttest value ('$UUID');"
  9.     let "COUNT=COUNT+1"
  10. done

I know, I could have used the uuid() function of MySQL. For the benchmark, I used an old PII 350 MHz computer, I think for such CPU bound benchmarks, an old computer is better, if there are small differences, they'll be easier to observe. I varied the sort_buffer_size by steps of 32 KB and I recorded the time required to perform 12 queries like 'select * from sorttest order by data limit 78000,1' with, of course, the query cache disabled. I also verified that during the whole process, the computer never swapped and I pre-warmed the file cache before the benchmark by doing "alter table sorttest engine=myisam;". The script used for the benchmark is the following:

CODE:
  1. #!/bin/bash
  2.  
  3. for i in `seq 1 1000`
  4. do
  5.         START=`date +%s.%N`
  6.         OUT=`mysql -e "set session sort_buffer_size=32*1024*$i;select * from sorttest order by data limit 78000,1;show session status like 'Sort_merge_passes';select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;select * from sorttest order by data limit 78000,1;" test`
  7.         END=`date +%s.%N`
  8.         MERGE=`echo $OUT | cut -d' ' -f6`
  9.         TIME=`echo "$END - $START" | bc`
  10.         echo "$i $MERGE $TIME"
  11. done

which in addition to output the total time, output the number of Sort_merge_passes, which will be useful to interpret the results. The figure below shows a graphical representation of the results.

The first we can notice by looking at the graph is that the expected correspondence between the time for the queries and the number of sort merge passes. For the small values of the sort buffer size, below 440KB, there are many sort merge passes and the time for the queries hover around 18s. Above 440K, as the sort merge passes drops to 1, there is a large drop of the time for the queries below 14s. Then, as the sort buffer size is further risen, the performance gain is negative up to the point, around 6.4MB where no more sort merge pass are required and then, the time for the queries loses all dependency over the sort buffer size. I am still trying to figure out why the number of sort merge passes felt to zero at 6.4MB since the total size of the table is less than 3MB. That seems to be a pretty high overhead per row (~37 bytes) but if the structure has a few pointers, that can add up to such amount of bytes pretty quickly.

The important point here, at least for the Linux, glibc and MySQL versions I used and for the test I did, there doesn't seem to be an observable negative impact of the glibc memory allocation threshold at 256KB. I'll try to find ways to repeat this little experiment with the other per session buffers just to confirm the findings.

OS: Ubuntu 10.04 LTS, Linux test1 2.6.32-21-generic-pae #32-Ubuntu SMP Fri Apr 16 09:39:35 UTC 2010 i686 GNU/Linux
MySQL: 5.1.41-3ubuntu12.6-log

P.S.: Gnumerics for graphs is so much better than OpenOffice Calc.


Entry posted by Yves Trudeau | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Tuning MySQL Server Settings

Сентябрь 30th, 2010
The default configuration file for MySQL is intended not to use many resources, because its a general purpose sort of a configuration file. The default configuration does enough to have MySQL running happily with limited resources and catering to simple queries and small data-sets. The configuration file would most definitely need to be customized if you intend on using complex queries and when your good amount of data. Getting started Before you start changing configuration settings, you should keep in mind the fact that “more is not always better”. You might have 16-20 GB of RAM, but that doesn’t mean...
PlanetMySQL Voting: Vote UP / Vote DOWN

Cluster — spins/contentions and thread stuck in..

Сентябрь 2nd, 2010
I get a number of question about contentions/"stuck in..". So here comes some explanation to:
  • Contention
  • Thread Stuck in
  • What you can do about it
In 99% of the cases the contentions written out in the out file of the data nodes (ndb_X_out.log) is nothing to pay attention to.

sendbufferpool waiting for lock, contentions: 6000 spins: 489200
sendbufferpool waiting for lock, contentions: 6200 spins: 494721


Each spin is read from the L1 cache (4 cycles on a Nehalem (3.2GHz), so about a nanosecond).
1 spin = 1.25E-09 seconds (1.25ns)

In the above we have:
(494721-489200)/(6200-6000)= 27 spins/contention
Time spent on a contention=27 x 1.25E-09=3.375E-08 seconds (0.03375 us)

So we don't have a problem..

Another example (here is a lock guarding a job buffer (JBA = JobBuffer A, in short it handles signals for heartbeats and some other small things, all traffic goes over JobBuffer B).

jbalock thr: 1 waiting for lock, contentions: 145000 spins: 3280892543
jbalock thr: 1 waiting for lock, contentions: 150000 spins: 3403539479


(3403539479-3280892543)/(150000-145000)=24529 spins/contention
Time spent on a contention: 3.06613E-05 seconds (30.66us )

This is a bit higher than I would have expected and I think more analysis is needed. However, i tend not to get these contentions on a busy system.

Ndb kernel thread X is stuck in ..

Ndb kernel thread 4 is stuck in: Job Handling elapsed=100 Watchdog: User time: 82 System time: 667
Ndb kernel thread 4 is stuck in: Job Handling elapsed=200

Watchdog: User time: 82 System time: 668
Ndb kernel thread 4 is stuck in: Job Handling elapsed=300
Watchdog: User time: 82 System time: 669
Ndb kernel thread 4 is stuck in: Job Handling elapsed=400
Watchdog: User time: 82 System time: 670

Here the important is to look at how User time and System time behaves.
If User time is constant (as it is here - 82ms), but the System time is growing (667, 668 etc) which indicates that the OS kernel is busy.
Slow network? Sub-optimal kernel version? NIC drivers? swapping? some kernel process using too much cpu?

If User time is growing it is probably because the ndb kernel is overloaded.

What can you do about this?
  • In config.ini:
    RealtimeScheduler=1

    LockExecThreadToCPU=[cpuids]
  • check that cpuspeed is not running ( yum remove cpuspeed )
  • .. and finally ask us to optimize more!
Also, pay attention if you get the contentions on an idle Cluster or a busy Cluster.

PlanetMySQL Voting: Vote UP / Vote DOWN