Performance schema performance tuning
One of the most important question users ask before deciding to use the performance schema is: what is the overhead ? The underlying concern of course is to make sure deploying the performance schema does not negatively impact production ("First, do no harm").
The question is simple, and yet the answer is not so simple, as it depends on so many things.
As seen in some benchmarks done already, the overhead when using the performance schema can vary a lot, and in some cases really causes degradations.
A lot of time has been spent investigating the root causes (note the plural form) for the performance schema overhead in general, so I think it is a good time to share current findings.
Server workload
Analysing performances, and testing the effects of different code or configuration changes, can hardly be done in a vacuum. That's where benchmarks come in.
The problem with benchmarks in general is that different benchmarks will produce different loads, and that a given code may perform very well under some conditions but poorly under others ...
One particular property of the performance schema is that, since it instruments the server itself, the load and behavior of the server:
- is there capacity left or is it saturated ?
- is it CPU bound or disk io bound ?,
- paging ?
- swaping ?
- contended in a specific place (query cache, table cache, innodb buffer pool, [insert your
favorite here]) ?
directly influences the load put on the performance schema code:
- how many mutexes / rwlocks / cond events are detected ?
- how many file io events are detected ?
- how many table io / table lock events are detected ?
- how many stages / statements are detected ?
Given this load as input, the user configuration of the performance schema:
- the sizing of internal performance schema buffers,
- which instrument is enabled ?
- which instrument is timed ?
- which timer is used ?
- which consumer is enabled ?
- is there some filtering per user ? per host ? (table setup_actors)
- is there some filtering per object? (table setup_objects)
will decide what to do with the events seen:
- discard them,
- count them,
- time them,
- aggregate them, possibly with fancy aggregations,
all of which affects overhead.
To summarize, the overall system can be modeled as giant pipe, which consists of, in order:
- the client application and the generated application workload,
- the mysql server and the generated instrumentation workload,
- the performance schema and the generated statistics / overhead.
It is critical to understand how that pipe works to interpret results.
For example, if a client application is sending very few queries to the server, generating a very "small" load, a benchmark with or without the performance schema is not going to show any differences. To be meaningful, a benchmark (of the database) has to put a lot of pressure (on the database). This is hardly a surprise, but here comes the interesting part.
Next, if a server is poorly configured for the application workload received, and if the server itself does not perform well for that load, the instrumentation workload on the performance schema will be "small". A properly configured server will execute more queries, transactions, etc, which will generate more events to instrument. To be meaningful, a benchmark (of the performance schema) has to put a lot of pressure (on the performance schema).
An immediate consequence is that benchmarks of the performance schema should disable the query cache: there are currently 474 instrumentation points in the server. If most executions paths bounce off of the query cache, stress will be on "wait/synch/mutex/sql/Query_cache::structure_guard_mutex" only, cutting off the load generated from the lower layers on the server code, such as the table io instrumentation, or the storage engines instrumentation.
A more subtle consequence is that every time a performance bottleneck in the server code is removed, either by a bug fix or by a better server tuning, server efficiency goes up, which increases the number of statements / transactions / whatever the benchmark metric is counting, that are executed. While this is good news for the server, it actually increases the
instrumentation workload for the performance schema.
In conclusion, and in my opinion, to interpret a performance schema benchmark, results should describe not only the overhead observed, but also the load seen that caused the overhead. By load seen, I mean here the instrumentation load seen at the server / performance schema interface, since this is the real stress put on the performance schema code. The more details are available for each stage of the pipe, the better.
Time to dive into numbers then ...
The trunk benchmark
The load used for this analysis is sysbench, and more specifically the sysbench oltp read-only benchmark.
The reason this particular case was picked is not because I like the results on this one better.
The reasons are:
- sysbench oltp was the test used in previous benchmarks, and to reproduce the results
(and analyse the reasons for overhead), it helps to be in similar
conditions,
- a read only, in memory only load eliminates a lot of noise from
measurements, making analysis easier: the goal here is not to get a number
(by "just" running a benchmark), it is to get an explanation for the
number, so that the code can be improved,
- the performance schema code itself is "CPU only, user land only" code,
as it does not make any io or any system call. A benchmark that shares the
same characteristics seems more likely to provide interesting
insight.
Time permitting, I hope to get to do the same analysis for other cases (sysbench read-write, dbt2, etc).
The test machine used is a 8 CPU / 32 GB Linux box,
so the test configuration is sized accordingly:
- for sysbench: --oltp-table-size=1000000
- for mysql my.cnf: innodb_buffer_pool_size = 12G
Historical note: testing done for this post started on a 8 CPU 16 GB machine, then moved to
another lab. The original configuration was not changed, hence the 12G
buffer pool only.
The trunk results
All results are for mysql-trunk (revno 3195, 2011-06-13), compiled in Release
configuration. This corresponds to the current code for mysql 5.6, which is still in development.
Tests without the performance schema
With this build, the performance instrumentation is not compiled.
All the computation of overhead expressed as percentages are relative to
this reference, which is used as the ultimate baseline.
| Connections | TPS | CPU User (%) | CPU System (%) | CPU Idle (%) |
| 1 | 807 | 10 | 2 | 88 |
| 2 | 1439 | 20 | 5 | 75 |
| 4 | 2610 | 40 | 10 | 50 |
| 8 | 5086 | 84 | 14 | 2 |
| 12 | 5037 | 84 | 15 | 1 |
| 16 | 5038 | 85 | 15 | 0 |
| 32 | 4854 | 85 | 15 | 0 |
| 48 | 4762 | 85 | 15 | 0 |
| 64 | 4118 (*) | 60 | 32 | 8 |
| 128 | 3538 (*) | 57 | 35 | 8 |
| 256 | 1746 (*) | 57 | 35 | 8 |
The overall CPU statistics are collected by vmstat, and helps to qualify in
which mode the server is operating.
Looking at the overall shape of the performance curve, three different
operating modes can be observed:
- Ramping up (1 to 8 connections)
- Peek efficiency (8 to 48 connections)
- Server progressive degradation (48 to infinity connections)
For a 8 core machine, this looks reasonable.
(*) A special note about results for 64 / 128 / 256 connections: the performance measured in this cases is not stable and can vary a lot (50 percent) between runs.
This is due to the machine saturation: at 256 connections, 256 threads are running in sysbench, plus 256+16 in mysqld, which makes 512+ active threads on a 8 core.
Numbers are posted for this case as well, to include this mode of operation in the analysis.
Tests with the performance schema
Now that we have established what the baseline looks like, executing the
same workload on the same hardware, under the same conditions gives the
following results. The CPU user / system / idle statistics from vmstat are
not repeated (the overall profile is the same) for brevity.
| With performance schema |
disabled |
off |
global (timed) |
thread (timed) |
| Connections |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
| 1 |
796 | 1.36 |
775 | 3.97 |
684 | 15.24 |
681 | 15.61 |
| 2 |
1409 | 2.08 |
1349 | 6.25 |
1198 | 16.75 |
1191 | 17.23 |
| 4 |
2544 | 2.53 |
4634 | 5.86 |
2115 | 18.97 |
2114 | 19.00 |
| 8 |
4844 | 4.76 |
4634 | 8.89 |
3974 | 21.86 |
3913 | 23.06 |
| 12 |
4850 | 3.71 |
4662 | 7.44 |
4020 | 20.19 |
3979 | 21.00 |
| 16 |
4847 | 3.79 |
4676 | 7.19 |
4051 | 19.59 |
4020 | 20.21 |
| 32 |
4676 | 3.61 |
4485 | 7.60 |
3851 | 20.66 |
3789 | 21.94 |
| 48 |
4582 | 3.78 |
4384 | 7.94 |
3759 | 21.06 |
3692 | 22.47 |
| 64 (*) |
2548 | 38.13 |
3048 | 25.98 |
3392 | 17.63 |
3286 | 20.20 |
| 128 (*) |
2035 | 42.48 |
2007 | 43.27 |
1938 | 45.22 |
1917 | 45.82 |
| 256 (*) |
1745 | 0.06 |
1697 | 2.81 |
1662 | 4.81 |
1624 | 6.99 |
(*) The same comment about results stability applies here. Because the results are unstable with great variations, the overhead computed after normalization on a unstable baseline gives meaningless results.
Analysis
Overall, the overhead in the most basic configuration, with all instruments enabled and only the global instrumentation consumer enabled, is at 20 percent already.
This is a lot. Looking in more details at these results, it appears that:
- Adding the thread instrumentation consumer (to get statistics per thread) does only marginally increase overhead:
at 8 connections, the overhead goes from 21.86% to 23.06%.
- The overhead depends on the number of connections. It is progressive in the 1-8 connections range, peaks at 8 connections, then is both stable and significant at 8-48 connections.
- Peak TPS without the performance schema was at 8 connections. With the performance schema, it seems to move to 12 or even 16 connections.
To understand the root cause for this 20% toll, the following measurements are also instructive:
- With the performance schema enabled but all instruments disabled (configuration "off"), the overhead is at ~8%. For code doing nothing, this is actually quite high.
- When the performance schema is disabled at startup time (configuration "disabled"), the overhead is at ~4%. This configuration is not very useful from a functional point of view, but from a technical point of view it executes a code path which is a subset of the configuration "off". That makes the numbers collected useful.
Based on this, overhead seems to originate from, roughly:
- compiling with the performance schema (configuration "disabled"): ~4% cost,
- starting with the performance schema (configuration "off"): additional ~4% cost,
- enabling the global consumer only (configuration "global (timed)"): additional ~12% cost,
- adding the thread consumer only (configuration "global (timed)"): additional ~1% cost
To understand better which code is executed, and how often, let's use the performance schema itself to get some insight on the server execution. After all, that is precisely what the performance schema is for.
Using the "thread (timed)" configuration, with the assumption that the instrumentation does not distort the workload, and looking at details for 8 connections, which is chosen because it corresponds to the peak performance without the performance schema, statistics collected are as follows below.
Total number of wait events.
mysql> select sum(count_star), sum(sum_timer_wait)
-> from performance_schema.events_waits_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
| 1098658477 | 797822128662705 |
+-----------------+---------------------+
1 row in set (1.79 sec)
Top waits, by count
mysql> select * from performance_schema.events_waits_summary_global_by_event_name
-> where count_star>0 order by count_star desc limit 25;
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| wait/io/table/sql/handler | 482492695 | 583215424951365 | 42210 | 1208680 | 7913678535 |
| wait/synch/mutex/innodb/srv_conc_mutex | 265080324 | 142377797253015 | 30150 | 537005 | 1862766495 |
| wait/synch/mutex/sql/THD::LOCK_thd_data | 122077261 | 6519294560295 | 30150 | 53265 | 1072076715 |
| wait/synch/rwlock/innodb/hash table locks | 38409624 | 5420427797475 | 27135 | 141035 | 360901530 |
| wait/synch/mutex/innodb/read_view_mutex | 32554195 | 6146001298890 | 24120 | 188605 | 1011083265 |
| wait/synch/mutex/sql/LOCK_open | 32553870 | 9359193544965 | 30150 | 287430 | 1080446355 |
| wait/lock/table/sql/handler | 32553724 | 20576707870950 | 99495 | 631810 | 1083789990 |
| wait/synch/mutex/mysys/THR_LOCK::mutex | 32553724 | 6964485402105 | 30150 | 213730 | 3022857090 |
| wait/synch/rwlock/innodb/trx_sys_lock | 18592602 | 5270410352025 | 42210 | 283410 | 1622682045 |
| wait/synch/rwlock/sql/LOCK_grant | 16276408 | 3502306526580 | 51255 | 215070 | 892225935 |
| wait/synch/rwlock/innodb/btr_search_latch | 16198467 | 4144473779535 | 27135 | 255605 | 1048297410 |
| wait/synch/rwlock/sql/MDL_lock::rwlock | 2325412 | 524010714480 | 54270 | 225120 | 36433260 |
| wait/synch/mutex/mysys/BITMAP::mutex | 2325268 | 409667867595 | 42210 | 175875 | 908214480 |
| wait/synch/mutex/innodb/trx_mutex | 1163261 | 32488536510 | 24120 | 27805 | 7419915 |
| wait/synch/mutex/sql/MDL_map::mutex | 1162780 | 274705866855 | 36180 | 236175 | 133772535 |
| wait/synch/mutex/sql/LOCK_plugin | 1162673 | 212769433395 | 33165 | 182910 | 791328960 |
| wait/synch/mutex/innodb/lock_mutex | 1162633 | 145945057815 | 39195 | 125290 | 14173515 |
| wait/synch/mutex/innodb/log_sys_mutex | 7757 | 2441752020 | 27135 | 314565 | 1239502680 |
| wait/synch/mutex/innodb/os_mutex | 2490 | 271684665 | 30150 | 108875 | 17423685 |
| wait/synch/mutex/innodb/fil_system_mutex | 1402 | 146559150 | 24120 | 104520 | 5538555 |
| wait/synch/mutex/innodb/flush_list_mutex | 607 | 96362415 | 33165 | 158455 | 346725 |
| wait/io/file/innodb/innodb_log_file | 600 | 2693980326195 | 23094900 | 4489967145 | 819050374485 |
| wait/synch/mutex/innodb/buf_pool_mutex | 516 | 40862295 | 27135 | 79060 | 521595 |
| wait/synch/mutex/innodb/log_flush_order_mutex | 473 | 49503285 | 33165 | 104520 | 397980 |
| wait/synch/mutex/innodb/lock_wait_mutex | 428 | 71949960 | 63315 | 167835 | 1706490 |
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
25 rows in set (1.79 sec)
Top waits, by time
mysql> select * from performance_schema.events_waits_summary_global_by_event_name
-> where count_star>0 order by sum_timer_wait desc limit 25;
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| wait/io/table/sql/handler | 482492695 | 583215424951365 | 42210 | 1208680 | 7913678535 |
| wait/synch/mutex/innodb/srv_conc_mutex | 265080324 | 142377797253015 | 30150 | 537005 | 1862766495 |
| wait/lock/table/sql/handler | 32553724 | 20576707870950 | 99495 | 631810 | 1083789990 |
| wait/synch/mutex/sql/LOCK_open | 32553870 | 9359193544965 | 30150 | 287430 | 1080446355 |
| wait/synch/mutex/mysys/THR_LOCK::mutex | 32553724 | 6964485402105 | 30150 | 213730 | 3022857090 |
| wait/synch/mutex/sql/THD::LOCK_thd_data | 122077261 | 6519294560295 | 30150 | 53265 | 1072076715 |
| wait/synch/mutex/innodb/read_view_mutex | 32554342 | 6146010024300 | 24120 | 188605 | 1011083265 |
| wait/synch/rwlock/innodb/hash table locks | 38409624 | 5420427797475 | 27135 | 141035 | 360901530 |
| wait/synch/rwlock/innodb/trx_sys_lock | 18592651 | 5270414135850 | 42210 | 283410 | 1622682045 |
| wait/synch/rwlock/innodb/btr_search_latch | 16198467 | 4144473779535 | 27135 | 255605 | 1048297410 |
| wait/synch/rwlock/sql/LOCK_grant | 16276408 | 3502306526580 | 51255 | 215070 | 892225935 |
| wait/io/file/innodb/innodb_log_file | 600 | 2693980326195 | 23094900 | 4489967145 | 819050374485 |
| wait/synch/rwlock/sql/MDL_lock::rwlock | 2325412 | 524010714480 | 54270 | 225120 | 36433260 |
| wait/synch/mutex/mysys/BITMAP::mutex | 2325268 | 409667867595 | 42210 | 175875 | 908214480 |
| wait/synch/mutex/sql/MDL_map::mutex | 1162780 | 274705866855 | 36180 | 236175 | 133772535 |
| wait/synch/mutex/sql/LOCK_plugin | 1162673 | 212769433395 | 33165 | 182910 | 791328960 |
| wait/synch/mutex/innodb/lock_mutex | 1162633 | 145945057815 | 39195 | 125290 | 14173515 |
| wait/synch/mutex/innodb/trx_mutex | 1163457 | 32497638795 | 24120 | 27805 | 7419915 |
| wait/io/file/innodb/innodb_data_file | 8 | 23031533745 | 54936315 | 2878941425 | 7430889600 |
| wait/synch/rwlock/innodb/checkpoint_lock | 4 | 5048894880 | 250245 | 1262223720 | 2745088155 |
| wait/synch/mutex/innodb/log_sys_mutex | 8095 | 2471021640 | 27135 | 305185 | 1239502680 |
| wait/synch/mutex/sql/LOCK_thread_count | 31 | 830150100 | 30150 | 26778895 | 235748880 |
| wait/synch/mutex/innodb/os_mutex | 2539 | 276339825 | 30150 | 108540 | 17423685 |
| wait/synch/mutex/innodb/fil_system_mutex | 1451 | 149064615 | 24120 | 102510 | 5538555 |
| wait/synch/mutex/innodb/flush_list_mutex | 704 | 111835395 | 33165 | 158790 | 346725 |
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
25 rows in set (1.79 sec)
Total number of stage events
mysql> select sum(count_star), sum(sum_timer_wait)
-> from performance_schema.events_stages_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
| 244153447 | 2263079088326000 |
+-----------------+---------------------+
1 row in set (0.00 sec)
Top stages, by count
mysql> select * from performance_schema.events_stages_summary_global_by_event_name
-> where count_star>0 order by count_star desc limit 25;
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| stage/sql/init | 34879169 | 157347558824000 | 2374000 | 4511000 | 1685373000 |
| stage/sql/cleaning up | 18602306 | 25263461317000 | 628000 | 1358000 | 2278265000 |
| stage/sql/closing tables | 18602129 | 384870074281000 | 628000 | 20689000 | 4374172000 |
| stage/sql/query end | 18602129 | 23815226118000 | 698000 | 1280000 | 1674268000 |
| stage/sql/checking permissions | 16276937 | 37278853417000 | 1327000 | 2290000 | 1734543000 |
| stage/sql/Opening tables | 16276936 | 91344126412000 | 3352000 | 5611000 | 1641511000 |
| stage/sql/executing | 16276863 | 15264844128000 | 628000 | 937000 | 1097577000 |
| stage/sql/statistics | 16276863 | 325924010836000 | 6495000 | 20023000 | 3296152000 |
| stage/sql/Sending data | 16276863 | 548279995436000 | 2794000 | 33684000 | 6805393000 |
| stage/sql/System lock | 16276863 | 68298443425000 | 2444000 | 4196000 | 1102675000 |
| stage/sql/preparing | 16276863 | 40997524468000 | 1676000 | 2518000 | 1090453000 |
| stage/sql/optimizing | 16276863 | 31599629407000 | 1606000 | 1941000 | 1093176000 |
| stage/sql/end | 16276863 | 17257774496000 | 768000 | 1060000 | 1070408000 |
| stage/sql/Sorting result | 2325266 | 270070871391000 | 7264000 | 116146000 | 3214855000 |
| stage/sql/removing tmp table | 1162634 | 15392269553000 | 9359000 | 13239000 | 904741000 |
| stage/sql/freeing items | 1162634 | 15041751826000 | 3143000 | 12937000 | 1106237000 |
| stage/sql/Creating tmp table | 1162633 | 14681955665000 | 9568000 | 12628000 | 1092898000 |
| stage/sql/Copying to tmp table | 1162633 | 180350717326000 | 116008000 | 155122000 | 8074224000 |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
18 rows in set (0.00 sec)
Top stages, by time
mysql> select * from performance_schema.events_stages_summary_global_by_event_name
-> where count_star>0 order by sum_timer_wait desc limit 25;
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME | COUNT_STAR | SUM_TIMER_WAIT | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| stage/sql/Sending data | 16276863 | 548279995436000 | 2794000 | 33684000 | 6805393000 |
| stage/sql/closing tables | 18602129 | 384870074281000 | 628000 | 20689000 | 4374172000 |
| stage/sql/statistics | 16276863 | 325924010836000 | 6495000 | 20023000 | 3296152000 |
| stage/sql/Sorting result | 2325266 | 270070871391000 | 7264000 | 116146000 | 3214855000 |
| stage/sql/Copying to tmp table | 1162633 | 180350717326000 | 116008000 | 155122000 | 8074224000 |
| stage/sql/init | 34879169 | 157347558824000 | 2374000 | 4511000 | 1685373000 |
| stage/sql/Opening tables | 16276936 | 91344126412000 | 3352000 | 5611000 | 1641511000 |
| stage/sql/System lock | 16276863 | 68298443425000 | 2444000 | 4196000 | 1102675000 |
| stage/sql/preparing | 16276863 | 40997524468000 | 1676000 | 2518000 | 1090453000 |
| stage/sql/checking permissions | 16276937 | 37278853417000 | 1327000 | 2290000 | 1734543000 |
| stage/sql/optimizing | 16276863 | 31599629407000 | 1606000 | 1941000 | 1093176000 |
| stage/sql/cleaning up | 18602306 | 25263461317000 | 628000 | 1358000 | 2278265000 |
| stage/sql/query end | 18602129 | 23815226118000 | 698000 | 1280000 | 1674268000 |
| stage/sql/end | 16276863 | 17257774496000 | 768000 | 1060000 | 1070408000 |
| stage/sql/removing tmp table | 1162634 | 15392269553000 | 9359000 | 13239000 | 904741000 |
| stage/sql/executing | 16276863 | 15264844128000 | 628000 | 937000 | 1097577000 |
| stage/sql/freeing items | 1162634 | 15041751826000 | 3143000 | 12937000 | 1106237000 |
| stage/sql/Creating tmp table | 1162633 | 14681955665000 | 9568000 | 12628000 | 1092898000 |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
18 rows in set (0.00 sec)
Total number of statement events
mysql> select sum(count_star), sum(sum_timer_wait)
-> from performance_schema.events_statements_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
| 18602306 | 2293686457711000 |
+-----------------+---------------------+
1 row in set (0.00 sec)
Top statements, by count
mysql> select event_name, count_star, sum_timer_wait
from performance_schema.events_statements_summary_global_by_event_name
where count_star>0 order by count_star desc limit 25;
+---------------------------------+------------+------------------+
| event_name | count_star | sum_timer_wait |
+---------------------------------+------------+------------------+
| statement/com/Execute | 17439495 | 2267764442472000 |
| statement/sql/begin | 1162633 | 25914627564000 |
| statement/com/Prepare | 88 | 3041856000 |
| statement/com/Close stmt | 80 | 729649000 |
| statement/com/Quit | 9 | 47910000 |
| statement/sql/show_table_status | 1 | 3568260000 |
+---------------------------------+------------+------------------+
6 rows in set (0.00 sec)
Top statements, by time
mysql> select event_name, count_star, sum_timer_wait
from performance_schema.events_statements_summary_global_by_event_name
where count_star>0 order by sum_timer_wait desc limit 25;
+---------------------------------+------------+------------------+
| event_name | count_star | sum_timer_wait |
+---------------------------------+------------+------------------+
| statement/com/Execute | 17439495 | 2267764442472000 |
| statement/sql/begin | 1162633 | 25914627564000 |
| statement/sql/show_table_status | 1 | 3568260000 |
| statement/com/Prepare | 88 | 3041856000 |
| statement/com/Close stmt | 80 | 729649000 |
| statement/com/Quit | 9 | 47910000 |
+---------------------------------+------------+------------------+
6 rows in set (0.00 sec)
Timer statistics
mysql> select * from performance_schema.setup_timers;
+-----------+------------+
| NAME | TIMER_NAME |
+-----------+------------+
| wait | CYCLE |
| stage | NANOSECOND |
| statement | NANOSECOND |
+-----------+------------+
3 rows in set (0.00 sec)
mysql> select * from performance_schema.performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE | 2982179104 | 1 | 36 |
| NANOSECOND | 1000000000 | 1 | 1404 |
| MICROSECOND | 1000000 | 1 | 1422 |
| MILLISECOND | 1034 | 1 | 1431 |
| TICK | 104 | 1 | 774 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)
Interpreting results
Let's do some basic math to understand what this means ...
The total number of wait events collected is "1098658477". That's a big number. In a more readable form, it reads "1,098,658,477".
That still is a big number. In plain English, it is
1 Billion, with a
B. This was for a 5 minutes (300 sec) run, and all the statistics were truncated just before the sysbench load, so only the load itself is counted here.
Events collection rates (in number or events per seconds) are then:
- 3662194 waits/sec, or 3.6 M waits/sec
- 813844 stages/sec, or 813 K stages/sec
- 62007 statements/sec, or 62 K statements/sec
These numbers are actually mind blowing. Keep in mind that there is no sampling involved. When each and every one of these events is collected, the performance schema time the event, and add statistics to the proper counter for the object and the thread involved, so that statistics per object, per table, per thread, etc, are available.
For what it is worth, we also have 13 stages per statements, and 59 waits per statements, on average.
Based on the top wait events, we can verify that this workload is indeed CPU bound and read only, since no file io are involved
An important point to note is that "wait/io/table" here represents table IO, as seen at the SQL layer / storage engine interface. A table IO event will time the code executed overall inside the storage engine, so it basically is the sum of the CPU time spent inside the engine and the waits blocking the storage engine. Having a big number reported in table io is expected.
Given the rate of collection for this load, 3.6 M waits/sec, the overhead in the performance schema is not caused by a big, rarely invoked, slow function, but rather by very frequent calls to very short functions.
With such a number of events, timing events cause a huge number of timer invocations.
For waits, the timer used is the cycle timer (see table performance_schema.setup_timers). For the CYCLE timer, the timer overhead alone, expressed in cycles, is 36 for this machine (see table performance_schema.performance_timers).
With 1098658477 events, 2 timer invocation per event, that makes 79103410344 cycles spent timing waits.
Given a CPU frequency of 2982179104 or 2.9 GHz, this represents 26.5224 seconds of CPU in a 300 seconds benchmark.
During this 300 sec benchmark, top reports that mysqld (the server process) consumes 723 percents of CPU (this is a 8 core, but sysbench alone eats 70 percent), so the overall CPU used by the server is 2169 sec. An overhead of 26 sec / 2169 sec represents 1.2 percent of CPU power, just timing waits.
Let's do the same math for stages and statements. They both use the same timer, nanoseconds.
With 244153447 stages, 2 timers per stage, 1404 cycles per timer, we have 685582879176 cycles or 229.8933 sec of CPU.
With 18602306 statements, 2 timers per stage, 1404 cycles per timer, we have 52235275248 cycles or 17.5158 sec of CPU.
Ouch ! That makes 229.8933+17.5158 or 247.4091 seconds of CPU calling the nanosecond timer. In terms of overhead, 247 seconds taken from 2169 represents 11.4 percent of mysqld spent for the nanosecond timer alone.
The bottom line is, this usage of the nanosecond timer is killing the performance. Now, why ? Intuitively, one would think that collecting and timing waits at a rate of 3.6 M / sec is more consuming than doing the same thing for stages and statements, at a rate of 813 K / sec and 62 K / sec respectively.
The root cause for this bad performances is hidden in plain sight, exposed by the performance_timers table.
mysql> select * from performance_schema.performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE | 2982179104 | 1 | 36 |
| NANOSECOND | 1000000000 | 1 | 1404 | <-- Really ? This is HUGE !
| MICROSECOND | 1000000 | 1 | 1422 |
| MILLISECOND | 1034 | 1 | 1431 |
| TICK | 104 | 1 | 774 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)
This is actually the first time I find a machine where the overhead for the NANOSECOND, MICROSECOND and MILLISECOND timer is so high
Some details about the machine used ...
malff@tyr03:~>more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
stepping : 6
cpu MHz : 2992.227
cache size : 6144 KB
...
There are 8 cores total, all identical.
Timer details
malff@tyr03:withpfs>./unittest/mysys/my_rdtsc-t
1..11
# ----- Routine ---------------
# myt.cycles.routine : 2 <-- RDTSC
# myt.nanoseconds.routine : 11 <-- CLOCK_GETTIME
# myt.microseconds.routine : 13
# myt.milliseconds.routine : 18
# myt.ticks.routine : 17
# ----- Frequency -------------
# myt.cycles.frequency : 2986029850
# myt.nanoseconds.frequency : 1000000000
# myt.microseconds.frequency : 1000000
# myt.milliseconds.frequency : 1035
# myt.ticks.frequency : 104
# ----- Resolution ------------
# myt.cycles.resolution : 1
# myt.nanoseconds.resolution : 1
# myt.microseconds.resolution : 1
# myt.milliseconds.resolution : 1
# myt.ticks.resolution : 1
# ----- Overhead --------------
# myt.cycles.overhead : 36
# myt.nanoseconds.overhead : 1404 <-- Ouch
# myt.microseconds.overhead : 1422
# myt.milliseconds.overhead : 1431
# myt.ticks.overhead : 765
...
malff@tyr03:~>uname -a
Linux tyr03 2.6.32-100.28.17.el6.x86_64 #1 SMP Mon May 23 04:37:16 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
To have a point of comparison, the following are taken from a 4 core 8 Gb machine (my now several years old desktop)
mysql> select * from performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE | 2831810945 | 1 | 34 | <-- Marc desktop
| NANOSECOND | 1000000000 | 1 | 102 | <-- Marc desktop
| MICROSECOND | 1000000 | 1 | 111 |
| MILLISECOND | 1037 | 1 | 127 |
| TICK | 105 | 1 | 493 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)
malff@linux-8edv:~> more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
stepping : 10
cpu MHz : 2833.000
cache size : 6144 KB
...
malff@linux-8edv:~> uname -a
Linux linux-8edv 2.6.37.6-0.5-desktop #1 SMP PREEMPT 2011-04-25 21:48:33 +0200 x86_64 x86_64 x86_64 GNU/Linux
Both the lab machine (tyr03) and my desktop have an Intel processor, and they happen to even be of the same family (6) and model (23),
only the model name differ (Xeon E5450 vs Core 2 Quad Q9550).
The CYCLE timer is implemented using RDTSC in both cases, and the overheads measured are very similar, with 36 and 34 cycles.
Implementation of the CYCLE timer is in
my_timer_cycles() in mysys.
Details about the RDTSC instruction can be found in the
processor spec, page C-32 (722). Family 6 model 23 is CPUID 06_17H.
For the nanoseconds timer, it is implemented in
my_timer_nanoseconds() in both cases with a call to clock_gettime(CLOCK_REALTIME), in both cases on Linux 2.6.
The lab machine has a Linux 2.6.32 kernel, while the desktop has Linux 2.6.37.
The exact root cause of this huge difference in overhead (1404 vs 102) has not been explained yet.
In the mean time, given the cost of the nanosecond timer, maybe the performance schema overhead could be reduced by not timing events or timing them with the cycle timer.
This is worth a new set of measurements, still for mysql-trunk ...
The trunk benchmark, revisited
| With performance schema |
global (not timed) |
thread (not timed) |
thread (all cycles) |
| Connections |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
| 1 |
757 | 6.20 |
752 | 6.82 |
739 | 8.43 |
| 2 |
1362 | 5.35 |
1327 | 7.78 |
1293 | 10.15 |
| 4 |
2441 | 6.48 |
2385 | 8.62 |
2328 | 10.80 |
| 8 |
4490 | 11.72 |
4384 | 13.80 |
4165 | 18.11 |
| 12 |
4489 | 10.88 |
4411 | 12.43 |
4220 | 16.22 |
| 16 |
4489 | 10.90 |
4436 | 11.95 |
4284 | 14.97 |
| 32 |
4336 | 10.67 |
4232 | 12.81 |
3992 | 17.76 |
| 48 |
4244 | 10.88 |
4111 | 13.67 |
3876 | 18.61 |
| 64 (*) |
3977 | 3.42 |
3040 | 26.18 |
2527 | 38.64 |
| 128 (*) |
2830 | 20.01 |
1972 | 44.26 |
1897 | 46.38 |
| 256 (*) |
1683 | 3.61 |
1671 | 4.30 |
1629 | 6.70 |
(*) Noise at 64/128/256 still makes results unstable.
These new results, using a different configuration but still the same code in mysql-trunk, are interesting for two reasons.
First, it confirms the effect of timing alone. Timing a
lot or events takes time, and timing a lot of events with a
bad timer makes things really worse. Using the CYCLE timer for waits, stages and statements here, in the "thread (all cycle)" configuration, helped to regain 5 percent.
Second, it shows that once all timers are removed from the code path, the overhead is still in the ~11% range for the global only consumer. The code probably has room for optimizations, see the next section.
The tuning benchmark
The current code in mysql-trunk (at time of writing) can be improved and optimized, in an effort to reduce the overhead even more. A long list of different tweaks, changes, fixes ... has been the topic of a lot of research and tuning lately. Instead of going through each change, the cumulated changes are available in a different branch, in mysql-trunk-pfs-tuning.
This branch is
available on launchpad for the technically curious. A bit of warning: it is used only for prototyping and benchmarking to see what code does (or does not) work better. It may not even build for your platform. Do not use this in production.
| mysql-trunk-pfs-tuning |
without pfs (baseline) |
disabled |
off |
global (not timed) |
global (timed) |
thread (not timed) |
thread (timed) |
thread (all cycles) |
| Connections |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
TPS | Overhead (%) |
| 1 |
819 | 0 |
788 | 3.79 |
778 | 5.01 |
771 | 5.86 |
684 | 16.48 |
755 | 7.81 |
678 | 17.22 |
744 | 9.16 |
| 2 |
1431 | 0 |
1405 | 1.82 |
1383 | 3.35 |
1354 | 5.38 |
1212 | 15.30 |
1329 | 7.13 |
1199 | 16.21 |
1307 | 8.67 |
| 4 |
2598 | 0 |
2567 | 1.19 |
2522 | 2.93 |
2452 | 5.62 |
2163 | 16.74 |
2394 | 7.85 |
2141 | 17.59 |
2371 | 8.74 |
| 8 |
4949 | 0 |
4980 | -0.63 |
4799 | 3.03 |
4635 | 6.34 |
4105 | 17.05 |
4445 | 10.18 |
4061 | 17.94 |
4367 | 11.76 |
| 12 |
4948 | 0 |
4936 | 0.24 |
4759 | 3.82 |
4616 | 6.71 |
4112 | 16.90 |
4478 | 9.50 |
4076 | 17.62 |
4390 | 11.28 |
| 16 |
4944 | 0 |
4931 | 0.26 |
4754 | 3.84 |
4601 | 6.94 |
4131 | 16.44 |
4476 | 9.53 |
4077 | 17.54 |
4384 | 11.33 |
| 32 |
4764 | 0 |
4753 | 0.23 |
4596 | 3.53 |
4455 | 6.49 |
3984 | 16.37 |
4286 | 10.03 |
3945 | 17.19 |
4200 | 11.84 |
| 48 |
4682 | 0 |
4681 | 0.02 |
4519 | 3.48 |
4377 | 6.51 |
3908 | 16.53 |
4202 | 10.25 |
3860 | 17.56 |
4111 | 12.20 |
| 64 (*) |
4148 | 0 |
3881 | 6.44 |
2583 | 37.73 |
3653 | 11.93 |
2916 | 29.70 |
3082 | 25.70 |
2863 | 30.98 |
3496 | 15.72 |
| 128 (*) |
3659 | 0 |
2034 | 44.41 |
2024 | 44.68 |
1983 | 45.80 |
1924 | 47.42 |
2011 | 45.04 |
1886 | 48.46 |
1939 | 47.01 |
| 256 (*) |
2514 | 0 |
1746 | 30.55 |
1732 | 31.11 |
1690 | 32.78 |
1632 | 35.08 |
1724 | 31.42 |
1640 | 34.77 |
1646 | 34.53 |
We can see some progress:
- When all consumers are off, the overhead is in the 3% to 4% range, which is much better compared to ~8%.
- The "global (timed)" and "thread (timed)" configurations still have an important overhead because they use the nanosecond timer, but this overhead decreased, which shows the benefits of the code optimizations.
- The "thread (all cycles)" overhead is in the ~11% range.
Now that's better. With every instruments enabled and timed, for all waits stages and statements, and with statistics collected globally and per thread, 11% for this wealth of information is not so bad. Especially knowing that the overhead can be changed back to 3-4% at anytime, since all this configuration can be changed dynamically.
Take aways
The most important facts found here:
- Timer performances do vary between platforms, and have a great impact. Check the timer performance itself with table performance_schema.performance_timers before deciding to time events.
- The performance schema code itselt can be improved. Current optimizations are available in mysql-trunk-pfs-tuning, as a preview.
- Overall, the performance schema captures instrumentation events at a very high rate. Limiting the choice of consumers and instruments to use is a good way to cut down the overhead.
- There is no magic formula that says "overhead is X". Overhead depends so much on the work load, the hardware, the configuration, etc, that the best answer is "it depends".
The real question about overhead, given that it depends so much, really boils down to "What is the overhead for my application ?". The best way to find that is to try it ...
-- Marc Alff
PlanetMySQL Voting:
Vote UP /
Vote DOWN