Archive for the ‘production’ Category

Estimating Replication Capacity

Июль 21st, 2010

It is easy for MySQL replication to become bottleneck when Master server is not seriously loaded and the more cores and hard drives the get the larger the difference becomes, as long as replication
remains single thread process. At the same time it is a lot easier to optimize your system when your replication runs normally - if you need to add/remove indexes and do other schema changes you probably would be looking at some methods involving replication if you can't take your system down. So here comes the catch in many systems - we find system is in need for optimization when replication can't catch up but yet optimization process we're going to use relays on replication being functional and being able to catch up quickly.

So the question becomes how can we estimate replication capacity, so we can deal with replication load before slave is unable to catch up.

Need to replication capacity is not only needed in case you're planning to use replication to perform system optimization, it is also needed on other cases. For example in sharded environment you may need to schedule downtime or set object read only to move it to another shard. It is much nicer if it can be planned in advance rather than done on emergency basics when slave(s) are unable to catch up and application is suffering because of stale data. This especially applies to Software as Service providers which often have very strict SLA agreements with their customers and which can have a lot of data per customer so move can take considerable amount of time.

So what is replication capacity I call replication capacity the ability to replicate the master load. If replication is able to replicate 3 times the write load from the master without falling behind I will call it replication capacity of 3. When used with context of applying binary logs (for example point in time recovery from backup) replication capacity of 1 will mean you can reply 1 hour worth of binary logs within 1 hour. I will call "replication load" the inverse of replication capacity - this is basically what percentage of time the replication thread was busy replicating events vs staying idle.

Note you can speak about idle replication capacity, when box does not do anything else as well as loaded replication capacity when the box serves the normal load. Both are important. You care about idle replication capacity when you have no load on the slave and need it to catch up or when restoring from backup, the loaded replication capacity matters during normal operation.

So we defined what replication capacity is. There is however no tools which can tell us straight what replication capacity is for the given system. It also tends to float depending on the load similar as loadavg metrics. Here are some of the ways to measure it:

1) Use "UserStats" functionality from Google patches, which is now available in Percona Server and MariaDB. This is the probably the easiest and most accurate approach but it
does not work in Oracle MySQL Server. set userstat_running=1 and run following query:

SQL:
  1. mysql> SELECT * FROM information_schema.user_statistics WHERE user="#mysql_system#" \G
  2. *************************** 1. row ***************************
  3. USER: #mysql_system#
  4. TOTAL_CONNECTIONS: 1
  5. CONCURRENT_CONNECTIONS: 0
  6. CONNECTED_TIME: 446
  7. BUSY_TIME: 74
  8. CPU_TIME: 0
  9. BYTES_RECEIVED: 0
  10. BYTES_SENT: 63
  11. BINLOG_BYTES_WRITTEN: 0
  12. ROWS_FETCHED: 0
  13. ROWS_UPDATED: 127576
  14. TABLE_ROWS_READ: 4085689
  15. SELECT_COMMANDS: 0
  16. UPDATE_COMMANDS: 119127
  17. OTHER_COMMANDS: 89557
  18. COMMIT_TRANSACTIONS: 90259
  19. ROLLBACK_TRANSACTIONS: 0
  20. DENIED_CONNECTIONS: 1
  21. LOST_CONNECTIONS: 0
  22. ACCESS_DENIED: 0
  23. EMPTY_QUERIES: 0
  24. 1 row IN SET (0.00 sec)

In this case CONNECTED_TIME is 446 second, out of this replication thread was busy (BUSY_TIME) 74 seconds which means replication capacity is 446/74 = 6
You normally would not like to measure it from the start but rather take the difference in these counters every 5 minutes or other interval of your choice.

2) Use full slow query log and mk-query-digest. This method is great for one time execution especially as it comes together with giving you the list of queries which load replication
the most. It however works only with statement level replication. You need to set long_query_time=0 and log_slave_slow_statements=1 for this method to work.
Get the log file which will include all queries MySQL server ran with their times and run mk-query-digest with filter to only check queries from replication thread:

mk-query-digest slow-log --filter '($event->{user} || "") =~ m/[SLAVE_THREAD]/' > /tmp/report-slave.txt

In the report you will see something like this as a header:

SQL:
  1. # 475s user time, 1.2s system time, 80.41M rss, 170.38M vsz
  2. # Current date: Mon Jul 19 15:12:24 2010
  3. # Files: slow-log
  4. # Overall: 1.22M total, 1.27k unique, 558.56 QPS, 0.37x concurrency ______
  5. # total min max avg 95% stddev median
  6. # Exec time 819s 1us 92s 669us 260us 120ms 93us
  7. # Lock time 28s 0 166ms 23us 49us 192us 25us
  8. # Rows sent 4.27k 0 325 0.00 0 1.04 0
  9. # Rows exam 30.88M 0 1.28M 26.48 0 3.07k 0
  10. # Time range 2010-07-19 14:35:53 to 2010-07-19 15:12:22
  11. # bytes 350.99M 5 1022.34k 301.01 719.66 5.75k 124.25
  12. # Bytes sen 1.94M 0 9.42k 1.67 0 110.38 0
  13. # Killed 0 0 0 0 0 0 0
  14. # Last errn 34.11M 0 1.55k 29.26 0 185.83 0
  15. # Merge pas 0 0 0 0 0 0 0
  16. # Rows affe 875.19k 0 17.55k 0.73 0.99 25.61 0.99
  17. # Rows read 2.20M 0 14.83k 1.88 1.96 24.68 1.96
  18. # Tmp disk 4.15k 0 1 0.00 0 0.06 0
  19. # Tmp table 14.19k 0 2 0.01 0 0.14 0
  20. # Tmp table 8.30G 0 2.01M 7.12k 0 117.75k 0
  21. # 0% (5k) Filesort
  22. # 0% (5k) Full_join
  23. # 0% (7k) Full_scan
  24. # 0% (10k) Tmp_table
  25. # 0% (4k) Tmp_table_on_disk

There is a lot of interesting you can find out from this header but in relation to replication capacity - you can get replication load, which is same as "concurrency" figure (0.37x) The concurrency as reported by mk-query-digest is sum of query execution time vs time range the log file covers. In this case as we know there is only one replication thread it will be same as replication load. This gives us replication capacity of 1/0.37 = 2.70

This method should work with original MySQL Server in theory, though I have not tested it. Some versions had log_slave_slow_statements unreliable and also you may need to adjust regular expression for finding users replication thread uses.

3) Processlist Pooling This method is simple - the Slave thread has different status in Show Processlist depending on if it processes query or simply waiting. By pooling processlist frequently (for example 10 times a second) we can compute the approximate percentage the thread was busy vs idle. Of course running processlist very aggressively can be an overhead especially if it is busy system with a lot of connections

SQL:
  1. mysql> SHOW processlist;
  2. +--------+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
  3. | Id | User | Host | db | Command | Time | State | Info |
  4. +--------+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
  5. | 801812 | system user | | NULL | Connect | 2665 | Waiting FOR master TO send event | NULL |
  6. | 801813 | system user | | NULL | Connect | 0 | Has READ ALL relay log; waiting FOR the slave I/O thread TO UPDATE it | NULL |
  7. | 802354 | root | localhost | NULL | Query | 0 | NULL | SHOW processlist |
  8. +--------+-------------+-----------+------+---------+------+-----------------------------------------------------------------------+------------------+
  9. 3 rows IN SET (0.00 sec)

4) Slave Catchup/Binlog Application method. We can just get the spare server with backups restored on it and apply binary log to it. If 1 hour worth of binary logs applies for 10 minutes we have replication capacity of 6. The challenge of course having spare server around and it is quite labor intensive. At the same time it can be good measurement to take during backup recovery trials when you're doing this activity anyway. Using this way you can also measure "cold" vs "hot" replication capacity as well as how long replication warmup takes. It is very typical for servers with cold cache to perform a lot slower then they are warmed up. Measuring times for each binary log separately should give you these numbers.

The less intrusive process which can be done in production (especially if you have slave which is used for backups/reporting etc) is to stop the replication for some time and when see how long it takes to catch up. If you paused replication for 10 minutes and it took 5 minutes to catch up your replication capacity will be 3 (not 2) because you not only had to process the events for outstanding 10 minutes but also for these 5 minutes it took to catch up. The formula is (Time_Replication_Paused+Time_Took_To_Catchup)/Time_Took_To_Catchup.

So how much of replication capacity do you need in the healthy system ? It depends a lot on many things including how fast do you need to be able to recover from backups and how much your load variance is. A lot of systems have special requirements on the time it takes to warmup too (there are different things you can do about it too). First I would measure replication capacity on 5 minute intervals (or something similar) because it tends to vary a lot. When I would suggest to ensure the loaded replication capacity is at least 3 during the peak load and 5 during the normal load. This applies to normal operational load - if you push heavy ALTER TABLE through replication they will surely get your replication capacity down for their duration.

One more thing about these methods - methods 1,2,3 work well only if replication capacity is above 1, so system is caught up. If it is less than 1, so the master writes more binary logs than slave can process they will show number close to 1. the method 4 however with work even if replication can't ever catch up - If 1 hour worth of binary logs takes 2 hours to apply, your replication capacity is 0.5.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Scaling: Consider both Size and Load

Июль 14th, 2010

So lets imagine you have the server handling 100.000 user accounts. You can see the CPU,IO and Network usage is below 10% of capacity – does it mean you can count on server being able to
handle 1.000.000 of accounts ? Not really, and there are few reasons why, I’ll name most important of them:

Contention – This is probably the most obvious one. MySQL (and systems in general) do not scale perfectly with numbers of CPUs and number of concurrent requests. Reduced efficiency of CPU cache, Mutex contention and database lock contention all come here. Some of them are preventable and can be reduced by code changes, such as there have been a lot of advanced in scalability of MySQL by improving locking code design, others, such as row level locks would require application changes to allow more concurrent process. The scalability numbers depend a lot on the system scale, software and workload.

Data size impact There are different type of applications out of there. Some (minority?) vary the load independently or almost independently of data size. You can think about Google search engine – the data size on which search operation is performed is constant, no matter if you’re serving 10 queries a day or a billion. True you probably would need to hold many copies of data to support high load but this is a scaling through copying not the amount of data you have in the system. Wikipedia is a similar case – the data size does not depend on the amount of readers, though writers contribute to the data size by creating new articles and increasing amount of versions in system. Applications such as Facebook, Flickr or Twitter have a very clear correlation between traffic and users.
Each registered user will in average have N MB of data stored in database, and traffic system is getting is somewhat proportional (though often not linear) to amount of users.

For systems of the first type the data size grows independently of traffic so it is fine to measure system capacity in Transactions per second. If system can handle twice amount of transactions per second it may be able to handle double the load. For systems of the second type you better use Transactions/Second/User or Transactions/Second/MB (which is similar measures as users in average have certain amount of data each). Doubling traffic for system of such type means handling twice amount of transactions on the twice amount of data.

Increasing amount of data is very serious implication for system performance. Some queries have relatively small impact (having LOG(N) scalability), others may have linear or even square complexity which
means increasing data size puts a very serious strain on the system. What is also very important and often forgotten is caching. Having twice amount of data means having half the cache – if you previously had 20% of data fits in memory, now it is only 10%. The impact of cache on performance is very application dependent as well and may vary from insignificant to dramatic.

You’re in the highest danger if you have very high portion of your database (or working) set fits in memory, hence having CPU bound workload. As your data growths you may frequently find load becoming CPU bound and hence things becoming 10x slower (or more) sometimes with very modest size increase. I’ve seen things slowing down about 10x from less than 50% increase in the data size.

I see the data size impact often omitted in “consolidation” tests – when you would get a new server and see it can handle 5x of the load of the old one, so you would consider you can put 5 “shards” on it.
5 shards surely come with 5x more data which you need to carefully take into account.

Design Limits This is the brother of contention but I decided to put it separately. There are more things than contention which can limit the perfect scalability. The Replication is perfect example in MySQL world. Slave executes replication stream in single thread which means it replication can’t scale for large amount of writes. The lack of parallel query execution is similar issue – you may have a lot of resources in terms of CPUs and disk but it can’t help to reduce response time of the single query.

Response Time Do not forget you do not only be able to handle number of users in terms of capacity you also need to have response time to be within certain range in majority of cases. Some may look at 99% response time, some 95% but neither the less you want users to get response fast. This means you can’t plan on loading system 100%. There is a nice paper by Cary Millsap explaining this in more details. Depending on the system and workload you may want to keep your system loaded no more than 80% in peak times, though applications which need to accommodate for larger traffic variance need to have a lot more spare room.

So in the end the math to scale your system may not be as straightforward as you think – you need to take number of things into account and I’d always suggest to confirm your modeling with benchmarks/performance evaluation if have a chance.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Color code your performance numbers

Июнь 2nd, 2010

When analyzing how good or bad response time is it is not handy to look at the averages, min or max times - something what is easily computed using built in aggregate functions. We most likely would like to see some percentile numbers - 95 percentile or 99 percentile. The problem is computing these in SQL is ugly and resource consuming. There is however a simple way to get similar data, looking at it from the different point of view.

When we're speaking about application we may not always care about exact value of response time but rather we want to see response time to be within certain range. For example if we define page feels good if response time is below 50ms it is not as important if response was 40ms or 44ms - it is much more important how frequently this goal was reached.

In fact I prefer to define 2 performance level. One what users would consider good performance, another what is acceptable, and basically the third level is unacceptable. We can also set classical color codes to them - green, yellow, red.

For example we can define good performance as response time for search pages within 1 second, acceptable as response within 3 seconds and anything over that will be considered unacceptable:

SQL:
  1. mysql> SELECT sum(wtime<1.0)/count(*)*100 green_pct, sum(wtime>=1.0 AND wtime<3.0)/count(*)*100 yellow_pct, sum(wtime>=3.)/count(*)*100 red_pct FROM performance_log_100601 WHERE page_type='search';
  2. +-----------+------------+---------+
  3. | green_pct | yellow_pct | red_pct |
  4. +-----------+------------+---------+
  5. |   99.6701 |     0.30060.0293 |
  6. +-----------+------------+---------+
  7. 1 row IN SET (24.87 sec)

These are great numbers to look after. In this case we can say system has "two nines" in terms of performance - it responds with what we consider "good" performance in over 99% cases. I consider this a lot more meaningful number for business than speaking about 99% response time, especially as it also clearly clarifies how bad the problem is. If your goal is 99% response time is 1sec and you're seeing it 2 seconds what you really do not know is how many people fall in this 1-sec to 2 sec range. It could be very small number and so your performance is not that bad or it could be majority of response times are withing 1 to 2 sec range.

Next thing I would add is the average response time for our good requests. This eliminates slow outliers which we just count by their number and allows to see how good our performance really is. I consider it secondary value as our major goal is eliminating variance as defined by Six Sigma practices. However given variance is the same better average response time for good requests helps us to track performance changes closely and act before we get large number of yellow or red requests:

SQL:
  1. mysql> SELECT avg(IF(wtime<1.0,wtime,NULL)) avg_green,sum(wtime<1.0)/count(*)*100 green_pct, sum(wtime>=1.0 AND wtime<3.0)/count(*)*100 yellow_pct, sum(wtime>=3.0)/count(*)*100 red_pct FROM performance_log_100601 WHERE page_type='search';
  2. +-------------------+-----------+------------+---------+
  3. | avg_green         | green_pct | yellow_pct | red_pct |
  4. +-------------------+-----------+------------+---------+
  5. | 0.448726461517477 |   99.6701 |     0.30060.0293 |
  6. +-------------------+-----------+------------+---------+
  7. 1 row IN SET (6.84 sec)

Finally I find it helpful to track it among time period. For example in this case we're looking for data grouped by 15 min interval though you may be looking at different intervals too:

SQL:
  1. mysql> SELECT from_unixtime(unix_timestamp(logged) div 900 * 900) period, avg(IF(wtime<1.0,wtime,NULL)) avg_green,sum(wtime<1.0)/count(*)*100 green_pct, sum(wtime>=1.0 AND wtime<3.0)/count(*)*100 yellow_pct, sum(wtime>=3.0)/count(*)*100 red_pct FROM performance_log_100601 WHERE page_type='search' GROUP BY 1;
  2. +---------------------+-------------------+-----------+------------+---------+
  3. | period              | avg_green         | green_pct | yellow_pct | red_pct |
  4. +---------------------+-------------------+-----------+------------+---------+
  5. | 2010-05-31 20:00:00 | 0.454738099705166100.0000 |     0.00000.0000 |
  6. | 2010-05-31 20:15:00 | 0.477789820622135100.0000 |     0.00000.0000 |
  7. | 2010-05-31 20:30:00 | 0.463134796179216100.0000 |     0.00000.0000 |
  8. | 2010-05-31 20:45:00 | 0.397884420995359 |   99.4595 |     0.54050.0000 |
  9. | 2010-05-31 21:00:00 | 0.438939880065877100.0000 |     0.00000.0000 |
  10. | 2010-05-31 21:15:000.48698966072074100.0000 |     0.00000.0000 |
  11. | 2010-05-31 21:30:00 | 0.432073892663709 |   93.6364 |     6.36360.0000 |
  12. | 2010-05-31 21:45:00 | 0.483174517098134100.0000 |     0.00000.0000 |
  13. | 2010-05-31 22:00:00 | 0.449507013872618100.0000 |     0.00000.0000 |
  14. | 2010-05-31 22:15:00 | 0.327130372130607100.0000 |     0.00000.0000 |
  15. | 2010-05-31 22:30:00 | 0.431361319540545100.0000 |     0.00000.0000 |
  16. | 2010-05-31 22:45:00 | 0.480885540467242100.0000 |     0.00000.0000 |

Note response time does not always correlates with number of bad requests. For example we can see at one interval there were over 6% requests in yellow but average response time for good requests actually was better than in the other intervals.

The queries which produce this data may not be overly beautiful but they are pretty fast requiring simple scan of the data range.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Color code your performance numbers

Июнь 2nd, 2010

When analyzing how good or bad response time is it is not handy to look at the averages, min or max times - something what is easily computed using built in aggregate functions. We most likely would like to see some percentile numbers - 95 percentile or 99 percentile. The problem is computing these in SQL is ugly and resource consuming. There is however a simple way to get similar data, looking at it from the different point of view.

When we're speaking about application we may not always care about exact value of response time but rather we want to see response time to be within certain range. For example if we define page feels good if response time is below 50ms it is not as important if response was 40ms or 44ms - it is much more important how frequently this goal was reached.

In fact I prefer to define 2 performance level. One what users would consider good performance, another what is acceptable, and basically the third level is unacceptable. We can also set classical color codes to them - green, yellow, red.

For example we can define good performance as response time for search pages within 1 second, acceptable as response within 3 seconds and anything over that will be considered unacceptable:

SQL:
  1. mysql> SELECT sum(wtime<1.0)/count(*)*100 green_pct, sum(wtime>=1.0 AND wtime<3.0)/count(*)*100 yellow_pct, sum(wtime>=3.)/count(*)*100 red_pct FROM performance_log_100601 WHERE page_type='search';
  2. +-----------+------------+---------+
  3. | green_pct | yellow_pct | red_pct |
  4. +-----------+------------+---------+
  5. |   99.6701 |     0.30060.0293 |
  6. +-----------+------------+---------+
  7. 1 row IN SET (24.87 sec)

These are great numbers to look after. In this case we can say system has "two nines" in terms of performance - it responds with what we consider "good" performance in over 99% cases. I consider this a lot more meaningful number for business than speaking about 99% response time, especially as it also clearly clarifies how bad the problem is. If your goal is 99% response time is 1sec and you're seeing it 2 seconds what you really do not know is how many people fall in this 1-sec to 2 sec range. It could be very small number and so your performance is not that bad or it could be majority of response times are withing 1 to 2 sec range.

Next thing I would add is the average response time for our good requests. This eliminates slow outliers which we just count by their number and allows to see how good our performance really is. I consider it secondary value as our major goal is eliminating variance as defined by Six Sigma practices. However given variance is the same better average response time for good requests helps us to track performance changes closely and act before we get large number of yellow or red requests:

SQL:
  1. mysql> SELECT avg(IF(wtime<1.0,wtime,NULL)) avg_green,sum(wtime<1.0)/count(*)*100 green_pct, sum(wtime>=1.0 AND wtime<3.0)/count(*)*100 yellow_pct, sum(wtime>=3.0)/count(*)*100 red_pct FROM performance_log_100601 WHERE page_type='search';
  2. +-------------------+-----------+------------+---------+
  3. | avg_green         | green_pct | yellow_pct | red_pct |
  4. +-------------------+-----------+------------+---------+
  5. | 0.448726461517477 |   99.6701 |     0.30060.0293 |
  6. +-------------------+-----------+------------+---------+
  7. 1 row IN SET (6.84 sec)

Finally I find it helpful to track it among time period. For example in this case we're looking for data grouped by 15 min interval though you may be looking at different intervals too:

SQL:
  1. mysql> SELECT from_unixtime(unix_timestamp(logged) div 900 * 900) period, avg(IF(wtime<1.0,wtime,NULL)) avg_green,sum(wtime<1.0)/count(*)*100 green_pct, sum(wtime>=1.0 AND wtime<3.0)/count(*)*100 yellow_pct, sum(wtime>=3.0)/count(*)*100 red_pct FROM performance_log_100601 WHERE page_type='search' GROUP BY 1;
  2. +---------------------+-------------------+-----------+------------+---------+
  3. | period              | avg_green         | green_pct | yellow_pct | red_pct |
  4. +---------------------+-------------------+-----------+------------+---------+
  5. | 2010-05-31 20:00:00 | 0.454738099705166100.0000 |     0.00000.0000 |
  6. | 2010-05-31 20:15:00 | 0.477789820622135100.0000 |     0.00000.0000 |
  7. | 2010-05-31 20:30:00 | 0.463134796179216100.0000 |     0.00000.0000 |
  8. | 2010-05-31 20:45:00 | 0.397884420995359 |   99.4595 |     0.54050.0000 |
  9. | 2010-05-31 21:00:00 | 0.438939880065877100.0000 |     0.00000.0000 |
  10. | 2010-05-31 21:15:000.48698966072074100.0000 |     0.00000.0000 |
  11. | 2010-05-31 21:30:00 | 0.432073892663709 |   93.6364 |     6.36360.0000 |
  12. | 2010-05-31 21:45:00 | 0.483174517098134100.0000 |     0.00000.0000 |
  13. | 2010-05-31 22:00:00 | 0.449507013872618100.0000 |     0.00000.0000 |
  14. | 2010-05-31 22:15:00 | 0.327130372130607100.0000 |     0.00000.0000 |
  15. | 2010-05-31 22:30:00 | 0.431361319540545100.0000 |     0.00000.0000 |
  16. | 2010-05-31 22:45:00 | 0.480885540467242100.0000 |     0.00000.0000 |

Note response time does not always correlates with number of bad requests. For example we can see at one interval there were over 6% requests in yellow but average response time for good requests actually was better than in the other intervals.

The queries which produce this data may not be overly beautiful but they are pretty fast requiring simple scan of the data range.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

On Good Instrumentation

Июнь 1st, 2010

In so many cases troubleshooting applications I keep thinking how much more efficient things could be going if only there would be a good instrumentation available. Most of applications out there have very little code to help understand what is going on and if it is there it is frequently looking at some metrics which are not very helpful.

If you look at the system from bird eye view – system needs to process transactions and you want it to successfully complete large number of transactions it gets (this is what called availability) and we want it to serve them with certain response time, which is what is called performance. There could be many variables in environment which change – load, number of concurrent users, database, the way users use the system but in the nutshell all what you really care is having predictable response time within certain range. So if we care about response time – this is exactly what our instrumentation should measure

Response Time Summary We want to understand where exactly response time comes from. For example if we define transaction as the time it took to generate HTML page we want to understand how much time was spent waiting on the database, memcache, other external services, as well as how much CPU time it consumed.

Now what is important we need this information for individual transactions. It may be every transaction which is best and easily achievable for small-medium systems or at least for large enough sample. It is very important this information is available for individual transactions not the average. Average is useless because 100 transactions taking 1 sec and 99 transactions having 1ms and 1 taking 99.1 sec will have the same average while for sake of performance analyzes these are completely different. When you have transaction sample make sure it contains fair population of transaction – getting only transactions which are slow is not helpful as we might want to compare them to the fast transactions to understand why they are slow.

What kind of components do you need to have in response time summary – all components which are significant enough. If your instrumentation has 95% of response time unaccounted for it is useless. You also want blocks not to mix apples with oranges. For example “mysql and memcache” block would not be helpful. Even further I would prefer to split “mysql time” in the “connect time” and “query time” as there are situations when one but not other would be affected.

In is important for response time summary stored in the logs which are easy to query so you can analyze data in a lot of various ways. Sometimes you may find the response time is impacted by queries from certain user, in others it may be attributed to different application/web server.

The goal for Response Time Summary is to quickly point direction where problem happens. Whenever you have spike in response time or it is bad response time for certain kind of request you can quickly understand where does it come from ? Is it wasted CPU time Slow response from MySQL or Memcache.

I also like to see numbers of calls stored together with attributed response time. For example I’d like to see number of mysql calls in addition to MySQL response time. This helps to understand if it is the issue with number of queries or their performance. If I see 2 queries taking 30 seconds it is clearly slow queries. If it is 10.000 queries executed and total response time is 4 sec I know it is pretty much as good as it gets with standard Ethernet network and finding a ways to reduce number of queries is going to be more helpful.

The Glue Our applications involve multiple layers and typically higher layer can only report response time it took to call lower layer, but not the reason for that response time. For example we can report time it took to execute SQL query from PHP application side, but we can’t say why it has taken so long. Was it row level lock ? waiting on disk IO or was it simply question of burning a lot of CPU. On the other hand this information may be available in the instrumentation stats from that lower layer – for example in MySQL Query Log. What is important is however to be able to connect the data from these logs – glue them together. The easy way to do it is to provide an unique identifier to all requests and put it in the logs with request of the lower levels. With mySQL the simple way to do it is to put it in the comments for queries you execute.

Optional Tracing The information in lower layers logs is very helpful however it typically have two problems. First not every layer has good logs. For example if you’re running memcached you probably do not have the logs detailing all requests and their response time. Second – the lower layer may only know response time from its vantage point, which in many cases does not include network communication time which can be very important.

Tracing should be optional and normally applied to the small sample of requests, though it needs to be detailed. Typically you would include the calls to the lower level services together with timestamp, the response summary with timestamp again. The information about request has to be complete enough to identify target action and response completely. For example if I’m speaking about memcache I’d like to know which server:port request was issued to which key was requested, and on response I’d like to know if it was hit, miss or error.

The way I use it may be as follows. I see the increased response time for given kind of request. I see response time is coming from MySQL. I check the number of MySQL Queries and it is 5x when it usually is for this kind of request. Looking at memcache stats I can see high number of misses. Looking at some available traces shows the server memcache01 has very high miss ratio. Checking what is going on with memcache01 shows it just was restarted (and hence has almost empty cache). This is important example as it shows your increased response time from MySQL may not have anything to do with MySQL itself but you would not know unless you’re capturing the right data.

If you’re looking for nice example framework for instrumentation, check out instrumentation for PHP – It has everything mentioned by tracing which is trivial to add.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

On Good Instrumentation

Июнь 1st, 2010

In so many cases troubleshooting applications I keep thinking how much more efficient things could be going if only there would be a good instrumentation available. Most of applications out there have very little code to help understand what is going on and if it is there it is frequently looking at some metrics which are not very helpful.

If you look at the system from bird eye view – system needs to process transactions and you want it to successfully complete large number of transactions it gets (this is what called availability) and we want it to serve them with certain response time, which is what is called performance. There could be many variables in environment which change – load, number of concurrent users, database, the way users use the system but in the nutshell all what you really care is having predictable response time within certain range. So if we care about response time – this is exactly what our instrumentation should measure

Response Time Summary We want to understand where exactly response time comes from. For example if we define transaction as the time it took to generate HTML page we want to understand how much time was spent waiting on the database, memcache, other external services, as well as how much CPU time it consumed.

Now what is important we need this information for individual transactions. It may be every transaction which is best and easily achievable for small-medium systems or at least for large enough sample. It is very important this information is available for individual transactions not the average. Average is useless because 100 transactions taking 1 sec and 99 transactions having 1ms and 1 taking 99.1 sec will have the same average while for sake of performance analyzes these are completely different. When you have transaction sample make sure it contains fair population of transaction – getting only transactions which are slow is not helpful as we might want to compare them to the fast transactions to understand why they are slow.

What kind of components do you need to have in response time summary – all components which are significant enough. If your instrumentation has 95% of response time unaccounted for it is useless. You also want blocks not to mix apples with oranges. For example “mysql and memcache” block would not be helpful. Even further I would prefer to split “mysql time” in the “connect time” and “query time” as there are situations when one but not other would be affected.

In is important for response time summary stored in the logs which are easy to query so you can analyze data in a lot of various ways. Sometimes you may find the response time is impacted by queries from certain user, in others it may be attributed to different application/web server.

The goal for Response Time Summary is to quickly point direction where problem happens. Whenever you have spike in response time or it is bad response time for certain kind of request you can quickly understand where does it come from ? Is it wasted CPU time Slow response from MySQL or Memcache.

I also like to see numbers of calls stored together with attributed response time. For example I’d like to see number of mysql calls in addition to MySQL response time. This helps to understand if it is the issue with number of queries or their performance. If I see 2 queries taking 30 seconds it is clearly slow queries. If it is 10.000 queries executed and total response time is 4 sec I know it is pretty much as good as it gets with standard Ethernet network and finding a ways to reduce number of queries is going to be more helpful.

The Glue Our applications involve multiple layers and typically higher layer can only report response time it took to call lower layer, but not the reason for that response time. For example we can report time it took to execute SQL query from PHP application side, but we can’t say why it has taken so long. Was it row level lock ? waiting on disk IO or was it simply question of burning a lot of CPU. On the other hand this information may be available in the instrumentation stats from that lower layer – for example in MySQL Query Log. What is important is however to be able to connect the data from these logs – glue them together. The easy way to do it is to provide an unique identifier to all requests and put it in the logs with request of the lower levels. With mySQL the simple way to do it is to put it in the comments for queries you execute.

Optional Tracing The information in lower layers logs is very helpful however it typically have two problems. First not every layer has good logs. For example if you’re running memcached you probably do not have the logs detailing all requests and their response time. Second – the lower layer may only know response time from its vantage point, which in many cases does not include network communication time which can be very important.

Tracing should be optional and normally applied to the small sample of requests, though it needs to be detailed. Typically you would include the calls to the lower level services together with timestamp, the response summary with timestamp again. The information about request has to be complete enough to identify target action and response completely. For example if I’m speaking about memcache I’d like to know which server:port request was issued to which key was requested, and on response I’d like to know if it was hit, miss or error.

The way I use it may be as follows. I see the increased response time for given kind of request. I see response time is coming from MySQL. I check the number of MySQL Queries and it is 5x when it usually is for this kind of request. Looking at memcache stats I can see high number of misses. Looking at some available traces shows the server memcache01 has very high miss ratio. Checking what is going on with memcache01 shows it just was restarted (and hence has almost empty cache). This is important example as it shows your increased response time from MySQL may not have anything to do with MySQL itself but you would not know unless you’re capturing the right data.

If you’re looking for nice example framework for instrumentation, check out instrumentation for PHP – It has everything mentioned by tracing which is trivial to add.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Should I buy a Fast SSD or more memory?

Апрель 6th, 2010

While a scale-out solution has traditionally been popular for MySQL, it’s interesting to see what room we now have to scale up – cheap memory, fast storage, better power efficiency.  There certainly are a lot of options now – I’ve been meeting about a customer/week using Fusion-IO cards.  One interesting choice I’ve seen people make however, is buying an SSD when they still have a lot of pages read/second – I would have preferred to buy memory instead, and use the storage device for writes.

Here’s the benchmark I came up with to confirm if this is the case:

  • Percona-XtraDB-9.1 release
  • Sysbench OLTP workload with 80 million rows (about 18GB worth of data+indexes)
  • XFS Filesystem mounted with nobarrier option.
  • Tests run with:
    • RAID10 with BBU over 8 disks
    • Intel SSD X25-E 32GB
    • FusionIO 320GB MLC
  • For each test, run with a buffer pool of between 2G and 22G (to test performance compared to memory fit).
  • Hardware was our Dell 900 (specs here).

To start with, we have a test on the RAID10 storage to establish a baseline.  The Y axis is transactions/second (more is better), the X axis is the size of innodb_buffer_pool_size:

Let me point out three interesting characteristics about this benchmark:

  • The A arrow is when data fits completely in the buffer pool (best performance). It’s important to point out that once you hit this point, a further increase in memory at all.
  • The B arrow is where the data just started to exceed the size of the buffer pool.  This is the most painful point for many customers – because while memory decreased by only ~10% the performance dropped by 2.6 times!  In production this usually matches the description of “Last week everything was fine.. but it’s just getting slower and slower!”.  I would suggest that adding memory is by far the best thing to do here.
  • The C arrow shows where data is approximately three times the buffer pool.  This is an interesting point to zoom in on – since you may not be able to justify the cost of the memory, but an SSD might be a good fit:

Where the C arrow was, in this graph a Fusion-IO card improves performance by about five times (or 2x with an Intel SSD).  To get the same improvement with memory, you would have needed to add 60% more memory -or- 260% more memory for a 5x improvement.  Imagine a situation where your C point is when you have 32GB of RAM and 100GB of data.  Than it gets interesting:

  • Can you easily add another 32G RAM (are your memory slots already filled?)
  • Does your budget allow to install SSD cards? (You may still need more than one, since they are all relatively small.  There are already appliances on the market which use 8 Intel SSD devices).
  • Is a 2x or 5x improvement enough?  There are more wins to be had if you can afford to buy all the memory that is required.

The workload here is designed to keep as much of the data hot as possible, but I guess the main lesson here is not to underestimate the size of your “active set” of data.  For some people who just append data to some sort of logging table it may only need to be a small percentage – but in other cases it can be considerably higher.  If you don’t know what your working set is – ask us!

Important note: This graph and these results are valid only for sysbench uniform. In your particular workload the points B and C may be located in differently.

Raw results:

Buffer pool, GB FusionIO Intel SSD RAID 10
2 450.3 186.33 80.67
4 538.19 230.35 99.73
6 608.15 268.18 121.71
8 679.44 324.03 201.74
10 769.44 407.56 252.84
12 855.89 511.49 324.38
14 976.74 664.38 429.15
16 1127.23 836.17 579.29
18 1471.98 1236.9 934.78
20 2536.16 2485.63 2486.88
22 2433.13 2492.06 2448.88

Entry posted by Vadim | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Missleading Innodb message on recovery

Февраль 16th, 2010

As I wrote about 2 years ago the feature of Innodb to store copy of master’s position in Slave’s Innodb tablespace got broken. There is a lot of discussions at the corresponding bug report while outcome of the fix remained uncertain for me (the bug is market duplicate while the bugs it seems to be duplicate for describe different issues).
Anyway. The customer came to me today having the following message in the error log after Slave crash while running MySQL 5.1.41 (extra virgin version without XtraDB or Percona Patches)

InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 10000000, file name mysql-bin.005000
InnoDB: Last MySQL binlog file position 0 20000000, file name ./mysql-bin.003000

He has tried to restart replication from mysql-bin.005000 position 10000000 which failed with “Could not find first log file name in binary log index file” error message

Looking at the Master for this slave I could see its binary log files going only to about 2000, so the binary log file mentioned did not ever exist on this master. What is going on ?
The thing is Innodb does not update this information any more however if it is stored in the tablespace the code is still there to print it. This database was running older MySQL version a while back which was updated to MySQL 5.1 months ago, moved to the new hardware by physical copy and the log file numbers restarted back from 1 but tablespace still contained the ancient data.

I reported the bug on this which should be easy to fix. Otherwise it is easy mistake to make. We also have a patch which restores this information and uses it on slave crash recovery.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Why Swapping is bad for MySQL Performance ?

Январь 19th, 2010

So you get MySQL or other applications using too much memory on the box or OS behaving funny and using more memory for cache and pushing application to the swap. This causes swapping and causes performance problems. This much is obvious. But how bad is it ? Should you count it same as normal Disk IO as the box is having or is it worse than that ?

Swapping is going to impact your performance more than just normal IO and here are 3 reasons why. If you know more please let me know, for my taste these 3 are bad enough so I have not been looking for more.

Cache in the Swap File will multiply IO compared to just having less cache. What happens when page in cache is replaced which is swapped out itself ? First you have to find space to swap in the page (we’re speaking about memory pressure right?) which means swapping out some page. This would normally happen in background but still it has to be done. When the pages is swapped in which is second IO and finally you get page being cached read from the hard drive to the cache. This gives us 3 IOs instead of one. Nice :)
Flushing dirty pages or even discarding the page will cause extra IO slowing things down.

Skewing up all Algorithms The database internals algorithms are tuned for things being in memory and if they start dealing with data which is on disk they just often stop working with any reasonable level of efficiently – when database deals with on disk data it often uses different set of algorithms which are optimized to limit number of IOs or make them more sequential. Most of these were designed before SSD era. For example Insert Buffer in MySQL makes a special effort to avoid (delay) IO. If it happens to go to swap file it will more than defeat its purpose. Number of background threads are designed with assumptions they can check page statuses in buffer pool very efficiently which again stops working as soon as page accesses cause disk IO.

Escalated Locking/Latching If breaking of internal operations is not bad enough on its own lets see what swapping does for concurrent (multi CPU, multi client) processing. Database Locks/Latches are typically designed to be held for as short time as possible. The less portion of execution time thread spends holding exclusive locks the better system will scale. It is a big no-no to hold any critical locks while you’re doing disk IO as IOs take a long time. When system is swapping all of these gets messed up – when database is thinking it is taking the locks for few instructions only it can be a long while while IO completes – if this is critical lock it is possible to see everything in the system waiting on this IO, even transactions which work with data which is not in the swap file.

The bottom line: You should configure system so no swapping activity is going on during normal operations. The swap file itself may be justified – if you have some unexpected memory consumption spike you may prefer to see slowdown instead of MySQL being killed because of out of memory but do react on them promptly and do not treat such situation as normal.


Entry posted by peter | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Production scripts: sign me up for obfuscation

Январь 2nd, 2010
If there’s one thing that will always make me angry, it’s people that should not be editing my code going and editing my code. If you want to change something on the server and you have sudo privileges please let the real admin know beforehand. I don’t mind people improving processes or scripts but if [...]
PlanetMySQL Voting: Vote UP / Vote DOWN