Archive for the ‘InnoDB performance’ Category

The relationship between Innodb Log checkpointing and dirty Buffer pool pages

Февраль 18th, 2012

This is a time-honored topic, and there’s no shortage of articles on the topic on this blog. I wanted to write a post trying to condense and clarify those posts, as it has taken me a while to really understand this relationship.

Some basic facts

  • Most of us know that writing into Innodb updates buffer pool pages in memory and records page operations in the transaction (redo) log.
  • Behind the scenes those updated (dirty) buffer pool pages are flushed down the to the tablespace.
  • If Innodb stops (read: crashes) with dirty buffer pool pages, Innodb recovery must be done to rebuild the last consistent picture of the database.
  • Recovery uses the transaction log by redoing (hence the name ‘redo log’) the page operations in the log that had not already been flushed to the tablespaces.

Ultimately this mechanism was an optimization for slow drives:  if you can sequentially write all the changes into a log, it will be faster to do on the fly as transactions come in than trying to randomly write the changes across the tablespaces.  Sequential IO trumps Random IO.

However, even today in our modern flash storage world where random IO is significantly less expensive (from a latency perspective, not dollars), this is still an optimization because the longer we delay updating the tablespace, the more IOPs we can potentially conserve, condense, merge, etc.  This is because:

  • The same row may be written multiple times before the page is flushed
  • Multiple rows within the same page can be written before the page is flushed

Innodb Log Checkpointing

So, first of all, what can we see about Innodb log checkpointing and what does it tell us?

mysql> SHOW ENGINE INNODB STATUS\G
---
LOG
---
Log sequence number 9682004056
Log flushed up to   9682004056
Last checkpoint at  9682002296

This shows us the virtual head of our log (Log sequence Number), the last place the log was flushed to disk (Log flushed up to), and our last Checkpoint.  The LSN grows forever, while the actual locations inside the transaction logs are reused in a circular fashion.    Based on these numbers, we can determine how many bytes back in the transaction log our oldest uncheckpointed transaction is by subtracting our ‘Log sequence number’ from the ‘Last checkpoint at’ value.  More on what a Checkpoint is in a minute.    If you use Percona server, it does the math for you by including some more output:

---
LOG
---
Log sequence number 9682004056
Log flushed up to   9682004056
Last checkpoint at  9682002296
Max checkpoint age    108005254
Checkpoint age target 104630090
Modified age          1760
Checkpoint age        1760

Probably most interesting here is the Checkpoint age, which is the subtraction I described above.  I think of the Max checkpoint age as roughly the furthest back Innodb will allow us to go in the transaction logs; our Checkpoint age cannot exceed this without blocking client operations in Innodb to flush dirty buffers.  Max checkpoint age appears to be approximately 80% of the total number of bytes in all the transaction logs, but I’m unsure if that’s always the case.

Remember our transaction logs are circular, and the checkpoint age represents how far back the oldest unflushed transaction is in the log.  We cannot overwrite that without potentially losing data on a crash, so Innodb does not permit such an operation and will block incoming writes until the space is available to continue (safely) writing in the log.

Dirty Buffer Pool Pages

On the other side, we have dirty buffers.  These two numbers are relevant from the BUFFER POOL AND MEMORY section of SHOW ENGINE INNODB STATUS:

Database pages          65530
...
Modified db pages       3

So we have 3 pages that have modified data in them, and that (in this case) is a very small percentage of the total buffer pool.  A page in Innodb contains rows, indexes, etc., while a transaction may modify 1 or millions of rows.  Also realize that a single modified page in the buffer pool may contain modified data from multiple transactions in the transaction log.

As I said before, dirty pages are flushed to disk in the background.  The order in which they are flushed really has little to nothing to do with the transaction they are associated with, nor with the position associated with their modification in the transaction log.    The effect of this is that as the thread managing the dirty page flushing goes about its business, it is not necessarily flushing to optimize the Checkpoint age, it is flushing to try to optimize IO and to obey the LRU in the buffer pool.

Since buffers can and will be flushed out of order, it may be the case that there are a lot of transactions in the transaction log that are fully flushed to disk (i.e., all the pages associated with said transaction are clean), but there still could be  older transactions that are not flushed.  This, in essence, is what fuzzy checkpointing is.

The checkpoint process is really a logical operation.  It occasionally  (as chunks of dirty pages get flushed) has a look through the dirty pages in the buffer pool to find the one with the oldest LSN, and that’s the Checkpoint.  Everything older must be fully flushed.

The main reason this is important is if the Checkpoint Age is not a factor in dirty buffer flushing, it can get too big and cause stalls in client operations:  the algorithm that decides which dirty pages to flush does not optimize for this [well] and sometimes it is not good enough on its own.

So, how can we optimize here?  The short of it is: make innodb flush more dirty pages.  However, I can’t help but wonder if some tweaks could be made to the page flushing algorithm to be more effective there in choosing older dirty pages.   It is clear how that algorithm works without reading the source code.

There are a lot of ways to tune this, here is a list of the most signficant, roughly ordered from oldest to newest, and simultaneously listed from least effective to most effective:

  • innodb_max_dirty_pages_pct:  This attempts to keep the percentage of dirty pages under control, and before the Innodb plugin this was really the only way to tune dirty buffer flushing.  However, I have seen servers with 3% dirty buffers and they are hitting their max checkpoint age.  The way this increases dirty buffer flushing also doesn’t scale well on high io subsystems, it effectively just doubles the dirty buffer flushing per second when the % dirty pages exceeds this amount.
  • innodb_io_capacity: This setting, in spite of all our grand hopes that it would allow Innodb to make better use of our IO in all operations, simply controls the amount of dirty page flushing per second (and other background tasks like read-ahead).  Make this bigger, you flush more per second.  This does not adapt, it simply does that many iops every second if there are dirty buffers to flush.  It will effectively eliminate any optimization of IO consolidation if you have a low enough write workload (that is, dirty pages get flushed almost immediately, we might be better off without a transaction log in this case).  It also can quickly starve data reads and writes to the transaction log if you set this too high.
  • innodb_write_io_threads: Controls how many threads will have writes in progress to the disk.   I’m not sure why this is still useful if you can use Linux native AIO.  These can also be rendered useless by filesystems that don’t allow parallel writing to the same file by more than one thread (particularly if you have relatively few tables and/or use the global tablespaces) *cough ext3 cough*.
  • innodb_adaptive_flushing: An Innodb plugin/5.5 setting that tries to be smarter about flushing more aggressively based on the number of dirty pages and the rate of transaction log growth.
  • innodb_adaptive_flushing_method:  (Percona Server only)  This adds a few new different algorithms, but the more effective ones adjust the amount of dirty page flushing using a formula that considers the Checkpoint age and the Checkpoint age target (something you can set in Percona Server, otherwise it is effectively just a hair under the Max Checkpoint age).  The two main methods here would be ‘estimate’ (good default) and ‘keep_average’ designed for SSD type storage.  Running Percona Server and using this method is my go-to setting for managing Checkpoint age.
To be clear, the best we have today (IMO) is using innodb_adaptive_flushing_method in Percona server.
In any case, if you run any kind of production MySQL server, you should be monitoring your Checkpoint age and your Innodb dirty pages and try to see the relationship with those values and your write operations on disk.  The additional controls in 5.5 and Percona server are excellent reasons to consider an upgrade.

PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL 5.5 – A Community Winner

Апрель 16th, 2010

Ever since MySQL 5.5 beta has been announced by Edward Screven, Oracle’s chief corporate architect; there is lot of positive buzz (here, here, …) about the performance and scalability improvements added in this release. We should all be thankful to Michael Ronstrom (as most of the key developers are already working on different forks), who did a great job in the improvements especially scalability related that allows to scale beyond 16 cores by improving the performance by 2-5X in most common workloads. Not to forget about numerous improvements to replication by replication team.

Even though 5.5 has lot of new improvements officially from Sun/Oracle; but some of the changes are actually driven by community (yet another thanks to Google, Mark Callaghan and his team, Percona and his team, Facebook etc) and most of the ideas or patches were already floating for a while and they were used in the production as well (5.0 or 5.1). This is actually a good sign that community can look forward for 5.5 GA instead of worrying about what patches and builds to use.

This is a clear indication that 5.5 performance and scalability improvements were actually driven by community.

Key improvements in 5.5:

  1. InnoDB changes in 1.1
    • Multiple buffer pools (controlled by innodb_buffer_pool_instances)
    • Multiple rollback segments
    • Splitting of purge operation from main background thread (controlled by innodb_purge_threads)
    • New log_buf mutex now controls the mini transaction writes in buffer pool instead of shared log_sys, reduces the contention on buffer pool
    • Separate mutex for flush list handling, reduces the contention on buffer pool
    • Improved recovery time
  2. Rest of the changes as part of InnoDB plugin 1.0.x
  3. Numerous replication related changes

Even though they announced InnoDB as the default storage engine in 5.5; but the latest build still has MyISAM as the default

Server version: 5.5.4-m3 MySQL Community Server (GPL)
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
mysql> show engines;
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine             | Support | Comment                                                        | Transactions | XA   | Savepoints |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| InnoDB             | YES     | Supports transactions, row-level locking, and foreign keys     | YES          | YES  | YES        |
| MRG_MYISAM         | YES     | Collection of identical MyISAM tables                          | NO           | NO   | NO         |
| MEMORY             | YES     | Hash based, stored in memory, useful for temporary tables      | NO           | NO   | NO         |
| BLACKHOLE          | YES     | /dev/null storage engine (anything you write to it disappears) | NO           | NO   | NO         |
| CSV                | YES     | CSV storage engine                                             | NO           | NO   | NO         |
| MyISAM             | DEFAULT | Default engine as of MySQL 3.23 with great performance         | NO           | NO   | NO         |
| ARCHIVE            | YES     | Archive storage engine                                         | NO           | NO   | NO         |
| FEDERATED          | NO      | Federated MySQL storage engine                                 | NULL         | NULL | NULL       |
| PERFORMANCE_SCHEMA | YES     | Performance Schema                                             | NO           | NO   | NO         |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
9 rows in set (0.00 sec)

PlanetMySQL Voting: Vote UP / Vote DOWN