Proofs of Concept
Last week at the MySQL conference in Santa Clara, I presented some slides on our work towards group commit on the MySQL binlog. We examined the effects of not holding the prepare_commit_mutex across the binlog fsync, combining this with a timed wait on a condition variable to enable binlog group commits, and then we explored the effect of releasing row locks during the prepare step instead of in the commit step. The proof-of-concept code I worked on up to that point made me very quite familiar with the parts of the MySQL codebase that these changes are in. Energized by the MySQL conference, I got to work on writing some production-quality patches that are taking Facebook's MySQL 5.1 towards the performance gains we discussed at the conference.Patches for Production
The first step was to set up some new performance monitoring so we could understand the effects of the changes as we made them. The first patch keeps track of the number of binlog fsyncs MySQL performs and uses Ryan Mack's fast timers to count the total time spent by binlog fsyncs. It also keeps track of grouped fsyncs, but that counter hasn't been incremented yet. Next, I added added a variable to control whetherprepare_commit_mutex is locked in innobase_xa_prepare(). This variable can be set in my.cnf or changed dynamically by issuing the command SET GLOBAL innodb_prepare_commit_mutex=[1|0];. This variable defaults to 1 to maintain the same default behavior.
When prepare_commit_mutex is not taken, the binlog and innodb can have different commit orders. Since we are still holding row locks, this won't break replication -- yet. However, our plan is to release row locks before writing to the binlog. If we do this, how can we ensure replication remains correct?
The Importance of Order
Consider two transactions on an innodb table (INT k PRIMARY KEY, INT a, INT b, INT c):
BEGIN
UPDATE test SET a=a+1 WHERE k=1;
UPDATE test SET b=a WHERE k=1;
COMMIT
BEGIN
UPDATE test SET a=a+1 WHERE k=1;
UPDATE test SET c=a WHERE k=1;
COMMIT
If these occur in different orders on the master and slave, b and c will have different values on each, and replication will have silently broken. Once we release row locks before the binlog write, transactions that rely on each other can get out-of-order. To prevent this, but still reap the concurrency benefits of not holding the prepare_commit_mutex, our proof-of-concept used innodb's UT_LIST as linked list queues to ensure that innodb and the binlog commited in the same order.
Mark Callaghan noted that we can achieve the same effect by using a ticket system. As each thread enters innobase_xa_prepare(), it is given a "ticket" that is one higher than the previous ticket given out. As threads get ready to commit in the binlog, they must wait for their ticket number to come up. After a thread commits to the binlog, it increments the current ticket so that the thread with the next ticket can commit.
The third patch implemented this ticket system using atomic operations instead of holding locks to increase concurrency where possible. Again, I implemented a dynamic configuration variable, force_binlog_order, that enables or disables this ticket-based queuing.
Unexpected Performance
During performance testing, we noticed an unexpected performance increase whenprepare_commit_mutex was disabled and force_binlog_order was enabled.
prepare_commit_mutex=0, force_binlog_order=1 combination at greater than 32 threads.
My speculation at this point is that this change is managing to exploit sequential writes to the disk, but it isn't clear to me at all why this performance increase isn't seen when force_binlog_order is disabled. What is keeping the performance down to base levels when prepare_commit_mutex isn't taken and order isn't being enforced?
Coming Soon
Now that the foundation has been laid, the next steps towards achieving the performance results we think are possible are as follows:- Develop tests to ensure correctness in the absence of holding
prepare_commit_mutex. The test synchronization techniques available in MySQL should be useful in making these tests give us confidence in our changes so far - Implement the binlog group commit patch. Our proof-of-concept used a timed wait on a condition variable with a dynamically-configurable wait time in milliseconds. The next patch will probably use microseconds for more fine-grained tuning and may introduce concepts such as a maximum number of waiters and some sort of automatic wait time tuning.
- Release row locks early. In situations where a single row lock is the main performance bottleneck on a server, our proof-of-concept showed that releasing the row locks in
innobase_xa_prepare()can lead to huge performance wins -- in some cases, 20x-30x speedups when combined with the group commit changes.
PlanetMySQL Voting: Vote UP / Vote DOWN