Archive for the ‘Новости’ Category

MySQL: Kill sleeping connections

Сентябрь 1st, 2010
Platform: MySQL 5.x

Most of the time it would be handy to have a native MySQL script which would allow one to kill the sleeping connections which are in sleep state for more than 180 sec..  On the other hand DBA's can use  "wait_timeout" etc parameters to control this..

[code]

 # -- Make sure you are logged as MySQL 'root' user or any user who have got super privileges 



DELIMITER $$
DROP PROCEDURE IF EXISTS `uKillSleepingSessions`$$


CREATE PROCEDURE `uKillSleepingSessions`()
COMMENT 'This routne is used to kill idle sessions'
READS SQL DATA


BEGIN


DECLARE no_more_rows BOOLEAN;
DECLARE loop_cntr INT DEFAULT 0;
DECLARE num_rows INT DEFAULT 0;
DECLARE uID bigint(4);

DECLARE my_cur CURSOR FOR
SELECT ID
  FROM information_schema.PROCESSLIST PL
WHERE PL.COMMAND='Sleep' AND PL.TIME > 180;


DECLARE CONTINUE HANDLER FOR NOT FOUND
SET no_more_rows = TRUE;


OPEN my_cur;
select FOUND_ROWS() into num_rows;

the_loop: LOOP


FETCH my_cur
INTO uID;


IF no_more_rows THEN
    CLOSE my_cur;
    LEAVE the_loop;
END IF;


SET @tmp_sql= CONCAT("KILL ",uID);
PREPARE s1 FROM @tmp_sql;
EXECUTE s1;


DEALLOCATE PREPARE s1;
SET loop_cntr = loop_cntr + 1;
END LOOP the_loop;


END$$
DELIMITER ;



# Usage from mysql prompt/gui client
# Once called , it would cleanup idle connections which are sleeping for more than 180 sec
call uKillSleepingSessions();

[/code]



PlanetMySQL Voting: Vote UP / Vote DOWN

Cloud Insight: HP, Dell, 3PAR, VMWare & ScaleDB

Август 31st, 2010
The bidding war between HP and Dell for 3PAR has created great theater. The rationale is simple, both HP and Dell want a complete set of products to sell into the new cloud space and 3PAR is the only bitsized morsel among EMC, IBM and Hitachi that addresses this space. What is the compelling advantage they offer in storage? Elasticity. 3PAR provides the ability for companies to add/remove storage in thin slices (AKA thin provisioning). How does this relate to ScaleDB? We do the exact same thing for databases in the cloud and we do it for the most popular database in the cloud, MySQL.

How does VMWare play into this? Their CEO Paul Maritz was on CNBC talking about the hybrid cloud and how companies want to run core cloud capabilities on premise and then use the public cloud providers to handle compute overflow during peak usage. This means that public cloud value to corporations, assuming Maritz is correct, is based largely on their ability to provide elasticity. It will no longer be sufficient for public cloud companies to provide reserved servers, because the reserved servers will be run in the company’s data center. The public cloud will add/remove servers to handle peaks in usage. So elasticity is EVERYTHING. ScaleDB is all about elasticity for the database.

It is also interesting to note from the Maritz interview that he sees the next wave of cloud (and hence the next wave of cloud consolidation) coming from the software sector. More specifically, the ability to take existing applications and make them run on the cloud. In other words, to make them elastic. Again, this is exactly what ScaleDB does. We take existing MySQL applications and make them elastic.

It is also interesting to note that HP and Dell have decimated their own R&D and are now looking to acquire that expertise from outside, and they are willing to pay for the expertise.

Another theme playing out in the background makes this situation even more interesting. Oracle has adopted a systems approach, where they combine their hardware and software:

“The heart of the interview focused on Oracle's interest in Sun. By combining Sun's expertise in hardware with Oracle's software, Ellison suggested, the combined company can become a powerful "systems" company that sells solutions to businesses. The competitor that Ellison wants to beat: IBM.”

Summary: Cloud is the next battle ground. It all starts with the hardware/infrastructure (e.g. 3PAR) and then moves upstream to software. Oracle will be focused on selling complete systems, alienating HP & Dell, among others. This is compounded by the fact that HP and Dell have decimated their R&D, so they are forced to partner/acquire. At the same time, if Maritz’s vision of public clouds becoming effectively excess capacity for handling peaks from corporations is realized, then elasticity in the cloud will become critical as well. This obviously plays to ScaleDB’s strengths.

PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Community Server 5.6

Август 30th, 2010
MySQL Community Server 5.6 (5.6.0 m4, published on Monday, 30 Aug 2010)
PlanetMySQL Voting: Vote UP / Vote DOWN

テストケースの実行にあわせて Apache を起動・終了する方法

Август 27th, 2010

ウェブアプリケーションやライブラリの結合テストを行う段階になると、実際に Apache を起動してテストを実行したくなります。しかし、そのためにいちいち Apache の設定ファイルを修正して httpd を再起動して、とやっていては面倒です。特に複数のプログラムを同時に開発していると、あっちをテストしたらこっちが動かなくなって… なんてなったりして嫌気がさしてきます。

そこで、テストを実行する際に、環境毎に異なる以下のような問題を吸収しつつ、テスト専用に設定された Apache を自動的に起動終了してくれる Perl モジュール:Test::Httpd::Apache2 を書きました。

  • 環境によって、インストールパスが違う (/usr/local/apache/bin だったり /usr/sbin だったり)
  • 環境によって LoadModule の要不要や、ロードするパスが違う
  • 環境によってプログラム名が違う (Debian 系では httpd ではなく apache2)
  • Apache/2.0 と 2.2 でモジュール名が変わっているケースがある (mod_access と mod_authz_host)

使い方は簡単。テストコードの先頭で、Test::Httpd::Apache2 のインスタンスを生成すると、自動的に TCP の空きポートを探して Apache が起動するので、そのアドレスに HTTP クライアントからアクセスするだけです。起動された Apache は、Test::Httpd::Apache2 のインスタンスが解放されるタイミングで終了され、一時ファイルも自動的に消去されます。

use Test::Httpd::Apache2;

my $httpd = Test::Httpd::Apache2->new(
    required_modules => [ qw(perl) ],  # mod_perl をロード
    custom_conf => << 'EOT',
<Location />
  SetHandler perl-script
  ...
</Location>
EOT
);

my $root_url = "http://" . $httpd->listen . "/"; # 起動した host:port から URL 生成

# あとはテスト
...

そして、テスト本体が Perl のコードである必要はないので、PHP や他の言語のテストを実行するためのラッパーとして使うこともできます。

use Test::Httpd::Apache2;

my $httpd = Test::Httpd::Apache2->new(
    required_modules => [ qw(php5) ],  # mod_php をロード
    custom_conf => << 'EOT',
DocumentRoot "docroot"
SetHandler php5-script
EOT
);

# 起動した host:port から URL を生成して、環境変数に登録
$ENV{ROOT_URL} = "http://" . $httpd->listen . "/";

# my_test.php を実行
system('php', 'my_test.php') == 0
    or die "my_test.php failed: $?";

Test::mysqldTest::postgresql と組み合わせると、データベースを使うウェブアプリの結合テストが気軽に書けそうで夢が広がりまくりんぐ。

まだ荒削りなモジュールなので、改善提案やパッチ等お待ちしております。インストールは cpan -i Test::Httpd::Apache2、リポジトリは github.com/kazuho/p5-test-httpd-apache2 にあります。

それでは、have fun!


PlanetMySQL Voting: Vote UP / Vote DOWN

What concerns you most about your current MySQL backup solution?

Август 27th, 2010

PlanetMySQL Voting: Vote UP / Vote DOWN

What concerns you most about your current MySQL backup solution?

Август 27th, 2010

PlanetMySQL Voting: Vote UP / Vote DOWN

Querying for InnoDB Lock contention

Август 25th, 2010
In the previous post, I showed how the "virtual tables" that the InnoDB Plugin use for finding lock contention works, and what they look like. I also showed what an InnoDB lock graph looks will look like when monitored from MyQuery 3.3, which is not yet, but soon, released.

So, we are looking at three tables, the transaction table, where each transaction has no, one or more locks. For our intents and purposes, we will disregard transactions without locks, so we can join transactions table to the locks table. Then we need to link up the locks table to the lock waiters, to see if the lock in question is waiting on another lock, Each lock is either not waiting on one, or, as we have seen, more, locks, or it is not waiting at all. As a lock might not be waiting at all, we need an OUTER join here.

So the query I am using, the result of which is massaged a fair bit before I shown the lock tree, but this is the query I use to get the data:
SELECT t.trx_id, l.lock_id, w.blocking_lock_id, l.lock_table, l.lock_index,
t.trx_query, l.lock_data

FROM information_schema.innodb_trx AS t
JOIN information_schema.innodb_locks AS l ON t.trx_id = l.lock_trx_id
LEFT OUTER JOIN information_schema.innodb_lock_waits AS w
ON l.lock_id = w.requested_lock_id
/Karlsson

PlanetMySQL Voting: Vote UP / Vote DOWN

Connector/ODBC 5.1

Август 24th, 2010
Connector/ODBC 5.1 (5.1.7 GA, published on Tuesday, 24 Aug 2010)
PlanetMySQL Voting: Vote UP / Vote DOWN

Review of MySQL High Availability by Charles Bell, Mats Kindahl and Lars Thalmann

Август 19th, 2010

The kind folks at O'Reilly sent me MySQL High Availability by Charles Bell, Mats Kindahl and Lars Thalmann . In summary, the book is awesome. Personally I didn't think that Replication was enough of a feature in mySQL to fill up 500+ pages about it, BUT, they did and not with fillers or info that is NOT actually important to you as the end designer. In Mark Callaghan's forward he states that this book "adequately" explains MySQL replication, this is a huge low-ball, I dare say if you did not look at the replication code (prior to row-based replication there was not a lot of it) your questions can be answered with this book. There are three parts to this book: Replication (roughly 50% of the book), Monitoring and Disaster Recovery, then finally High Availability Environments. Each part is well written and accurate. I have been using mySQL before Replication existed. When it was first released I have been using it and been to many talk about it, from the original author talks to Brian Aker’s talks to asking Jeremy Zawodny about it to giving my own talks about Scale-Out, replication, etc.
My experience with mySQL Replication is rather robust. I broke holes in it, found bugs with it, and even wrote a quick C client (when it was still statement based replication only on the 4.1 branch) that would act as a proxy to gather all the replication feeds of all boxes onto a single box (it was a proof of concept). The book goes into clear detail about the common approaches and use cases of replication scaling out the apps reads. It even talks about data sharding that I must say I am an expert at (done it for Flickr/RockYou, for friends etc). The examples are clear although I would not recommend deploying their examples verbatim. The reason you never want to mod based on the number of shards in your system, because if you add more shards you are going to have to move your data all around. Central dictionaries are perfect to control balance. They go into this, but not how to lock a user to migrate to a lightly loaded shard. The example that they have creates holes in the flow, race conditions as well as downtime to move data, this is a nitpick, the book is good.

I do not want to give away the book, it is a good read and the quality is what you expect from O’Reilly. If you want to know about various replication techniques, replication shortcomings, building fail-over systems, and exploring other technologies that are comparable to replication, this is a great book for you. Oh and of note, they even go into exotic features that I rarely use in production and how that effects replication. Awesome.

PlanetMySQL Voting: Vote UP / Vote DOWN

Binary Log Group Commit – Recovery

Август 18th, 2010
It was a while since I wrote the previous article, but the merging of Oracle and Sun here resulted in quite a lot of time having to be spent on attending various events and courses for legal reason (one of the reasons I prefer working for smaller companies) and together with a summer vacation spent on looking over the house, there were little time for anything else. This is the second post of three, and in the last one I will cover some optimizations that improves performance significantly.

In the previous article, an approach was outlined to handle the binary log group commit. The basic idea is to use the binary log as a ticketing system by reserving space in it for the transactions that are going to be written. This will provide an order on the transactions as well as allowing writing the transactions in parallel to the binary log, thereby boosting performance. As noted in the previous post, a crash while writing transactions to the binary log requires recovery. To understand what needs to be changed, it is necessary to understand how the structure of the binary log as well as how recovery after a crash works currently together with the implementation of 2-phase commit that MySQL uses.

Figure 1. Binlog file structure

A quick intro to the structure of the binary log

Figure 1 gives the rough structure of the binary log with a set of binlog files and an binlog index file. The binlog index file just list the binlog files that makes up the binary log, while each binlog file have the real contents of the binary log that you can see when executing a SHOW BINLOG EVENTS.

Each binlog file consists of a sequence of binlog events, where the most important events from our perspective is the Format description event. In addition, each binlog file is also normally terminated by a Rotate event that refers to the next binlog file in the sequence.

The Format description event is used to describe the contents of the binlog file and therefore contain a a lot of information about the binlog file. In this case we are interested in a special flag called LOG_EVENT_BINLOG_IN_USE_F, which is used to tell if the binlog is actively being written by the server. When the server opens a new binlog file, this flag is set to indicate that the file is in use, and when the binary log is rotated and a new binlog file created, this flag is cleared when closing the old binlog file.

In the event of a crash, the flag will therefore be set and the server can see that the file was not closed properly and start with performing recovery.

Recovery and the binary log

When recovering, the server has to find all transactions that were partially executed and decide if they are going to be rolled back or committed properly. The deciding point when a transaction will be committed instead of rolled back is when the transaction has been written to the binary log. To do this, the server has to find all transactions that were written to the binary log and tell all storage engines to commit these transactions.

The recovery procedure is executed when the binary log is opened—which the server does calling TC_LOG_BINLOG::open during startup. When the binary log is opened, recovery is done if the last open binlog file was not closed properly. An outline of the procedure executed is:

  1. Open the binlog index file and go through it to find the last binlog file mentioned there [TC_LOG_BINLOG::open]
  2. Open this binlog file and check if the LOG_EVENT_BINLOG_IN_USE_F flag is set
  3. If the flag was clear, then the server stopped properly and no recovery is necessary. Otherwise, the server did not stop properly and recovery starts by calling.
  4. The last binlog file is now open, so the entire binlog file is scanned and the XID of each each Xid event is recorded. These XIDs denote the transactions that were properly written to the binary log—that is, the transactions that shall be committed [TC_LOG_BINLOG::recover].
  5. Each storage engine is handed the list of XIDs of transactions to commit through the handlerton::recover interface function [ha_recover].
  6. The storage engine will then commit each transaction in the list and roll back all the others.
Figure 2. Parallel binary log group commit

So, what's the problem?

The procedure above works fine, so what are the problems we have to solve to implement the procedure described in the previous article? If you look in Figure 2, you have a hint to what is the problem.

Now, assume that thread 1, 2, and 3 in Figure 2 is writing transactions to disk (starting at positions Trans_Pos1, Trans_Pos2, and Trans_Pos3 respectively) and that a preceding thread (a thread that got a binlog position before Last_Complete) decides that it is time to call fsync to group commit the state this far. The binlog file will then be written in this state—where some transactions are partially written—and Last_Committed will be set to the value of Last_Complete, leading to the situation depicted in Figure 2.

As you can see in the figure, thread 2 has already finished writing data to the binary log and is therefore written to durable storage. Since thread 1—which precedes thread 2 in the binary log—has not completed yet, thread 2 has not yet committed and is still waiting for all the preceding transactions to complete. If a crash occurs in this situation, it is necessary to somehow find the XID of all transactions that have committed—excluding the transaction that thread 2 has completed—and commit them to the storage engine when recovering.

A proposal for a new recovery algorithm

In the original algorithm, the scan of the binlog file stopped when the file ended, but since there can be partially written events in the binlog file after the "real" end of the file (the binlog file ends logically at Last_Committed/Last_Complete), so we have to find some other way to detect the logical end of the file.

To handle this, it is necessary to somehow mark events that are not yet committed so that the recovery algorithm can find the correct position where the binlog file ends. The same problem occurs if one wants to persist the end of the binlog file preallocating the binlog file. There are basically three ways to handle this:

  • Write the end of the binlog file in the binlog file header (that is, the Format description log event).
  • Mark each event by zeroing out a field that cannot be zero—for example, the length, the event type, or event position—before writing the event to the binary log. Then write this field with the correct value after the entire event has been written.
  • Checksum the events and find the end of the worklog by scanning for the first event with an incorrect checksum.
Write the length in the binlog file header
Finding the length of the binlog in this case is easy: just inspect the header and find the length of the binlog file there. In this case, it is necessary to update the length after the event has been written since there may be an fsync call at any time between starting to write the event data and finishing writing the event. Normally, this means updating two block of the file for each event written, which can be a problem since it requires at least the block containing the header and all the blocks that was written since the last group commit to be written when calling fsync. If a large number of events is written between each fsync, this might not impose a large penalty, but if sync-binlog=1 it might become quite expensive. Some experiments done by Yoshinori showed a drop from 15k events/sec to 10k events/sec, which means that we lose one third in performance.

Digression. The measurements that Yoshinori did consisted of one pwrite to write the event, one pwrite to write the length to the header and then a call to fsync. It is, in other word, most similar to using sync_binlog=1. In reality, however, this will not be the case since a user that is using the binary log group commit will have several events written between each call to fsync. Since these writes will be to memory (the file pages are in memory), performance will not drop as much. To evaluate the behavior for a group commit situation better, writing 10 events at a time was compared as well (pretending to be sync_binlog=10). Straight append (using write) gave at that point 110k events/sec and write to the header before calling fsync gave 80k events/sec. This means a performance reduction of 27%, which is an improvement but still a very large overhead.

Use a marker field
The second alternative is to use one of the fields as a marker field. By setting one of the fields that cannot be zero to zero, it is possible to detect that the event is incorrect and stop at the event before that. Good candidates as fields is the length—which cannot be zero for any event and is four bytes—and the event type, which is one byte and where zero denotes an unknown event and never occurs naturally in a binlog file. The technique would be to first blank out the type field of the event, write the event to the binlog file, and then use pwrite to fill in the correct type code after the entire event is written. If an fsync occurs before the event type is written, the event will be marked as unknown and if a crash occurs before the event is completely written (and written to disk), it will be possible to scan the binlog file to find the first event that is marked as unknown. In order for this technique to work, it is necessary to zero the unused part of the binlog file before starting to write anything there (or at least zero out the event type). Otherwise, crash recovery will not be able to correctly detect where the last completely written event is located.

Compared to the previous approach, this does not require writing to locations far apart (except in rare circumstances when the event spans two pages). It also has the advantage of not requiring any change of the binlog format. This technique is likely to be quite efficient. (Note that most of the writes will be to memory, so there will not be any extraneous "seeks" over the disk to zero out parts of the file.)

Checksum on each event
The third alternative is to rely on an event checksum to detect events that are incompletely written. This approach is by far the most efficient of the approaches since the event checksum is naturally written last. It also has the advantage of not requiring the unused parts of the binlog file to be zeroed since it is unlikely that the checksum will be correct for the event unless the event has been fully written. This also makes it a very good candidate for detecting the end of the binlog file when preallocating the binlog file. The disadvantage is, of course, that it requires checksums to be enabled and implemented.
With this in mind, the best approach seems to be to checksum each event and use that to detect the end of the binary log. If necessary, the second approach can be implemented when the binlog is not checksummed.

The next article will wrap up the description by pointing out some efficiency issues and how to solve them to get an efficient implementation.


PlanetMySQL Voting: Vote UP / Vote DOWN