Archive for the ‘recovery’ Category

BlitzDB Crash Safety and Auto Recovery

Июль 22nd, 2010

Crash Safety is a big deal in the database league. Lack of durability can lead to all sorts of terrible things upon a catastrophic event. Many projects, especially in the so called NoSQL world compromises crash safety in return for higher QPS. The argument there is that the availability of the overall system should be accomplished by replication since a database server can’t be rescued if the physical disk breaks. I happen to agree with this philosophy but I am also aware that this isn’t a correct answer for everyone. So, what will I do with BlitzDB?

Several relational database hackers have pointed out that BlitzDB isn’t any safer than MyISAM since it doesn’t guarantee crash safety. This is currently true but I plan on making BlitzDB much safer than MyISAM by providing following features.

  1. Auto Recovery Routine (startup option)
  2. Tokyo Cabinet’s Transaction API (table-specific option)

The second feature above would actually guarantee BlitzDB to be crash safe (especially combined with auto recovery) but I won’t get into depth in this post since this topic deserves a blog post of it’s own. Let me just state that this feature will be provided in a form like this:

CREATE TABLE t1 (
  a int PRIMARY KEY,
  b varchar(256)
) ENGINE = BLITZDB, CRASH_SAFE;

From here on, I’ll cover how I plan on hacking auto recovery in BlitzDB.

Auto Recovery Challenges

As I blogged a while back, recovering Tokyo Cabinet is relatively simple. However, this is not a sufficient solution in BlitzDB since the data file (hash database that actually holds the rows) and the index file(s) are independent from each other. That is, the likelihood of the data file and the index file(s) to be inconsistent is very high after a crash. So, how can we hack on this? Pretty simple.

Indexes aren’t Important at Recovery Phase

Because BlitzDB logically separates the data file and it’s indexes, index files aren’t that important. If a server crash had occurred, BlitzDB could delete the index file(s) and recompute them from the data file. Needless to say, this process would involve a lot of random access and computation but it would not dominate the time space of the system since it’s a one-time cost. This approach however has one flaw in it such that the index files can’t be recomputed if the data file is broken or is unrecoverable.

Therefore to guarantee crash safety, BlitzDB must ensure that the data file is unbreakable. This is precisely where Tokyo Cabinet’s Transaction API comes in. I’m planning on using it to protect the data file from breaking. If the data file is protected, the table can be rescued. Simple!

So, that’s what I have in mind for making BlitzDB a safer engine. Unfortunately I can’t start hacking on it immediately since I have several bugs to fix first. Nevertheless I’m looking forward to start hacking on it. This challenge should be quite fun to tackle.


PlanetMySQL Voting: Vote UP / Vote DOWN

Recover BLOB fields

Июль 1st, 2010

For a long time long types like BLOB, TEXT were not supported by Percona InnoDB Recovery Tool. The reason consists in a special way InnoDB stores BLOBs.

An InnoDB table is stored in a clustered index called PRIMARY. It must exist even if a user hasn't defined the primary index. The PRIMARY index pages are identified by 8-bytes number index_id. The highest 4 bytes are always 0, so index_id is often notated as o:<4 bytes number>, e.g. 0:258. The pages are ordered in a binary tree. Primary index is used as a key. Inside a page records are stored in a linked list.

InnoDB page by default is 16k. Obviously if a record is too long, a single page can't store it. If the total record size is less than UNIV_PAGE_SIZE/2 - 200 (this is roughly 7k) then the full record is stored in the page of PRIMARY index. Let's call it internal. In InnoDB sources they have type FIL_PAGE_INDEX*. If the record is longer than 7k bytes, only first 768 bytes of every BLOB field are stored internally. The rest is stored in external pages. They have type FIL_PAGE_TYPE_BLOB. Page type is stored in a FIL_PAGE_TYPE field of the page header . In an earlier post Peter described in details how BLOBs are stored.

Let me illustrate a record format of the example of the table:

CODE:
  1. CREATE TABLE `t1` (
  2. `ID` int(11) unsigned NOT NULL,
  3. `NAME` varchar(120),
  4. `N_FIELDS` int(10),
  5. PRIMARY KEY (`ID`)
  6. ) ENGINE=InnoDB DEFAULT
  7. CHARSET=latin1

Here COMPACT format is used, which is default in MySQL >= 5.1.

The record consists of four parts:

  1. Offsets. Effectively these are field lengths. Only variable length types have offsets. The offset can be one or two bytes depending on maximum field size. The highest bit of the offset is 1 if the field is stored in external pages (i.e. long BLOB field)
  2. NULL fields. A bit per NULL-able field padded to minimum number of bytes to store all flags.
  3. So called extra bytes. These are 5 bytes where different flags are stored like "record is deleted" flag. The last two bytes are the relative pointer to the next record in the page.
  4. User data. TRX_ID is a transaction id. PTR_ID is a pointer in a rollback segment to the old version of the record.

So if a field is the long one, it has 1) The highest bit of the offset is set to "1", 2) After 768 bytes there are 20 bytes in the end where the following:

  1. BTR_EXTERN_SPACE_ID - space id where the next piece of the field is stored
  2. BTR_EXTERN_PAGE_NO - page id
  3. BTR_EXTERN_OFFSET - offset inside a page. An external page has a header. The similar pointer to the next page is stored in it.
  4. BTR_EXTERN_LEN - length of the next piece.

The external pages are linked until BTR_EXTERN_PAGE_NO is FIL_NULL.

Percona InnoDB Recovery Tool supports now recovery of long fields. It is still in development branch, but should be released after QA tests.

The complexity of BLOB fields brings prerequisites to successfully recover a record with BLOB : all pieces of the BLOB field must be reachable by pointers. That means BTR_EXTERN_PAGE_NO, BTR_EXTERN_OFFSET and BTR_EXTERN_LEN must not be corrupted.

The tool outputs the recovered table in tab-separated values format. BLOBs are printed in a hex form - 0ACD86...

To upload the table back you should utilize UNHEX function:

CODE:
  1. mysql>
  2. LOAD DATA INFILE '/path/to/datafile'
  3. REPLACE INTO TABLE <table_name>
  4. FIELDS TERMINATED BY '\t'
  5. OPTIONALLY ENCLOSED BY '"'
  6. LINES STARTING BY '<table_name>\t'
  7. (id,sessionid,uniqueid,username,nasipaddress,@var1,@var2,etc)
  8. SET
  9.   blobfield = UNHEX(@var1),
  10.   datefield2 = FROM_UNIXTIME(@var2,'%Y %D %M %h:%i:%s %x');

* - there is a typo in Recovery of Lost or Corrupted InnoDB Tables Presentation


Entry posted by Aleksandr Kuzminsky | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Debugging problems with row based replication

Май 7th, 2010

MySQL 5.1 introduces row based binary logging.  In fact, the default binary logging format in GA versions of MySQL 5.1 is 'MIXED' STATEMENT*, which means it is very likely that some of your binary log entries will be written in a row-based fashion instead of the actual statement which changed data.   You can explicitly turn this feature off by using binlog_format=STATEMENT in your my.cnf, but the row-based format does offer advantages particularly if triggers or stored procedures are used, or if non deterministic functions like RAND() are used in DML statements.

A statement based replication slave can get out of sync with the master fairly easily, especially if data is changed on the slave.   It is possible for a statement to execute successfully on a slave even if the data is not 100% in sync, so MySQL doesn't know anything is wrong.  This isn't the case with row-based replication.  Only actual changes are written to the database with row-based logs.  Take as an example a DELETE statement which does not modify any rows.  No binary logging will be performed by this statement in row-based logs, but the DELETE will be written to a statement based binary log.  Since row-based logs only contain information about changed rows, a replication slave which does not find a row referenced in a row-based log will immediately return an error: HA_ERR_KEY_NOT_FOUND.

This is probably best shown by example.  First, lets create a test table on a MySQL master database and populate it with some data.

SQL:
  1. CREATE TABLE `t1` (
  2. `c1` int(11) NOT NULL AUTO_INCREMENT,
  3. PRIMARY KEY (`c1`)
  4. ) ENGINE=InnoDB;
  5. Query OK, 1 row affected (0.02 sec)

Insert a single "seed" row into the table:

SQL:
  1. INSERT INTO repl.t1 VALUES ();
  2. Query OK, 1 row affected (0.00 sec)

INSERT ... SELECT a few times to get some dummy data:

SQL:
  1. INSERT INTO repl.t1 SELECT NULL FROM t1;
  2. Query OK, 1 row affected (0.01 sec)
  3. Records: 1  Duplicates: 0  Warnings: 0
  4.  
  5. INSERT INTO repl.t1 SELECT NULL FROM t1;
  6. Query OK, 2 rows affected (0.01 sec)
  7. Records: 2  Duplicates: 0  Warnings: 0
  8.  
  9. ...
  10.  
  11. INSERT INTO repl.t1 SELECT NULL FROM t1;
  12. Query OK, 32768 rows affected (0.62 sec)
  13. Records: 32768  Duplicates: 0  Warnings: 0

Verify that we have a reasonable amount of data to play with:

SQL:
  1. SELECT count(*) FROM t1;
  2. +----------+
  3. | count(*) |
  4. +----------+
  5. |    65536 |
  6. +----------+
  7. 1 row IN SET (0.02 sec)

My test setup includes a MySQL slave.  After the above setup completes, I am going to change data on the slave by truncating t1.  After truncation:

SQL:
  1. SELECT count(*)
  2. FROM t1 AS empty_table_on_the_slave;
  3. +----------+
  4. | count(*) |
  5. +----------+
  6. |    0     |
  7. +----------+
  8. 1 row IN SET (0.00 sec)

Now, lets make a modification on the master:

SQL:
  1. UPDATE repl.t1 SET c1 = c1 + 100000 WHERE c1 = 1;
  2. Query OK, 1 row affected (0.02 sec)
  3. Rows matched: 1  Changed: 1  Warnings: 0

On the slave, SHOW SLAVE STATUS now reports an error (many fields omitted):

SQL:
  1. SHOW SLAVE STATUS\G
  2. Relay_Log_File: mysql_sandbox25162-relay-bin.000002
  3. Relay_Log_Pos: 340718
  4. Slave_SQL_Running: No
  5. Last_Errno: 1032
  6. Last_Error: Could NOT execute Update_rows event ON TABLE repl.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000001, end_log_pos 340718

This error message is telling us that a row that was modified on the master can not be found on the slave:HA_ERR_KEY_NOT_FOUND.  It doesn't tell us what row it couldn't find though.  In our simple example, we know that all the rows are missing, and the one in particular that the server is trying to find is c1=1, but how can we find that out if we don't know the conditions under which replication went out of sync?

To determine exactly what row is having an issue, we can use the 'mysqlbinlog' utility in combination with the slave relay log.  Examining the output of SHOW SLAVE STATUS, we can see that the slave is currently executing at relay log file: mysql_sandbox25162-relay-bin.000002 at position: 340718.  MySQL 5.1 ships with a new version of mysqlbinlog that includes new options that make decoding RBR statements possible.  These options are "--base64-output" and "-v".  For debugging row based logs, I suggest '--base64-output=decode-rows -v".  This omits any base64 output and instead replaces the output with semi-valid SQL statements.  I say semi-valid because the statements do not include column names.  Column names are not included in the binary log, only positions.

mysqlbinlog mysql_sandbox25162-relay-bin.000005 --base64-output=decode-rows -v > decoded.log

Examine the decoded binary log with the 'less' utility, and search forward for the value 340718:

SQL:
  1. # at 340718
  2. #100506 12:42:56 server id 1  end_log_pos 340637        Query   thread_id=6     exec_time=0     error_code=0
  3. SET TIMESTAMP=1273174976/*!*/;
  4. BEGIN
  5. /*!*/;
  6. # at 340782
  7. # at 340823
  8. #100506 12:42:56 server id 1  end_log_pos 340678        Table_map: `repl`.`t1` mapped to number 15
  9. #100506 12:42:56 server id 1  end_log_pos 340718        Update_rows: table id 15 flags: STMT_END_F
  10. ### UPDATE repl.t1
  11. ### WHERE
  12. ###   @1=1
  13. ### SET
  14. ###   @1=100001
  15. # at 340863
  16. #100506 12:42:56 server id 1  end_log_pos 340745        Xid = 51
  17. COMMIT/*!*/;
  18. # at 340890

As you can see, the RBR entry has been decoded as an update statement.  The OLD value (the row that is missing) is represented in the WHERE clause.  The binary log does not carry column names, so each column is represented by the column position prefixed with @.

There you have it.  You can now find out exactly which row is missing and begin your investigation as to why it is missing. You may have to do a bit more leg work if many rows were modified during a transaction.

You should always set read_only = true on your MySQL slaves to prevent accidental changes in data from happening!

*edit*
The MySQL 5.1 default binary logging format changed back to STATEMENT in MySQL 5.1.29.


Entry posted by Justin Swanhart | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

InnoDB’s tablespace ids and Partitions

Март 22nd, 2010
There are times when what you have is a partially running database and a bunch of backup innodb tablespace files (the .ibd files). If you're using innodb_file_per_table, then you have a separate .ibd file for each InnoDB table.

Now, you have your running database with a bunch of tables, and you want to replace some of them with the backup .ibd files. According to the MySQL docs, you'd do this:
  1. ALTER TABLE foo DISCARD TABLESPACE; (this deletes the current .ibd file)
  2. copy the old .ibd file into your database directory
  3. ALTER TABLE foo IMPORT TABLESPACE;
Assuming your .ibd file was from the same database and you did not drop the table and recreate it sometime between when you made the backup .ibd and now, this should work.Except... if you use partitions. If your table foo uses partitions, ie, its create statement was something like this:
CREATE TABLE foo (
...
) PARTITION BY ... (
PARTITION p0 ...,
);
In this case, you cannot discard the tablespace, and the first alter command throws an error:
mysql> ALTER TABLE foo DISCARD TABLESPACE;

ERROR 1031 (HY000): Table storage engine for 'foo' doesn't have this option
I have not investigated if there are workarounds for this, but I do have a little more information on what's happening.Remember that each .ibd file is a tablespace. For a partitioned table, there are multiple .ibd files, one for each partition. The table's files look like this:
foo.frm
foo.par
foo#P#p0.ibd
foo#P#p1.ibd
...
Where p0, p1, etc. are the partition names that you specified in the create statement. Each partition is a different tablespace and has its own tablespace id. When you create an InnoDB table without partitioning, the internal tablespace id counter is incremented by 1. When you create an InnoDB table with paritions, the internal tablespace id counter is incremented by the number of partitions. The actual tablespace id is stored in each partition's .ibd file somewhere within the first 100 bytes. I have not attempted to find out where exactly though.

PlanetMySQL Voting: Vote UP / Vote DOWN

How PostgreSQL protects against partial page writes and data corruption

Февраль 8th, 2010

I explored two interesting topics today while learning more about Postgres.

Partial page writes

PostgreSQL’s partial page write protection is configured by the following setting, which defaults to “on”:

full_page_writes (boolean)

When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint… Storing the full page image guarantees that the page can be correctly restored, but at a price in increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)

Trying to reduce the cost of full-page writes by increasing the checkpoint interval highlights a compromise. If you decrease the interval, then you’ll be writing full pages to the WAL quite often. This should in theory lead to surges in the number of bytes written to the WAL, immediately following each checkpoint. As pages are revisited over time for further changes, the number of bytes written should taper off gradually until the next checkpoint. Hopefully someone who knows more can confirm this. Does anyone graph the number of bytes written to their WAL? That would be a nice illustration to see how dramatic this surging is.

Decreasing the checkpoint interval seems a bit scary, and is bound to have its own costs, for all the usual reasons. A massive checkpoint once in a while should be really expensive, and would lead to a bad worst-case time for recovery. Does the new bgwriter implementation in 8.3 fix any of this? In theory it could, but I don’t know enough yet to say. I have heard conflicting opinions on this point. I have a lot more to read about it before I form my own opinion.

Storing full pages might not really be that expensive. It could bloat the WAL, but is the cost (in terms of time) really that high? InnoDB (in MySQL) protects against partial page writes with a double-write strategy: a region in the tablespace is called the doublewrite buffer. Page writes are first sent to the doublewrite buffer, then to their actual location in the data file. I don’t remember where, but I’ve seen benchmarks showing that this doesn’t hurt performance, even though it seems counter-intuitive. Modern disks can do a lot of sequential writes, and the way InnoDB writes its data makes a lot of things sequential. I doubt that putting full pages into the PostgreSQL WAL is forced to cost a lot, unless there is an implementation-specific aspect that makes it expensive.

The TODO has some items on the WAL, which look interesting — “Eliminate need to write full pages to WAL before page modification” and a couple more items. I need to understand PostgreSQL’s recovery process better before I know what these really mean.

Detecting data corruption

I was able to verify that the WAL entries have a checksum. It is a CRC32. This is in xlog.c.

However, as far as I can understand, the answer for detecting data corruption in normal data pages is “Postgres doesn’t do that.” I was told on the IRC channel that normal data pages don’t have checksums. I am not sure how to verify that, but if it’s true it seems like a weakness. I’ve seen hardware-induced corruption on InnoDB data many times, and it could sometimes only be detected by page checksums.

What happens when a page is corrupt? It probably depends on where the corruption is. If a few bytes of the user’s data is changed, then I assume you could just get different data out of the database than you inserted into it. But if non-user data is corrupted then do you get bizarre behavior, or do you get a crash or error? I need to learn more about PostgreSQL’s data file layout to understand this. Imagining (I haven’t verified this) that a page has a pointer to the next page, what happens if that pointer is flipped to refer to some other page, say, a page from a different table? If TABLE1 and TABLE2 have identical structures but different data, could SELECT * FROM TABLE1 suddenly start showing rows from TABLE2 partway through the results? Again I need to learn more about this.

Related posts:

  1. The Ma.gnolia data might not be permanently lost I keep rea
  2. What data types does your innovative storage engine NOT support? I’ve
  3. PostgreSQL Conference East 2009, Day Three As I said

Related posts brought to you by Yet Another Related Posts Plugin.


PlanetMySQL Voting: Vote UP / Vote DOWN

How PostgreSQL protects against partial page writes and data corruption

Февраль 8th, 2010

I explored two interesting topics today while learning more about Postgres.

Partial page writes

PostgreSQL’s partial page write protection is configured by the following setting, which defaults to “on”:

full_page_writes (boolean)

When this parameter is on, the PostgreSQL server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint… Storing the full page image guarantees that the page can be correctly restored, but at a price in increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)

Trying to reduce the cost of full-page writes by increasing the checkpoint interval highlights a compromise. If you decrease the interval, then you’ll be writing full pages to the WAL quite often. This should in theory lead to surges in the number of bytes written to the WAL, immediately following each checkpoint. As pages are revisited over time for further changes, the number of bytes written should taper off gradually until the next checkpoint. Hopefully someone who knows more can confirm this. Does anyone graph the number of bytes written to their WAL? That would be a nice illustration to see how dramatic this surging is.

Decreasing the checkpoint interval seems a bit scary, and is bound to have its own costs, for all the usual reasons. A massive checkpoint once in a while should be really expensive, and would lead to a bad worst-case time for recovery. Does the new bgwriter implementation in 8.3 fix any of this? In theory it could, but I don’t know enough yet to say. I have heard conflicting opinions on this point. I have a lot more to read about it before I form my own opinion.

Storing full pages might not really be that expensive. It could bloat the WAL, but is the cost (in terms of time) really that high? InnoDB (in MySQL) protects against partial page writes with a double-write strategy: a region in the tablespace is called the doublewrite buffer. Page writes are first sent to the doublewrite buffer, then to their actual location in the data file. I don’t remember where, but I’ve seen benchmarks showing that this doesn’t hurt performance, even though it seems counter-intuitive. Modern disks can do a lot of sequential writes, and the way InnoDB writes its data makes a lot of things sequential. I doubt that putting full pages into the PostgreSQL WAL is forced to cost a lot, unless there is an implementation-specific aspect that makes it expensive.

The TODO has some items on the WAL, which look interesting — “Eliminate need to write full pages to WAL before page modification” and a couple more items. I need to understand PostgreSQL’s recovery process better before I know what these really mean.

Detecting data corruption

I was able to verify that the WAL entries have a checksum. It is a CRC32. This is in xlog.c.

However, as far as I can understand, the answer for detecting data corruption in normal data pages is “Postgres doesn’t do that.” I was told on the IRC channel that normal data pages don’t have checksums. I am not sure how to verify that, but if it’s true it seems like a weakness. I’ve seen hardware-induced corruption on InnoDB data many times, and it could sometimes only be detected by page checksums.

What happens when a page is corrupt? It probably depends on where the corruption is. If a few bytes of the user’s data is changed, then I assume you could just get different data out of the database than you inserted into it. But if non-user data is corrupted then do you get bizarre behavior, or do you get a crash or error? I need to learn more about PostgreSQL’s data file layout to understand this. Imagining (I haven’t verified this) that a page has a pointer to the next page, what happens if that pointer is flipped to refer to some other page, say, a page from a different table? If TABLE1 and TABLE2 have identical structures but different data, could SELECT * FROM TABLE1 suddenly start showing rows from TABLE2 partway through the results? Again I need to learn more about this.

Related posts:

  1. The Ma.gnolia data might not be permanently lost I keep rea
  2. What data types does your innovative storage engine NOT support? I’ve
  3. PostgreSQL Conference East 2009, Day Three As I said

Related posts brought to you by Yet Another Related Posts Plugin.


PlanetMySQL Voting: Vote UP / Vote DOWN

Tool of the Day: rsnapshot

Сентябрь 17th, 2009

rsnapshot is a filesystem snapshot utility for making backups of local and remote systems, based on rsync. Rather than just doing a complete copy every time, it uses hardlinks to create incrementals (which are from a local perspective a full backup also). You can specify how long to keep old backups, and all the other usual jazz. You’d generally have it connect over ssh. You’ll want/need to run it on a filesystem that supports hardlinks, so that precludes NTFS.

In the context of MySQL, you can’t just do a filesystem copy of your MySQL data/logs, that would be inconsistent and broken. (amazingly, I still see people insisting/arguing on this – but heck it’s your business/data to gamble with, right?)

Anyway, if you do a local mysqldump also, or for instance use XtraBackup to take a binary backup of your InnoDB tablespace/logs, then rsnapshot can be used to automate the transfer of those files to a different geographical location.

Two extra things you need to do:

  • Regularly test your backups. They can fail, and that can be fatal. For XtraBackup, run the prepare command and essentially start up a MySQL instance on it to make sure it’s all happy. Havint this already done also saves time if you need to restore.
  • For restore time, you need to include the time needed to transfer files back to the target server.

PlanetMySQL Voting: Vote UP / Vote DOWN

Tool of the Day: rsnapshot

Сентябрь 17th, 2009

rsnapshot is a filesystem snapshot utility for making backups of local and remote systems, based on rsync. Rather than just doing a complete copy every time, it uses hardlinks to create incrementals (which are from a local perspective a full backup also). You can specify how long to keep old backups, and all the other usual jazz. You’d generally have it connect over ssh. You’ll want/need to run it on a filesystem that supports hardlinks, so that precludes NTFS.

In the context of MySQL, you can’t just do a filesystem copy of your MySQL data/logs, that would be inconsistent and broken. (amazingly, I still see people insisting/arguing on this – but heck it’s your business/data to gamble with, right?)

Anyway, if you do a local mysqldump also, or for instance use XtraBackup to take a binary backup of your InnoDB tablespace/logs, then rsnapshot can be used to automate the transfer of those files to a different geographical location.

Two extra things you need to do:

  • Regularly test your backups. They can fail, and that can be fatal. For XtraBackup, run the prepare command and essentially start up a MySQL instance on it to make sure it’s all happy. Havint this already done also saves time if you need to restore.
  • For restore time, you need to include the time needed to transfer files back to the target server.

PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Cluster Data Node restart times

Июль 30th, 2009

Restarts are required for certain, infrequent maintenance activities. Note that there is no loss of service while a single node restarts.

When a data node restarts, it first attempts to load the data into memory from the local log files and then it will catch up with any subsequent changes by retrieveing them from the surviving node(s) in its node group.

 Based on this, you would expect the time taken to restart a data node to be influenced by:

  1. The amount of data that was stored on the data node before the restart
  2. Rate of updates being made to the data during the restart
  3. Network performance (assuming the data is being updated during recovery)

The times will also be influenced bycertain configuration parameters, performance of the host machine and whether the multi-threaded data node (ndbmtd) is being used.

To provide some insight into how these factors impact restart times, tests have been performed where the following factors are varied:

  • Database size (Each Gbyte is made up of 1,000,000 tuples in each of 5 tables)
  • Whether traffic is running or not (a single thread using the NDB API to send in up to 2K tps (10K updates/second))
  • Whether the 2 data nodes in the node group are on the same host or separated by a Gbit Ethernet network

The following factors are kept constant:

  • Physical hosts: Intel Core 2 Quad Q8200@2.33 GHz; 7.7 GBytes RAM
  • NoOfFragmentLogFiles: 300
  • MaxNoOfExecutionThreads=4

Here are the observed results:

Data Node restart times

Data Node restart times

There are a couple of things to note from these results:

  • Using the multi-threaded data node (ndbmtd) greatly improves the restart time (in this case, 4 threads were available, improvements could be even greater on an 8 core/thread system)
  • Results become less predictable when heavy update traffic is being processed (in this case, up to 10,000 updated rows/second on a single node group). In the tests, no attempt was made to regulate this traffic and the test application was run on the same host as the one of the data nodes. Changes to the rate of updates will vary how long it takes for the restarting node to catch-up as it’s a moving target.

There is another recovery/restart scenario. The measurements shown above assumed that the file system on the data node’s host was intact and could be used to recover the in-memory copy – if that were not the case (or the data nodes were restarted with the “initial” option) then all of the data would have to be recovered from the surviving data node(s) in the same node group. As a comparison restarting a 6 Gbyte data node with the “initial” option took 20 minutes compared to 8 minutes without it (ndbmtd, over Gbit n/w).

Recovery after DROP [ TABLE | DATABASE ]

Июль 20th, 2009

In your recovery practice we often face the problem when data lost by execution DROP TABLE or DROP DATABASE statement. In this case even our InnoDB Data Recovery tool can’t help, as table / directory with files was deleted (if you have innodb-file-per-table). And the same for MyISAM, all .MYD / .MYI / .frm – files are deleted in this case.

So first step after DROP is to restore files, and for ext3 file system there are two utilities which can help of you are fast (and lucky) enough.
First one is ext3grep http://code.google.com/p/ext3grep/, with some instruction on this page http://www.xs4all.nl/~carlo17/howto/undelete_ext3.html.
And also there is TestDisk, with wide list of functionality, one of them is to restore deleted files.

I don’t know such tools for xfs, so in this case DROP will be fatal.

But better advise is do your backups :)


Entry posted by Vadim | 4 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks