Working on a new set of replication APIs in MariaDB, I have
given some thought
to the generation of replication events on the master server.
But there is another side of the equation: to apply the generated events on a
slave server. This is something that most replication setups will need (unless
they replicate to non-MySQL/MariaDB slaves). So it will be good to provide a
generic interface for this, otherwise every binlog-like plugin implementation
will have to re-invent this themselves.
A central idea in the current design for generating events is that we do not
enforce a specific content of events. Instead, the API provides accessors for
a lot of different information related to each event, allowing the plugin
flexibility in choosing what to include in a particular event format. For
example, one plugin may request column names for a row-based UPDATE event;
another plugin may not need them and can avoid any overhead related to column
names simply by not requesting them.
To get the same flexibility on the slave side, the roles of plugin and API are
reversed. Here, the plugin will have a certain pre-determined (by the
particular event format implemented) set of information related to the event
available. And the API must make do with whatever information it is provided
(or fail gracefully if essential information is missing).
My idea is that the event application API will provide corresponding events to
the events in the generation API. Each application event will have "provider"
methods corresponding to the accessor methods of the generator API. So the
plugin that wants to apply an event can obtain an event generator object, call
the appropriate provider methods for all the information available, and
finally ask the API to execute the event with the provided context
information.
This is only an abstract idea at this point; there are lots of details to take
care of to make this idea into a concrete design proposal. And I have not
fully decided if such an API will be part of the replication project or
not. But I like the idea so far.
Understanding how MySQL binlog events are applied on the slave
I wanted to get a better understanding of what is involved in an event
application API like the one described above. So I did a similar exercise to
the one I wrote about in my
last post,
where I went through in detail all the information that the existing MySQL
binlog format includes. This time I went through the details in the code that
applies MySQL binlog events on a slave.
Again, I concentrate on the actual events that change the database, ignoring
(most) details that relate only to the particular binlog format used by MySQL
(and there are quite a few :-).
At the top level, the slave SQL thread
(in exec_relay_log_event()) reads events
(next_event()) from the relay logs and executes
(apply_event_and_update_pos()) them.
There are a number of intricate details here relating to switching to a new
relay log (and purging old ones), and re-winding in the relay log to re-try a
failed transaction (eg. in case of deadlock or the like). This is mostly
specific to the particular binlog implementation.
The actual data changes are mostly done in the do_apply_event()
methods of each event class in sql/log_event.cc. I will go
briefly through this method for the events that are used to change actual data
in current MySQL replication. It is relatively easy to read a particular
detail out of the code, since it is all located in
these do_apply_event() methods.
(One problem however is that there are special cases sprinkled across the code
where special action is taken (or not taken) when running in the slave SQL
thread. I have not so far tried to determine the full list of such special
cases, or access how many there are).
Query_log_event
The main task done here is to set up various context in the THD,
execute the query, and then perform necessary cleanup. If the query throws an
error, there is also some fairly complex logic to handle this error correctly;
for example to ignore certain errors, to require same error as on master (if
the query failed on the master in some particular way), and to re-try the
query/transaction for certain errors (like deadlock).
There are also some hard-coded special cases for NDB (this seems to be a
common theme in the replication code).
The main think that to my eyes make this part of the code complex is the set
of actions taken to prepare context before executing the query, and to clean
up after execution. Each individual step in the code is in fact relatively
easy to follow (and often the commenting is quite good). The problem is that
there are so many individual steps. It is very hard to feel sure that exactly
this set of actions is sufficient (and that none are redundant for that
matter).
For example, code like this (not complete, just a random part of the setup):
thd->set_time((time_t)when);
thd->set_query((char*)query_arg, q_len_arg);
VOID(pthread_mutex_lock(&LOCK_thread_count));
thd->query_id = next_query_id();
VOID(pthread_mutex_unlock(&LOCK_thread_count));
thd->variables.pseudo_thread_id= thread_id;
It seems very easy to forget to assign query_id or whatever if
one was to write this from scratch. That is something I would really like to
improve in an API for replication plugins: it should be possible to understand
completely exactly what setup and cleanup is needed around event execution,
and there should be appropriate methods to achieve such setup/cleanup.
Another thing that is interesting in the code
for Query_log_event::do_apply_event() is that the code
does strcmp() on the query against keywords like COMMIT,
SAVEPOINT, and ROLLBACK, and follows a fairly different execution path for
these. This seems to bypass the SQL parser! But in reality, these particular
queries are generated on the master with special code in the server that
carefully restricts the possible format (eg. no whitespace or comments
etc). So in effect, this is just a way to hack in special event types for
these special queries without actually adding new binlog format events.
Rand_log_event, Intvar_log_event, and User_var_log_event
The action taken for these events is essentially to update
the THD with the information in the event: value of random
seed, LAST_INSERT_ID/INSERT_ID,
or @user_variable.
Xid_log_event
On the slave side, this event is essentially a COMMIT. One thing that springs
to mind is how different the code to handle this event is from the special
code in Query_log_event that handles a query "COMMIT". Again, it
is very hard to tell from the code if this is a bug, or if the
different-looking code is in fact equivalent.
Begin_load_query_log_event, Append_block_log_event, Delete_file_log_event, and Execute_load_query_log_event
This implements the LOAD DATA INFILE query, which needs special
handling as it originally references a file on the master server (or on the
client machine). The actual execution of the query is handled the same way as
for normal queries (Execute_load_query_log_event is a sub-class
of Query_log_event), but some preparation is needed first to
write the data from the events in the relay log into a temporary file on the
slave.
The main thing one notices about this code is how it handles re-writing the
LOAD DATA query to use a new temporary file name on the
slave. Take for example this query:
LOAD DATA CONCURRENT /**/ LOCAL INFILE 'foobar.dat' REPLACE INTO /**/ TABLE ...
The event includes offsets into the query string of the two places
marked /**/ in the example. This part of the query is then
re-written in the slave code (so it is not just replacing the
filename). Again, the code by-passes the SQL parser, it just so happens that
the SQL syntax in this case is sufficiently simple that this is not too
hackish to do. If one were to check, one would probably see that any user comments
in this particular part of the query string disappear in the slave
binlog if --log-slave-updates is enabled.
Table_map_log_event
This just links a struct RPL_TABLE_LIST into a list, containing
information about the tables described by the event. The actual opening of the
table is done when executing the first row event (WRITE/UPDATE/DELETE).
Write_rows_log_event, Update_rows_log_event, and Delete_rows_log_event
These are what handle the application of row-based replication events on a slave.
The first of the row events causes some setup to be done, a partial extract is this:
lex_start(thd);
mysql_reset_thd_for_next_command(thd, 0);
thd->transaction.stmt.modified_non_trans_table= FALSE;
query_cache.invalidate_locked_for_write(rli->tables_to_lock);
(As remarked for Query_log_event, while row event application
setup is somewhat simpler, it still appears a bit magic that exactly these
setups are sufficient and necessary).
The code also switches to row-based binlogging for the following row
operations (this is for --log-slave-updates, as it is
not possible to binlog the application of row-based events as statement-based
events in the slave binlog). This is by the way an interesting challenge for a
generic replication API: how does one handle binlogging of events applied on a
slave, for daisy-chaining replication servers? This of course gets more
interesting if one were to use a different binlogging plugin on the slave than
on the master. I need to think more about it, but this seems to be a pretty
strong argument that a generic event application API is needed, which hooks
into event generators to properly generate all needed events for the updates
done on slaves. Another important aspect is the support of a global
transaction ID, that will identify a transaction uniquely across an entire
replication setup to make migrating slaves to a new master easier. Such a
global transaction ID also needs to be preserved in a slave binlog when
replicating events from a master.
During execution of the first row event, the code also sets the flags for
foreign key checks and unique checks that is included in the event from the master.
And it checks if the event should be skipped
(for --replicate-ignore-db and friends).
A check is made to ensure that the table(s) on the slave are compatible with
the tables on the master (as described in the table map event(s) received just
before the row event(s)). The MySQL row-based replication has a fair bit of
flexibility in terms of allowing differences in tables between master and
slave, such as allowing different column types, different storage engine, or
even extra columns on the slave table. In particular allowing extra columns
raises some issues about default values etc. for these columns, though I did
not really go into details about this.
Again, there are a number of hard-coded differences for NDB.
For Write_rows_log_event, flags need to be set to ensure that
column values for AUTO_INCREMENT and TIMESTAMP
columns are taken from the supplied values, not
auto-generated. If slave_exec_mode=IDEMPOTENT, an INSERT that
fails due to already existing row does not cause replication to fail; instead
an UPDATE is tried, or in some cases (like if there are foreign keys) a DELETE
+ re-tried INSERT. There is also code to hint storage engines about the
approximate number of rows that are part of a bulk insert.
For Update_rows_log_event and Delete_rows_log_event,
the code needs to locate the row to update/delete. This is done by primary key
if one exists on the slave table (the current binlogging always includes every
column in the before image of the row, so the primary key value of the row to
modify is always available). But there is also support for tables with no
primary key, in which the first index on the table is used to locate the row
(if any), or failing that a full table scan. This btw. is a good reminder
to not use row-based replication with primary-key-less tables with
non-trivial amount of rows: every row operation applied will need a
full table scan!
Finally, the values from the event are unpacked into a row buffer in the
format used by MySQL storage engines, and the read_set
and write_set are set up (current replication always includes all
columns in row operations), before the
actual ha_write_row(), ha_update_row(),
or ha_delete_row() call into the storage engine handler is made
to perform the actual update. Note that a single row event can include
multiple rows, which are applied one after the other.
Final words
And that is it! Quite a bit of detail, but again I found it very useful to
create this complete overview; it will make things easier when re-implementing
this in a new replication API.
PlanetMySQL Voting:
Vote UP /
Vote DOWN