Archive for the ‘insert ignore’ Category

On “Replace Into”, “Insert Ignore”, Triggers, and Row Based Replication

Август 11th, 2010

In posts on June 30 and July 6, I explained how implementing the commands “replace into” and “insert ignore” with TokuDB’s fractal trees data structures can be two orders of magnitude faster than implementing them with B-trees. Towards the end of each post, I hinted at that there are some caveats that complicate the story a little. On July 21st I explained one caveat, secondary keys, and on August 3rd, Rich explained another caveat. In this post, I explain the other two caveats: triggers and replication.

First, let’s look at triggers. The key to “replace into” and “insert ignore” being fast is that the uniqueness check is not done when the command is executed. The uniqueness check is deferred to a more opportune time. Triggers do not allow the uniquness check to be deferred. For triggers to properly work, we must know exactly what rows applied what modifications. For “replace into”, we must know which rows resulted in an insertion and which rows resulted in a deletion and insertion. For “insert ignore”, we must know which rows resulted in an insertion and which rows performed no operation. This information is not available unless disk seeks are performed.

Now, let’s look at row based replication. The problem here lies within mysql (as I describe in bug 53561, although that bug only addresses “replace into”, similar issues exist with “insert ignore”). Row based replication is not designed to handle anything other than normal insertions, which return an error if a duplicate key is found. So, a successful execution of “replace into” or “insert ignore” on the master maps to a normal insertion on the slave. Because normal insertions have different semantics than “replace into” and “insert ignore”, row based replication does not work if you use the altered semantics.

The problems with triggers are semantic. Their requirements incur disk seeks, which slows performance. The problem with row based replication is a design issue which will hopefully be fixed one day.


PlanetMySQL Voting: Vote UP / Vote DOWN

TokuDB speeds up “replace” and “insert ignore” operations by relaxing the affected rows constraint

Август 4th, 2010

In posts on June 30 and July 6, we explained how implementing the commands “replace into” and “insert ignore” with TokuDB’s fractal trees data structures can be two orders of magnitude faster than implementing them with B-trees. Towards the end of each post, we hinted at that there are some caveats that complicate the story a little. In this post, we explain one of the complications: the calculation of affected rows.

MySQL returns the number of rows affected by a “replace” or “insert” statement to the client. For the “replace” statement, the number of affected rows is defined to be the sum of the number of rows deleted and the number of rows inserted. For example, if we replace a row whose key does not exist in the table, then the number of affected rows is one. However, if the key does exist in the table, then it must be deleted and then inserted. In this case, the number of affected rows is two.

Executing a “replace” statement on a B-tree with an ad-hoc key may cause disk seeks to read the sub-tree that contains the row. Once the leaf block is in memory, it is easy to compute the number of affected rows, but the cost of the “replace” operation is high due to the disk seeks.

TokuDB fractal trees can avoid the disk seeks, as noted in the previous blogs. A “replace” message is inserted near the root of the fractal tree. No disk seeks are necessary since the blocks near the root of the tree are usually in memory. As a result, fractal trees can achieve much higher performance than B-trees for this operation. Unfortunately, the calculation of affected rows hinders performance. To compute the number of rows deleted, we need to see if the row already exists in the fractal tree, and this may cause disk seeks to read the sub-tree that contains the row. If we want high performance, we need to relax the definition of affected rows for “replace” operations.

Many applications replace a row in the table, and do not look at the value of the affected rows that is returned to them. For these applications, why whould one be willing to lose 2 orders of magnitude of performance computing a value that is not even used? TokuDB uses a session variable, called “tokudb_pk_insert_mode”, to control how the affected rows is computed. We can compute affected rows as defined by MySQL, and be as slow as a B-tree, or we can turn off the computation and be really fast. The application can decide what semantics it wants.

There are similar issues with how the computation of affected rows can destroy “insert ignore” performance. For the “insert ignore” statement, the number of affected rows is defined to be the number of rows inserted. If the row already exists in the table, the number of affected rows is zero. Otherwise, the number of affected rows is one. A B-tree needs to lookup the key in the tree, and for ad-hoc keys, this may result in disk seeks to read the leaf block that may contain the key. If the key does not exist, then the row is inserted. Since the leaf node must be read into memory, the computation of affected rows is easy for B-trees, but the cost of the “insert ignore” operation is high due to the disk seeks.

TokuDB fractal trees can avoid the disk seeks, as noted in the previous blogs. An “insert ignore” message is injected near the root of the fractal tree. No disk seeks are necessary since the blocks near the root of the tree are usually in memory. As a result, fractal trees can achieve much higher performance than B-trees for this operation. Unfortunately, the definition of affected rows for “insert ignore” hinders peformance. For “insert ignore”, the row is inserted only if the key does not already exist in the table. The computation of affected rows requires us to know whether or not the row already exists in the table. This requires a lookup in the tree, and the tree lookup may cost one or more disk seeks to read the sub-tree that may contain the row. TokuDB uses the session variable described previously to give applications control over how the affected rows is computed.

Applications may use “replace” to replace a row in a table, or “insert ignore” to insert a row in a table if it does not already exist in the table. The definition of how the affected rows is computed for these operations may cause hidden disk seeks. These hidden disk seeks may destroy the performance. TokuDB relaxes the affected rows constraint on these operations and as a result gains almost 2 orders of magnitude performance gain. As expected, TokuDB also has a session variable that controls how affected rows is computed for those applications that need compatibility with the specification.B-trees are slow because of the replace or insert disk seeks that have nothing to do with the details of the affected row computation. They are equally slow with either semantics.


PlanetMySQL Voting: Vote UP / Vote DOWN

On “Replace Into”, “Insert Ignore”, and Secondary Keys

Июль 22nd, 2010

In posts on June 30 and July 6, I explained how implementing the commands “replace into” and “insert ignore” with TokuDB’s fractal trees data structures can be two orders of magnitude faster than implementing them with B-trees. Towards the end of each post, I hinted at that there are some caveats that complicate the story a little. In this post, I explain one of the complications: secondary indexes.

Secondary indexes act the same way in TokuDB as they do in InnoDB. They store the defined secondary key, and the primary key as a pointer to the rest of the row. So, say the table foo has the following schema:

create table (a int, b int, c int, primary key (a), key(b));

And we did:

insert into foo values (1,10,100),(2,20,200);

Logically, there is one dictionary that stores all the data (this is the clustered primary key). Let us call it the main dictionary:

key  value
1    10,100
2    20,200

And there is another dictionary for the secondary key that stores the column ‘b’ and the primary key, ‘a’:

key  value
10   1
20   2

For secondary indexes to work properly, there must be a one to one correspondence between elements in the secondary index and in the primary index. If this correspondence is broken, then the table is corrupt.

Now suppose we were to execute:

replace into foo values (1,1000,1000);

This does:


  • in main dictionary, overwrite the value of key ’1′ and value ’10,100′ with key ’1′ and value ’1000,1000′.
  • in secondary dictionary, remove the key ’10′ with value ’1′.
  • in secondary dictionary, insert the key ’1000′ and key ’1′.

Notice that we cannot perform the second step unless we know the content of the existing row that is being replaced. Learning the content of the existing row requires a lookup in the main dictionary, which incurs a disk seek.

So, when executing “replace into” or “insert ignore” on tables with secondary keys, all engines must still incur a disk seek on the primary dictionary to learn where associated elements are in a secondary index, whereas if no secondary keys exist, then TokuDB’s fractal trees can avoid this disk seek.

Even with secondary indexes, fractal tree indexes are preferred. B-trees still incur additional disk seeks on insertions into secondary indexes that fractal trees do not. However, with no secondary indexes, fractal trees can do away with the mandatory disk seek whereas B-trees do not.


PlanetMySQL Voting: Vote UP / Vote DOWN

Why “insert … on duplicate key update” May Be Slow, by Incurring Disk Seeks

Июль 14th, 2010

In my post on June 18th, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. I previously explained why it would be better to use “replace into” or to use “insert ignore” over normal inserts. In this post, I explain why another alternative to normal inserts, “insert … on duplicate key update” is no better in MySQL, because the command incurs disk seeks.

The reason “insert ignore” and “replace into” can be made fast with TokuDB’s fractal trees is that the semantics of what to do in case a duplicate key is found is simple. In one case, you ignore, and in the other, you overwrite. With specific tombstone messages defined for these simple semantics, we defer the uniqueness check to a more opportune time.

The semantics of “insert … on duplicate key update” are not simple:


  • if the primary (or unique) key does not exist, insert the new row
  • if the primary key does exist, perform some update as defined in the SQL statement

The problem is we do not have a way of encoding the SQL update function into a message, the way we are able to encode “replace into” as an ‘i’ and “insert ignore” as an ‘ii’. If we did, we could similarly make “insert … on duplicate key update” fast.

I am not claiming that this is not theoretically possible, just that the storage engine API in MySQL does not allow for the encoding of updates as messages. Instead, what MySQL does is the following:


  • call handler::write_row to attempt an insertion, if it succeeds, we are done
  • if handler::write_row returns an error indicating a duplicate key, outside of the handler, apply the necessary update to the row
  • call handler::update_row to apply the update

The storage engine API does not have any access to the function that applies an update to the existing row. This is why the storage engine has no way of encoding any SQL update function (even some simple ones, such as “increment column a”).

So, in the meantime, to implement these semantics, B-trees and Fractal Tree data structures both:


  • look up the primary (or unique) key to verify existence
  • take the appropriate action based on whether the primary (or unique) key exists

The first step incurs a disk seek on large data sets with an ad-hoc primary (or unique key). And that is why it is slow.

So, the moral of the story is this. In MySQL, “insert … on duplicate key update” is slower than “replace into”. Although the sematics are slightly different in the case where the primary key is found (the former is defined as an update, whereas the latter is defined as a delete followed by an insert), if possible, the simpler semantics of “replace into” allow it to be faster than “insert … on duplicate key update”.


PlanetMySQL Voting: Vote UP / Vote DOWN

Making “Insert Ignore” Fast, by Avoiding Disk Seeks

Июль 7th, 2010

In my post from three weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because the semantics of these statements do NOT require disk seeks. In my post last week, I explained how the command “replace into” can be fast with TokuDB’s fractal trees. Today, I explain how “insert ignore” can be fast, using a strategy that is very similar to what we do with “replace into”.

The semantics of “insert ignore” are similar to that of “replace into”:


  • if the primary (or unique) key does not exist: insert the new row
  • if the primary (or unique) key does exist: do nothing

B-trees have the same problem with “insert ignore” that they have with “replace into”. They perform a lookup of the primary key, incurring a disk seek. We have already shown how fractal trees do not incur this disk seek for “replace into”, so let’s see how we can avoid disk seeks with “insert ignore”.

The only difference with “replace into” is when the primary (or unique) key exists, instead of overwriting the old row with the new row, we disregard the new row. So, all we need to do is tweak our tombstone messaging scheme (that we use for deletes and “replace into”) so that when “insert ignore” commands do not overwrite old rows with new rows. Similar to deletes and replace into, with this scheme, “insert ignore” can be two orders of magnitude faster than insertions into a B-tree.

Here is what we do. We insert a message into the fractal tree, with a new message “ii”, to signify that we are doing an “insert ignore”. The only difference between this message and the normal “i” message for insertions is what we do on queries and merges. On queries, if the message is an “ii”, then the value in the LOWER node is read, and not the higher node. On merges, if the higher node has a message of “ii”, the value in the LOWER node takes precedence over the value in the higher node.

Let’s look at an example that is similar to what we looked at for “replace into”:

create table foo (a int, b int, primary key (a));

Suppose the fractal tree for this table looks as follows:

- 

- -

- - - -

....

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (1000,1000)) ... (i (2^32, 2^32))

The ‘i’ stands for insertion message. Now suppose we do:

insert ignore into foo values (1000, 1001).

With fractal trees, we insert (ii (1000,1001)) into the top node. The tree then looks as such:

(ii (1000,1001)) 

- -

- - - -

....

(i (1,1)) (i (2,2)) (i (3,3)) (i (4,4)) ... (i (2^32, 2^32))

So upon querying the key ’1000′, a cursor notices that (1000,1001) has a message of “ii”. If it finds another value for the key 1000 in a lower node, it reads that value, otherwise, it reads (1000,1001). Because (1000,1000) is located in a lower node, the cursor returns (1000,1000) to the user. On merges, the message in the lower node, (1000,1000) overwrites the message in the higher node, (1000,1001).

While “insert ignore” can be fast, there are caveats (indexes, triggers, replication), just as there are with “replace into”. In a future posting, I will get into some of them.


PlanetMySQL Voting: Vote UP / Vote DOWN