Archive for the ‘key’ Category

MySQL Partitioning and its Confusing Syntax

Июль 18th, 2011

While looking at partitioning I recently made a mistake which I guess can happen to others. Often this is due to not fully reading the documentation or scanning it too quickly and misunderstanding what’s being said.

So this post is to complain about the MySQL partitioning syntax and to warn others who may easily make the same mistake without realising.

First we probably need to ask why we are partitioning a table in the first place. The main reasons for this are I think:

  • to improve query performance
  • to reduce individual .ibd file sizes for large tables (if using innodb_file_per_table)

In my case I wanted to do both. I had a several tables which store a large number of rows (batches of data) based on an incremental batch number. One of these tables was around 40 GB and had about 500,000,000 rows in it.  When processing data in this table often all the data from a particular batch run would be aggregated with the resultant data being stored elsewhere.

So I knew I wanted to partition and performance of these aggregations would be improved as the time to table scan a batch would be reduced to the time to scan the partition rather than the whole table.

The primary key of these tables was already in the form ( batch_id, other_key ) so I really wanted to just partition by the batch_id key, using in my case 64 partitions. batch_id is defined as int unsigned NOT NULL.

This is where I made the mistake. There are various ways to partition tables in MySQL and they are named: RANGE, LIST, HASH, KEY.  Given the batch_id was a gradually increasing value and I didn’t want to modify the partitioning once it was created RANGE and LIST seemed inappropriate. Of HASH and KEY, I incorrectly assumed that HASH would do some complex hash function of my integer batch_id, and since batch_id was part of the key that KEY would be the right way to partition.

So I incorrectly defined the table like this:

ALTER TABLE xxxxx PARTITION BY KEY ( batch_id ) PARTITIONS 64;

When you read the documentation you see that for HASH-type partitioning you provide a functional value which must be numeric and it actually determines the partition to use by doing MOD( your_functional_value, number_of_partitions ).

KEY-type partitioning works diferently and only allows you to provide column names and then it uses its own internal hashing function to generate the partition id.

So using PARTITION BY KEY ( numeric_column_name ) seems to correct but is likely to be more expensive to calculate. For large tables this is likely to cause additional performance issues.  It looks like I’m going to have to rebuild the tables I’ve just partitioned and that’ll be another weekend of work.

A suggestion to those at Oracle is that:

  • PARTITION BY KEY ( numeric_column_name ) should be modified to behave like PARTITION BY HASH ( numeric_column_name ). However, as this is likely to cause on disk incompatibilities during a version change if it were implemented, I’m guessing it just won’t happen, unless there’s some easy way to distinguish the current behaviour and my proposed new behaviour.
  • The documentation is made a little clearer and mentions this obvious case. What I see with a lot of the MySQL documentation is that it documents technically how things are done rather than documenting the problem and then how to implement the solution. Since it’s likely that many people are going to partition on one of the columns especially if a multi-column primary key is used this use case should be made clearer.

If I had time I’d look at how partitioning is implemented in Oracle database, Sybase, Postgres or DB2 and see whether it’s just MySQL which has chosen these unfortunate keywords for defining their partitioning methods.

However, I’m curious: am I the only one to fall in this trap?


PlanetMySQL Voting: Vote UP / Vote DOWN

Understanding B+tree Indexes and how they Impact Performance

Июль 14th, 2011
Indexes are a very important part of databases and are used frequently to speed up access to particular data item or items. So before working with indexes, it is important to understand how indexes work behind the scene and what is the data structure that is used to store these indexes, because unless you understand the inner working of an index, you will never be able to fully harness its power.
PlanetMySQL Voting: Vote UP / Vote DOWN

The Drizzle (and MySQL) Key tuple format

Апрель 1st, 2010

Here’s something that’s not really documented anywhere (unless you count ha_innodb.cc as a source of server documentation). You may have some idea about the MySQL/Drizzle row buffer format. This is passed around the storage engine interface: in for write_row and update_row and out for the various scan and index read methods.

If you want to see the docs for it that exist in the code, check out store_key_val_for_row in ha_innodb.cc.

However, there is another format that is passed to your engine (and that your engine is expected to understand) and for lack of a better name, I’m going to call it the key tuple format. The first place you’ll probably see this is when implementing the index_read function for a Cursor (or handler in MySQL speak).

You get two things: a pointer to the buffer and the length of the buffer. Since a key can be made up of multiple parts, some of which can be NULL and some of which can be of variable length, this buffer is not (usually) a simple value. If you are starting out in your engine development, you can use this buffer blindly as a single value for non-nullable indexes with only 1 column.

The basic format is this:

  • The buffer is in-order of the index. First column in the index is first in the buffer, second second etc.
  • The buffer must be zero-filled. The server kernel will use memcmp to compare two key values.
  • If the column is NULLable, then the first byte is set to 1 if the column is null. Else, 0 means not-null.
  • From ha_innodb.cc (for BLOBs, which I haven’t put in embedded_innodb yet): If the column is of a BLOB type (it must be a column prefix field in this case), then we put the length of the data in the field to the next 2 bytes, in the little-endian format. If the field is SQL NULL, then these 2 bytes are set to 0. Note that the length of data in the field is <= column prefix length.
  • For fixed length fields (such as int), the next max field length bytes are for that field.
  • For VARCHAR, there is always a 2 byte (in little endian) length. This is different to the row format, which may have 1 or 2 bytes. In the key tuple format it is ALWAYS two bytes.

I’ll discuss the use of this for rnd_pos() and position() in a later post…

This blog post (but not the whole blog) is published under the Creative Commons Attribution-Share Alike License. Attribution is by linking back to this post and mentioning my name (Stewart Smith).


PlanetMySQL Voting: Vote UP / Vote DOWN