Archive for the ‘JOIN Performance’ Category

Joining on range? Wrong!

Май 18th, 2010

The problem I am going to describe is likely to be around since the very beginning of MySQL, however unless you carefully analyse and profile your queries, it might easily go unnoticed. I used it as one of the examples in our talk given at phpDay.it conference last week to demonstrate some pitfalls one may hit when designing schemas and queries, but then I thought it could be a good idea to publish this on the blog as well.

To demonstrate the issue let’s use a typical example – a sales query. Our data is a tiny store directory consisting of three very simple tables:

SQL:
  1. CREATE TABLE `products` (
  2.   `prd_id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  3.   `prd_name` varchar(32) NOT NULL,
  4.   PRIMARY KEY (`prd_id`),
  5.   KEY `name` (`prd_name`)
  6. )
  7.  
  8. CREATE TABLE `tags` (
  9.   `tag_prd_id` int(10) UNSIGNED NOT NULL,
  10.   `tag_name` varchar(32) NOT NULL,
  11.   PRIMARY KEY (`tag_name`, `tag_prd_id`)
  12. )
  13.  
  14. CREATE TABLE `items_ordered` (
  15.   `itm_id` int(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  16.   `itm_prd_id` int(10) UNSIGNED NOT NULL,
  17.   `itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  18.   PRIMARY KEY (`itm_id`),
  19.   KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`)
  20. )

"Please excuse the crudity of this model, I didn't have time to build it to scale or to paint it." -- Dr. Emmett Brown

I populated these tables with enough data to serve our purpose.

Our hypothetical sales query could be to figure out how many LCD TVs were sold yesterday.

SQL:
  1. SELECT        COUNT(1)
  2.        FROM   tags t
  3.               JOIN products p
  4.               ON     p.prd_id = t.tag_prd_id
  5.               JOIN items_ordered i
  6.               ON     i.itm_prd_id    = p.prd_id
  7.        WHERE  t.tag_name             = 'lcd'
  8.        AND    i.itm_order_timestamp>= '2010-05-16 00:00:00'
  9.        AND    i.itm_order_timestamp  <'2010-05-17 00:00:00'
  10. +----------+
  11. | COUNT(1) |
  12. +----------+
  13. |     4103 |
  14. +----------+

Seems like a very successful day! :)

When we look at the data structures it looks quite good – there is index on `tag_name` in `tags`, there is index on (`itm_prd_id`, `itm_order_timestamp`) in `items_ordered` and indexes on other columns used in joins. Let’s verify how the query performed in greater detail:

CODE:
  1. SHOW STATUS LIKE 'Handler_read%';                   
  2.  
  3. +-----------------------+--------+
  4. | Variable_name         | Value  |
  5. +-----------------------+--------+
  6. | Handler_read_first    | 0      |
  7. | Handler_read_key      | 3      |
  8. | Handler_read_next     | 118181 |
  9. | Handler_read_prev     | 0      |
  10. | Handler_read_rnd      | 0      |
  11. | Handler_read_rnd_next | 0      |
  12. +-----------------------+--------+

Somehow this does not look as good as the sales numbers. Query matched 4103 rows, but almost 120000 were scanned. And we have proper indexes on all necessary columns! What does EXPLAIN have to say about this?

CODE:
  1. *************************** 1. row ***************************
  2.            id: 1
  3.   select_type: SIMPLE
  4.         table: t
  5.          type: ref
  6. possible_keys: PRIMARY
  7.           key: PRIMARY
  8.       key_len: 98
  9.           ref: const
  10.          rows: 1
  11.         Extra: Using where; Using index
  12. *************************** 2. row ***************************
  13.            id: 1
  14.   select_type: SIMPLE
  15.         table: p
  16.          type: eq_ref
  17. possible_keys: PRIMARY
  18.           key: PRIMARY
  19.       key_len: 4
  20.           ref: example_db.t.tag_prd_id
  21.          rows: 1
  22.         Extra: Using index
  23. *************************** 3. row ***************************
  24.            id: 1
  25.   select_type: SIMPLE
  26.         table: i
  27.          type: ref
  28. possible_keys: itm_prd_id__and__itm_order_timestamp
  29.           key: itm_prd_id__and__itm_order_timestamp
  30.       key_len: 4
  31.           ref: example_db.p.prd_id
  32.          rows: 10325
  33.         Extra: Using where; Using index

To remind - our structure design is:

SQL:
  1. `itm_prd_id` int(10) UNSIGNED NOT NULL
  2.   `itm_order_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
  3.   KEY `itm_prd_id__and__itm_order_timestamp` (`itm_prd_id`,`itm_order_timestamp`)

In 3rd row key_len is only 4 bytes, while the full key length is 4 bytes for `itm_prd_id` plus 4 bytes for `itm_order_timestamp`, so 8 bytes in total! Also ref shows only one column being used by the last join.

How should we understand this then? Database reads all ordered items where tag is 'lcd', which totals to about 120000 rows as shown by the counters in SHOW STATUS output above, and then filters out those not matching the date range. A very inefficient approach! MySQL was unable to optimize those simple conditions to match both product id and date range by index and read only the relevant rows.

This affects joins only. When you use a range condition on the first (or the only) table, it works as expected:

SQL:
  1. EXPLAIN
  2. SELECT        COUNT(1)
  3.        FROM   items_ordered i
  4.        WHERE  i.itm_prd_id           = 5
  5.        AND    i.itm_order_timestamp>= '2010-05-16 00:00:00'
  6.        AND    i.itm_order_timestamp  <'2010-05-17 00:00:00'
  7.  
  8. *************************** 1. row ***************************
  9.            id: 1
  10.   select_type: SIMPLE
  11.         TABLE: i
  12.          type: range
  13. possible_keys: itm_prd_id__and__itm_order_timestamp
  14.           KEY: itm_prd_id__and__itm_order_timestamp
  15.       key_len: 8
  16.           ref: NULL
  17.          rows: 1306
  18.         Extra: USING WHERE; USING INDEX

In this case MySQL does not print ref at all, because there is no join, however you can notice key_len is 8 bytes, so the full index length. It means both index columns will be used to execute the query.

There may be many workarounds to this problem, all depends on the specific case you may need to solve. Essentially it always comes down to removing range condition from join one way or another. For our example query this could mean introducing additional DATE column and using it for filtering instead:

SQL:
  1. ALTER TABLE items_ordered ADD itm_order_date DATE NOT NULL, ADD INDEX itm_prd_id__and__itm_order_date (itm_prd_id, itm_order_date);
  2. UPDATE items_ordered SET itm_order_date = DATE(itm_order_timestamp);

Now the rewritten query:

SQL:
  1. EXPLAIN
  2. SELECT        COUNT(1)
  3.        FROM   tags t
  4.               JOIN products p
  5.               ON     p.prd_id = t.tag_prd_id
  6.               JOIN items_ordered i
  7.               ON     i.itm_prd_id = p.prd_id
  8.        WHERE  t.tag_name          = 'lcd'
  9.        AND    i.itm_order_date    = '2010-05-16'
  10.  
  11. *************************** 1. row ***************************
  12.            id: 1
  13.   select_type: SIMPLE
  14.         TABLE: t
  15.          type: ref
  16. possible_keys: PRIMARY
  17.           KEY: PRIMARY
  18.       key_len: 98
  19.           ref: const
  20.          rows: 1
  21.         Extra: USING WHERE; USING INDEX
  22. *************************** 2. row ***************************
  23.            id: 1
  24.   select_type: SIMPLE
  25.         TABLE: p
  26.          type: eq_ref
  27. possible_keys: PRIMARY
  28.           KEY: PRIMARY
  29.       key_len: 4
  30.           ref: example_db.t.tag_prd_id
  31.          rows: 1
  32.         Extra: USING INDEX
  33. *************************** 3. row ***************************
  34.            id: 1
  35.   select_type: SIMPLE
  36.         TABLE: i
  37.          type: ref
  38. possible_keys: itm_prd_id__and__itm_order_timestamp,itm_prd_id__and__itm_order_date
  39.           KEY: itm_prd_id__and__itm_order_date
  40.       key_len: 7
  41.           ref: example_db.p.prd_id,const
  42.          rows: 206494
  43.         Extra: USING WHERE; USING INDEX

This query uses 7 bytes of `itm_prd_id__and__itm_order_date` index – 4 bytes is `itm_prd_id` and 3 bytes is `itm_order_date` (DATE type uses 3 bytes). Also ref shows two columns used in join.

CODE:
  1. SHOW STATUS LIKE 'Handler_read%';                   
  2. +-----------------------+-------+
  3. | Variable_name         | Value |
  4. +-----------------------+-------+
  5. | Handler_read_first    | 0     |
  6. | Handler_read_key      | 3     |
  7. | Handler_read_next     | 4104  |
  8. | Handler_read_prev     | 0     |
  9. | Handler_read_rnd      | 0     |
  10. | Handler_read_rnd_next | 0     |
  11. +-----------------------+-------+

Statistics also look much better.

But remember - different query will likely need a different solution.

You can find several bug reports regarding this problem (e.g. #8569, #19548). Some replies from MySQL indicate this may be eventually fixed in 6.0 or some future version. Others say "it’s a documented behaviour – deal with it". But in the real world this is a serious bug, not a feature, and it needs fixing.


Entry posted by Maciej Dobrzanski | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN