Archive for the ‘explain’ Category

Understanding the unique_subquery optimization

Ноябрь 30th, 2011
If you use the EXPLAIN SELECT statement to see how your subqueries are treated by MySQL, you may sometimes meet the "unique_subquery" optimization. Here is how the manual describes it:
"unique_subquery: this type replaces ref for some IN subqueries of the following form: value IN (SELECT primary_key FROM single_table WHERE some_expr); unique_subquery is just an index lookup function that replaces the subquery completely for better efficiency".
Few weeks ago, while I was reviewing a patch fixing a bug in unique_subquery, I got a "simplification" pulsion. I told myself that:
  •  unique_subquery is an optimization for a special case of simple subqueries (single inner table, using index, no aggregates);
  • we have a more general system around, used for more complex subqueries, naturally capable of handling simple ones too if we wanted;
  • this general system does not have the bug in question...
Then I wondered: what if we removed the unique_subquery optimization, and let the general system handle this simple subquery? This would certainly simplify code, and thus maintainance...But before removing it, of course, we should check whether unique_subquery brings a significant performance benefit.

So today I'm testing unique_subquery against the DBT3 benchmark. I grab a copy of MySQL 5.6.3, and focus on the sixteenth query of DBT3, which contains a subquery (in red) suitable for handling by unique_subquery:


select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#23'
and p_type not like 'LARGE PLATED%'
and p_size in (43, 1, 25, 5, 35, 12, 42, 40)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)

group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;

This query executes in 0.65 seconds on my Linux box, and EXPLAIN is:

+----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
| id | select_type        | table    | type            | possible_keys        | key          | key_len | ref                 | rows   | Extra                                        |
+----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
|  1 | PRIMARY            | part     | ALL             | PRIMARY              | NULL         | NULL    | NULL                | 199498 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY            | partsupp | ref             | PRIMARY,i_ps_partkey | i_ps_partkey | 4       | dbt3.part.p_partkey |      2 | Using where; Using index                     |
|  2 | DEPENDENT SUBQUERY | supplier | unique_subquery | PRIMARY              | PRIMARY      | 4       | func                |      1 | Using where                                  |
+----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+

When I disable unique_subquery (by modifying MySQL's C++ code), EXPLAIN becomes:

+----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
| id | select_type        | table    | type   | possible_keys        | key          | key_len | ref                 | rows   | Extra                                        |
+----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
|  1 | PRIMARY            | part     | ALL    | PRIMARY              | NULL         | NULL    | NULL                | 199498 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY            | partsupp | ref    | PRIMARY,i_ps_partkey | i_ps_partkey | 4       | dbt3.part.p_partkey |      2 | Using where; Using index                     |
|  2 | DEPENDENT SUBQUERY | supplier | eq_ref | PRIMARY              | PRIMARY      | 4       | func                |      1 | Using where                                  |
+----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+

The only change, as expected, is that "unique_subquery" becomes "eq_ref". The used index is the same (the primary key of the "supplier" table). The optimizer has the same notion of unicity: "unique_subquery" and "eq_ref" both denote that a single lookup is needed, as the index is UNIQUE. Same index, same number of lookups: execution could well be as fast with "eq_ref" as it was with "unique_subquery".
But... no. Query now executes in 0.80 seconds. 23% slower than with unique_subquery!

Finer-grained timing shows that the extra 0.15 seconds are indeed lost in the subquery evaluation code.

To understand this, let's follow the execution in detail, based on EXPLAIN output above.
  • First line of EXPLAIN output: we do a table scan on the "part" table ("type=ALL" means "table scan") . The "rows" column of EXPLAIN suggests that we are going to have 199,498 rows of "part".
  • Second line of EXPLAIN output: for each row from the "part" table, we do an index lookup ("ref") into the "i_ps_partkey" index of the "partsupp" table; apparently such lookup will find two rows ("rows=2").
  • At this point, we have a row made of needed columns of "part" and of "partsupp". An upper estimate of the number of those rows is 199,498 multiplied by 2: 400,000. Actually, the real number is around 120,000 (there has been filtering going on, as the "Using where" indicates).
  • Then we evaluate the WHERE clause and thus the "NOT IN (subquery)" predicate (the "DEPENDENT SUBQUERY"). 120,000 evaluations of such predicate. And that's where the difference is.
EXPLAIN EXTENDED and then SHOW WARNINGS show how the predicate
looks like. Let's start with the case where unique_subquery is disabled:


/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(/* select#2 */ select 1 from `dbt3`.`supplier` where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`

Above, the part in red says that

ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
has been transformed from "IN(non correlated subquery)" to "EXISTS(correlated subquery)", yielding this:


not exists (
        select
            1
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
                        AND s_suppkey = ps_suppkey
        )

or, more exactly (leaving out the NOT operator, for brevity):


<exists>(/* select#2 */ select 1 from `dbt3`.`supplier`
           where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%')
           and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`)))

Evaluating this EXISTS() evaluates the new subquery. This means all the subquery evaluation machinery: calls to JOIN::exec(), sub_select(), evaluate_join_record()... Sure, deep down it does an index lookup like unique_subquery does, but all those function calls have a cost, and so has all the logic which is lying around ready to handle any complexity in the subquery, as this is generic subquery evaluation code ("if group_by do this", "if order_by do this", "if left_join do this": none of those if()s are entered, but deciding to enter them or not has a cost). Plus some initialization code. Plus some de-initialization code. This overhead, repeated 120,000 times, amounts to 0.15 seconds...

Now, EXPLAIN EXTENDED when unique_subquery is enabled:


/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`) in supplier on PRIMARY where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`)))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`

The optimizer has first done the same transformation (IN to EXISTS) as we saw before, then has done one more transformation, and EXISTS has become, as written in red above:


<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`)
          in supplier on PRIMARY
          where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%')
          and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))

which is directly an index lookup ("<primary_index_lookup>"), followed by an additional WHERE clause. So the overhead of full-blown subquery evaluation is
avoided. And this overhead is not neglectable, compared to the index lookup (assuming the relevant index pages are already in memory).

So the conclusion of my experiment is that unique_subquery is worth having. I'll have to direct simplification pulsions to some other code!

Note that there also exists a similar "index_subquery" optimization applying to non-unique indices. And it's worth having, for the same reasons.

PlanetMySQL Voting: Vote UP / Vote DOWN

Understanding the unique_subquery optimization

Ноябрь 30th, 2011
If you use the EXPLAIN SELECT statement to see how your subqueries are treated by MySQL, you may sometimes meet the "unique_subquery" optimization. Here is how the manual describes it:
"unique_subquery: this type replaces ref for some IN subqueries of the following form: value IN (SELECT primary_key FROM single_table WHERE some_expr); unique_subquery is just an index lookup function that replaces the subquery completely for better efficiency".
Few weeks ago, while I was reviewing a patch fixing a bug in unique_subquery, I got a "simplification" pulsion. I told myself that:
  •  unique_subquery is an optimization for a special case of simple subqueries (single inner table, using index, no aggregates);
  • we have a more general system around, used for more complex subqueries, naturally capable of handling simple ones too if we wanted;
  • this general system does not have the bug in question...
Then I wondered: what if we removed the unique_subquery optimization, and let the general system handle this simple subquery? This would certainly simplify code, and thus maintainance...But before removing it, of course, we should check whether unique_subquery brings a significant performance benefit.

So today I'm testing unique_subquery against the DBT3 benchmark. I grab a copy of MySQL 5.6.3, and focus on the sixteenth query of DBT3, which contains a subquery (in red) suitable for handling by unique_subquery:


select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#23'
and p_type not like 'LARGE PLATED%'
and p_size in (43, 1, 25, 5, 35, 12, 42, 40)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)

group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;

This query executes in 0.65 seconds on my Linux box, and EXPLAIN is:

+----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
| id | select_type        | table    | type            | possible_keys        | key          | key_len | ref                 | rows   | Extra                                        |
+----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
|  1 | PRIMARY            | part     | ALL             | PRIMARY              | NULL         | NULL    | NULL                | 199498 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY            | partsupp | ref             | PRIMARY,i_ps_partkey | i_ps_partkey | 4       | dbt3.part.p_partkey |      2 | Using where; Using index                     |
|  2 | DEPENDENT SUBQUERY | supplier | unique_subquery | PRIMARY              | PRIMARY      | 4       | func                |      1 | Using where                                  |
+----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+

When I disable unique_subquery (by modifying MySQL's C++ code), EXPLAIN becomes:

+----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
| id | select_type        | table    | type   | possible_keys        | key          | key_len | ref                 | rows   | Extra                                        |
+----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
|  1 | PRIMARY            | part     | ALL    | PRIMARY              | NULL         | NULL    | NULL                | 199498 | Using where; Using temporary; Using filesort |
|  1 | PRIMARY            | partsupp | ref    | PRIMARY,i_ps_partkey | i_ps_partkey | 4       | dbt3.part.p_partkey |      2 | Using where; Using index                     |
|  2 | DEPENDENT SUBQUERY | supplier | eq_ref | PRIMARY              | PRIMARY      | 4       | func                |      1 | Using where                                  |
+----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+

The only change, as expected, is that "unique_subquery" becomes "eq_ref". The used index is the same (the primary key of the "supplier" table). The optimizer has the same notion of unicity: "unique_subquery" and "eq_ref" both denote that a single lookup is needed, as the index is UNIQUE. Same index, same number of lookups: execution could well be as fast with "eq_ref" as it was with "unique_subquery".
But... no. Query now executes in 0.80 seconds. 23% slower than with unique_subquery!

Finer-grained timing shows that the extra 0.15 seconds are indeed lost in the subquery evaluation code.

To understand this, let's follow the execution in detail, based on EXPLAIN output above.
  • First line of EXPLAIN output: we do a table scan on the "part" table ("type=ALL" means "table scan") . The "rows" column of EXPLAIN suggests that we are going to have 199,498 rows of "part".
  • Second line of EXPLAIN output: for each row from the "part" table, we do an index lookup ("ref") into the "i_ps_partkey" index of the "partsupp" table; apparently such lookup will find two rows ("rows=2").
  • At this point, we have a row made of needed columns of "part" and of "partsupp". An upper estimate of the number of those rows is 199,498 multiplied by 2: 400,000. Actually, the real number is around 120,000 (there has been filtering going on, as the "Using where" indicates).
  • Then we evaluate the WHERE clause and thus the "NOT IN (subquery)" predicate (the "DEPENDENT SUBQUERY"). 120,000 evaluations of such predicate. And that's where the difference is.
EXPLAIN EXTENDED and then SHOW WARNINGS show how the predicate
looks like. Let's start with the case where unique_subquery is disabled:


/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(/* select#2 */ select 1 from `dbt3`.`supplier` where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`

Above, the part in red says that

ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
has been transformed from "IN(non correlated subquery)" to "EXISTS(correlated subquery)", yielding this:


not exists (
        select
            1
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
                        AND s_suppkey = ps_suppkey
        )

or, more exactly (leaving out the NOT operator, for brevity):


<exists>(/* select#2 */ select 1 from `dbt3`.`supplier`
           where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%')
           and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`)))

Evaluating this EXISTS() evaluates the new subquery. This means all the subquery evaluation machinery: calls to JOIN::exec(), sub_select(), evaluate_join_record()... Sure, deep down it does an index lookup like unique_subquery does, but all those function calls have a cost, and so has all the logic which is lying around ready to handle any complexity in the subquery, as this is generic subquery evaluation code ("if group_by do this", "if order_by do this", "if left_join do this": none of those if()s are entered, but deciding to enter them or not has a cost). Plus some initialization code. Plus some de-initialization code. This overhead, repeated 120,000 times, amounts to 0.15 seconds...

Now, EXPLAIN EXTENDED when unique_subquery is enabled:


/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`) in supplier on PRIMARY where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`)))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`

The optimizer has first done the same transformation (IN to EXISTS) as we saw before, then has done one more transformation, and EXISTS has become, as written in red above:


<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`)
          in supplier on PRIMARY
          where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%')
          and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))

which is directly an index lookup ("<primary_index_lookup>"), followed by an additional WHERE clause. So the overhead of full-blown subquery evaluation is
avoided. And this overhead is not neglectable, compared to the index lookup (assuming the relevant index pages are already in memory).

So the conclusion of my experiment is that unique_subquery is worth having. I'll have to direct simplification pulsions to some other code!

Note that there also exists a similar "index_subquery" optimization applying to non-unique indices. And it's worth having, for the same reasons.

PlanetMySQL Voting: Vote UP / Vote DOWN

Added a Table of Contents

Ноябрь 10th, 2011

Not a big deal, but I just added a “Table of Contents” page to my blog to make finding older articles much easier.

I noticed most of my posts are quite lengthy, and it can take a bit of searching/clicking to find an older entry. So unless you happen to recall the ‘month/year’ it was published, which I don’t even remember that, then hopefully this will help.

Really simple, and looks just like this:

My hopes are that this will aid in making some posts easier to find (such as ones about InnoDB Recovery, Recovery with an Individual .ibd, Proxy-related articles, Error-related articles, How-to posts, and so forth).

You can see the full “table of contents” here:

http://www.chriscalender.com/?page_id=399

Happy reading :)

 
 
 


PlanetMySQL Voting: Vote UP / Vote DOWN

When EXPLAIN estimates can go wrong!

Октябрь 16th, 2011
This is the title of my first blog post on MySQL Performance Blog. It deals with a customer case where the customer was facing a peculiar problem where the rows column in the EXPLAIN output of the query was totally off. The actual number of rows was 18 times more than the number of rows reported by MySQL in the output of EXPLAIN. Now this can be a real pain as MySQL uses “the number of rows” estimation to pick and choose indexes and it could really be picking up a wrong index simply because of the wrong estimate. You...
PlanetMySQL Voting: Vote UP / Vote DOWN

Optimizer tracing: how to configure it

Октябрь 4th, 2011
In this blog post, my colleague Jørgen Løland described a new feature of MySQL 5.6: Optimizer Tracing. I recommend reading his article, as it presents this new feature in a simple, easy-to-read manner.

The Optimizer Tracing feature can help understanding what the Optimizer is doing; it is available since milestone 5.6.3, announced October 3rd at Oracle Open World (here is the changelog). It's good to see it mature now; I remember that Sergey Petrunia did the first prototype back in March 2009!

Today  I will be giving some must-have tips related to handling big traces.

First thing to know, a trace lives in main memory (internally it is allocated on the heap or free store of the MySQL Server). An SQL statement which gives the optimizer a lot of work (for example, by joining many tables) will generate a large trace. Up to gigabytes in some pathological cases! To avoid hogging memory then, a trace will never grow beyond the value of the @@optimizer_trace_max_mem_size session variable. Which has a default of 16 kilobytes; yes, it's a low value, to protect the innocent. I often raise it in my session, to 1 megabyte, which is enough for most queries in my daily work:

mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE=1000000;

If a trace was truncated because exceeding this limit, the column INFORMATION_SCHEMA.OPTIMIZER_TRACE.MISSING_BYTES_BEYOND_MAX_MEM_SIZE shows a non-zero value: the number of bytes which could not be added to the trace.

When I read a trace I like to scroll up and down in it, search for some optimization phase (that means searching for some keyword in the trace); to do this, I set, in the "mysql" command-line client, the pager to "less" (I'm using Unix):

mysql> pager less;

This is very useful. I have all "less" commands under hand: can save to a file, can search for a regular expression match, search forward or backward... One last benefit is that when I have finished reading, quitting "less" makes the trace go away from the terminal, it does not linger on my screen forever, does not occupy the terminal's display history...

When there are many tables to join, as I said above the trace can grow a lot, because the Optimizer evaluates many possible join orders (using an algorithm known as "greedy search"): each evaluated join order is mentioned in the trace. Greedy search ends up being the greatest part of the trace. What if I want to see the trace of what happens in the Optimizer after greedy search is complete? If I set @@optimizer_trace_max_mem_size to a low value, it will trim greedy search and what follows. If I set @@optimizer_trace_max_mem_size to a high value, to see what I want, greedy search will be traced which will possibly exceed the amount of memory I can afford on this machine... It would be nice if I could tell the system: "please do not trace this and that". In my case, it would be "please do not trace greedy search".

As an example, let's consider twenty tables, t1...t20, all similar to this one:

mysql> CREATE TABLE t1 (a INT, INDEX(a)) ENGINE=MYISAM;
mysql> INSERT INTO t1 VALUES(x),(y); # x and y being some integers

and let's run this query, after turning tracing on:
mysql> SET OPTIMIZER_TRACE="ENABLED=ON,END_MARKER=ON";
mysql> EXPLAIN SELECT 1 FROM t1  JOIN t2 ON t2.a=t1.a JOIN t3 ON t3.a=t2.a JOIN t4 ON t4.a=t3.a JOIN t5 ON t5.a=t4.a JOIN t6 ON t6.a=t5.a JOIN t7 ON t7.a=t6.a JOIN t8 ON t8.a=t7.a JOIN t9 ON t9.a=t8.a JOIN t10 ON t10.a=t9.a JOIN t11 ON t11.a=t10.a JOIN t12 ON t12.a=t11.a JOIN t13 ON t13.a=t12.a JOIN t14 ON t14.a=t13.a JOIN t15 ON t15.a=t14.a JOIN t16 ON t16.a=t15.a JOIN t17 ON t17.a=t16.a JOIN t18 ON t18.a=t17.a JOIN t19 ON t19.a=t18.a JOIN t20 ON t20.a=t19.a;
+----+-------------+-------+-------+---------------+------+---------+------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------------+------+--------------------------+
| 1 | SIMPLE | t1 | index | a | a | 5 | NULL | 2 | Using where; Using index |
| 1 | SIMPLE | t2 | ref | a | a | 5 | test.t1.a | 1 | Using where; Using index |
| 1 | SIMPLE | t3 | ref | a | a | 5 | test.t2.a | 1 | Using where; Using index |
| 1 | SIMPLE | t4 | ref | a | a | 5 | test.t1.a | 1 | Using index |
| 1 | SIMPLE | t5 | ref | a | a | 5 | test.t1.a | 1 | Using index |
| 1 | SIMPLE | t6 | ref | a | a | 5 | test.t1.a | 1 | Using index |
| 1 | SIMPLE | t7 | ref | a | a | 5 | test.t1.a | 1 | Using index |
| 1 | SIMPLE | t8 | ref | a | a | 5 | test.t1.a | 1 | Using index |
| 1 | SIMPLE | t9 | ref | a | a | 5 | test.t1.a | 1 | Using where; Using index |
| 1 | SIMPLE | t10 | ref | a | a | 5 | test.t9.a | 1 | Using where; Using index |
| 1 | SIMPLE | t11 | ref | a | a | 5 | test.t1.a | 1 | Using where; Using index |
| 1 | SIMPLE | t12 | ref | a | a | 5 | test.t11.a | 1 | Using where; Using index |
| 1 | SIMPLE | t13 | ref | a | a | 5 | test.t12.a | 1 | Using where; Using index |
| 1 | SIMPLE | t14 | ref | a | a | 5 | test.t12.a | 1 | Using where; Using index |
| 1 | SIMPLE | t15 | ref | a | a | 5 | test.t12.a | 1 | Using where; Using index |
| 1 | SIMPLE | t16 | ref | a | a | 5 | test.t12.a | 1 | Using where; Using index |
| 1 | SIMPLE | t17 | ref | a | a | 5 | test.t12.a | 1 | Using where; Using index |
| 1 | SIMPLE | t18 | ref | a | a | 5 | test.t11.a | 1 | Using where; Using index |
| 1 | SIMPLE | t19 | ref | a | a | 5 | test.t12.a | 1 | Using where; Using index |
| 1 | SIMPLE | t20 | ref | a | a | 5 | test.t11.a | 1 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------------+------+--------------------------+
As optimizer developer, it catches my eye that some tables have "Using where" and others don't. "Using where" tells me that a condition is evaluated after fetching a row from the table, but does not tell me what condition. To know more, I will look at the trace. But I'm not interested in the trace of greedy search, which likely accounts, in this 20-table example, for most of the total trace, and would just make my reading less comfortable, or even hog memory, if I had more than those 20 tables or more indices on them, or more complex conditions. So I do:

mysql> SET OPTIMIZER_TRACE_FEATURES="GREEDY_SEARCH=OFF";
and run the EXPLAIN again. To show the resulting trace, I'm posting a screenshot of how it is displaid by the JsonView Firefox add-on. Using this add-on requires taking two precautions:
  • turning off the end_marker flag of @@optimizer_trace (this flag is good for human-readability but is not JSON-compliant)
  • sending the trace to a file without escaping of newlines (which is why I use INTO DUMPFILE instead of INTO OUTFILE):

mysql> SET OPTIMIZER_TRACE="END_MARKER=OFF";
mysql> SELECT TRACE INTO DUMPFILE "/tmp/trace.json" FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE;

Here's the trace now in Firefox:

We see how greedy_search=off has eliminated the trace of greedy search ("considered_execution_plans"), replacing it with just an ellipsis ("...")! Then there is what I wanted to see: the part after greedy search, named "attached_conditions_summary", which describes what condition is behind each "Using where" in EXPLAIN. There are equality conditions of course. Some coming directly from the "JOIN ON" conditions. Some deduced by equality propagation (if t1.a=t2.a and t2.a=t3.a then t1.a=t3.a, a new condition which we see is attached to t3). There are also "IS NOT NULL" conditions; indeed, an equality condition of the form t1.a=t2.a allows us to deduce that both t1.a and t2.a are not NULL, so if rows of t1 are retrieved first they can be filtered with this deduced condition, known as early NULL filtering.

There are other flags in @@optimizer_trace_features; we added a flag for each feature which we think can make the trace grow unreasonably large:
  • greedy search
  • repeated execution of subqueries (one execution per row)
  • repeated range optimization (known as "range checked for each record" in EXPLAIN)
  • range optimization in general.
I think I have shown enough for today. A positive note to finish: in the Optimizer team, this tracing has already helped us to debug some problems, so has started to fulfill its goal!

For the curious: the Optimizer Trace is covered in full detail, including the above, and more, in a chapter of the "MySQL Internals" manual.

PlanetMySQL Voting: Vote UP / Vote DOWN

Optimizer tracing: Query Execution Plan descriptions beyond EXPLAIN

Октябрь 4th, 2011
Understanding why MySQL chooses a particular join order or why table scan is chosen instead of range scan is often very hard even for experienced MySQL users. Two almost identical queries, differing only in constant values, may produce completely different plans. That's why we're introducing a great new feature in 5.6: Optimizer Tracing. The target users of this feature are developers and MySQL users experienced enough to understand the ins and outs of EXPLAIN.

What Optimizer Tracing is
You may already have guessed this, but optimizer tracing is a printout  of important decisions the MySQL optimizer has done during the process of making the Query Execution Plan.

The trace is presented in JSON format which is easy to read both for humans and others.

Currently, the optimizer trace includes short explanations for:
  • Why the join order was chosen.
  • Important query transformations like IN to EXISTS.
  • The access methods applicable and why one of them was chosen.
  • Which conditions are attached to each table (i.e., what hides behind "Using where" in EXPLAIN)
More coverage will be added as we go along.


What Optimizer Tracing is NOT
The feature is not, and never will be, complete in the sense that it does not describe all choices the optimizer does. However, we're going to add more information when we find valuable things to add. We will not guarantee backwards compatibility like we do for EXPLAIN. This gives us the freedom to improve tracing at a high pace without going through multi-release deprecation warnings etc. In other words: optimizer tracing is not a replacement for EXPLAIN.

A quick tour
The best way get a feeling of optimizer tracing is to give it a try. Below is a teaser:
Read more »

PlanetMySQL Voting: Vote UP / Vote DOWN

Tips and tricks: Killer response time for non-overlapping intervals

Сентябрь 27th, 2011
Assume you have a table where you store non-overlapping intervals using two columns, e.g. IP ranges. IP ranges are simple to represent using integer notation:

CREATE TABLE ip_owner (
   owner_id int NOT NULL,
   /* some columns */
   ip_start_int bigint NOT NULL,      /* IP address converted to integer */
   ip_end_int bigint NOT NULL,        /* IP address converted to integer */
   PRIMARY KEY (owner_id),
   INDEX ip_range (ip_start_int, ip_end_int)
) ENGINE=InnoDB;


And then you find yourself in a situation where you want to know who, if anyone, owns the IP address X. This can be done using the following query:

SELECT * FROM ip_owner WHERE ip_start_int <= X AND ip_end_int >= X;

MySQL can resolve this using a range scan, but will unfortunately only be able to use the ip_start_int <= X part of the condition as a range as explained here. Thus, the query will either be resolved by range scan if fairly few records have ip_start_int <= X or table scan otherwise. That means unreliable response time because it will be much quicker to query low-valued IPs than high valued IPs. I inserted 1M records into the table before running the queries below:

Read more »

PlanetMySQL Voting: Vote UP / Vote DOWN

Explain….

Сентябрь 2nd, 2011

Explain.... It is a very simple command that I feel is one of the most overlooked commands by new MySQL users. It is also a very valuable command available for MySQL. I realize I am preaching to the choir for a lot of MySQL users. However, for everyone who uses explain, we are bound to have many who do not. 
The MySQL documentation on this is great and available here and Optimizing Queries with EXPLAIN
Developer and a dba issues will continue for years,  but we can at least start on a level playing field. When writing a query, regardless of what it is, it is a good practice is to start it with explain first. This can achieve a couple things for you.
  • It checks your syntax to help you avoid mistakes.
  • Allows you to display information from the optimizer
Using the world example data for avoiding mistakes via explain. (innodb version)

mysql [localhost] {root} (world) > explain extended SELECT City.CountryCode , City.Name , Country.Name , Country.Capital , Country.Population FROM City , Country WHERE Country.Continent = 'North America' AND City.CountryCode = Country.Code;
+----+-------------+---------+------+---------------+-------------+---------+--------------------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------+---------------+-------------+---------+--------------------+------+----------+-------------+
| 1 | SIMPLE | Country | ALL | PRIMARY | NULL | NULL | NULL | 200 | 100.00 | Using where |
| 1 | SIMPLE | City | ref | CountryCode | CountryCode | 3 | world.Country.Code | 18 | 100.00 | |
+----+-------------+---------+------+---------------+-------------+---------+--------------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

This query does work but I would never write it this way. It works for such a small data set but a join would work better.

mysql [localhost] {root} (world) > explain extended SELECT C.CountryCode , C.Name , Y.Name , Y.Capital , Y.Population
-> FROM City C
-> INNER JOIN Country Y ON Y.Code = C.CountryCode ; WHERE Y.Continent = 'North America' ;
+----+-------------+-------+------+---------------+-------------+---------+--------------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+-------------+---------+--------------+------+----------+-------+
| 1 | SIMPLE | Y | ALL | PRIMARY | NULL | NULL | NULL | 200 | 100.00 | |
| 1 | SIMPLE | C | ref | CountryCode | CountryCode | 3 | world.Y.Code | 18 | 100.00 | |
+----+-------------+-------+------+---------------+-------------+---------+--------------+------+----------+-------+
2 rows in set, 1 warning (0.00 sec)
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'WHERE Y.Continent = 'North America'' at line 1
mysql [localhost] {root} (world) >

An error! Explain found this simple typo. While updating query to use a join, I adjusted my table references to aliases and a semicolon was placed before the where. Simple typo, but in the real world it could have been worse. What if your typo, enabled a full table scan across a table with billions of rows of data? Real world issues are usually related to “such an easy query” issues, so it is rushed. Then code gets pushed and nothing worked as planned. We can all write, clean, effective, and optimized queries, if we pay attention and understand what we are executing.

So error fixed .
mysql [localhost] {root} (world) > explain extended SELECT C.CountryCode , C.Name , Y.Name , Y.Capital , Y.Population
-> FROM City C
-> INNER JOIN Country Y ON Y.Code = C.CountryCode WHERE Y.Continent = 'North America' ;
+----+-------------+-------+------+---------------+-------------+---------+--------------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+-------------+---------+--------------+------+----------+-------------+
| 1 | SIMPLE | Y | ALL | PRIMARY | NULL | NULL | NULL | 200 | 100.00 | Using where |
| 1 | SIMPLE | C | ref | CountryCode | CountryCode | 3 | world.Y.Code | 18 | 100.00 | |
+----+-------------+-------+------+---------------+-------------+---------+--------------+------+----------+-------------+
2 rows in set, 1 warning (0.01 sec)

Using the employee example data to display information from the optimizer via explain. (innodb version)

The real use and best use of explain is to “display information from the optimizer about the query execution plan.” (refman) When executing joins across tables you need to understand what your pulling and make sure the best indexes are being used.

mysql [localhost] {root} (employees) > explain SELECT e.first_name , e.last_name , e.gender , e.hire_date , t.title , s.salary
FROM employees e
INNER JOIN titles t ON t.emp_no = e.emp_no AND e.hire_date BETWEEN t.from_date and t.to_date
INNER JOIN salaries s ON s.emp_no = e.emp_no AND e.hire_date BETWEEN s.from_date and s.to_date
WHERE e.gender='M'
ORDER BY e.hire_date ASC;
+----+-------------+-------+------+----------------+---------+---------+--------------------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+---------+---------+--------------------+--------+----------------+
| 1 | SIMPLE | e | ALL | PRIMARY | NULL | NULL | NULL | 300030 | Using filesort |
| 1 | SIMPLE | t | ref | PRIMARY,emp_no | PRIMARY | 4 | employees.e.emp_no | 1 | Using where |
| 1 | SIMPLE | s | ref | PRIMARY,emp_no | PRIMARY | 4 | employees.t.emp_no | 4 | Using where |
+----+-------------+-------+------+----------------+---------+---------+--------------------+--------+----------------+
150291 rows in set (3.88 sec)

300030 rows ? NULL KEY ?  ick.

mysql [localhost] {root} (employees) > ALTER TABLE employees ADD KEY gender (gender);
mysql [localhost] {root} (employees) > explain SELECT e.first_name , e.last_name , e.gender , e.hire_date , t.title , s.salary
FROM employees e
INNER JOIN titles t ON t.emp_no = e.emp_no AND e.hire_date BETWEEN t.from_date and t.to_date
INNER JOIN salaries s ON s.emp_no = e.emp_no AND e.hire_date BETWEEN s.from_date and s.to_date
WHERE e.gender='M'
ORDER BY e.hire_date ASC;
+----+-------------+-------+------+----------------+---------+---------+--------------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+---------+---------+--------------------+--------+-----------------------------+
| 1 | SIMPLE | e | ref | PRIMARY,gender | gender | 1 | const | 150015 | Using where; Using filesort |
| 1 | SIMPLE | t | ref | PRIMARY,emp_no | PRIMARY | 4 | employees.e.emp_no | 1 | Using where |
| 1 | SIMPLE | s | ref | PRIMARY,emp_no | PRIMARY | 4 | employees.t.emp_no | 4 | Using where |
+----+-------------+-------+------+----------------+---------+---------+--------------------+--------+-----------------------------+
3 rows in set (0.00 sec)

Now I am not doing a full table scan but using the index on gender and only across 15k rows.
Explain shows this information so we can adjust when required.

These are simple examples I realize. Adding an index also has to be reviewed and just not put on everything.

Take the time to use explain. Make it a habit.

Other blog posts about explain over the years:


PlanetMySQL Voting: Vote UP / Vote DOWN

The MySQL range access method explained

Август 23rd, 2011
The range access method uses an index to read a subset of rows that form one or multiple continuous index value intervals. The intervals are defined by the query's range predicates, which are comparisons using any of =, <=>, IN(), IS NULL, IS NOT NULL, >, <, >=, <=, BETWEEN, !=, <> or LIKE.

Some examples:
SELECT * FROM blog WHERE author_id IN (1, 7, 8, 10)
SELECT * FROM orders WHERE value > 1000

You know that the range access method is used when EXPLAIN shows type=range.

Naturally, there has to be an index on the column used by the range predicate. Since indexes are ordered, MySQL will, for each interval, dive down the index using the interval start value and read it's way through the index leaves until it reaches the interval end value:



Read more »

PlanetMySQL Voting: Vote UP / Vote DOWN

The MySQL range access method explained

Август 23rd, 2011
The range access method uses an index to read a subset of rows that form one or multiple continuous index value intervals. The intervals are defined by the query's range predicates, which are comparisons using any of =, <=>, IN(), IS NULL, IS NOT NULL, >, <, >=, <=, BETWEEN, !=, <> or LIKE.

Some examples:
SELECT * FROM blog WHERE author_id IN (1, 7, 8, 10)
SELECT * FROM orders WHERE value > 1000

You know that the range access method is used when EXPLAIN shows type=range.

Naturally, there has to be an index on the column used by the range predicate. Since indexes are ordered, MySQL will, for each interval, dive down the index using the interval start value and read it's way through the index leaves until it reaches the interval end value:



Read more »

PlanetMySQL Voting: Vote UP / Vote DOWN