Archive for the ‘Execution plan’ Category

When EXPLAIN estimates can go wrong!

Октябрь 16th, 2011
This is the title of my first blog post on MySQL Performance Blog. It deals with a customer case where the customer was facing a peculiar problem where the rows column in the EXPLAIN output of the query was totally off. The actual number of rows was 18 times more than the number of rows reported by MySQL in the output of EXPLAIN. Now this can be a real pain as MySQL uses “the number of rows” estimation to pick and choose indexes and it could really be picking up a wrong index simply because of the wrong estimate. You...
PlanetMySQL Voting: Vote UP / Vote DOWN

Speaking on Percona Live, London: "Programmatic Queries: things you can code with SQL"

Октябрь 10th, 2011

I'll be speaking at the Percona Live event, held in London, October 24, 25, 2011.

My session is called Programmatic Queries: things you can code with SQL. It's a short 30 minute talk, in which I present underlying knowledge of the programmatic nature of SQL queries within MySQL, and how to take advantage of such knowledge so as to build faster, shorter, and sometimes unexpected queries.

This is not about stored routine programming, a classic programmatic aspect of MySQL, but rather about expected order of execution: of row evaluation, of control flow statements, of table inference, of time issues.

I have far too many examples, some real-world problem solvers, and some less common in daily use, to be able to deliver them all on this session. I will pick up those which seem most interesting to me, or those best presenting the programmatic nature of the query. As time allows I may add more examples, or look into interesting future possibilities.

I hope to see you there.


PlanetMySQL Voting: Vote UP / Vote DOWN

Views: better performance with condition pushdown

Май 20th, 2010

Justin’s A workaround for the performance problems of TEMPTABLE views post on mysqlperformanceblog.com reminded me of a solution I once saw on a customer’s site.

The customer was using nested views structure, up to depth of some 8-9 views. There were a lot of aggregations along the way, and even the simplest query resulted with a LOT of subqueries, temporary tables, and vast amounts of data, even if only to return with a couple of rows.

While we worked to solve this, a developer showed me his own trick. His trick is now impossible to implement, but there’s a hack around this.

Let’s use the world database to illustrate. Look at the following view definition:

CREATE
  ALGORITHM=TEMPTABLE
VIEW country_languages AS
  SELECT
    Country.CODE, Country.Name AS country,
    GROUP_CONCAT(CountryLanguage.Language) AS languages
  FROM
    world.Country
    JOIN world.CountryLanguage ON (Country.CODE = CountryLanguage.CountryCode)
  GROUP BY
    Country.CODE;

The view presents with a list of spoken languages per country. The execution plan for querying this view looks like this:

mysql> EXPLAIN SELECT * FROM country_languages;
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref                               | rows | Extra                                        |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>      | ALL    | NULL          | NULL    | NULL    | NULL                              |  233 |                                              |
|  2 | DERIVED     | CountryLanguage | index  | PRIMARY       | PRIMARY | 33      | NULL                              |  984 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | Country         | eq_ref | PRIMARY       | PRIMARY | 3       | world.CountryLanguage.CountryCode |    1 |                                              |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+

And, even if we only want to filter out a single country, we still get the same plan:

mysql> EXPLAIN SELECT * FROM country_languages WHERE Code='USA';
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref                               | rows | Extra                                        |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+
|  1 | PRIMARY     | <derived2>      | ALL    | NULL          | NULL    | NULL    | NULL                              |  233 | Using where                                  |
|  2 | DERIVED     | CountryLanguage | index  | PRIMARY       | PRIMARY | 33      | NULL                              |  984 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | Country         | eq_ref | PRIMARY       | PRIMARY | 3       | world.CountryLanguage.CountryCode |    1 |                                              |
+----+-------------+-----------------+--------+---------------+---------+---------+-----------------------------------+------+----------------------------------------------+

So, we need to scan the entire country_language and country tables in order to return results for just one row.

A non-working solution

The solution offered by the developer was this:

CREATE
  ALGORITHM=MERGE
  VIEW country_languages_non_working AS
  SELECT
    Country.CODE, Country.Name AS country,
    GROUP_CONCAT(CountryLanguage.Language) AS languages
  FROM
    world.Country
    JOIN world.CountryLanguage ON
      (Country.CODE = CountryLanguage.CountryCode)
  WHERE
    Country.CODE = @country_code
  GROUP BY Country.CODE;

And follow by:

mysql> SET @country_code='USA';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM country_languages_2;
+------+---------------+----------------------------------------------------------------------------------------------------+
| CODE | country       | languages                                                                                          |
+------+---------------+----------------------------------------------------------------------------------------------------+
| USA  | United States | Chinese,English,French,German,Italian,Japanese,Korean,Polish,Portuguese,Spanish,Tagalog,Vietnamese |
+------+---------------+----------------------------------------------------------------------------------------------------+

So, pushdown a WHERE condition into the view’s definition. The session variable @country_code is used to filter rows. In the above simplified code the value is assumed to be set; tweak it as you see fit (using IFNULL, for example, or OR statements) to allow for full scan in case the variable is undefined.

This doesn’t work. It used to work a couple years back; but today you cannot create a view which uses session variables or parameters. It is a restriction imposed by views.

A workaround

Justin showed a workaround using an additional table. There is another workaround which does not involve tables, but rather stored routines. Now, this is a patch, and an ugly one. It may not work in future versions of MySQL for all I know. But, here it goes:

DELIMITER $$
CREATE DEFINER=`root`@`localhost` FUNCTION `get_session_country`() RETURNS CHAR(3)
    NO SQL
    DETERMINISTIC
BEGIN
  RETURN @country_code;
END $$
DELIMITER ;

CREATE
  ALGORITHM=MERGE
  VIEW country_languages_2 AS
  SELECT
    Country.CODE, Country.Name AS country,
    GROUP_CONCAT(CountryLanguage.Language) AS languages
  FROM
    world.Country
    JOIN world.CountryLanguage ON
      (Country.CODE = CountryLanguage.CountryCode)
  WHERE
    Country.CODE = get_session_country()
  GROUP BY Country.CODE;

And now:

mysql> SET @country_code='USA';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM country_languages_2;
+------+---------------+----------------------------------------------------------------------------------------------------+
| CODE | country       | languages                                                                                          |
+------+---------------+----------------------------------------------------------------------------------------------------+
| USA  | United States | Chinese,English,French,German,Italian,Japanese,Korean,Polish,Portuguese,Spanish,Tagalog,Vietnamese |
+------+---------------+----------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> EXPLAIN SELECT * FROM country_languages_2;
+----+-------------+-----------------+--------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref  | rows | Extra                    |
+----+-------------+-----------------+--------+---------------+---------+---------+------+------+--------------------------+
|  1 | PRIMARY     | <derived2>      | system | NULL          | NULL    | NULL    | NULL |    1 |                          |
|  2 | DERIVED     | Country         | const  | PRIMARY       | PRIMARY | 3       |      |    1 |                          |
|  2 | DERIVED     | CountryLanguage | ref    | PRIMARY       | PRIMARY | 3       |      |    8 | Using where; Using index |
+----+-------------+-----------------+--------+---------------+---------+---------+------+------+--------------------------+

Since views are allowed to call stored routines (Justing used this to call upon CONNECTION_ID()), and since stored routines can use session variables, we can take advantage and force the view into filtering out irrelevant rows before these accumulate to temporary tables and big joins.

Back in the customer’s office, we witnessed, what with their real data and multiple views, a reduction of query times from ~30 minutes to a few seconds.

Another kind of use

Eventually we worked to make better view definitions and query splitting, resulting in clearer code and fast queries, but this solution plays nicely into another kind of problem:

Can we force different customers to see different parts of a given table? e.g., only those rows that relate to the customers?

There can be many solutions: different tables; multiple views (one per customer), stored procedures, what have you. The above provides a solution, and I’ve seen it in use.


PlanetMySQL Voting: Vote UP / Vote DOWN

EXPLAIN: missing db info

Май 11th, 2010

I’m further developing a general log hook, which can stream queries from the general log.

A particular direction I’m taking is to filter queries by their type of actions. For example, the tool (oak-hook-general-log) can be instructed to only stream out those queries which involve creation of a temporary table; or those which cause for a filesort, or full index scan, etc.

This is done by evaluating of query execution plans on the fly. I suspect the MySQL query analyzer roughly does the same (as a small part of what it does).

It’s almost nothing one cannot do with sed/awk. However, I bumped into a couple of problems:

  1. The general log (and the mysql.general_log table, in  particular) does not indicate the particular database one is using for the query. Since slow log does indicate this data, I filed a bug on this. I currently solve this by crossing connection id with the process list, where the current database is listed. It’s shaky, but mostly works.
  2. Just realized: there’s no DB info in the EXPLAIN output! It’s weird, since I’ve been EXPLAINing queries for years now. But I’ve always had the advantage of knowing the schema used: either because I was manually executing the query on a known schema, or mk-query-digest was kind enough to let me know.

For example, look at the following imaginary query, involving both the world and sakila databases:

mysql> use test;
Database changed
mysql> EXPLAIN SELECT * FROM world.Country JOIN sakila.city WHERE Country.Capital = city.city_id;
+----+-------------+---------+--------+---------------+---------+---------+-----------------------+------+-------------+
| id | select_type | table   | type   | possible_keys | key     | key_len | ref                   | rows | Extra       |
+----+-------------+---------+--------+---------------+---------+---------+-----------------------+------+-------------+
|  1 | SIMPLE      | Country | ALL    | NULL          | NULL    | NULL    | NULL                  |  239 |             |
|  1 | SIMPLE      | city    | eq_ref | PRIMARY       | PRIMARY | 2       | world.Country.Capital |    1 | Using where |
+----+-------------+---------+--------+---------------+---------+---------+-----------------------+------+-------------+
2 rows in set (0.00 sec)

The query is imaginary, since the tables don’t actually have anything in common. But look at the EXPLAIN result: can you tell where city came from? Country can somehow be parsed from the ref column, but no help on city.

Moreover, table aliases show on the EXPLAIN plan (which is good), but with no reference to the original table.

So, is it back to parsing of the SQL query? I’m lazy reluctant to do that. It’s error prone, and one needs to implement, or use, a good parser, which also accepts MySQL dialect. Haven’t looked into this yet.

I’m currently at a standstill with regard to automated query execution plan evaluation where database cannot be determined.


PlanetMySQL Voting: Vote UP / Vote DOWN

Things to monitor on MySQL, the user’s perspective

Март 10th, 2010

Working on mycheckpoint, I have the intention of adding custom monitoring. That is, letting the user define things to monitor. I have my own thoughts, I would be grateful to get more input!

What would the user want to monitor?

Monitoring for the number of SELECT statements per second, InnoDB locks, slave replication lag etc. is very important, and monitoring utilities provide with this information. But what does that tell the end user? Not much.

The experienced DBA may gain a lot. The user would be more interested in completely other kind of information. In between, some information is relevant to both.

Say we were managing an on-line store. We want to monitor the health of the database. But the health of the database is inseparable from the health of the application. I mean, having little to no disk usage is fine, unless… something is wrong with the application, which leads to no new purchases.

And so a user would be interested in monitoring the number of purchases per hour, or the time passed since last successful purchase. This kind of data can only be generated by a user’s specific query. Looking at the charts, the user would then feel safer and confident in the wellness of his store app.

But let’s dig further. We want the store’s website to provide with good response. In particular, the query which returns the items in a customer’s cart must react quickly. Our user would not only want to see that purchases get along, but also that page load times (as in our example) are quick for those critical parts. And so a user should be able to monitor the time it took to execute a given query.

It can be of further interest to know how many times per second a given query is executed. This part is not easily done on the server side, and requires the user’s cooperation (or else we must analyze the general log, sniff, or set up a proxy). If the user is willing, she can log to some table each time she executes a certain query. Then we’re back to monitoring a regular table, as with the first example.

It is also possible to monitor for a query’s execution plan. Is it full scan? How many rows are expected? But given that we can monitor the time it took to execute a query, I’m not sure this is useful. If everything runs fast enough — who cares about how it executes?

Some of the above can be monitored on an altogether higher level: if  we’re talking about some web application, then we can use our Apache logs to determine load time for pages, or number of requests to our “cart items” page. But not always do we work with web servers, and we may be interested in checking the specific queries behind the scenes.

Summary

Custom monitoring can include:

  • User defined queries (number of concurrent visitors; count of successful operations per second; number of rows per given table or condition; …)
  • Execution time for user defined queries (time it takes to return cart items; find rows matching condition; sort a table; …)
  • Number of executions for a given query, per second.

I intend to incorporate the above into mycheckpoint as part of its standard monitoring scheme.

Please share your thought below.


PlanetMySQL Voting: Vote UP / Vote DOWN

Things to monitor on MySQL, the user’s perspective

Март 10th, 2010

Working on mycheckpoint, I have the intention of adding custom monitoring. That is, letting the user define things to monitor. I have my own thoughts, I would be grateful to get more input!

What would the user want to monitor?

Monitoring for the number of SELECT statements per second, InnoDB locks, slave replication lag etc. is very important, and monitoring utilities provide with this information. But what does that tell the end user? Not much.

The experienced DBA may gain a lot. The user would be more interested in completely other kind of information. In between, some information is relevant to both.

Say we were managing an on-line store. We want to monitor the health of the database. But the health of the database is inseparable from the health of the application. I mean, having little to no disk usage is fine, unless… something is wrong with the application, which leads to no new purchases.

And so a user would be interested in monitoring the number of purchases per hour, or the time passed since last successful purchase. This kind of data can only be generated by a user’s specific query. Looking at the charts, the user would then feel safer and confident in the wellness of his store app.

But let’s dig further. We want the store’s website to provide with good response. In particular, the query which returns the items in a customer’s cart must react quickly. Our user would not only want to see that purchases get along, but also that page load times (as in our example) are quick for those critical parts. And so a user should be able to monitor the time it took to execute a given query.

It can be of further interest to know how many times per second a given query is executed. This part is not easily done on the server side, and requires the user’s cooperation (or else we must analyze the general log, sniff, or set up a proxy). If the user is willing, she can log to some table each time she executes a certain query. Then we’re back to monitoring a regular table, as with the first example.

It is also possible to monitor for a query’s execution plan. Is it full scan? How many rows are expected? But given that we can monitor the time it took to execute a query, I’m not sure this is useful. If everything runs fast enough — who cares about how it executes?

Some of the above can be monitored on an altogether higher level: if  we’re talking about some web application, then we can use our Apache logs to determine load time for pages, or number of requests to our “cart items” page. But not always do we work with web servers, and we may be interested in checking the specific queries behind the scenes.

Summary

Custom monitoring can include:

  • User defined queries (number of concurrent visitors; count of successful operations per second; number of rows per given table or condition; …)
  • Execution time for user defined queries (time it takes to return cart items; find rows matching condition; sort a table; …)
  • Number of executions for a given query, per second.

I intend to incorporate the above into mycheckpoint as part of its standard monitoring scheme.

Please share your thought below.


PlanetMySQL Voting: Vote UP / Vote DOWN

SQL: Ranking without self join

Сентябрь 14th, 2009

The common way of solving the classic SQL problem of ranking, involves a  self join. I wish to present a different solution, which only iterates the table once, and provides the same output.

The ranking problem

Given a table with names and scores (e.g. students exams scores), add rank for each row, such that the rank identifies her position among other rows. Rows with identical scores should receive the same rank (e.g. both contenders got the silver medal).

Consider the following table (download score.sql):

mysql> select * from score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
|        1 | Wallace      |    95 |
|        2 | Gromit       |    97 |
|        3 | Shaun        |    85 |
|        4 | McGraw       |    92 |
|        5 | Preston      |    92 |
+----------+--------------+-------+
5 rows in set (0.00 sec)

We wish to present ranks in some way similar to:

+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        2 | Gromit       |    97 |    1 |
|        1 | Wallace      |    95 |    2 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
|        3 | Shaun        |    85 |    4 |
+----------+--------------+-------+------+

Note that McGraw and Preston got same scores, therefore both share 3rd rank, whereas Shaun gets ranked #4.

The common solution

Following is the SQL for the common solution:

SELECT
  s1.score_id, s1.student_name, s1.score, COUNT(DISTINCT s2.score) AS rank
FROM
  score s1 JOIN score s2 ON (s1.score <= s2.score)
GROUP BY s1.score_id
;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
|        1 | Wallace      |    95 |    2 |
|        2 | Gromit       |    97 |    1 |
|        3 | Shaun        |    85 |    4 |
|        4 | McGraw       |    92 |    3 |
|        5 | Preston      |    92 |    3 |
+----------+--------------+-------+------+

(The above can by ORDERed at will, more on this later)

What I’m suggesting is that this self join is an overkill. It recalculates over and over what we knew a second before: to get the Preston’s rank, we need to count how many students got score >=92. But when we need to find Shaun’s rank, we re-iterate these, and in addition count those with grades 85..91. We’re reading, re-reading, then re-re-reading (you get the point) the same data all over again. It’s a waste of energy.

Offering a new solution

I propose a simpler solution: do a one-time sorting of rows according to score (descending). The first row in the sorted set should obviously get the score “1″. Now we iterate the rows one by one, and keep a rank variable. Whenever the score remains the same, we just keep on iterating. Whenever the score changes (it can only change in the direction of “downwards”, since we sorted by score. descending), we increment the rank.

Actually, the above explanation makes it sound as if we do this with multiple steps. This is not so. We do this all in one step:

SELECT
  score_id, student_name, score,
  @prev := @curr,
  @curr := score,
  @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
FROM
  score,
  (SELECT @curr := null, @prev := null, @rank := 0) sel1
ORDER BY score DESC
;
+----------+--------------+-------+----------------+----------------+------+
| score_id | student_name | score | @prev := @curr | @curr := score | rank |
+----------+--------------+-------+----------------+----------------+------+
|        2 | Gromit       |    97 |           NULL |             97 |    1 |
|        1 | Wallace      |    95 |             97 |             95 |    2 |
|        4 | McGraw       |    92 |             95 |             92 |    3 |
|        5 | Preston      |    92 |             92 |             92 |    3 |
|        3 | Shaun        |    85 |             92 |             85 |    4 |
+----------+--------------+-------+----------------+----------------+------+

Execution plan comparison

Do we have an index on the score column?

If not, I clearly win. The self join (when there’s more than mere 5 rows, of course) will make for repeated full table scans, thereby making for O(n²).

+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                           |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+
|  1 | SIMPLE      | s1    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using temporary; Using filesort |
|  1 | SIMPLE      | s2    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using where                     |
+----+-------------+-------+------+---------------+------+---------+------+------+---------------------------------+

Whereas the suggested solution requires one filesort. This still means table data can be re-read, but significantly less so: it only takes O(n*log(n)), where the log(n) part is usually very small (and it all depend on sort_buffer_size).

+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+
| id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra          |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL | NULL    | NULL |    1 | Using filesort |
|  1 | PRIMARY     | score      | ALL    | NULL          | NULL | NULL    | NULL |    5 |                |
|  2 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+------+---------+------+------+----------------+

Testing on the somewhat larger sakila.film table (1000 rows is all) on my laptop, it takes 47 seconds for the common query to complete, 0.01 seconds to presented solution (repeatedly, no cache issues).

What happens when we do have an index? Again, I win. Testing on sakila.film now, having added an index on location column:

+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+
| id | select_type | table | type  | possible_keys | key    | key_len | ref  | rows | Extra                                        |
+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | s2    | index | length        | length | 3       | NULL | 1140 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | s1    | ALL   | length        | NULL   | NULL    | NULL | 1140 | Using where                                  |
+----+-------------+-------+-------+---------------+--------+---------+------+------+----------------------------------------------+

The above still needs to do index scan. The GROUP BY requires either sorting or utilizing the PRIMARY KEY, and the execution plan with reversed tables ordering does not improve.

Presented solution utilized index for a single pass, with O(n) complexity:

+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+
| id | select_type | table      | type   | possible_keys | key    | key_len | ref  | rows | Extra          |
+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+
|  1 | PRIMARY     | <derived2> | system | NULL          | NULL   | NULL    | NULL |    1 |                |
|  1 | PRIMARY     | film       | index  | NULL          | length | 3       | NULL | 1140 |                |
|  2 | DERIVED     | NULL       | NULL   | NULL          | NULL   | NULL    | NULL | NULL | No tables used |
+----+-------------+------------+--------+---------------+--------+---------+------+------+----------------+

This is done with a single index pass. Again, on my laptop, I get ~41 seconds for 1st query, 0.01 seconds for the proposed solution.

I also argue that in addition, the results would often be required by order of rank, in which case the common solution must eventually sort anyhow.

Conclusion

To be honest, I’ve seen the self-join solution in so many places: textbooks, training material, online tutorials… Maybe it’s just a silly exercise, perhaps not your daily real-world task; but it’s one of those classic SQL problems. The so-often-repeated common solution is ANSI SQL, for sure, but at what cost?


PlanetMySQL Voting: Vote UP / Vote DOWN