Over the weekend I spent a lot of time improving my new Shard-Query tool (
code.google.com/p/shard-query) and the improvements can equate to big performance gains on partitioned data sets versus executing the query directly on MySQL.

I'll explain this graph below, but lower is better (response time) and Shard-Query is the red line.
MySQL understands that queries which access data in only certain partitions don't have to read the rest of the table. This partition elimination works well, but MySQL left a big optimization out of partitioning: getting data in parallel.
In fact, since partition elimination is the only major optimization provided by the partition options it isn't great for scaling access to large data sets when the entire data set must be accessed, but only when smaller parts of a the set are examined.
Since Shard-Query exploits parallelism with Gearman (
http://www.gearman.org) I decided to extend the Shard-Query "optimizer" to support running queries with IN lists in parallel. This makes a query scale much further than it would if there was no parallelism at work.
Consider the table following partitioned fact table:
CREATE TABLE `fact` (
`id` bigint(20) unsigned DEFAULT NULL,
`a_id` bigint(20) unsigned DEFAULT NULL,
`b_id` int(11) NOT NULL,
`c_id` int(11) NOT NULL,
`i1` tinyint(4) DEFAULT NULL,
`qty` smallint(6) DEFAULT NULL,
`score` decimal(10,10) DEFAULT NULL,
`price` decimal(7,3) DEFAULT NULL,
`i2` int(11) DEFAULT NULL,
`i3` int(11) DEFAULT NULL,
`wide_row` char(54) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (i1) PARTITIONS 100 */
The table is partitioned into 100 partitions, and there are 100 distinct values for i1. This means that all the values for a particular i1 are housed in a single partition.
Consider the following query:
select price*qty from fact where i1 in (1,2,3);
This query is semantically equivalent to:
select price*qty from fact where i1 = 1
UNION ALL
select price*qty from fact where i1 = 2
UNION ALL
select price*qty from fact where i1 = 3
Unfortunately, MySQL does not have any intra-query parallelism, so rewriting the query that way is not an effective scaling strategy. However, if you execute all three queries at the same time, and use a temporary table as the UNION ALL, you can actually get parallelism. This is what Shard-Query does. It can take each IN list item and assign it to a worker, and stuff the results back together at the end.
The second thing that can be done to improve performance at the partition level (or the shard level) is to push down aggregation of distributable aggregate functions to the worker. If you've read my blog before you might know that the distributable aggregate functions are SUM and COUNT.
Consider a query very much like the previous query:
select shard_col, sum(price * qty) from t1 where shard_col in (1,2,3) group by shard_col;
This query features aggregation with distributable functions. This query is semantically equivalent to the following:
This query is semantically equivalent to:
SELCT shard_col, SUM(`sum(price*qty)`) as `sum(price*qty)`
from ( select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
UNION ALL
select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
UNION ALL
select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
) GROUP BY shard_col;
Shard-Query can push the aggregation down to the shards and it sends each query which is part of the "UNION ALL" operation in parallel.
I ran a benchmark based on the above table with 40M rows in it. That is 40K rows per shard.
The benchmark queries follow the pattern:
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3,4) group by i1;
...
All the way up to 100 values in the IN list. The table contains 400K rows for each c1 value, and all those rows are stored in a single partition.
Here is that graph again. For this test I used an EC2 "c1.large" instance. That is, 8 cores with 8 GB of CPU. I used a 4GB data set, Percona Server 10.2 and a 1GB buffer pool size. Each partition is approximately 32MB in size.
Since the machine has eight cores, I started eight Gearman workers. The results are, I think, impressive. In the graph, "partitions scanned" is the number of values in the IN list.

This performs very well due to the pushdown of the aggregation. If a non-distributable aggregate function were to be used, then performance would probably be worse than MySQL because all 40M rows would be accessed and copied into a temporary table.
PlanetMySQL Voting:
Vote UP /
Vote DOWN