Archive for the ‘parallelism’ Category

Scale differences between OLTP and Analytics

Май 15th, 2012

In my previous post,http://database-scalability.blogspot.com/2012/05/oltp-vs-analytics.html, I reviewed the differences between OLTP and Analytics databases.

Scale challenges are different between those 2 worlds of databases.



Scale challenges in the Analytics world are with the growing amounts of data. Most solutions have been leveraging those 3 main aspects: Columnar storage, RAM and parallelism.
Columnar storage makes scans and data filtering more precise and focused. After that – it all goes down to the I/O - the faster the I/O is, the faster the query will finish and bring results. Faster disks and also SSD can play good role, but above all: RAM! Specialized Analytics databases (such as Oracle Exadata and Netezza) have TBs of RAM. Then, in order to bring results for queries, data needs to be scanned and filtered, a great fit for parallelism. A big data range is divided into many smaller ranges given to parallel worker threads that each performs his task in parallel, the entire scan will finish in a fraction of the time.

In the OLTP, scale challenges are in the growing transaction concurrency throughput and… growing amounts of data. Again? Didn't we just say growing data is the problem of Analytics? Well, today’s OLTP apps are required to hold more data to provide a larger span online functionality. In the last couple of years OLTP data archiving was changed dramatically. OLTP data now covers years and not just days or weeks. Facebook recently launched its “time line” feature (http://www.facebook.com/about/timeline), can you imagine your timeline ends after 1 week? Facebook’s probably world’s largest OLTP database holds data of a billion users for years back. Today all data is required anywhere anytime, right here, right now, online. Many of today’s OLTP databases go well beyond the 1TB line. And what about transaction concurrency throughput? Applications today are bombarded by millions of users shooting transactions from browsers, smartphones, tablets… I personally checked my bank account 3 times today. Why? Because I can…

What can be done to solve OLTP scale challenges?

In my next post let's start answering this question with understanding why solutions proposed for the Analytics are limited in the OLTP, and start reviewing relevant approaches.

Stay tuned, subscribe, get involved!

PlanetMySQL Voting: Vote UP / Vote DOWN

Scale differences between OLTP and Analytics

Май 15th, 2012

In my previous post,http://database-scalability.blogspot.com/2012/05/oltp-vs-analytics.html, I reviewed the differences between OLTP and Analytics databases.

Scale challenges are different between those 2 worlds of databases.



Scale challenges in the Analytics world are with the growing amounts of data. Most solutions have been leveraging those 3 main aspects: Columnar storage, RAM and parallelism.
Columnar storage makes scans and data filtering more precise and focused. After that – it all goes down to the I/O - the faster the I/O is, the faster the query will finish and bring results. Faster disks and also SSD can play good role, but above all: RAM! Specialized Analytics databases (such as Oracle Exadata and Netezza) have TBs of RAM. Then, in order to bring results for queries, data needs to be scanned and filtered, a great fit for parallelism. A big data range is divided into many smaller ranges given to parallel worker threads that each performs his task in parallel, the entire scan will finish in a fraction of the time.

In the OLTP, scale challenges are in the growing transaction concurrency throughput and… growing amounts of data. Again? Didn't we just say growing data is the problem of Analytics? Well, today’s OLTP apps are required to hold more data to provide a larger span online functionality. In the last couple of years OLTP data archiving was changed dramatically. OLTP data now covers years and not just days or weeks. Facebook recently launched its “time line” feature (http://www.facebook.com/about/timeline), can you imagine your timeline ends after 1 week? Facebook’s probably world’s largest OLTP database holds data of a billion users for years back. Today all data is required anywhere anytime, right here, right now, online. Many of today’s OLTP databases go well beyond the 1TB line. And what about transaction concurrency throughput? Applications today are bombarded by millions of users shooting transactions from browsers, smartphones, tablets… I personally checked my bank account 3 times today. Why? Because I can…

What can be done to solve OLTP scale challenges?

In my next post let's start answering this question with understanding why solutions proposed for the Analytics are limited in the OLTP, and start reviewing relevant approaches.

Stay tuned, subscribe, get involved!

PlanetMySQL Voting: Vote UP / Vote DOWN

Loading Air Traffic Control Data with TokuDB 4.1.1

Август 27th, 2010

TokuDB has a big advantage over B-trees when trickle loading data into existing tables. However, it is possible to preprocess the data when bulk loading into empty tables or when new indexes are created. TokuDB release 4 now uses a parallel algorithm to speed up these types of bulk insertions. How does the parallel loader performance compare with the serial loader? We use the Air Traffic Control (ATC) data and queries described in a Percona blog and also used in an experiment with TokuDB 2.1.0 to gain some insight.

Our ATC data is about 122M rows in size, is stored in a 40GiB CSV file, and can be found in our Amazon S3 public bucket. See the end of this blog for details. We use a table schema with 8 indices to speed up the ATC queries. The loader inserts the data into the primary fractal tree and one fractal tree for each of the 8 keys.

The load was run on two (old) machines:

  • 2 core: Intel Core 2 Duo E8500 3.16GHz, 5GiB RAM, 1 disk SATA 1TB.
  • 8 core: 2 socket quad core Xeon X5460 3.16GHz, 16GiB RAM, 6 disk SAS 1TB RAID0.

Load times:

    Load Times for the ATC Database
    TokuDB 2.1.0 and MySQL 5.1.36 TokuDB 4.1.1 and MySQL 5.1.46 TokuDB Speedup
    2 core 10,974s 5,201s 2.1x
    8 core 11,286s 2,655s 4.2x
  • We use “LOAD DATA INFILE” to load the CSV file into a MySQL table. The CSV file can be found in our Amazon S3 public bucket.
  • We see over 4x load time speed up for this database. This compares favorably with the other loaders. The next steps are to run experiments on machines with a larger number of cores, identify bottlenecks, and remove them.
  • BTW, the ATC load times into MyISAM are roughly the same as seen with the TokuDB 2.1.0 serial loader. Perhaps loads into MyISAM can be made faster with some twiddling of the MySQL system variables.

TokuDB data sizes (including indices):

  • 6.7 GiB, or almost 6 times smaller than the CSV source file. The parallel loader uses the same compression parameters as the serial loader, so no change in size occurred.

Query times:

    Query Times for the ATC Database
    TokuDB 2.1.0 TokuDB 4.1.1
    Q0 22s 28s
    Q1 67s 33s
    Q2 20s 24s
    Q3 54s 29s
    Q4 2s 1s
    Q5 11s 7s
    Q6 35s 29s
    Q7 32s 39s
    Q8.1y 7s 4s
    Q8.2y 42s 32s
    Q8.3y 37s 33s
    Q8.4y 38s 35s
    Q8.10y 139s 56s
    Q9 36s 35s
  • TokuDB 4 uses the standard MySQL handler row iterator, so the query times are roughly unchanged. In addition, the query handling is not yet parallelized, so there is not much difference in query times on the 2 core and 8 core machines.
  • BTW, the query times for MyISAM are slightly longer than TokuDB.
  • The queries can be found in our Amazon S3 public bucket.

Future blogs:

  • Loader scalability on systems with larger number of cores. Machines with 48 cores (4 socket, 12 cores per socket) are now reasonably priced.
  • How we implemented the parallel loader: algorithms and parallel runtime.

The ATC CSV data files, the schema, and the queries can be retrieved from our public Amazon S3 bucket called
tokutek-pub.  Here are the Amazon S3 keys:

  • atc_On_Time_Performance.mysql.csv.gz.aa
  • atc_On_Time_Performance.mysql.csv.gz.ab
  • atc_On_Time_Performance.mysql.csv.gz.ac
  • atc_On_Time_Performance.mysql.csv.gz.ad
  • atc_On_Time_Performance.mysql.csv.gz.xml
  • atc_ontime_create_covered.sql
  • atc_ontime_create_covered.sql.xml
  • atc_ontime_queries.tar.gz
  • atc_ontime_queries.tar.gz.xml
  • s3get
  • s3get.xml

PlanetMySQL Voting: Vote UP / Vote DOWN

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 1 (introduction and power)

Июль 30th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

These are my notes from the session, which include comparisons of how Oracle works (which Maria gave) and how MySQL works (which I researched to figure out the difference, which is why this blog post took a month after the conference to write). Note that I am not an expert on data warehousing in either Oracle or MySQL, so these are more concepts to think about than hard-and-fast advice. In some places, I still have questions, and I am happy to have folks comment and contribute what they know.

One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it) from Forrester saying, “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”

I thought that was an interesting way of pointing that out — most people do not understand why 3NF is not good enough for data warehousing, and I have had a hard time explaining why a star or snowflake schema should be used. Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.

From Maria and the slides:
“Oracle says model what will suit your business best. Don’t get lost in academia. Most schemas are not 100% according to the theoretical
models. Some examples: 3NF schema with denormalized attributes to avoid costly joins, Star schema with multiple hierarchies in same fact table.”

Data warehousing has a 3-step approach —

1) data sources -> staging layer (temp loading layer)
2) staging layer (temp loading layer)-> foundation (logical, data store) layer
3) foundation (logical, data store) layer -> access and performance layer

The foundation layer is usually 3NF the access layer is usually a star or snowflake schema. As for the data sources, they can be varied, you would hope that they are in 3NF (and if they are you can skip the first 2 steps) but they are not always that way.

The 3 P’s of best practice for data warehousing (on Oracle) are power, partitioning, parallelism. The goal of the data warehousing environment is to minimize the amount of data accessed and use the most efficient joins – so it is not so index focused. This may be based on Oracle’s way of doing joins, I am not so sure if it applies to MySQL as well.

Power The weakest link in the chain (the 3 steps above) will define the throughput, so make sure your hardware configuration is balanced. Maria mentioned that as DBAs, “most of the time we don’t have control over this, but we’re still bound to the SLAs.”

This includes hardware that immediately comes to mind such as # of CPUs/cores, speed of CPU, amount of RAM, speed of disk as well as what we may not think of immediately: speed of network switches, speed of disk controllers, number and speed of host BUS adapters. Notes on host BUS adapters (HBAs): Know the # of HBA ports you have. 4 Gb HBA does 400 Mb/sec. 2 Gb HBA does 200 Mb/sec. Make sure there’s enough HBA capacity to sustain the CPU throughput (ie, make sure HBA isn’t the bottleneck). Also the speed at which it all talks. If you have a 4 Gb machine but a 2 Gb switch, you end up having 2 Gb throughput. Upgrade the network at the same time you upgrade machines.

Because we are talking about data warehousing, it is often not possible to eliminate disk I/O, so the goal is to have the fastest I/O throughput possible. Data warehouses need to be sized on I/O throughput not number of I/O’s.

I made a post earlier about how to determine I/O throughput for a system, which used information from this session. Justin Swanhart already pointed out that this is based on the fact that Oracle can do hash joins and MySQL can only do nested loop joins. I wonder, though, if there is indeed no case when using MySQL for which I/O throughput is a more useful metric than iops.

Disk arrays that are expensive are usually sized for iops, not throughput, and because they’re expensive the disk array is shared throughout the company. A DBA needs to ask ‘how many connections into the storage array do I have? How many disk controllers do I have? Where are my physical disks, and which controllers are they hanging off of?’

Typical 15k rpm disk can do about 25-35 Mb/sec (per disk) random i/o’s. Disk manufacturers will throw out numbers like 200-300 Mb/sec but that’s sequential I/O and leading edge of the drive. Make sure all your LUNs are not coming off the same set of disks, so that you’re not conflicting on disk seeks.

Continue to part 2, partitioning.


PlanetMySQL Voting: Vote UP / Vote DOWN

Intra-query parallelism for MySQL queries without an appliance or closed source database

Май 25th, 2010
Over the weekend I spent a lot of time improving my new Shard-Query tool (code.google.com/p/shard-query) and the improvements can equate to big performance gains on partitioned data sets versus executing the query directly on MySQL.


I'll explain this graph below, but lower is better (response time) and Shard-Query is the red line.

MySQL understands that queries which access data in only certain partitions don't have to read the rest of the table. This partition elimination works well, but MySQL left a big optimization out of partitioning: getting data in parallel.

In fact, since partition elimination is the only major optimization provided by the partition options it isn't great for scaling access to large data sets when the entire data set must be accessed, but only when smaller parts of a the set are examined.

Since Shard-Query exploits parallelism with Gearman (http://www.gearman.org) I decided to extend the Shard-Query "optimizer" to support running queries with IN lists in parallel. This makes a query scale much further than it would if there was no parallelism at work.

Consider the table following partitioned fact table:
CREATE TABLE `fact` (
  `id` bigint(20) unsigned DEFAULT NULL,
  `a_id` bigint(20) unsigned DEFAULT NULL,
  `b_id` int(11) NOT NULL,
  `c_id` int(11) NOT NULL,
  `i1` tinyint(4) DEFAULT NULL,
  `qty` smallint(6) DEFAULT NULL,
  `score` decimal(10,10) DEFAULT NULL,
  `price` decimal(7,3) DEFAULT NULL,
  `i2` int(11) DEFAULT NULL,
  `i3` int(11) DEFAULT NULL,
  `wide_row` char(54) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (i1) PARTITIONS 100 */

The table is partitioned into 100 partitions, and there are 100 distinct values for i1. This means that all the values for a particular i1 are housed in a single partition.

Consider the following query:
select price*qty from fact where i1 in (1,2,3);


This query is semantically equivalent to:
select price*qty from fact where i1 = 1
UNION ALL
select price*qty from fact where i1 = 2
UNION ALL
select price*qty from fact where i1 = 3


Unfortunately, MySQL does not have any intra-query parallelism, so rewriting the query that way is not an effective scaling strategy. However, if you execute all three queries at the same time, and use a temporary table as the UNION ALL, you can actually get parallelism. This is what Shard-Query does. It can take each IN list item and assign it to a worker, and stuff the results back together at the end.

The second thing that can be done to improve performance at the partition level (or the shard level) is to push down aggregation of distributable aggregate functions to the worker. If you've read my blog before you might know that the distributable aggregate functions are SUM and COUNT.

Consider a query very much like the previous query:
select shard_col, sum(price * qty) from t1 where shard_col in (1,2,3) group by shard_col;


This query features aggregation with distributable functions. This query is semantically equivalent to the following:

This query is semantically equivalent to:
SELCT shard_col, SUM(`sum(price*qty)`) as `sum(price*qty)` 
from ( select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
       UNION ALL
       select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
       UNION ALL
       select shard_col, sum(price*qty) from t1 where shard_col = 1 group by shard_col
) GROUP BY shard_col;


Shard-Query can push the aggregation down to the shards and it sends each query which is part of the "UNION ALL" operation in parallel.

I ran a benchmark based on the above table with 40M rows in it. That is 40K rows per shard.
The benchmark queries follow the pattern:
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3) group by i1;
select i1,sum(qty*price) from SAB_SF1.fact where i1 in(1,2,3,4) group by i1;
...

All the way up to 100 values in the IN list. The table contains 400K rows for each c1 value, and all those rows are stored in a single partition.

Here is that graph again. For this test I used an EC2 "c1.large" instance. That is, 8 cores with 8 GB of CPU. I used a 4GB data set, Percona Server 10.2 and a 1GB buffer pool size. Each partition is approximately 32MB in size.

Since the machine has eight cores, I started eight Gearman workers. The results are, I think, impressive. In the graph, "partitions scanned" is the number of values in the IN list.



This performs very well due to the pushdown of the aggregation. If a non-distributable aggregate function were to be used, then performance would probably be worse than MySQL because all 40M rows would be accessed and copied into a temporary table.
PlanetMySQL Voting: Vote UP / Vote DOWN