Archive for the ‘Pythian’ Category

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 2 (partitioning)

Июль 30th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

See part 1 for the introduction and talking about power and hardware. This part will go over the 2nd “P”, partitioning. Learning about Oracle’s partitioning has gotten me more interested in how MySQL’s partitioning works, and I do hope that MySQL partitioning will develop to the level that Oracle partitioning does, because Oracle’s partitioning looks very nice (then again, that’s why it costs so much I guess).

Partition – Larger tables or fact tables can benefit from partitioning because it makes data load easier and can increase join performance and use data elimination. Parallel execution can be done with partitioning due to partition pruning. The degree of parallelism should be a power of 2, because of hash-based algorithm in hash partitioning. To translate this to the MySQL world, if you are using LINEAR HASH partitioning, then you should use a degree of parallelism that is a power of 2 (I checked, and indeed. Otherwise, use a degree of parallelism that makes sense given the number of partitions you have.

One important note that during Pythian’s testing of MySQL partitioning, we found that all partitions were locked when an INSERT occurs, for the duration of the INSERT. Bulk-loading with MySQL partitioning is not as fast as it would be if MySQL allowed partition pruning for INSERTs.

So, what should be partitioned? For the first level of partitioning, the goal is to enable partitioning pruning and simplify data management. The most typical partitioning is range or interval partitioning on a date column. Interval partitioning is you say what the partition is (date, month) and partition is automatically created. MySQL does not have interval partitioning, and I have seen typical first-level partitioning be range or list based on a date or timestamp column. Note that if you use a timestamp field, the partitioning expression is optimized if you use TO_DAYS(timestamp_field) or YEAR(timestamp_field). In my experience, using anything else (such as DATE(timestamp_field)) actually makes partitioning slower than not using partitioning at all. Note that this is based on tests I did a few months ago, and your mileage may vary.

So — how do you decide partitioning strategy? Ask yourself:

  • What range of data do the queries touch – a quarter, a year?
  • What is the data loading frequency?
  • Is an incremental load required?
  • How much data is involved, a day, a week, a month?

The answers to the above questions will tell you about how big your interval needs to be. The best scenario is that all answers are the same, “we load every day, and people query by day.” If the answers are different weight access a higher priority than loading, because most people care more about query performance than performance of ETL.

This is true even if your intervals have different sizes — ie sales per day are much bigger in Dec but that’s OK. However, Maria recommends that the subpartition be as evenly divided as possible.

Easier to look at more partitions than to look at a partition that’s too big. But you don’t want too many partitions, max Oracle allows partitions is 1 million partitions, prior to 11g it was 64,000. “Stick closer to 64,000 than 1 million”. MySQL’s limitation is 1024 per table.

For the second level of partitioning, also called subpartitioning, the goal is to allow for multi-level pruning and improve join performance. In Oracle, the most typical subpartition is hash or list – in MySQL, you can only subpartition by hash or key.

How do you decide subpartitioning strategy?

  • Select the dimension queried most frequently on the fact table OR
  • Pick the common join column

For example, if you want to look at sales per day, per store, you would choose “per day” as the partition and “per store” as the subpartition.

If you do not have a good partition on logical elements (like grouping), then you can subpartition using hash partitioning on common joins — perhaps surrogate keys, or using join key of the largest table involved in the join.

For example, if the sales table is partitioned and another big table is product, you can hash subpartition product_id.

Because there’s overhead in partitions (loading metadata, reading metadata), make sure size of partitions and subpartitions is >20 Mb. So better to have a 30 Mb subpartition than a 15 Mb subpartition. [I have no idea if this is true in MySQL or not -- I think the general concept is true, because there is some overhead, but I have no idea about the 20 Mb figure and why that's true for Oracle, nor do I know what is true in MySQL.]

One easy calculation is double the # of CPUs, round up to nearest power of 2. If you’re executing in parallel, Oracle will use 2x CPUs. (all this advice, by the way, follows 80/20 rule, this is probably good for about 80% of the environments out there). Of course, MySQL does not do parallel execution very well, so this probably does not apply.

Oracle knows it can get partition elimination while it does a join.

If 2 tables have the same degree of parallelism (same # of buckets) and are partitioned in the same way on the join column (say, customer_id in a subpartition of sales and a partition of customer), Oracle will match the partitions when joining:

sales table joined with customer table can change into 4 small joins:
sales sub part 1 joins with customer part 1
sales sub part 2 joins with customer part 2
sales sub part 3 joins with customer part 3
sales sub part 4 joins with customer part 4

And with parallelism, the total time is now reduced to the time it takes to do one of those smaller joins.

This is also why you want to have a power of 2 for buckets – because cores/processors come in powers of 2. Partition-wise joins like this can also be done with range or list, assuming both tables in the join have the same buckets.

I have no idea if MySQL partitioning works this way, but it’s certainly a functionality that makes sense to me.


PlanetMySQL Voting: Vote UP / Vote DOWN

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 1 (introduction and power)

Июль 30th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

These are my notes from the session, which include comparisons of how Oracle works (which Maria gave) and how MySQL works (which I researched to figure out the difference, which is why this blog post took a month after the conference to write). Note that I am not an expert on data warehousing in either Oracle or MySQL, so these are more concepts to think about than hard-and-fast advice. In some places, I still have questions, and I am happy to have folks comment and contribute what they know.

One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it) from Forrester saying, “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”

I thought that was an interesting way of pointing that out — most people do not understand why 3NF is not good enough for data warehousing, and I have had a hard time explaining why a star or snowflake schema should be used. Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.

From Maria and the slides:
“Oracle says model what will suit your business best. Don’t get lost in academia. Most schemas are not 100% according to the theoretical
models. Some examples: 3NF schema with denormalized attributes to avoid costly joins, Star schema with multiple hierarchies in same fact table.”

Data warehousing has a 3-step approach —

1) data sources -> staging layer (temp loading layer)
2) staging layer (temp loading layer)-> foundation (logical, data store) layer
3) foundation (logical, data store) layer -> access and performance layer

The foundation layer is usually 3NF the access layer is usually a star or snowflake schema. As for the data sources, they can be varied, you would hope that they are in 3NF (and if they are you can skip the first 2 steps) but they are not always that way.

The 3 P’s of best practice for data warehousing (on Oracle) are power, partitioning, parallelism. The goal of the data warehousing environment is to minimize the amount of data accessed and use the most efficient joins – so it is not so index focused. This may be based on Oracle’s way of doing joins, I am not so sure if it applies to MySQL as well.

Power The weakest link in the chain (the 3 steps above) will define the throughput, so make sure your hardware configuration is balanced. Maria mentioned that as DBAs, “most of the time we don’t have control over this, but we’re still bound to the SLAs.”

This includes hardware that immediately comes to mind such as # of CPUs/cores, speed of CPU, amount of RAM, speed of disk as well as what we may not think of immediately: speed of network switches, speed of disk controllers, number and speed of host BUS adapters. Notes on host BUS adapters (HBAs): Know the # of HBA ports you have. 4 Gb HBA does 400 Mb/sec. 2 Gb HBA does 200 Mb/sec. Make sure there’s enough HBA capacity to sustain the CPU throughput (ie, make sure HBA isn’t the bottleneck). Also the speed at which it all talks. If you have a 4 Gb machine but a 2 Gb switch, you end up having 2 Gb throughput. Upgrade the network at the same time you upgrade machines.

Because we are talking about data warehousing, it is often not possible to eliminate disk I/O, so the goal is to have the fastest I/O throughput possible. Data warehouses need to be sized on I/O throughput not number of I/O’s.

I made a post earlier about how to determine I/O throughput for a system, which used information from this session. Justin Swanhart already pointed out that this is based on the fact that Oracle can do hash joins and MySQL can only do nested loop joins. I wonder, though, if there is indeed no case when using MySQL for which I/O throughput is a more useful metric than iops.

Disk arrays that are expensive are usually sized for iops, not throughput, and because they’re expensive the disk array is shared throughout the company. A DBA needs to ask ‘how many connections into the storage array do I have? How many disk controllers do I have? Where are my physical disks, and which controllers are they hanging off of?’

Typical 15k rpm disk can do about 25-35 Mb/sec (per disk) random i/o’s. Disk manufacturers will throw out numbers like 200-300 Mb/sec but that’s sequential I/O and leading edge of the drive. Make sure all your LUNs are not coming off the same set of disks, so that you’re not conflicting on disk seeks.

Continue to part 2, partitioning.


PlanetMySQL Voting: Vote UP / Vote DOWN

Determining I/O throughput for a system

Июль 29th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. In that session, there was a section on how to determine I/O throughput for a system, because in data warehousing I/O per second (iops) is less important than I/O throughput (how much actual data goes through, not just how many reads/writes).

The section contained an Oracle-specific in-database tool, and a standalone tool that can be used on many operating systems, regardless of whether or not a database exists:

If Oracle is installed, run DBMS_RESOURCE_MANAGER.CALIBRATE_IO:

SET SERVEROUTPUT ON
DECLARE
lat INTEGER;
iops INTEGER;
mbps INTEGER;
BEGIN
-- DBMS_RESOURCE_MANAGER.CALIBRATE_IO(<DISKS>, <MAX_LATENCY>,iops,mbps,lat);
DBMS_RESOURCE_MANAGER.CALIBRATE_IO (2, 10, iops, mbps, lat);
dbms_output.put_line('max_mbps = ' || mbps);
end;

For us MySQL folks, or even the Drizzle or NoSQL folks, Oracle offers a free standalone tool called Orion. The example given in the slides was:

./orion –run advanced –testname mytest –num_small 0 –size_large 1024 –type rand –simulate contact –write 0 –duration 60 –matrix column

-num_small is 0 because you don’t usually do small transactions in a dw.
-type rand for random I/O’s because data warehouse queries usually don’t do sequential reads
-write 0 – no writes, because you do not write often to the dw, that is what the ETL is for.
-duration is in seconds
-matrix column shows you how much you can sustain

I would be interested to see how other folks measure I/O throughput, and maybe even do a side-by-side comparison of different tools. Orion is available for:

Linux (x86, x86-64, Itanium, Power)
Solaris (SPARC64)
AIX (PPC64)
zLinux
HPUX (PA RISC, Itanium)
Windows

I am working on a larger write-up of the session itself, which had many concise descriptions of data warehousing issues, but I thought that this merited its own post.


PlanetMySQL Voting: Vote UP / Vote DOWN

Online Verification That Master and Slaves are in Sync

Июль 27th, 2010

In October 2008, Baron posted How to Check MySQL Replication Integrity Continually. Here at Pythian we have developed a method based on that post, and added “verifying that masters and slaves are in sync” to our standard battery of tests.

We call it “Continual replication sync checking”. This article will explain how it works, how to test and make the procedure non-blocking, benchmarks from the real world, issues we encountered along the way, and finally Pythian’s procedure of setting up continual replication sync in a new environment.

At the 2010 MySQL User Conference & Expo, my co-worker Danil Zburivsky did a presentation about this, and you can get the slides in ODP (Open Office) format or watch the 46-minute video on YouTube.

How it works

On the master, mk-table-checksum is run. In order to make the checksum operation online, we use the modulo and offset features of mk-table-checksum to checksum only part of the data at a time. The checksum is run (from cron) on the master and replicates to the slave. The results are captured in a result table, and a separate process checks the result table and notifies us of any discrepancies.

Testing resource usage for non-blocking replication sync checking

Each environment has a different data size, different resources, and a different level of what “intrusive” means. We start with a baseline of the size of the database, which we get from:

SELECT SUM(INDEX_LENGTH+DATA_LENGTH)/1024/1024/1024 as sizeGb
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE='BASE TABLE';

I included INDEX_LENGTH because the queries that generate the checksum information may or may not depend on indexes. Note that for InnoDB tables, the metadata is approximate, so this will not necessarily be the exact size of the database, but it will be in the right range. In general, running this query in the INFORMATION_SCHEMA did not consume too many resources, though in many cases the query took a few minutes to complete. On databases with several thousand tables, this query can take hours to complete.

Of course there is the caveat that it may take a VERY long time to run this in your environment, and some folks have reported crashing (this is VERY rare, I personally have not seen it even with the high demand client we have had, but others have reported it). So if you prefer to be on the safe side, you could look at the size of your ibdata files (whether you are innodb_file_per_table or not) plus the sizes of all the tables on disk (MyISAM, CSV, etc). That will give you a sense of how big the data is, although the size of the files on disk also includes any fragmentation. In practice, I have seen fragmentation reduce the size of a large database by 40%, so beware that neither of these methods is perfect.

Once we get the data size in gigabytes, we can come up with a modulo to test. In general, we have found that mk-table-checksum is quick enough that it is non-blocking when we use a modulo value that corresponds to approximately 1 Gb of data, and a chunk-size of 100. We actually started out using a checksum of 1000, but locking 1000 rows to test a checksum proved to be longer than was acceptable for clients with lots of transactions — the problem was not the locking on the master, the problem was that it created a slave lag on the slaves. We tried using the --sleep option to mk-table-checksum, which worked, but changing the chunksize to be smaller caused less slave lag than adding a sleep parameter.

Another issue when using chunk-size is that it requires a numeric (but non-decimal) index. We ran across clients using VARCHAR(32) as a PRIMARY key (yes, on InnoDB, and yes, performance was an issue there), or tables that did not have any indexes at all (such as a logging table). If mk-table-checksum does not find a suitable index it will just do a checksum on the whole table.

To find tables that would be problematic, here is an INFORMATION_SCHEMA query that can be run (again, the usual caveats about INFORMATION_SCHEMA apply):

SELECT CONCAT("SHOW CREATE TABLE ", table_schema,".",table_name,"\\G")
FROM TABLES
LEFT JOIN (
     SELECT distinct table_schema,table_name
     FROM STATISTICS INNER JOIN COLUMNS USING
       (table_schema,table_name,column_name)
     WHERE (data_type like '%int' or data_type like 'date%'
     or data_type like 'time%' or data_type='float') and seq_in_index=1
     ) AS good_for_checksum
USING (table_schema,table_name)
WHERE good_for_checksum.table_name IS NULL
and table_schema not in ('information_schema');

Frequency of checksum runs


The frequency that we run the checksum is also very flexible, so we take the size and translate that into a modulo that is “even” in a time-based way. For example, on a server that reported 113 Gb in size from the INFORMATION_SCHEMA query above, we set the modulo to 120. The checksum took 10 minutes and 6 seconds from the time the master started to the time the slave finished. There was no excessive slave lag caused and other resource checks showed that this test was acceptable, including application response time for non-administrative queries.

Given a modulo of 120 that takes about 10 minutes to run and the environment, we decided to run the checksum 9 times per day (hourly for 9 hours during off-peak time). This resulted in the entire data set being checked during a period of just under 2 weeks (120 parts / 9 times per day = 13.333 days).

This means that if there is a data discrepancy, it is discovered within 2 weeks in this particular environment. Though that is not ideal, it is much better than not discovering data discrepancies at all, which is how replication currently works.

Benchmarks From the Real World

The first row in the table below is the example we just went through. The subsequent rows of the table are results from some of the other production environments we run the checksum in. As you can see, we try to keep the data checked at one time to about 1 Gb.



Total Data SizeModuloTest timeChecksum RateFrequencyPeriod
113 Gb12010 min 6 seconds1.59 Mb / sec9x / day2 weeks
9 Gb2151 seconds8.4 Mb / sec3x / day1 week
29 Gb219 min 16 seconds2.6 Mb / sec3x / day1 week
70 Gb2128 seconds2650 Mb / sec
(data was
freshly defragmented!)
3x / day1 week
5.1 Gb214 min 22 sec0.958 Mb / sec3x / day1 week
314.5 Gb33636 min 3 seconds0.44 Mb / sec16x / day3 weeks

In all of these environments, slave lag was 10 seconds or less at any given point on the slaves.

Issues Encountered


Some of the issues we encountered have workarounds, so I wanted to discuss and explain them here before giving our procedure, which contains the workarounds.

  1. Bug 304 – mk-table-checksum deletes all prior checksum results in the result table on every run. There is no fix for this yet, but if you are using mk-table-checksum only for the procedure described in this article (and in particular are not using the --resume-replication option), you can comment out the following code from mk-table-checksum:

    # Clean out the replication table entry for this table.
    if ( (my $replicate_table = $final_o->get('replicate'))
    && !$final_o->get('explain') ) {
    use_repl_db(%args); # USE the proper replicate db
    my $del_sql = "DELETE FROM $replicate_table WHERE db=? AND tbl=?";
    MKDEBUG && _d($dbh, $del_sql, $db, $table->{table});
    $dbh->do($del_sql, {}, $db, $table->{table});
    }

    It is in different places in different versions, but last I checked, searching for "DELETE FROM" in mk-table-checksum only matched three lines of code, and it was pretty clear (due to the inline comment) which block of code to delete. The block shown above is from lines 5154 – 5161 in mk-table-checksum changeset 6647.

  2. Running the checksum may cause “statement not safe for replication” errors, especially in 5.1.
  3. This is usually OK to ignore, because mk-table-checksum works specifically because you can run the same command on the master and slave and get different results. In MySQL 5.1, CSV tables for the general and slow logs exist by default, even if they are not being written to, and “Statement is not safe to log in statement format” errors show up.

    You will need to redirect stderr and expect to see those statements in the MySQL error log. Note that mk-table-checksum works regardless of whether you are using statement-based, row-based or mixed replication.

  4. mk-table-checksum is not perfect, and sometimes shows false positives and false negatives.
  5. This is a hard to deal with problem, and we encourage making bug reports when they are found. However, I will note that if mk-table-checksum finds even one undetected data integrity issue, then it is useful, because right now there is no other way of detecting issues in an automated fashion. As more people use mk-table-checksum and can help the developers figure out how to fix the false positives/false negatives, I am sure it will be even better.

Pythian’s procedure to set up continual replication sync

  1. Check to make sure all tables have appropriate indexes, as above
  2. . If they do not, the nibbling algorithm can be used, though as a caveat I have not tested nibbling with mk-table-checksum.

  3. Figure out the modulo value based on data size, as above

  4. Decide what database in which to put the tables that mk-table-checksum uses. We either use our monitoring database or a database called “maatkit”. Note that it is important to use a database that actually gets replicated!
  5. Get mk-table-checksum and comment out the lines that always delete from the replicated table, as above
  6. Create and populate the table mk-table-checksum will need for the modulo value:
    CREATE TABLE IF NOT EXISTS `checksum_modulo` (
    `modulo_offset` smallint(5) unsigned NOT NULL default '0' primary key
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8;

    INSERT IGNORE INTO checksum_modulo (modulo_offset) VALUES (0);

  7. Do a test run of mk-table-checksum:

  8. perl mk-table-checksum -u avail --ask-pass --algorithm BIT_XOR \
    --replicate maatkit.checksum --create-replicate-table \
    --modulo 120 --chunk-size 100 \
    --offset 'modulo_offset FROM maatkit.checksum_modulo' localhost

    And update the modulo table:
    update maatkit.checksum_modulo set modulo_offset = ((modulo_offset+1)%21);

    And check the results on the slave:
    SELECT * FROM maatkit.checksum
    WHERE this_crc!=master_crc AND db not in ('maatkit','mysql');

  9. If the test run came out OK, set up a script to run via cron such as:

  10. # run checksum table 9 times per day, during off-peak times:
    30 0,1,2,3,4,5,21,22,23 * * * /home/pythian/bin/checksumtbls.sh >> /home/pythian/logs/checksum.log 2>&1

    And the checksumtbls.sh script looks like:

    #!/bin/sh

    DB=maatkit
    USER=maatkit_user
    PASS=password
    REPLTBL="$DB.checksum"
    MODULO=120
    CHUNKSIZE=100
    OFFSETTBL="checksum_modulo"
    ALGORITHM=BIT_XOR
    LOG=/home/pythian/logs/checksum.log

    /usr/bin/perl /home/pythian/bin/mk-table-checksum -u $USER -p $PASS \
    --modulo $MODULO \ --algorithm $ALGORITHM --chunk-size $CHUNKSIZE \
    --offset "modulo_offset FROM $DB.$OFFSETTBL" \
    --replicate $REPLTBL --create-replicate-table localhost >> $LOG

    /usr/bin/mysql -u $USER -p$PASS -e "update $DB.$OFFSETTBL set modulo_offset=((modulo_offset+1)%$MODULO)" >> $LOG

    And of course, do not forget to periodically check on the slave to see where the issues are:

    SELECT * FROM maatkit.checksum
    WHERE this_crc!=master_crc AND db not in ('maatkit','mysql');

    I hope this helps; It is extremely important to make sure


    PlanetMySQL Voting: Vote UP / Vote DOWN

Webinar: What you need to know for a MySQL 5.0 -> 5.1 upgrade

Июль 26th, 2010

IOUG has a free series of three webinars on upgrading MySQL. Each webinar is an hour long, and it starts with a webinar by me tomorrow at 12 noon Central time (GMT-5) on “Why and How to Upgrade to MySQL 5.1″. The webinar assumes you are upgrading from MySQL 5.0 to MySQL 5.1, and talks a little bit about the new features, server variables, and what you need to know when upgrading to MySQL 5.1.

The software used is GoToWebinar (formerly GoToMeeting), so you will need to install that software. To register, use the links on the IOUG MySQL Upgrade Webinar Series page.

The complete list of webinars in the MySQL Upgrade Series is:
* MySQL 5.1: Why and How to Upgrade
Sheeri Cabral, The Pythian Group
Tuesday, July 27, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

* MySQL Upgrades With No Downtime
Sean Hull, Heavyweight Internet Group
Wednesday, July 28, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

* MySQL Upgrade Best Practices
Matt Yonkovit, Percona
Thursday, July 29, 12:00 p.m. – 1:00 p.m. CT (GMT-5)

(note, I am not sure if it is free for everyone or just free for IOUG members; my apologies if it is the latter)


PlanetMySQL Voting: Vote UP / Vote DOWN

Using MySQL Partitioning Instead of MERGE Tables

Июль 24th, 2010

One common question I get is how to use partitioning instead of MERGE tables. The process I use involves using stored procedures to create and drop partitions. This article will go over the stored procedures I use; special thanks to Roland Bouman for taking a look and giving great feedback to optimize this process.

First, a simple table, not partitioned (yet):

use test;
DROP TABLE IF EXISTS my_part;
CREATE TABLE IF NOT EXISTS my_part (
  id int NOT NULL,
  creationDate datetime NOT NULL,
  PRIMARY KEY (id,creationDate)
) ENGINE=InnoDB;

In real, life there is more to the table than just id and creationDate. The most important part is that the partitioned field(s) need to be part of the primary key.

Now, add the partition definition. This can be done in the CREATE statement, but I have found that it is easier for me to think about what fields and indexes I want first, and then worry about partitioning, so I naturally gravitate towards defining the fields and indexes first and then altering the table to add partitioning. This also helps me think about how to modify an existing table to add partitioning.

ALTER TABLE my_part PARTITION BY RANGE (TO_DAYS(creationDate)) (
partition 2010_07_01 values less than (to_days('2010-07-02')),
partition 2010_07_02 values less than (to_days('2010-07-03')),
partition 2010_07_03 values less than (to_days('2010-07-04')),
partition 2010_07_04 values less than (to_days('2010-07-05'))
);

This makes it pretty clear what is happening — the idea is to give the partition names actual dates that they hold, so that it is easy to see what partitions need to be added and deleted.

Deleting partitions

I find that making stored procedures makes things easy….so I will define a procedure called partition_drop to drop partitions. The partition_drop stored procedure takes in a table name and schema name to drop partitions from, and a date to delete up through and including that date. Here’s the procedure:

DELIMITER ||
DROP PROCEDURE IF EXISTS partition_drop ||

CREATE PROCEDURE partition_drop (IN through_date date, IN tbl varchar(64), IN db varchar(64))
BEGIN
DECLARE delete_me varchar(64);
DECLARE notfound BOOL DEFAULT FALSE;
DECLARE pname CURSOR FOR SELECT PARTITION_NAME
FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME=tbl AND TABLE_SCHEMA=db
AND DATE(PARTITION_NAME)!= 0
AND DATE(PARTITION_NAME) IS NOT NULL
AND DATE(PARTITION_NAME)<=through_date;

DECLARE CONTINUE HANDLER FOR NOT FOUND SET notfound:=TRUE;
OPEN pname;

cursor_loop: LOOP
FETCH pname INTO delete_me;
IF notfound THEN LEAVE cursor_loop; END IF;
SET @alter_stmt:=CONCAT("ALTER TABLE ",db,".",tbl," DROP PARTITION ",delete_me);

# sanity check commented out for production use
# SELECT @alter_stmt;
PREPARE stmt_alter FROM @alter_stmt; EXECUTE stmt_alter; DEALLOCATE PREPARE stmt_alter;

END LOOP;
CLOSE pname;
END ||
DELIMITER ;

Go ahead and run CALL partition_drop('2010-07-02','my_part','test'); to verify that SHOW CREATE TABLE my_part; shows that the desired partitions have been dropped.

Adding partitions

Adding partitions is similar to deleting partitions — using a stored procedure that takes in the date to add partitions up to that date. It will not try to add so many partitions that the table will have more than 1024, and it won’t add any partitions that already exist.

DELIMITER ||
DROP PROCEDURE IF EXISTS partition_add ||

CREATE PROCEDURE partition_add (IN through_date date, IN tbl varchar(64), IN db varchar(64))
BEGIN
DECLARE add_me char(10);
DECLARE max_new_parts,add_cnt smallint unsigned default 0;
SELECT 1024-COUNT(*) AS max_new_parts,
SUM(CASE WHEN
 DATE(PARTITION_NAME)>=through_date then 1 else 0
 END)
INTO max_new_parts, add_cnt
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE  TABLE_SCHEMA = db
 AND TABLE_NAME = tbl;

IF add_cnt=0 THEN
BEGIN
SELECT MAX(DATE(PARTITION_NAME)) INTO add_me
FROM INFORMATION_SCHEMA.PARTITIONS WHERE TABLE_NAME=tbl and TABLE_SCHEMA=db
AND DATE(PARTITION_NAME)<through_date;

# to do: declare handler for exceptions here
IF DATEDIFF(through_date,add_me)+1 < max_new_parts THEN
BEGIN
WHILE add_me<through_date do BEGIN
SET add_me:=DATE_FORMAT(add_me + INTERVAL 1 DAY,"%Y_%m_%d");
SET @alter_stmt:=CONCAT("ALTER TABLE ",db,".",tbl," ADD PARTITION (PARTITION ",add_me," VALUES LESS THAN (TO_DAYS('",add_me+INTERVAL 1 DAY, "')))" );

PREPARE stmt_alter FROM @alter_stmt; EXECUTE stmt_alter; DEALLOCATE PREPARE stmt_alter;

END;
END WHILE;
END;
END IF;
END;
END IF;
END ||
DELIMITER ;

Here’s how to call that stored procedure:
CALL partition_add ('2010_07_10', 'my_part','test');

Caveat: This stored procedure will only add in partitions from the maximum partition name/date until the date you add it in, it will not fill in any gaps. However, it is possible to add in another input parameter to be the “start date” and change the stored procedure to add the partition if it does not exist, from the start date through the end date.

Note: It was pointed out by Roland Bouman that it would be better to change both the add and drop stored procedures to do one ALTER TABLE instead of sequential ones. He wrote:

It’s probably better to generate a single statement to drop / add all partitions.

So the general pattern would be:

- generate and concatenate lines for each partition
- one single sequence of prepare, execute, deallocate to execute one DDL statement.

For the drop partition procedure, this approach would allow you to do away with the cursor. You can simply use GROUP_CONCAT in a single SELECT…INTO statement to generate the entire DDL statement.

Note: Roland also mentioned that these stored procedures could be generalized to use any interval, for example if each table partition held 7 days. The only limitation is that to get the optimal partition performance with dates, use either TO_DAYS(date_field) or YEAR(date_field) as the partitionining function.

Putting it all together

The procedure: daily I would run the following in MySQL, to keep only the previous “x” days:

CALL partition_drop('CURRENT_DATE()-INTERVAL x DAY', 'tbl_name','schema_name');

To add new partitions, I would do:
CALL partition_add('CURRENT_DATE()-INTERVAL 32 DAY', 'tbl_name','schema_name');

It is run daily just in case; the side benefit is that it will catch any gaps if you do not modify the stored procedure to include a start date for when to add. Adding a partition does not take a long time, because there’s no data to reorganize.

I would make a daily MySQL event, and then everything is stored nicely in the database, and backed up with the database. Others may choose to run the stored procedures in a cron script or scheduled task (Windows), but that requires a password to be stored somewhere — either in the script, somewhere the script can read, or in a config file such as .my.cnf.


PlanetMySQL Voting: Vote UP / Vote DOWN

Log Buffer #196, A Carnival of the Vanities for DBAs

Июль 24th, 2010

Welcome to Log Buffer, the weekly roundup of database industry news.

For your reading pleasure this week we have Log Buffer #196:

Charles Hooper blogs about an in-depth investigation on what can cause Oracle to ignore a hint.

Doug Burns reminds his readers that there are only two weeks left to submit papers for UKOUG. The deadline is Aug. 2.

A while back Greg Rahn at Structured Data blog thought that the best way to get results out of Exadata is by changing your application to get the most out of Exadata. He was very happy to see that Pythian thinks the same. On the subject of Exadata and data warehousing, Greg posts this week on the core performance fundamentals of Oracle Data Warehousing – set processing vs row processing.

Jonathan Lewis links to instructions and explanations on how to switch to a different UNDO tablespace. It is trickier than it sounds and Jonathan provided additional traps to watch out for.

Jonathan also continues his fragmentation series with an explanation of table fragmentation and its causes.

Alex Fatkulin explains about ASM mirroring and disk partnership and why you may have less redundancy than you thought.

On the same subject, Jeremy Schneider of Ardent Performance blog explains about hot disks, raid and what it means for ASM mirroring.

Back to blogging after a recent trip to TechInsights 2010, Edwin Sarmiento answers questions on what needs to be done as part of the installation of a SQL Server 2008 R2 Failover Cluster on Windows Server 2008 R2. A number of things related to Windows Clustering need to be considered.

Willie Favero introduces the “IBM zEnterprise System”, on his blog Getting the Most out of DB2 for z/OS and System z.

On Join-fu! the Art of SQL blog, Jay Pipes talks about getting started developing Nova on Linux, as he’s involved in a new OpenStack project.

Peter Zaitsev, on MySQL Performance Blog, posts about estimating replication capacity so that replication load can be dealt with before slave is unable to catch up.

Paul Randal publishes his survey results around the purchase and use of SSIDs.

And, if you happen to be attending Oracle OpenWorld, register before July 30 to take advantage of early bird rates.


PlanetMySQL Voting: Vote UP / Vote DOWN

INFORMATION_SCHEMA tables are case sensitive

Июль 20th, 2010

I wanted to get examples of some of the extra information that the Percona server has in its INFORMATION_SCHEMA metadata, and in doing so, I stumbled across an interesting MySQL bug/feature/point — INFORMATION_SCHEMA tables (which are actually system views) are case sensitive when used in comparisons:

mysql> use information_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql>  select @@version;
+--------------------+
| @@version          |
+--------------------+
| 5.1.36-xtradb6-log |
+--------------------+
1 row in set (0.00 sec)

mysql> use information_schema;
Database changed
mysql> show tables like 'innodb%';
Empty set (0.00 sec)

mysql>  show tables like 'INNODB%';
+----------------------------------------+
| Tables_in_information_schema (INNODB%) |
+----------------------------------------+
| INNODB_BUFFER_POOL_PAGES_INDEX         |
| INNODB_RSEG                            |
| INNODB_LOCKS                           |
| INNODB_BUFFER_POOL_PAGES               |
| INNODB_TRX                             |
| INNODB_INDEX_STATS                     |
| INNODB_LOCK_WAITS                      |
| INNODB_CMP_RESET                       |
| INNODB_CMP                             |
| INNODB_CMPMEM_RESET                    |
| INNODB_BUFFER_POOL_PAGES_BLOB          |
| INNODB_CMPMEM                          |
| INNODB_TABLE_STATS                     |
+----------------------------------------+
13 rows in set (0.00 sec)

It is not just for the new tables Percona has added:

mysql> show tables like 'table%';

Empty set (0.00 sec)

mysql> show tables like 'TABLE%';

+---------------------------------------+
| Tables_in_information_schema (TABLE%) |
+---------------------------------------+
| TABLES                                |
| TABLE_CONSTRAINTS                     |
| TABLE_PRIVILEGES                      |
+---------------------------------------+
3 rows in set (0.00 sec)

And it is not due to the collation:

mysql> show global variables like '%collat%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

mysql> show session variables like '%collat%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | utf8_general_ci   |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

PlanetMySQL Voting: Vote UP / Vote DOWN

Log Buffer #195, A Carnival of The Vanities for DBAs

Июль 16th, 2010

A short post marks Pythian’s 195th edition of Log Buffer, a blog of blogs encapsulating what’s going on in the world of database administration.

Remember if you find a link or interesting blog post that you think Log Buffer should mention, send a note to the editor at Log Buffer and be sure to include the link, and a short note on why you think that others will want to read it too.

Now on to Log Buffer #195. Alex Gorbachev starts us off with his suggested readings and funnily enough, Chen Shapira had many of the same favorites this week.

Jonathan Lewis introducing a new series about fragmentation. In this post he defines what he means by fragmentation. Alex is looking forward to reading the next bits.

Charles Hooper shows how to present performance data in visually pleasing charts.

Mark Calahan, on his High Availability MySQL blog announces a new book on MySQL replication and high availability from Charles, Mat and Lars, stating it’s a good read to save people from some inevitable failures when distributed systems are deployed for enought time over many servers.

Tanel Poder, back blogging full force shows what really happens when you drop a table, and why you can drop tables from a read-only tablespace.

Maria Colgan from Oracle Optimizer Team drills into details of CURSOR_SHARING. By the way, Oracle Optimizer blog has moved to the official Oracle blogs as Doug Burns noted. And he couldn’t resist mentioning another post from Maria Colgan about moving from the old Rule-Based Optimizer to the Cost-based Optimizer. Alex saw Maria speak at ODTUG/Kaleidoscope and since then follows her blog all the time. There are lots of other recently posted materials in the new location so read on.

Randolf Geist, Oracle ACE, and member of the OakTable Network, posted a monstrous (size-wise) article about limitations of index and table compression covering Oracle release from 9i up to the latest 11.2. Very thorough!

Michael Janke complains about the number of severe bugs in Oracle and tells some amusing anecdotes about defective parts in the industry.

Michelle Gutzait suggests a new free eBook available now on how to do performance tuning with Data Management Views in SQL Server.

Joshua Drake notes the call for papers for PgWest 2010 is now open and provides the link to submit your talk. Deadline for submission is September 5, 2010 and speakers will be notified the following week. They’re looking for general PostgreSQL and stack topics and tutorials.

Until next week.


PlanetMySQL Voting: Vote UP / Vote DOWN

The XLDB4 Conference for Very Large Databases

Июль 8th, 2010

Ronald saved me a post by giving his feedback on a few Oracle conferences that now have MySQL content.

My opinion is pretty much a summary of Ronald’s post, so I won’t repeat it here. Instead, I’ll post about a conference he did not, the 4th Extremely Large Databases Conference. I am particularly interested in any MySQL folks planning to attend (I would expect Tokutek to be represented, and maybe even the "http://www.calpont.com/">Calpont folks).

Most of this is directly from an e-mail I received from Jacek Becla, who had a keynote at the 2008 MySQL User Conference and Expo. If you also received this e-mail, please feel free to skip ahead to my viewpoints on the various Oracle conferences (or just skip altogether).

4th Extremely Large Databases (XLDB4) Conference
October 6-7, 2010
SLAC National Accelerator Laboratory
Menlo Park, California
http://www.xldb.org/4

XLDB, now including a conference open to all interested parties, is the gathering place for the community managing and analyzing data at extreme scale.

Leading practitioners from science, industry, and academia will discuss lessons learned and real-world solutions for handling terabytes and petabytes of data. Experts will present emerging trends that will significantly impact how extremely large databases are built and used.

More information, including the program, and online registration are available at http://www.xldb.org/4. Space is limited, so if you would like to attend, make sure to register soon. Early registration ($100) ends on July 31; late registration is $150.

If you are uncertain whether to attend, I would recommend reading the article recently written by Curt Monash:
http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/

For questions, please contact us by email using
xldb-admin at slac.stanford.edu or contact me [Jacek] directly at
becla at slac.stanford.edu.



PlanetMySQL Voting: Vote UP / Vote DOWN