Archive for the ‘performance’ Category

Better Controlling MySQL Memory Usage

Январь 25th, 2012

MySQL, like a lot of other software, has many knobs you can tweak. Most of these knobs may affect behaviour, but more importantly most affect the memory usage of the server, so getting these settings right is very important.

Most of MySQL’s memory is really used just as a cache, in one form or another, information that otherwise is on disk. So ensuring you have as large a cache as possible is important. However, making these memory sizes too large will trigger the server to start swapping and possibly can cause it to crash or cause the kernel to kill the process when it runs out of memory.  So that’s something we want to avoid.

Certain settings affect memory allocation on a per connection/thread basis, being bounded by thread_cache_size and max_connections.  If you configure for the worst behaviour (max_connections) you may end up not actually using all the memory you have available, memory which normally could be used for other purposes.

Recently a server I managed was configured incorrectly with a large sort_buffer_size (4M to 256M) and larger read_buffer_size (4M to 20M).  The change in configuration on first glance looks quite innocent, and not noticing that these are per-connection settings this got rolled out. max_connections on this server was set to 1000, while normally there were ~40 connections of which only a few were active. The mysqld memory footprint on startup looked fine. In fact under normal usage it also worked fine. However spiky load suddenly changed this nice behaviour: as configured mysqld ramped up the thread count and hence memory usage, resulting in swapping and finally server death…

The fault of course was mine for not noticing, but this has been an issue with MySQL forever.  Most other RDBMSes manage memory slightly differently: you define how much memory you want to use (maximum), this is optinally locked into memory, and further requests for different buffers are taken from this global pool.  That is much safer, avoids unexpected swapping, or memory over-allocation, but does raise the question of what happens when a memory request can not be honoured.

Initially I would expect two different behaviours: either

  1. kill a thread whose memory can not be allocated, or
  2. wait a certain time to allocate that memory, after which it gets killed anyway.

Option (2) is probably saner, and possibly some sort of deadlock detection can kick if in all threads are waiting for memory, perhaps killing the younger thread, or thread which has done least work first. Possibly there are other better ways of doing this?

I can imagine that changing MySQL’s current behaviour to do something like this could be quite hard, especially as ideally the engines would also use the same memory management mechanisms, but I see this as being a good thing and would make MySQL more robust, especially under load, which is after all what counts.  Of course this will not happen in today’s 5.5 GA version, or tomorrow’s 5.6 version which is probably likely to appear some time this year. That’s a major change. It would be nice if Oracle look at this for 5.7 as a way of ensuring that when resource usage does come under pressure MySQL does not go heads up, but attempts to use the allocated resources as best as possible.

In the meantime what would help would be:

  • better documentation so we can see clearly how all mysql memory is allocated. There are several web pages commenting ways to calculate this, but certainly no definitive guide.
  • The InnoDB engine’s documentation talks about memory usage and most people think that the innodb_buffer_pool_size is the main setting. yet read further and there’s talk of an overhead of perhaps 1/8th. I have recently been playing with innodb_buffer_pool_instances settings > 1 (using values in the range of 20-40) and am inclined to think that this increases that overhead somewhat more, yet there’s no documentation on this and whether my guess is right or not. Please InnoDB developers improve your documentation, if only to prove me wrong.
  • Ideally some tools to tell you if you server is possibly misconfigured. Coming from a Sybase environment I’d be tempted to suggest a stored procedure in the mysql database which can tell you total memory usage and how it is broken down as doing this with a single SELECT is going to be tricky.

Then once that is done consider adding some extra variable to enable total memory usage to be controlled. I made a feature request for this at http://bugs.mysql.com/?id=64108. If you think this feature might interest you please let Oracle know.


PlanetMySQL Voting: Vote UP / Vote DOWN

Announcing TokuDB v5.2: Improved Multi-Client Scaling and Faster Queries

Январь 19th, 2012

TokuDB® v5.2, the latest version of Tokutek’s flagship storage engine for MySQL and MariaDB, is now available.

This version offers performance enhancements over previous releases, especially for multi-client scale up and point queries, and extends the cases where ALTER TABLE is non-blocking, in particular adding Hot Column Rename.

TokuDB v5.2 maintains all our established advantages: fast trickle load, fast bulk load, fast range queries through clustering indexes, hot schema changes, great compression, no fragmentation, and full MySQL compatibility for ease of installation. See our benchmark page for details.

Multi-client workloads

In TokuDB v5.2, we have reworked our locking scheme to better support mulit-client workloads, and as always, we have focused on large databases. How did we do? Let’s check out some benchmark numbers.

SysBench

This is a SysBench comparison of InnoDB 1.1.8 and TokuDB v5.2. Prior to the run we started the database from a cold back-up (the cache is empty at the beginning of the 1 client thread run) and ran for 1 hour at each number of client threads. The following graph shows a significant performance improvement (10%-60%) at all measured levels of concurrency. The values shown are the average transactions per second for the final 15 minutes of the benchmark.

Additional details on the software settings for Sysbench can also be found in the Appendix at the end of this page.

TPCC

This is a TPCC-like comparison of InnoDB and TokuDB v5.2 on a 5000 warehouse database. The horizontal axis is the number of clients, the vertical axis shows throughput (New Order Transactions/10 seconds). Our multi-client work brings us to parity with InnoDB for this test.

Other key improvements

Both the Sysbench and the TPCC-like benchmarks have strong point-query components. Our improved performance over InnoDB, even with one client, shows that we are now outperforming InnoDB for point queries, at least in these tests. We’ll be blogging more specifically about point query performance so stay tuned. One way we achieve better point query performance is to have a different read-block size and write-block size. I’ll explain what that means in later blog posts, but one consequence is that read-intensive loads on RAIDed disks now perform many fewer I/Os.

In other news, we previously released Hot Indexing (HI) and Hot Column Addition and Deletion (HCAD). In both cases, the downtime of these Alter Table operations goes from hours to seconds.

In v5.2, we have added Hot Column Rename to the suite of online operations we support. You’ll be able to change the name of a column in a matter of seconds, just as you can now add or delete columns. We have also made Optimize Table hot, but it’s important to note that in TokuDB, Optimize Table only flushes background work, such as that produced by a column addition or deletion. It does not rebuild indexes, nor does it need to, because TokuDB indexes don’t fragment.

Summary

TokuDB v5.2 offers great scaling with increasing client thread count, improved point query performance, and Hot Column Rename. In the next couple of weeks, we’ll be posting more performance information, so stay tuned. TokuDB v5.2 is available for download.


Appendix – Configuration Details

Hardware

Centos 5.7; 2x Xeon L5520; 72GB RAM; 8x 300GB 10k SAS in RAID10.
TokuDB (running MySQL 5.1.52) is configured to use 36GB cache,
and InnoDB (running MySQL 5.5.16) with 52GB cache.

The difference is because InnoDB uses direct I/O whereas TokuDB reserves space for the OS cache.

TokuDB MySQL Config File (TokuDB v5.2 on MySQL 5.1.52)

[mysqld]
max_connections=400
table_open_cache=2048

InnoDB MySQL Config File (InnoDB v1.1.8 on MySQL 5.5.16)

[mysqld]
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_log_file_size=1900M
innodb_log_files_in_group=2
innodb_file_per_table=true
innodb_log_buffer_size=16M
innodb_file_format=barracuda
innodb_buffer_pool_size=52G
innodb_flush_log_at_trx_commit=1
max_connections=400
table_open_cache=2048

TPCC
All TPCC-like benchmarks were run with the following command line:

tpcc-mysql/tpcc_start localhost tpcc root 5000 \
         ${num_threads} 10 3600

Sysbench
All sysbench benchmarks were run with the following command line:

sysbench --test sysbench-0.5/sysbench/tests/db/oltp.lua
--oltp_tables_count 16  --oltp-table-size 50000000 --rand-init on
--num-threads ${num_threads} --oltp-read-only off
--report-interval 10 --rand-type uniform --mysql-socket
/tmp/mysql.sock --mysql-table-engine tokudb --max-time 3600
--mysql-user root --mysql-password --mysql-db sbtest
--max-requests 0 --percentile 99 run

PlanetMySQL Voting: Vote UP / Vote DOWN

Momentum MTA Performance Tuning Tips

Январь 7th, 2012

About 2 months ago I’ve joined LivingSocial technical operations team and one of my first tasks there was to figure out a way to make our MTAs perform better and deliver faster. We use a really great product called Momentum MTA (former Ecelerity) and it is really fast, but it is always good to be able to squeeze as much performance as possible so I’ve started looking for a ways to make our system faster.

While working on it I’ve created a set of scripts to integrate Momentum with Graphite for all kinds of crazy stats graphing, those scripts will be opensourced soon, but for now I’ve decided to share a few tips about performance-related changes we’ve made to improve our performance at least 2x:

  1. Use EXT2 Filesystem for the spool storage – After a lot of benchmarking we’ve noticed that amounts of I/O we’ve been doing was way too high compared to our throughput. Some investigation showed that EXT3 filesystem we were using for the spool partition had way too high metadata update overhead because of the fact that the spool storage uses a lot of really small files. Switching to EXT2 helped us gain at least 50-75% additional performance. Additional performance gain was caused by turning on noatime option for our spool.

    There are some sources that claim using XFS for spool directories is a better option, but we’ve decided to stick with EXT2 for now.

  2. Do not use %h{X} macro in your custom logs – Custom logs is an awesome feature of momentum and we use it to log our bounces along with some information from mail headers. Unfortunately the most straighforward thing to do (using %h{X} macro) was not the best option for I/O loaded servers because every time Momentum needs to log a bounce it needs to swap message body in from the disk and parse it to get you the header value.

    To solve this issue we’ve created a Sieve+ policy script that would extract the headers we need from a message during initial spooling phase (when the message is still in memory) and put those values to the message metadata. This way when we need to log those values we wouldn’t need to swap message body in from the disk. Here is the Sieve script to extract header value:

    1
    2
    3
    4
    5
    6
    require [ "ec_header_get", "vctx_mess_set", "ec_log" ];

    # Extract x-ls-send-id header to LsSendId context variable
    # (later used in deliver log)
    ($send_id) = ec_header_get "x-ls-send-id";
    vctx_mess_set "LsSendId" $send_id;

    After this we could use it in a custom logger like this:

    1
    2
    3
    4
    5
    6
    custom_logger "custom_logger1"
    {
      delivery_logfile = "cluster:///var/log/ecelerity/ls-delivery_log.cluster=>master"
      delivery_format = "%t@%BI@%i@%CI@D@%r@%R@%m@%M@%H@%p@%g@%b@%vctx_mess{LsSendId}"
      delivery_log_mode = 0664
    }
  3. Give more RAM to Momentum – When Momentum receives a message, it stores it to the disk (as required by SMTP standard) and then tries to deliver the copy it has in memory, if delivery succeeds, on-disk copy is unliked. The problem with a really have outbound traffic load is that momentum needs to keep tons of emails in memory, but by default it could only hold 250 messages. With a load of 250-500 messages a second this is just too small.

    To change this limit we’ve increased Max_Resident_Active_Queue parameter and changed it to 1000000 (of course we made sure have enough RAM to hold that many messages if needed) and Max_Resident_Messages to 0 (which means unlimited). This allows Momentum keep as many messages resident as possible and reduce the load caused by swap-in operations required for re-delivery attempts, etc.

  4. Choose a proper size for your I/O-related thread pools – in default Momentum config they set SwapIn and SwapOut thread pool sizes to 20. Under really high load even on our 4xSAS15k RAID10 this tends to be too high value. We’ve switched those pools to 8 threads each and it helped to reduce I/O contention and overall I/O throughput.

As a summary, I’d like to note, that as with any optimizations, before tuning your system it really helps to set up as much monitoring for your MTA servers as possible: cacti graphs, graphite, ganglia or something else – does not matter. Just make sure you see all the aspects of your system performance and understand what is going on with your system before changing any performance-related settings.



PlanetMySQL Voting: Vote UP / Vote DOWN

Profiling your slow queries using pt-query-digest and some love from Percona Server

Декабрь 28th, 2011
This guide will get you up and running with how to identify the bottleneck queries using the excellent tool pt-query-digest. You will learn how to use and analyze the output returned by pt-query-digest. You will also learn some differences between slow query logging in various MySQL versions. Later on in the post I will also show you how to make use of the extra diagnostic data available with Percona Server.
PlanetMySQL Voting: Vote UP / Vote DOWN

OurSQL Episode 71: Table Manners, part 2

Декабрь 26th, 2011

This week, we continue our discussion about MySQL and its forks. We discuss the Percona server and MariaDB.

Percona Live comes to Washington, DC on Wednesday, January 11th, 2012 at the Walter E. Washington Convention Center. There is a 50% discount for students, faculty and staff of educational organizations, and a 35% discount for government employees. http://www.percona.com/live/dc-2012/

Percona Server
Percona software

read more


PlanetMySQL Voting: Vote UP / Vote DOWN

Improving InnoDB memory usage continued

Декабрь 23rd, 2011

Continues from Improving InnoDB memory usage.

Here are some numbers from the fixups described in the above article:

The workload consists of 10 partitioned tables, each one containing 1000 partitions. This means 10’000 InnoDB tables. We truncate the tables, then restart mysqld and run:

1. INSERT a single row into each of the 10 tables
2. SELECT * from each table
3. FLUSH TABLES (this causes the tables to be closed and reopened on the next run)
4. wait for 10 seconds

we repeat the above steps 10 times. Here is the total memory consumption by mysqld with 1GB InnoDB buffer pool during the workload:

In the fixed case (green line) you can clearly see the peaks when the commands 1. – 4. are run and the valley bottoms during the 10 seconds waits.

Notice that no memory leaks were fixed in these improvements. This is all about interaction between InnoDB and the system allocator. InnoDB always frees memory it has allocated, but the problem was that the system allocator could not reuse the freed memory optimally due to memory fragmentation. We mainly reduced the number of and the sizes of some of the hottest allocations, mitigating the malloc/free storm.

Notice also that this is somewhat unusual workload – 10’000 InnoDB tables, row size of about 64KB and frequent FLUSH TABLES.


PlanetMySQL Voting: Vote UP / Vote DOWN

Improving InnoDB memory usage

Декабрь 20th, 2011

Last month we did a few improvements in InnoDB memory usage. We solved a challenging issue about how InnoDB uses memory in certain places of the code.

The symptom of the issue was that under a certain workloads the memory used by InnoDB kept growing infinitely, until OOM killer kicked in. It looked like a memory leak, but Valgrind wasn’t reporting any leaks and the issue was not reproducible on FreeBSD – it only happened on Linux (see Bug#57480). Especially the latest fact lead us to think that there is something in the InnoDB memory usage pattern that reveals a nasty side of the otherwise good-natured Linux’s memory manager.

It turned out to be an interesting memory fragmentation caused by a storm of malloc/free calls of various sizes. We had to track and analyze each call to malloc during the workload, including the code path that lead to it. We collected a huge set of analysis data – some code paths were executed many 10’000s of times! A hurricane of allocations and deallocations! We looked at the hottest ones hoping that some of them are not necessary, can be eliminated, avoided, minimized or stuck together. Luckily there were plenty of them!

After an extensive testing we did a numerous improvements, allocating the smallest chunks of the memory from the stack instead of from the heap, grouping allocations together where possible, removing unnecessary allocations altogether, estimating exactly how much memory will be consumed by a given operation and allocating it in advance and others and others and others.

This not only fixed Bug#57480 but improved InnoDB memory usage in general.

Note: the fix is not in the 5.6.4 release.


PlanetMySQL Voting: Vote UP / Vote DOWN

Better scaling of read-only workloads

Декабрь 20th, 2011

The problem and its cause

There have been several complaints over the years about InnoDB’s inability to scale beyond 256 connections. One of the main issues behind this scalability bottleneck was the read view creation that is required for MVCC (Multi Version Concurrency Control) to work. When the user starts a transaction this is what InnoDB does under the hood:

  • Create or reuse a transaction instance – usually it is reused, the transactions are reused from a pool (trx_sys_t::mysql_trx_list).
  • Initialize the transaction start time and assign a rollback segment
  • Append the transaction to an active  transaction list ordered on trx_t::id in descending order

The append to  the trx_sys_t::trx_list and corresponding remove during commit is covered by trx_sys_t::mutex. After the transaction is “started” and if the transaction has an isolation greater than or equal to REPEATABLE-READ then before the first record/row is accessed by the transaction, InnoDB creates a view (snapshot) of the running system state. It does this by examining the transactions that are active at the time of the MVCC snapshot, so that their changes can be excluded from the creating transaction’s read view. This read view creation is also covered by the trx_sys_t::mutex. As the number of active transactions in the system increases this read view creation takes longer and longer. This increases the wait times on the trx_sys_t::mutex (during transaction start and commit) and once threads are forced to wait on a condition variable (in contrast to simply spinning while waiting for the mutex) the system throughput drops dramatically.

The solution

While investigating this problem there were two observations that I made:

  • Read only transactions should not be considered in the MVCC snapshot
  • Auto commit non-locking read-only selects should not be in the trx_sys_t::trx_list at all

For the first to work we need to tag the transactions as READ ONLY when the transaction is started e.g.,

  •       START TRANSACTION READ ONLY;

I will not be discussing this functionality in this blog because the syntax for this doesn’t exist in MySQL (yet). However, the functionality exists in InnoDB to handle this case and is  in the 5.6.4 release. Once the above syntax exists, InnoDB can take advantage of the new syntax trivially. What I want to talk about is the second case. This special case can be detected by InnoDB using existing state information and handled transparently without any syntax change in user applications and is fully functional in the 5.6.4 release.

InnoDB transaction life cycle redesign

Split the trx_sys_t::trx_list (the active transactions list) into two, trx_sys_t::ro_trx_list and trx_sys_t::rw_trx_list. Only transactions that are in the trx_sys_t::rw_trx_list are taken into consideration when creating the MVCC snapshot. For a read-only heavy work load the benefits are obvious,  the smaller size of the RW active transaction list makes the read view creation for MVCC (and purge) very fast. For auto-commit read-only non-locking selects the additional benefit is that we don’t need to acquire the trx_sys_t::mutex at all because we don’t put them on the active list. This removes the bottleneck around the trx_sys_t::mutex and improves concurrency and scalability.

Auto-commit read-only non-locking transactions go from state NOT STARTED -> ACTIVE -> NOT STARTED, in contrast to locking read-only (and read-write) transactions which go from state NOT STARTED -> ACTIVE -> COMMIT TO MEMORY -> NOT STARTED. The additional advantage in skipping the COMMIT TO MEMORY state is that we know that they cannot acquire any locks and therefore it is pointless to acquire the lock_sys_t::mutex and attempt lock release. Also, during COMMIT because they are not on any active transaction list we don’t need to acquire the trx_sys_t::mutex to remove them from the list, improving concurrency and performance further.

Changes to transaction state visibility

Currently (5.6.4) doesn’t display the auto-commit read-only non-locking selects in “SHOW ENGINE INNODB STATUS“, however they are visible in the INFORMATION_SCHEMA.innodb_trx table. The innodb_trx table has two new fields that can be queried to determine whether  a transaction is tagged as a read-only transaction and additionally whether it qualifies for the special handling of auto-commit read-only non-locking select. The new fields are:

  • trx_read_only – 0 or 1 (INT)
  • trx_autocommit_non_locking – 0 or 1 (INT)

Some additional minor tweaks, read-only transactions aren’t assigned a rollback segment and if they are flagged as autocommit non-locking selects then we only set the start time once every 32 times, his reduces the overhead of a system call.

Test results

These tests were run by Alexey Stroganov (a.k.a Ranger) using Sysbench (with –oltp-skip-trx=on), wl6046 refers to the internal worklog number of this performance fix. Note how InnoDB scales easily up to 4K threads on the 24 host and 1.5K threads on the 8 core host, there is very little (negligible) drop in the TPS as the number of threads is cranked up. In particular note how the peak TPS on the 24 core host is nearly double :-)

Test POINT_SELECT
  • QUERY - SELECT c FROM sbtest WHERE id=N


Test SIMPLE_RANGES

  • QUERY – SELECT c FROM sbtest WHERE id BETWEEN N AND M

The same result for SIMPLE RANGES, at lower thread levels 5.6 has less TPS than 5.5 on the 8 core host however that is due to another known and unrelated issue that is currently being worked on actively.

Conclusion

This is a significant step forward in InnoDB read-only performance and will allow InnoDB to scale up to a very high number of concurrent queries and take advantage of greater number of processors with the improved parallelism. Note: mixed workloads with a read-only heavy component will also benefit from this improvement. Currently, only if the read-only transactions are auto-commit non-locking selects. The better news is that with this split of the active transaction list into two, we can now optimize deadlock detection, lock release and transaction scheduling around lock grants and waits, stay tuned!


PlanetMySQL Voting: Vote UP / Vote DOWN

InnoDB 5.6.4 supports databases with 4k and 8k page sizes

Декабрь 20th, 2011

In the 5.6.4 release it is now possible to create an InnoDB database with 4k or 8k page sizes in addition to the original 16k page size. Previously, it could be done by recompiling the engine with a different value for UNIV_PAGE_SIZE_SHIFT and UNIV_PAGE_SIZE. With this release, you can set –innodb-page-size=n when starting mysqld, or put innodb_page_size=n in the configuration file in the [mysqld] section where n can be 4k, 8k, 16k, or 4096, 8192, 16384.

The support of smaller page sizes may be useful for certain storage media such as SSDs. Performance results can vary depending on your data schema, record size, and read/write ratio. But this provides you more options to optimize your performance.

When this new setting is used, the page size is set for all tablespaces used by that InnoDB instance. You can query the current value with;

SHOW VARIABLES LIKE ‘innodb_page_size’;
or
SELECT variable_value FROM information_schema.global_status  WHERE LOWER(variable_name) = ‘innodb_page_size’;

It is a read-only variable while the engine is running since it must be set before InnoDB starts up and creates a new system tablespace. That happens when InnoDB does not find ibdata1 in the data directory. If you start mysqld with a page size other than the standard 16k, the error log will contain something like this;

111214 15:55:05 InnoDB: innodb-page-size has been changed from the default value 16384 to 4096.

If your system tablespace already exists using one page size and innodb-page-size is something else, the engine will not start. There will be a message logged like this;

111214 16:06:51 InnoDB: Error: data file .\ibdata1 uses page size 4096,
111214 16:06:51 InnoDB: but the start-up parameter is innodb-page-size=16384

InnoDB knows the page size used in an existing tablespace created by version 5.6.4 because it stamps that page size in the header page. But this is not readable to older engines, of course. If an older engine opens a database with a page size other than 16k, it will report a corrupted page and quit.

All features in InnoDB work the same with smaller page sizes. But some limits are affected. The maximum record size is proportionately less with smaller pages. Record length limits are calculated within InnoDB based on a variety of factors including row type, column type and length, number of columns, secondary indexes, index prefix lengths, and of course, the page size. The main record is stored in a clustered index and a minimum of two records must fit into each page. So the maximum record length for 8k pages is about half that of 16k pages and the maximum record size with 4k pages is about half that of 8k pages.

In addition to record lengths, the maximum key lengths are proportionately smaller. For 16k pages, MySQL prevents key definitions from containing over 3072 bytes of data. This means that the primary key in the clustered index cannot use more than 3072 bytes of data. Secondary indexes can only contain up to 3072 bytes of data in the key also. But within a secondary index, InnoDB must store the primary key as the record pointer.  So if you have a 3072 byte primary key and a 3072 byte secondary key, each entry in the secondary index contains 6144 bytes of data plus internal record overhead. The amount of internal overhead is dependent upon the column types, but this gets the secondary record close to half the page size which is its natural limit. So this limit of 3072 bytes per key is less than but close to what InnoDB would have to impose on a table by table basis.  Based on that, InnoDB now reports to MySQL that any key defined for 8k page sizes must be less than 1536 bytes.  Likewise, the maximum key length when InnoDB uses 4k page sizes is 768 bytes.

If you have a database schema with any large records or keys defined, you may not be able to use smaller page sizes. Even if your records do barely fit in the clustered index page, it may not be advisable to use these smaller pages because the btree will be a lot deeper.  For example, if only 2 records fit on the page and there are 1,000,000 records, leaf pages are 20 levels deep, meaning InnoDB will need to read 20 pages to find the leaf page.  If that were on 4k pages, then using the same table on 16k pages would give 8 records per page and the leaf pages would only be 7 levels down.

There is a trick to reducing the size of records on the clustered index page;
1) Use Dynamic or Compressed row format.
2) Convert VARCHAR fields to TEXT fields.  (VARBINARY can be converted to BLOB)

There are 4 ROW FORMATS in InnoDB. The first two, Redundant and Compact, which are considered the Antelope file version, store at least 768 bytes of each field in the clustered record.  The Barracuda file version consists of Compact and Dynamic ROW FORMATS. These have the ability to put all of a VARCHAR, VARBINARY, TEXT or BLOB field onto a separate BLOB page for storage.  So one good way to prepare a table structure to use smaller page sizes is to use Dynamic or Compressed row format.

The decision of how big a record will be in the clustered record is made during the INSERT or UPDATE.  Each record is evaluated for its own length.  If the record is too long to fit in half the page, InnoDB will look for the longest actual field in that record and put as much of that field off-page as possible based on the row type.  Then if it is still to long, it will shorten the longest field remaining.  Since this is done when the record is added to the page, different records may have different columns stored off-page when there are multiple long fields in the record.

VARCHAR/VARBINARY fields are treated like TEXT/BLOB fields if they are over 255 bytes long.  If you are using Compressed or Dynamic row format and your record is too long because you have too many VARCHAR fields 255 bytes or less, you can reduce the record length by converting them to TEXT fields. Likewise, VARBINARY fields 255 bytes or less can be converted to BLOB fields to take advantage of this ability to store the whole TEXT or BLOB field on a separate page.

A file extent in InnoDB is 1 Mb independent of the page size. So an extent will hold 64 16k pages, 128 8k pages and 256 4k pages.  This means that the read ahead mechanisms will read more pages with smaller page sizes since they read a whole extent at a time.  The doublewrite buffer, which is based on the size of an extent, will also contain more pages.

If you want to use smaller page sizes with existing data, export the data first with a logical export utility such as mysqldump. Then create the new mysql instance with innodb-page-size=4k or 8k and import the data. Do not use a physical export method such as alter table … discard tablespace.

Summary:

This feature makes it easier to try smaller page sizes in an InnoDB database. And with the 5.6.4 release, those smaller page sizes are fully supported by MySQL. Just export your data, move or delete the system database (ibdata1) and the log files (ib_logfile0 & ib_logfile1), set innodb-page-size to either 4k or 8k, and restart MySQL. A new InnoDB instance will be created with the smaller page size. Then you can import your data and run your tests, all without recompiling InnoDB.


PlanetMySQL Voting: Vote UP / Vote DOWN

InnoDB 5.6.4 supports databases with 4k and 8k page sizes

Декабрь 20th, 2011

In the 5.6.4 release it is now possible to create an InnoDB database with 4k or 8k page sizes in addition to the original 16k page size. Previously, it could be done by recompiling the engine with a different value for UNIV_PAGE_SIZE_SHIFT and UNIV_PAGE_SIZE. With this release, you can set –innodb-page-size=n when starting mysqld, or put innodb_page_size=n in the configuration file in the [mysqld] section where n can be 4k, 8k, 16k, or 4096, 8192, 16384.

The support of smaller page sizes may be useful for certain storage media such as SSDs. Performance results can vary depending on your data schema, record size, and read/write ratio. But this provides you more options to optimize your performance.

When this new setting is used, the page size is set for all tablespaces used by that InnoDB instance. You can query the current value with;

SHOW VARIABLES LIKE ‘innodb_page_size’;
or
SELECT variable_value FROM information_schema.global_status  WHERE LOWER(variable_name) = ‘innodb_page_size’;

It is a read-only variable while the engine is running since it must be set before InnoDB starts up and creates a new system tablespace. That happens when InnoDB does not find ibdata1 in the data directory. If you start mysqld with a page size other than the standard 16k, the error log will contain something like this;

111214 15:55:05 InnoDB: innodb-page-size has been changed from the default value 16384 to 4096.

If your system tablespace already exists using one page size and innodb-page-size is something else, the engine will not start. There will be a message logged like this;

111214 16:06:51 InnoDB: Error: data file .\ibdata1 uses page size 4096,
111214 16:06:51 InnoDB: but the start-up parameter is innodb-page-size=16384

InnoDB knows the page size used in an existing tablespace created by version 5.6.4 because it stamps that page size in the header page. But this is not readable to older engines, of course. If an older engine opens a database with a page size other than 16k, it will report a corrupted page and quit.

All features in InnoDB work the same with smaller page sizes. But some limits are affected. The maximum record size is proportionately less with smaller pages. Record length limits are calculated within InnoDB based on a variety of factors including row type, column type and length, number of columns, secondary indexes, index prefix lengths, and of course, the page size. The main record is stored in a clustered index and a minimum of two records must fit into each page. So the maximum record length for 8k pages is about half that of 16k pages and the maximum record size with 4k pages is about half that of 8k pages.

In addition to record lengths, the maximum key lengths are proportionately smaller. For 16k pages, MySQL prevents key definitions from containing over 3072 bytes of data. This means that the primary key in the clustered index cannot use more than 3072 bytes of data. Secondary indexes can only contain up to 3072 bytes of data in the key also. But within a secondary index, InnoDB must store the primary key as the record pointer.  So if you have a 3072 byte primary key and a 3072 byte secondary key, each entry in the secondary index contains 6144 bytes of data plus internal record overhead. The amount of internal overhead is dependent upon the column types, but this gets the secondary record close to half the page size which is its natural limit. So this limit of 3072 bytes per key is less than but close to what InnoDB would have to impose on a table by table basis.  Based on that, InnoDB now reports to MySQL that any key defined for 8k page sizes must be less than 1536 bytes.  Likewise, the maximum key length when InnoDB uses 4k page sizes is 768 bytes.

If you have a database schema with any large records or keys defined, you may not be able to use smaller page sizes. Even if your records do barely fit in the clustered index page, it may not be advisable to use these smaller pages because the btree will be a lot deeper.  For example, if only 2 records fit on the page and there are 1,000,000 records, leaf pages are 20 levels deep, meaning InnoDB will need to read 20 pages to find the leaf page.  If that were on 4k pages, then using the same table on 16k pages would give 8 records per page and the leaf pages would only be 7 levels down.

There is a trick to reducing the size of records on the clustered index page;
1) Use Dynamic or Compressed row format.
2) Convert VARCHAR fields to TEXT fields.  (VARBINARY can be converted to BLOB)

There are 4 ROW FORMATS in InnoDB. The first two, Redundant and Compact, which are considered the Antelope file version, store at least 768 bytes of each field in the clustered record.  The Barracuda file version consists of Compact and Dynamic ROW FORMATS. These have the ability to put all of a VARCHAR, VARBINARY, TEXT or BLOB field onto a separate BLOB page for storage.  So one good way to prepare a table structure to use smaller page sizes is to use Dynamic or Compressed row format.

The decision of how big a record will be in the clustered record is made during the INSERT or UPDATE.  Each record is evaluated for its own length.  If the record is too long to fit in half the page, InnoDB will look for the longest actual field in that record and put as much of that field off-page as possible based on the row type.  Then if it is still to long, it will shorten the longest field remaining.  Since this is done when the record is added to the page, different records may have different columns stored off-page when there are multiple long fields in the record.

VARCHAR/VARBINARY fields are treated like TEXT/BLOB fields if they are over 255 bytes long.  If you are using Compressed or Dynamic row format and your record is too long because you have too many VARCHAR fields 255 bytes or less, you can reduce the record length by converting them to TEXT fields. Likewise, VARBINARY fields 255 bytes or less can be converted to BLOB fields to take advantage of this ability to store the whole TEXT or BLOB field on a separate page.

A file extent in InnoDB is 1 Mb independent of the page size. So an extent will hold 64 16k pages, 128 8k pages and 256 4k pages.  This means that the read ahead mechanisms will read more pages with smaller page sizes since they read a whole extent at a time.  The doublewrite buffer, which is based on the size of an extent, will also contain more pages.

If you want to use smaller page sizes with existing data, export the data first with a logical export utility such as mysqldump. Then create the new mysql instance with innodb-page-size=4k or 8k and import the data. Do not use a physical export method such as alter table … discard tablespace.

Summary:

This feature makes it easier to try smaller page sizes in an InnoDB database. And with the 5.6.4 release, those smaller page sizes are fully supported by MySQL. Just export your data, move or delete the system database (ibdata1) and the log files (ib_logfile0 & ib_logfile1), set innodb-page-size to either 4k or 8k, and restart MySQL. A new InnoDB instance will be created with the smaller page size. Then you can import your data and run your tests, all without recompiling InnoDB.


PlanetMySQL Voting: Vote UP / Vote DOWN