It was a long time now that I wanted to run some benchmark tests to
understand better the surprises I've met in the past with Linux I/O
performance during MySQL benchmarks, and finally it happened last year,
but I was able to organize and present my results only now..
My
main questions were:
-
what is so different with various I/O schedulers in Linux (cfq, noop,
deadline) ?..
-
what is wrong or right with O_DIRECT on Linux ?..
-
what is making XFS more attractive comparing to EXT3/EXT4 ?..
There were already several posts in the past about impact on MySQL
performance when one or another Linux I/O layer feature was used (for ex.
Domas
about I/O
schedulers, Vadim
regarding
TPCC-like performance, and many other) - but I still did not find any
answer
WHY (for ex.) cfq I/O scheduler is worse than noop, etc,
etc..
So, I'd like to share here some answers to my WHY questions
;-))
(while for today I still have more questions than answers ;-))
Test
PlatformFirst of all, the system I've used for my tests:
-
HW server: 64 cores (Intel), 128GB RAM, running RHEL 5.5
-
the kernel is 2.6.18 - as it was until now the most common Linux
kernel used on Linux boxes hosting MySQL servers
-
installed filesystems: ext3, ext4, XFS
-
Storage: ST6140 (1TB on x16 HDD striped in RAID0, 4GB cache on
controller) - not a monster, but fast enough to see if the bottleneck
is coming from the storage level or not ;-))
Test PlanThen, my initial test plan:
-
see what the max possible Read/Write I/O performance I can obtain from
the given HW on the raw level (just RAW-devices, without any
filesystem, etc.) - mainly I'm interested here on the impact of Linux
I/O scheduler
-
then, based on observed results, setup more optimally each filesystem
(ext3, ext4, XFS) and try to understand their bottlenecks..
-
I/O workload: I'm mainly focusing here on the random reads and random
writes - they are the most problematic for any I/O related performance
(and particularly painful for databases), while sequential read/writes
may be very well optimized on the HW level already and hide any other
problems you have..
-
Test Tool: I'm using here my IObench
tool (at least I know exactly what it's doing ;-))
TESTING RAW DEVICESImplementation of raw devices in Linux
is quite surprising.. - it's simply involving O_DIRECT access to a block
device. So to use a disk in raw mode you have to open() it with O_DIRECT
option (or use "raw" command which will create an alias device in your
system which will always use O_DIRECT flag on any involved open() system
call). Using O_DIRECT flag on a file opening is disabling any I/O
buffering on such a file (or device, as device is also a file in UNIX ;-)
- NOTE: by default all I/O requests on block devices (e.g. hard disk) in
Linux are buffered, so if you'll start a kind of I/O write test on, say,
your /dev/sda1 - you'll obtain a kind of incredible great performance ;-))
as no data probably will not yet even reach your storage and in reality
you'll simply test a speed of your RAM.. ;-))
Now, what is "fun"
with O_DIRECT:
-
all your I/O requests (read, write, etc.) block size should be aligned
to 512 bytes (e.g. be multiplier of 512 bytes), otherwise your I/O
request is simply rejected and you get an error message.. - and
regarding to RAW devices it's quite surprising comparing to Solaris
for ex. where you're simply instead of /dev/dsk/device using
/dev/rdsk/device and may use any block size you want..
-
but it's not all.. - the buffer you're using within your system call
involving I/O request should also be allocated aligned to 512 bytes,
so mainly you have to allocate it via posix_memalign() function,
otherwise you'll also get an error.. (seems that during O_DIRECT
operations there is used some kind of direct memory mapping)
-
then, reading the manual: "The O_DIRECT flag on its own makes at an
effort to transfer data synchronously, but does not give the guarantees
of the O_SYNC that data and necessary metadata are transferred. To
guarantee synchronous I/O the O_SYNC must be used in addition to
O_DIRECT" - quite surprising again..
-
and, finally, you'll be unable to use O_DIRECT within your C code
until you did not declare #define _GNU_SOURCE
Interesting that the man page is also quoting Linus about O_DIRECT:
"The
thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey
on some serious mind-controlling substances." Linus
But we
have to live with it ;-))
And if you need an example of C or C++
code, instead to show you the mine, there is a great dev page on
Fusion-io
site.
So far, what about my storage performance on the RAW
devices now?..
Test scenario on RAW devices:
-
I/O Schedulers: cfq, noop, deadline
-
Block size: 1K, 4K, 16K
-
Workload: Random Read, Random Write
NOTE: I'm using here 1K block size as the smallest "useful" size for
databases :-)) then 4K as the most aligned to the Linux page size (4K),
and 16K - as the default InnoDB block size until now.
Following
graphs are representing 9 tests executed one after one: cfq with 3
different block sizes (1K, 4K, 16K), then noop, then deadline. Each test
is running a growing workload of 1, 4, 16, 64 concurrent users (processes)
non-stop bombarding my storage subsystem with I/O requests.
Read-Only@RAW-device:
Observations
:
-
Random Read is scaling well for all Linux I/O Schedulers
-
Reads reported by application (IObench) are matching numbers reported
by the system I/O stats
-
1K reads are running slightly faster than 4K (as expected as it's a
"normal" disks, and transfer of a bigger data volume reducing an
overall performance, which is normal)..
Write-Only @RAW-device:
Observations
:
-
looking on the graph you may easily understand now what is wrong with
"cfq" I/O scheduler.. - it's serializing write operations!
-
while "noop" and "deadline" are continuing to scale with a growing
workload..
-
so, it's clear now WHY performance gains were observed by many people
on MySQL workloads by simply switching from "cfq" to "noop" or
"deadline"
To check which I/O scheduler is used for your storage device:
# cat
/sys/block/{DEVICE-NAME}/queue/scheduler
For ex. for "sda": # cat
/sys/block/sda/queue/scheduler
Then set "deadline" for "sda": #
echo deadline > /sys/block/sda/queue/scheduler
To set "deadline" as
default I/ scheduler for all your storage devices you may boot your system
with "elevator=deadline" boot option. Interesting that by default many
Linux systems used "cfq". All recent Oracle Linux systems are shipped with
"deadline" by default.
TESTING FILESYSTEMSAs
you understand, there is no more reasons to continue any further tests by
using "cfq" I/O scheduler.. - if on the raw level it's already bad, it
cannot be better due any filesystem features ;-)) (While I was also told
that in recent Linux kernels "cfq" I/O scheduler should perform much more
better, let's see)..
Anyway, my filesystem test scenario:
-
Linux I/O Scheduler: deadline
-
Filesystems: ext3, ext4, XFS
-
File flags/options: osync (O_SYNC), direct (O_DIRECT), fsync (fsync()
is involved after each write()), fdatasync (same as fsync, but calling
fdatasync() instead of fsync())
-
Block size: 1k, 4K, 16K
-
Workloads: Random Reads, Random Writes on a single 128GB file -
it's the most critical file access for any database (having a hot
table, or a hot tablespace)
-
NOTE: to avoid most of background effects of caching, I've
limited an available RAM for FS cache to 8GB only! (all other RAM was
allocated to the huge SHM segment with huge pages, so not swappable)..
Also, we have to keep in mind now the highest I/O levels observed on RAW
devices:
-
Random Read: ~4500 op/sec
-
Random Write: ~5000 op/sec
So, if for any reason Read or Write performance will be faster on any of
filesystems - it'll be clear there is some buffering/caching happening on
the SW level ;-))
Now, let me explain what you'll see on the
following graphs:
-
they are already too many, so I've tried to bring more data on each
graph :-))
-
there are 12 tests on each graph (x3 series of x4 tests)
-
each serie of tests is executed by using the same block size (1K, then
4K, then 16K)
-
within a serie of 4 tests there are 4 flags/options are used one after
one (osync, direct, fsync, fdatasync)
-
each test is executed as before with 1, 4, 16, 64 concurrent user
processes (IObench)
-
only one filesystem per graph :-))
So, let's start now with Read-Only results.
Read-Only @EXT3:
Observations
:
-
pretty well scaling, reaching 4500 reads/sec in max
-
on 1K reads: only "direct" reads are really reading 1K blocks, all
other options are involving reading of 4K blocks
-
nothing unexpected finally :-)
Read-Only @EXT4:
Observations
:
-
same as on ext3, nothing unexpected
Read-Only @XFS:
Observations
:
-
no surprise here either..
-
but there were one surprise anyway ;-))
While the results on Random Read workloads are looking exactly the same on
all 3 filesystems, there are still some difference in how the O_DIRECT
feature is implemented on them! ;-))
The following graphs are
representing the same tests, but only corresponding to execution with
O_DIRECT flag (direct). First 3 tests are with EXT3, then 3 with XFS, then
3 with EXT4:
Direct I/O & Direct I/O
Observations
:
-
the most important here the last graph showing here the memory usage
on the system during O_DIRECT tests
-
as you may see, only with XFS the filesystem cache usage is near zero!
-
while EXT3 and EXT4 are still continuing cache buffering.. - may be a
very painful surprise when you're expecting to use this RAM for
something else ;-))
Well, let's see now what is different on the Write Performance.
Write-Only
@EXT3:
Observations
:
-
the most worse performance here is with 1K blocks.. - as default EXT3
block size is 4K, on the 1K writes it involves a read-on-write (it has
to read 4K block first, then change corresponding 1K on changes within
it, and then write the 4K block back with applied changes..)
-
read-on-write is not happening on 1K when O_DIRECT flag is used: we're
really writing 1K here
-
however, O_DIRECT writes are not scaling at all on EXT3! - and it
explains me finally WHY I've always got a worse performance when tried
to use O_DIRECT flush option in InnoDB on EXT3 filesystem! ;-))
-
interesting that the highest performance here is obtained with O_SYNC
flag, and we're not far from 5000 writes/sec for what the storage is
capable..
Write-Only @EXT4:
Observations
:
-
similar to EXT3, but performance is worse comparing to EXT3
-
interesting that only with O_SYNC flag the performance is comparable
with EXT3, while in all other cases it's simply worse..
-
I may suppose here that EXT3 is not flushing on every
fsync() or fdatasync(), and that's why it's performing better with
these options ;-)) need to investigate here.. But anyway, the result
is the result..
What about XFS?..
Write-Only @XFS:
Observations
:
-
XFS results are quite different from those of EXT3 and EXT4
-
I've used a default setup of XFS here, and was curios to not
observe the impact of missed "nobarrier" option which was reported
by Vadim in the past..
-
on 1K block writes only O_DIRECT is working well, but in difference
from EXT3/EXT4 it's also scaling ;-) (other options are giving poor
results due the same read-on-write issue..)
-
4K block writes are scaling well with O_SYNC and O_DIRECT, but still
remaining poor with other options
-
16K writes are reporting some anomalies: while with O_SYNC nothing is
going wrong and it's scaling well, with O_DIRECT there is some kind of
serialization happened on 4 and 16 concurrent user processes.. - and
then on 64 users things then came back to the normal.. Interesting
that is was not observed with 4K block writes.. Which remains me the
last year discussion about page block size in InnoDB for SSD, and the
gain reported by using 4K page size vs 16K.. - just keep in mind that
sometimes it may be not related to SSD at all, but just to some
filesystem's internals ;-))
-
anyway, no doubt - if you have to use O_DIRECT in your MySQL server -
use XFS! :-)
Now, what is the difference between a "default" XFS configuration and
"tuned" ??..
I've recreated XFS with 64MB log size and
mounted with following options:
# mount -t xfs -o
noatime,nodiratime,nobarrier,logbufs=8,logbsize=32k
The results are
following..
Write-Only @XFS-tuned:
Observations
:
-
everything is similar to "default" config, except that there is no
more problem with 16K block size performance
-
and probably this 16K anomaly observed before is something random,
hard to say.. - but at least I saw it, so cannot ignore ;-))
Then, keeping in mind that XFS is so well performing on 1K block size, I
was curious to see if thing will not go even better if I'll create my XFS
filesystem with 1K block size instead of default 4K..
Write-Only
@XFS-1K:
Observations
:
-
when XFS is created with 1K block size there is no more read-on-write
issue on 1K writes..
-
and we're really writing 1K..
-
however, the performance is completely poor.. even on 1K writes with
O_DIRECT !!!
-
why?..
The answer is came from the Random Reads test on the same XFS, created
with 1K block size.
Read-Only @XFS-1K:
Observations
:
-
if you followed me until now, you'll understand everything from the
last graph, reporting RAM usage.. ;-))
-
the previously 8GB free RAM is no more free here..
-
so, XFS is not using O_DIRECT here!
-
and you may see also that for all reads except O_DIRECT, it's reading
4K for every 1K, which is abnormal..
Instead of SUMMARY
-
I'd say the main point here is - "test your I/O subsystem performance
before to deploy your MySQL server" ;-))
-
avoid to use "cfq" I/O scheduler :-)
-
if you've decided to use O_DIRECT flush method in your MySQL server -
deploy your data on XFS..
-
seems to me the main reason why people are using O_DIRECT with MySQL
it's a willing to avoid to deal with various issues of filesystem
cache.. - and there is probably something needs to be improved in the
Linux kernel, no? ;-)
-
could be very interesting to see similar test results on the other
filesystems too..
-
things may look better with a newer Linux kernel..
So far, I've got some answers to my
WHY questions.. Will be fine
now to get a time to test it directly with MySQL ;-)
Any comments
are welcome!
Rgds,
-Dimitri
PlanetMySQL Voting:
Vote UP /
Vote DOWN