Archive for the ‘hardware’ Category
Hardware Components Failures — Survey Results
Май 10th, 2012PlanetMySQL Voting: Vote UP / Vote DOWN
Hardware Components Failures — Survey Results
Май 10th, 2012PlanetMySQL Voting: Vote UP / Vote DOWN
Should RAID 5 be used in a MySQL server?
Апрель 2nd, 2012PlanetMySQL Voting: Vote UP / Vote DOWN
Relearn about your battery
Март 15th, 2012Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.
After a quick peek in all our monitoring systems I isolated it to this message:
Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless!
Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.
The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.
Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode.
Tagged: battery relearn, hardware, mysql, omconfig, omreport, poweredge r710, replication
PlanetMySQL Voting: Vote UP / Vote DOWN
Relearn about your battery
Март 15th, 2012Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.
After a quick peek in all our monitoring systems I isolated it to this message:
Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless!
Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.
The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.
Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode.
Tagged: battery relearn, hardware, mysql, omconfig, omreport, poweredge r710, replication
PlanetMySQL Voting: Vote UP / Vote DOWN
Got open source cloud storage? Red Hat buys Gluster
Октябрь 6th, 2011Red Hat’s $136m acquisition of open source storage vendor Gluster marks Red Hat’s biggest buy since JBoss and starts the fourth quarter with a very intersting deal. The acquisition is definitely good for Red Hat since it bolsters its Cloud Forms IaaS and OpenShift PaaS technology and strategy with storage, which is often the starting point for enterprise and service provider cloud computing deployments. The acquisition also gives Red Hat another weapon in its fight against VMware, Microsoft and others, including OpenStack, of which Gluster is a member (more on that further down). The deal is also good for Gluster given the sizeable price Red Hat is paying for the provider of open source, software-based, scale-out storage for unstructured data and also as validation of both open source and software in today’s IT and cloud computing storage.
This is exactly the kind of disruption we’ve been seeing and expecting as Linux vendors compete with new rivals in virtualization, cloud computing and different layers of the stack, including storage (VMware, Microsoft, OpenStack, Oracle, Amazon and others), as covered in our recent special report, The Changing Linux Landscape.
While the deal makes perfect sense for both Red Hat and for Gluster, it also has implications for the white hot open source cloud computing project OpenStack. There was no mention of OpenStack in Red Hat’s FAQ on the deal, but there was a reference to ongoing support for Gluster partners, of which there are many fellow OpenStack members. OpenStack was also highlighted among Gluster’s key open standards participation along with the Linux Foundation and Red Hat-led Open Virtualization Alliance oriented around KVM. Sources at both Gluster and Red Hat, which point to OpenStack support being bundled into Red Hat’s coming Fedora 16, also reiterated to me Red Hat is indeed planning to continue involvement with OpenStack around the Gluster technologies. I suspect Red Hat is looking to leverage Gluster more for its own purposes than for OpenStack’s, but I must also acknowledge Red Hat’s understanding of the value of openness, community and compatibility. Taking that idea a step further, Gluster may represent a way that Red Hat can integrate with and tap into the OpenStack community by blending it with its own community around Fedora, RHEL, JBoss, RHEV and Cloud Forms and OpenShift.
The deal also leads many to wonder whether or what may be next for Red Hat in terms of acquisition. We’ve long thought database and data management technologies were areas where we might see Red Hat building out. This was also the subject of renewed rumors recently, and we believe it might still be an attractive piece for Red Hat given the open source opportunities and targets around NoSQL technologies such as Apache Hadoop distributed data management framework and Cassandra distributed database management software. We’ve also believed systems management to be a potential place for Red Hat to further expand. Given its need to largely stay within open source, we would expect targets in this area to include GroundWork Open Source, which joins Linux and Windows systmes in its monitorig and management, and Zenoss, which works with Cisco and Red Hat rival VMware in monitoring and managing systems with its open source software. Another potential target that would increase Red Hat’s depth in open source virtualization and cloud computing is Convirture, which might also be an avenue for Red Hat to reach out to midmarket and SMB customers and channel players. Red Hat was among the non-OpenStack members we listed as potential acquirers when considering the M&A possibilities (451 subscribers) out of OpenStack.
Given its recent quarterly earnings report and topping the $1 billion annual revenue mark, Red Hat seems again to be bucking the bad economy. We’ve written before in 2008 and more recently how bad economic conditions can be good for open source software. Red Hat is atop the list of open source vendors that suffer as traditional, enterprise IT customers such as banks freeze spending or worse, fail. However, the company’s deal for Gluster is yet another sign it is thriving and expanding despite economic difficulty and uncertainty.
You don’t have to just look at Red Hat’s earnings or take our word for it. On Jim Cramer’s ‘Mad Money’ this week, we heard Red Hat CEO Jim Whitehurst praised for Red Hat performance and traction where most companies and many economists are throwing the blame: financial services, government and Europe. Cramer credited Red Hat for a ’spectacular quarter’ and allowed Whitehurst to tout the benefits of the Gluster technology and acquisition, particularly Gluster’s software-based storage technology that matches cloud computing. It was quite a contrast to the news out of Oracle Open World, where hardware was a focal point.
PlanetMySQL Voting: Vote UP / Vote DOWN
CodeBits — An event of competitive innovation
Август 15th, 2011
It was my pleasure and privilege to attend Codebits in 2009. As Roland Bouman says, its talk choice method is based on public voting, and therefore everyone cha have contribute to the schedule.But that is not the main reason for attending this extraordinary event. It is not just a conference. It's an innovation fest. For 1 and 1/2 days, it's a conference, where the speakers are encouraged to bring to their audience the most innovative and inspiring talks. In the afternoon of the second day, the event becomes a competition, where the teams that have registered will have 24 hours to bring a project to completion, and they have to start and finish within the allotted time. The project can be anything, and I have seen quite a lot of exciting stuff rolling live in the huge pavilion: I could hardly ignore robotics, as these little mechanical smurfs were running all over the place and you would have to be careful not to squash them when you walked.There was plenty of occasions for planning of great projects, together with attempts at improving social relations, and mixing up with big brother.There were projects based on 3D printing, and less broad projects like all-seasons keyboards.A very popular session, followed by practical workshops was lock picking. I attended one of them, learned how to pick simple and less simple locks, and I brought home some lockpicking tools.On a more technical level, I was there with Lenz Grimmer and Kai Seidler, we spoke about MySQL and other cool things, and we had lots of fun for three days.Besides the teams hacking away at their projects, there were several teams showcasing technology that had been developed by winners of the previous years, such as 3D television and intelligent phone networks. In short, This was an inspiring event, which I can warmly recommend.PlanetMySQL Voting: Vote UP / Vote DOWN
Aligning IO on a hard disk RAID – the Benchmarks
Июнь 9th, 2011In the first part of this article I have showed how I align IO, now I want to share results of the benchmark that I have been running to see how much benefit can we get from a proper IO alignment on a 4-disk RAID1+0 with 64k stripe element. I haven’t been running any benchmarks in a while so be careful with my results and forgiving to my mistakes
The environment
Here is the summary of the system I have been running this on (for brevity I have removed some irrelevant information):
# Aspersa System Summary Report ##############################
Platform | Linux
Release | Ubuntu 10.04.2 LTS (lucid)
Kernel | 2.6.32-31-server
Architecture | CPU = 64-bit, OS = 64-bit
# Processor ##################################################
Processors | physical = 2, cores = 12, virtual = 24, hyperthreading = yes
Speeds | 24x1600.000
Models | 24xIntel(R) Xeon(R) CPU X5650 @ 2.67GHz
Caches | 24x12288 KB
# Memory #####################################################
Total | 23.59G
...
Locator Size Speed Form Factor Type Type Detail
========= ======== ================= ============= ============= ===========
DIMM_A1 4096 MB 1333 MHz (0.8 ns) DIMM {OUT OF SPEC} Other
...
# Disk Schedulers And Queue Size #############################
sda | [deadline] 128
# RAID Controller ############################################
Controller | LSI Logic MegaRAID SAS
Model | MegaRAID SAS 8704EM2, PCIE interface, 8 ports
Cache | 128MB Memory, BBU Present
BBU | 100% Charged, Temperature 34C, isSOHGood=
VirtualDev Size RAID Level Disks SpnDpth Stripe Status Cache
========== ========= ========== ===== ======= ====== ======= =========
0(no name) 1.088 TB 1 (1-0-0) 2 2-2 64 Optimal WT, RA
PhysiclDev Type State Errors Vendor Model Size
========== ==== ======= ====== ======= ============ ===========
Hard Disk SAS Online 0/0/0 SEAGATE ST3600057SS 558.911
Hard Disk SAS Online 0/0/0 SEAGATE ST3600057SS 558.911
Hard Disk SAS Online 0/0/0 SEAGATE ST3600057SS 558.911
Hard Disk SAS Online 0/0/0 SEAGATE ST3600057SS 558.911
It says controller cache is set to write-through (WT), though in fact for every benchmark I have repeated it with (a) write-through and (b) write-back to see if write-back cache would minimize the effects of misalignment.
File system of choice was XFS. Barriers and physical disk cache was disabled. The tool I used was sysbench 0.4.10 that came with this Ubuntu system. I have run every fileio benchmark and an IO bound read-write oltp benchmark in autocommit mode.
File IO benchmark
For the FileIO benchmark, I used 64 files – 1GB, 4GB and 16GB total in size with 1, 4 and 8 threads. The operations were done in 16kB units to mimic InnoDB pages. There were couple interesting surprised I faced:
1. After I got (what I thought was) the best configuration, I added LVM on top of that and the performance improved another 20-40%. It took me a while to figure it out, but here’s what happened – for XFS file system on a raw partition I was using full partition size which was slightly over 1TB in size. When I added LVM on top however, I made the logical volume slightly below 1TB. Investigating this I found that 32-bit xfs inodes (which are used by default) have to live in the first terabyte of the device which seems to have affected the performance here (IMO that’s because of where first data extents were placed in this case). When I have mounted the partition with inode64 option however, the effect disappeared and performance without LVM was slightly better than with LVM as expected. I had to redo all of the benchmarks to get the numbers right.
2. I was running vmstat during one of the tests and my eye caught the spike in OS buffers during “prepare” phase of sysbench. I found out that sysbench would not honor –file-extra-flags during “prepare” phase and instead of having files created using direct IO they were buffered in OS cache and so writes to files were serialized until they were fully overwritten and that way flushed from OS buffers. Buffers would be flushed within first few seconds so the effects of this were marginal. Alexey Kopytov fixed this in the sysbench trunk immediately, though I didn’t want to recompile sysbench on this system so I’ve used Domas’ uncache after prepare to make sure caches were clean.
OLTP benchmark
As the goal was to compare performance with different IO alignment, not different MySQL configurations, I didn’t try out different MySQL versions or settings. Moreover, I have been running these benchmarks for a customer so I just used the setting that they would have used anyway. One thing I did change was – I have significantly reduced InnoDB buffer pool to make sure the benchmark is IO bound.
That said, benchmark was running on a Percona Server 5.0.92-87 with the following my.cnf configuration:
[mysqld] datadir=/data/mysql socket=/var/run/mysqld/mysqld.sock innodb_file_per_table = true innodb_data_file_path = ibdata1:10M:autoextend innodb_flush_log_at_trx_commit = 2 innodb_flush_method = O_DIRECT innodb_log_buffer_size = 8M innodb_buffer_pool_size = 128M innodb_log_file_size = 64M innodb_log_files_in_group = 2 innodb_read_io_threads = 8 innodb_write_io_threads = 8 innodb_io_capacity = 200 port = 3306 back_log = 50 max_connections = 2500 max_connect_errors = 10 table_cache = 2048 max_allowed_packet = 16M binlog_cache_size = 16M max_heap_table_size = 64M thread_cache_size = 32 query_cache_size = 0 tmp_table_size = 64M key_buffer_size = 8M bulk_insert_buffer_size = 8M myisam_sort_buffer_size = 8M myisam_max_sort_file_size = 10G myisam_repair_threads = 1 myisam_recover skip-grant-tables
Amount of rows used was 20M, transactions were not used (autocommit), number of threads – 1, 4, 8, 16 and 32.
Benchmark scenarios
Here’s the different settings that I have ran the same benchmark on. As I mentioned earlier, each of those were run twice – first with RAID controller cache set to Write-Through and then to Write-Back.
1. Baseline – misalignment on the partition table, no LVM and no alignment settings in the file system. This is what you would often get on RHEL5, Ubuntu 8.04 or similar “older” systems if you wouldn’t do anything with respect to IO alignment.
2. Misalignment on the partition table, but proper alignment options on the file system. This is what we get when file system tries to balance writes but is not aware that it is not aligned to the beginning of the stripe element.
3. 1M alignment in partition table but no options on the file system. You should get this on RHEL6, Ubuntu 10.04 and similar systems if you wouldn’t do anything with respect to IO alignment yourself. In this case offset is correct, but file system is unaware how to align files properly.
4. Partition table and file system properly aligned; sunit/swidth set during mkfs. No LVM at this point.
5. Partition table aligned properly; sunit/swidth set during mounting but not during mkfs. This is your best option if you have a proper alignment in partition table but you did not set alignment options in xfs when creating it and you don’t want or can’t format the file system. One thing to note however – files that were written before this was set may still be unaligned, though xfs defragmentation may be able to fix that (not verified).
6. Added LVM on top of aligned partition table, used proper file system alignment.
Benchmark results
I had a hard time thinking how it would be best to present results so it’s not too stuffed and actually interesting. I decided that instead of preparing charts for each benchmark, I’ll just describe few less interesting numbers first, then I’ll show graphs for more interesting results. Let me know if you thought this was a bad idea
File IO benchmark results
Sequential read results are expectedly the least interesting. Read-ahead kicked in immediately giving ~9’600 iops (~150MB/s) at 1 thread, 14500 iops (~230MB/s) at 4 threads and ~16300 iops (~250MB/s) at 8 threads. Neither IO alignment nor file size made any difference. Adding LVM here reduced single-thread performance by 5-10%.
Sequential write results were a bit more interesting. With WT (write-through) cache enabled, performance was really poor whatsoever and there was virtually no difference whether it was 1 thread, 4 or 8 threads. Different file sizes made no difference too. Write-back cache gave an incredible performance boost – up to 33x in single-threaded workload. File system IO alignment seems to have made a different – up to 15% when with write-back cache enabled. Here’s 1GB seqwr with WT cache:

Here’s same test with WB cache:

And just to show you the difference between sequential writes with WT cache and WB cache:

Random read. This is probably the most interesting number for OLTP workload which is usually light on writes (especially if there’s a BBU protected Write-Back cache) and heavy on random reads. Regardless of the file size, the difference between aligned and misaligned reads was the same and, WT -vs- WB cache of course showed no difference at all. Here are the results:

As you can see IO alignment makes a difference here and improves performance up to 15% in case of 8 threads running concurrently. Because the customer was running a database which was way bigger than 16G, I’ve repeated the random read (and write) benchmark with 8 threads and total size of 256G. While the number of operations per second was slightly lower, the difference was still 15% — 909 iops unaligned -vs- 1049 aligned.
Random write. This is an important metric for write intensive workloads where there’s a lot of data being modified, inserts are done to random positions (not consecutive PK causing page splits) etc. Benchmark results are fairly consistent regardless of file size, let’s look at them. First, results with WT cache:

And here’s with WB cache:

Apparently proper IO alignment in this case gives up to 23% improvement when WB cache is used. With WT cache enabled, single thread performance improvement is marginal however WB cache brings single thread random write performance close to what 8 threads can do, and IO alignment gives extra 23% in this case.
I mentioned I did single test on a larger files (same test I did for random reads) i.e. 8 thread random write benchmark on files totaling to 256GB. With WB cache enabled, I got 919 iops unaligned and 1127 iops aligned i.e. the improvement is still 23%.
OLTP benchmark results
From this benchmark, I only have two graphs to show you. First one is with RAID controller set to WT cache:

The second is with WB cache:

I couldn’t figure out what exactly happened with setting #3 when WB cache was disabled, what I do know though is that, based on IO stats I was gathering during the benchmarks, the reason was in fact lower number of IO operations and higher response time – so it seems in this case misaligned IO had some collateral effects in a mixed read/write environment. Note that the benchmarks were all scripted and oltp benchmarks would automatically start after file tests so if there was an error in the setting, it would have reflected across all other benchmarks for the same setting.
Summary
For the two workloads that are most relevant to databases – random reads and random writes – IO alignment on a 4-disk RAID10 with standard 64k stripe element size makes a significant difference. When I launched the system that I was benchmarking, I could clearly see the difference in production as I had another machine running sideways with the same hardware, but with a misaligned IO. Here’s diskstats from the two shards running side by side:
Aligned:
#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg
{540} dm-0 447.1 34.0 7.4 0% 2.4 5.4 23.4 49.6 0.6 0% 0.0 0.6 85% 0
Misaligned:
#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg
{925} dm-0 462.1 34.1 7.7 0% 3.8 8.2 12.1 87.0 0.5 0% 0.0 0.7 93% 0
While number of operations from the OS perspective is very similar, due to high concurrency response time in the first case is significantly better.
It would be interesting however to run similar benchmarks on a larger RAID5 system where it should make even bigger difference on writes. Another interesting setting might be a [mirrored] RAID0 with many more stripes as not having proper file system alignment should have really interesting effects. Large stripe on the other hand should somewhat reduce the effects of misalignment, though it would definitely be interesting to run benchmarks and verify that. If you have some numbers to share, please leave a comment. Next, I plan to look at IO alignment on Flash cards to see what benefits we can get there from proper alignment.
You can find scripts and plain data here on our public wiki.
PlanetMySQL Voting: Vote UP / Vote DOWN
dbbenchmark.com – now supporting MySQL on OSX 10.6
Август 29th, 2010Just a quick note to let everyone know that our new benchmarking script now supports OSX 10.6 on Intel hardware. That means you can run one simple command and get all of the sequential and random INSERT and SELECT performance statistics about your database performance. As usual the script is open source and released under the new BSD license. Give is a try by downloading now! See the download page for more details.
PlanetMySQL Voting: Vote UP / Vote DOWN
dbbenchmark.com – Benchmarking script now available
Август 28th, 2010You can download the first release of the benchmarking script here: http://code.google.com/p/dbbenchmark/
Please read the README file or consult the Support page before running the benchmarks.
PlanetMySQL Voting: Vote UP / Vote DOWN