Archive for the ‘storage’ Category

Running out of disk space on MySQL partition? A quick rescue.

Май 7th, 2012

No space left on device – this can happen to anyone. Sooner or later you may face the situation where a database either has already or is only minutes away from running out of disk space. What many people do in such cases, they just start looking for semi-random things to remove – perhaps a backup, a few older log files, or pretty much anything that seems redundant. However this means acting under a lot of stress and without much thinking, so it would be great if there was a possibility to avoid that. Often there is. Or what if there isn’t anything to remove?

While xfs is usually the recommended filesystem for a MySQL data partition on Linux, the extended filesystem family continues to be very popular as it is used as default in all major Linux distributions. There is a feature specific to ext3 and ext4 that can help the goal of resolving the full disk situation.

Unless explicitly changed during filesystem creation, both by default reserve five percent of a volume capacity to the superuser (root). It helps preventing non-privileged processes from filling up all disk, leaving no room for system logging or system applications. However such reservation only makes sense for the system volumes, while MySQL often sits on its own, dedicated partition, so there is no real reason to keep any number of blocks away from it.

So if your database files are stored on a partition formatted with ext3 or ext4 and MySQL runs out of disk space, you may be in luck as there may be some extra capacity the database may be able to use.

How to enable it?

I had a server that ran out of space on the MySQL volume. The system was reporting 5.7M free and MySQL essentially blocked waiting on the opportunity to complete the writes:

[root@db4 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_centos-lv_root
                       30G   25G  3.4G  89% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             243M   47M  183M  21% /boot
/dev/mapper/vg_centos-lv_mysql
                      145G  145G  5.7M 100% /vol/mysql

A quick verification of the filesystem used for that volume:

[root@db4 ~]# mount
[..]
/dev/mapper/vg_centos-lv_mysql on /vol/mysql type ext4 (rw,noatime,nodiratime)

As the next step, I had to verify if the volume had any reserved blocks that could be freed. I have not seen many servers that actually had the default setting changed during the installation process, so in many cases there should be something:

[root@db4 ~]# dumpe2fs /dev/mapper/vg_centos-lv_mysql | grep 'Reserved block count'
dumpe2fs 1.41.12 (17-May-2010)
Reserved block count: 1927884

It turned out 1927884 of 4KB blocks were reserved for the superuser, which was exactly five percent of the volume capacity. I was able to free this space and make it available to MySQL:

[root@db4 ~]# tune2fs -m 0 /dev/mapper/vg_centos-lv_mysql
tune2fs 1.41.12 (17-May-2010)
Setting reserved blocks percentage to 0% (0 blocks)

This works instantaneously. Simply applications start to see more disk space available.

[root@db4 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_centos-lv_root
                       30G   25G  3.4G  89% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             243M   47M  183M  21% /boot
/dev/mapper/vg_centos-lv_mysql
                      145G  145G  7.3G  95% /vol/mysql

Without removing a single file I managed to create over seven gigabytes of free space that allowed MySQL to resume operations. That didn’t solve the problem entirely, but it got me a lot of time to figure out a long term solution.

The method is a quick remedy for the emergency situation when your database runs out of disk space. I used it numerous times while helping many people to solve such problems. Of course, it is not a proper solution, but rather something that buys you time to figure out the options. As it only works once in a system lifetime, because after you remove the entire reservation, there would not be anything to remove if the server ran out of space for the second time, you should make sure to avoid facing such problem twice. Learn your lesson and work on implementing proper monitoring to alert you early enough.


PlanetMySQL Voting: Vote UP / Vote DOWN

Running out of disk space on MySQL partition? A quick rescue.

Май 7th, 2012

No space left on device – this can happen to anyone. Sooner or later you may face the situation where a database either has already or is only minutes away from running out of disk space. What many people do in such cases, they just start looking for semi-random things to remove – perhaps a backup, a few older log files, or pretty much anything that seems redundant. However this means acting under a lot of stress and without much thinking, so it would be great if there was a possibility to avoid that. Often there is. Or what if there isn’t anything to remove?

While xfs is usually the recommended filesystem for a MySQL data partition on Linux, the extended filesystem family continues to be very popular as it is used as default in all major Linux distributions. There is a feature specific to ext3 and ext4 that can help the goal of resolving the full disk situation.

Unless explicitly changed during filesystem creation, both by default reserve five percent of a volume capacity to the superuser (root). It helps preventing non-privileged processes from filling up all disk, leaving no room for system logging or system applications. However such reservation only makes sense for the system volumes, while MySQL often sits on its own, dedicated partition, so there is no real reason to keep any number of blocks away from it.

So if your database files are stored on a partition formatted with ext3 or ext4 and MySQL runs out of disk space, you may be in luck as there may be some extra capacity the database may be able to use.

How to enable it?

I had a server that ran out of space on the MySQL volume. The system was reporting 5.7M free and MySQL essentially blocked waiting on the opportunity to complete the writes:

[root@db4 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_centos-lv_root
                       30G   25G  3.4G  89% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             243M   47M  183M  21% /boot
/dev/mapper/vg_centos-lv_mysql
                      145G  145G  5.7M 100% /vol/mysql

A quick verification of the filesystem used for that volume:

[root@db4 ~]# mount
[..]
/dev/mapper/vg_centos-lv_mysql on /vol/mysql type ext4 (rw,noatime,nodiratime)

As the next step, I had to verify if the volume had any reserved blocks that could be freed. I have not seen many servers that actually had the default setting changed during the installation process, so in many cases there should be something:

[root@db4 ~]# dumpe2fs /dev/mapper/vg_centos-lv_mysql | grep 'Reserved block count'
dumpe2fs 1.41.12 (17-May-2010)
Reserved block count: 1927884

It turned out 1927884 of 4KB blocks were reserved for the superuser, which was exactly five percent of the volume capacity. I was able to free this space and make it available to MySQL:

[root@db4 ~]# tune2fs -m 0 /dev/mapper/vg_centos-lv_mysql
tune2fs 1.41.12 (17-May-2010)
Setting reserved blocks percentage to 0% (0 blocks)

This works instantaneously. Simply applications start to see more disk space available.

[root@db4 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_centos-lv_root
                       30G   25G  3.4G  89% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             243M   47M  183M  21% /boot
/dev/mapper/vg_centos-lv_mysql
                      145G  145G  7.3G  95% /vol/mysql

Without removing a single file I managed to create over seven gigabytes of free space that allowed MySQL to resume operations. That didn’t solve the problem entirely, but it got me a lot of time to figure out a long term solution.

The method is a quick remedy for the emergency situation when your database runs out of disk space. I used it numerous times while helping many people to solve such problems. Of course, it is not a proper solution, but rather something that buys you time to figure out the options. As it only works once in a system lifetime, because after you remove the entire reservation, there would not be anything to remove if the server ran out of space for the second time, you should make sure to avoid facing such problem twice. Learn your lesson and work on implementing proper monitoring to alert you early enough.


PlanetMySQL Voting: Vote UP / Vote DOWN

Now available: Slides from Percona Live and Linuxcon Europe

Ноябрь 1st, 2011

The slides from last week’s talks I (co-)presented at Percona Live and Linuxcon Europe are now available from our web site.

All slides are available entirely free of charge for logged-in users on our web site. To log in, you don’t even need to register — just use your Google Profile, or Google Apps account, or your WordPress account, or anything else that uses OpenID, and you’ll be good to go.

Comments on our slides are, of course, always highly appreciated.



PlanetMySQL Voting: Vote UP / Vote DOWN

zfs FileSystem and MySQL

Август 17th, 2011



ZFS is a new kind of 128-bit file system that provides simple administration, transactional semantics, end-to-end data integrity, and immense scalability. ZFS is not an incremental improvement to existing technology; it is a fundamentally new approach to data management. ZFS was first introduced in Solaris in 2004 and it is a default filesystem in OpenSolaris, but Linux ports are underway, Apple is shipping it in OS X 10.5 Leopard with limited zfs capability ( Apple shutdown this project afterward due to some known reason), and it will be included in FreeBSD 7.

ZFS Features:
  • Pooled Storage Model
  • Always consistent on disk
  • Protection from data corruption
  • Live data scrubbing
  • Instantaneous snapshots and clones
  • Portable snapshot streams
  • Highly scalable
  • Built in compression
  • Simplified administration model

Pooled Storage Model: ZFS presents a pooled storage model that completely eliminates the concept of volumes and the associated problems of partitions, provisioning, wasted bandwidth and stranded storage. Thousands of file systems can draw from a common storage pool, each one consuming only as much space as it actually needs. The combined I/O bandwidth of all devices in the pool is available to all file systems at all times.





Always consistent on disk: All operations are copy-on-write transactions, so the on-disk state is always valid. Every block is checksummed to prevent silent data corruption, and the data is self-healing in replicated (mirrored or RAID) configurations. If one copy is damaged, ZFS detects it and uses another copy to repair it.

Protection from data corruption: ZFS introduces a new data replication model called RAID-Z. It is similar to RAID-5 but uses variable stripe width to eliminate the RAID-5 write hole (stripe corruption due to loss of power between data and parity updates). All RAID-Z writes are full-stripe writes. There's no read-modify-write tax, no write hole, and — the best part — no need for NVRAM in hardware. ZFS loves cheap disks.

Live data scrubbing: But cheap disks can fail, so ZFS provides disk scrubbing. Similar to ECC memory scrubbing, all data is read to detect latent errors while they're still correctable. A scrub traverses the entire storage pool to read every data block, validates it against its 256-bit checksum, and repairs it if necessary. All this happens while the storage pool is live and in use.
ZFS has a pipelined I/O engine, similar in concept to CPU pipelines. The pipeline operates on I/O dependency graphs and provides scoreboarding, priority, deadline scheduling, out-of-order issue and I/O aggregation. I/O loads that bring other file systems to their knees are handled with ease by the ZFS I/O pipeline.

Instantaneous snapshots and clones (Most important and useful for huge backups in seconds): ZFS provides 2 64 constant-time snapshots and clones. A snapshot is a read-only point-in-time copy of a file system, while a clone is a writable copy of a snapshot. Clones provide an extremely space-efficient way to store many copies of mostly-shared data such as workspaces, software installations, and diskless clients.

Portable snapshot streams (Important & useful feature): You snapshot a ZFS file system, but you can also create incremental snapshots. Incremental snapshots are so efficient that they can be used for remote replication, such as transmitting an incremental update every 10 seconds.

Highly scalable (Important  useful feature): There are no arbitrary limits in ZFS. You can have as many files as you want: full 64-bit file offsets, unlimited links, directory entries, and so on.

Built in compression: ZFS provides built-in compression. In addition to reducing space usage by 2-3x, compression also reduces the amount of I/O by 2-3x. For this reason, enabling compression actually makes some workloads go faster.
In addition to file systems, ZFS storage pools can provide volumes for applications that need raw-device semantics. ZFS volumes can be used as swap devices, for example. And if you enable compression on a swap volume, you now have compressed virtual memory.

Simplified administration model: ZFS administration is both simple and powerful. zpool and zfs are the only two command you need to know. Please see the zpool(1M) and zfs(1M) man pages for more information.
The storage pool is a key abstraction: a pool can consist of many physical devices, and can hold many filesystems. Whenever you add storage to the pool, it becomes available to any filesystem that may need it. To take a newly-attached disk and use the whole disk for ZFS storage, you would use the command.

# zpool create zpool1 c2t0d0

Here, zpool1 represents the name of a pool, and c2t0d0 is a disk device.

If you have a disk had already been formatted – say, with a UFS filesystem on one partition – you can create a storage pool from another free partition:
# zpool create zpool1 c2t0d0s2 

You can even use a plain file for storage:
# zpool create zpool1 ~/storage/myzfile

Once you have a storage pool, you can build filesystems on it:
# zfs create zpool1/data # zfs create zpool1/logs 

Later on, if you run out of space, just add another device to the pool, and the filesystem will grow.
# zpool add zp1 c3t0d0

ZFS and Tablespaces:



innodb_data_file_path = /dbzpool/data/ibdatafile:20G:autoextend

Here is the only innodb_data_file_path that any ZFS system might ever need. You can split this over as many drives as you want, and ZFS will balance the load intelligently. You can stripe it, mirror it, add space when you need room to grow, bring spare disks online, and take faulted disks offline, without ever restarting the database.




PlanetMySQL Voting: Vote UP / Vote DOWN

OpenDBCamp: Information Lifecycle Architecture

Май 7th, 2011
The Open DB Camp in Sardinia 2011 has had a number of sessions on varying topics. Topics range from MySQL over MongoDB to replication and High Availability.

I decided to tap into the database expert resources present here at Sardegna Ricerche by discussing a non-database issue, where one can expert database experts to have insights beyond those of end users. And they did.

The topic was the particular case of information overload many of us suffer from on our hard disks: Too many files, too hard to find.
  • How do we find the bank statement from April 2007 from the more-seldom-used account?
  • What are the ten best work-related pictures from last year?
  • Is this the most current version of the presentation of BlackRay?
  • Are these films from Cagliari already backed up? Also offsite?
It turned out that I am not the only one suffering from a slight chaos on my hard disk. We all have some basic discipline we try to follow to keep things in order, but the consensus seemed to be that disorder on the hard disk is a psychological problem to be solved by good habits, more than a technical problem to be solved by an application. This in itself is a revolutionary insight, to come from a bunch of techies.

Before going into the individual points, let me first share how I had framed the discussion:
Many OpenSQLCamp attendees spend lots of time communicating about our SQL projects, internally and externally. We spend lots of time architecting database systems, and managing the lifecycle of products.

We do little to implement a proper architecture for the non-database information we create and manage, in business and privately. We drown in emails, digital pictures, versions of downloaded PDF documents, video snippets, and attachments sent by colleagues, partners and private friends. Chaos ensues.

Disorder and low productivity are inevitable unless we are very disciplined in following some basic rules for keeping order on our hard disks, pods and pads
. But what are those basic rules? And what tools can implement them?

I don't sit in with more than a rough first sketch of "an Information Lifecycle Architecture", but I'd like to share ideas, thoughts and attitudes with my fellow OpenSQLCamp attendees. I'll present some slides and guidelines, and will make an attempt at collecting your thoughts into a summary afterwards!
I threw in a couple of basic ideas on how to handle the type of information that we have to manage as individuals, usually on our own hard disks:
  1. Separate /pub from /rep: Store raw information in its original form in one directory tree, the "repository". Store distilled information ready to be consumed in a separate directory tree, the "publications".
  2. Limit the allowed /pub formats: Allow very few formats for publishing (such as .jpg .mov .pdf .mp3 .ogg but not .doc .ppt .xls .cr2 .psd .oo3 or anything even more "exotic").
  3. Delete systematically: Don't save many versions of the same file. Don't save information that isn't needed.
  4. Sync easily: Set up the directories (and configure your software) so that it's very easy to sync the published files with your mobile devices (Androids, iPhones, iPads, iPods, digi frames), regardless if PDFs, JPGs, MOVs or MP3s.
  5. Order files by type: Above /pub and /rep, separate files by rough category: Pictures, Movies, Documents, Music.
  6. Order files by year: Under /pub and /rep, separate most files into directories by year. Month or quarter would be too frequent for most personal information.
  7. Order files by common sense: Under the year (or in exceptional cases directly under /pub or /rep), separate files by placing them into a smart directory structure, which you yourself decide about according to the topic, as opposed to delegating the file structure to the random preferences of some software (like iPhoto).
Beat Vontobel, Liz van Dijk, Markus Popp, Sheeri Kritzer Cabral, Sergei Golubchik, René Cannao and others came with very good ideas and anecdotes. Let me here relate some of them, while they're in fresh memory:
  1. Blog your notes! Write your personal notes so that they're reusable for others. Publish them on your blog. Then you can use Google to find your own notes. I think this tip is smarter than what it sounds at first, i.e. it's applicable for quite a few situations.
  2. Use version control! For some who are familiar with version control anyway, it may make sense to put presentations and various types of other personal information into a version control system.
  3. Use the cloud! Put some of the information onto the cloud, for easy availability across machines, for easy synching, for backup.
  4. Tags for fields should be part of the operating system. You could tag expense reports, notes, contacts, pictures, films, documents and emails alike with #opendbcamp. The tagging should ideally work across operating systems.
  5. Order needs discipline. Any good habit of keeping order on the hard disk needs to be backed up by a commitment in time. If you slip once, and twice, and one more time, the discipline is lacking.
  6. Storage is cheap. Or is it? Here I noted two schools of thought. One would rather just tag anything and keep order by sorting. The other school would rather delete as much as possible, so that the remainder is smaller and hence easier to keep ordered. I belong to the latter one.
  7. Bad banks throw important yet unstructured information at you. You can get a bank account statement with a long filename which doesn't denote the year and month or bank account. You yourself have to parse the file, and name it properly. That's a burden even for a geeky OpenDBCamp visitor. Think of the poor average bank customers!
  8. The analog world forced you to have a physical relationship to your data. In order to use your CDs or spices or books, your mental maps of organising them were backed up by some physical structure. This physical structure is missing from digital data. It becomes easier to forget that you even have the information. We end up with a lot of pictures, music and videos we never use.
  9. Use Yojimbo http://www.barebones.com/products/yojimbo/ as an information organiser, if you're a Mac user.
  10. Does technology solve issues or create them? Earlier, we didn't have as many pics, films, CDs or books. Now, we have more of them, in a variety of forms. Does it really make sense to spend tens of hours sorting and otherwise maintaining your collections (of films, music, pictures)? Or is it better to have smaller collections, even of the seemingly "free" items such as digital pictures and films taken by yourself?
On that philosophic observation, let me end my personal notes from the "Information Lifecycle Architecture" session at the Open DB Camp, which I have now published and will be able to find later on by Googling it.

PlanetMySQL Voting: Vote UP / Vote DOWN

DRBD != fsck != DIX

Октябрь 28th, 2010

Every once in a while, we hear of users with corruption in a file system that sits on top of DRBD. That may be easy or tricky to resolve. If you’re lucky, a simple fsck will resolve the corruption. If you’re not quite that lucky, you may have to get out your backups.

But that’s typically not DRBD’s fault. Typically not at all, not in the least bit. DRBD is a block device, and as such it has no idea what rests on top of it. It has no concept of a filesystem, let alone its integrity. That of course is true for any other block device as well. If you have, say, RAID-1, and something corrupts the file system on top of it, then of course that corruption will be happily replicated across both component devices. DRBD is no different, except that its component devices are stored across distinct physical nodes.

And even if everything about your filesystem is logically correct, there’s still the chance that a user fat-fingers rm and nukes all your precious data, and DRBD will happily replicate that too. Just like RAID. In a nutshell: just like RAID, DRBD does not replace backups.

DRBD does bend over backwards in making sure that it is replicating data correctly, catching all sorts of network issues in the process and optionally doing an end-to-end checksum over everything it replicates. It can also immediately detach from a backing device if the latter is acting up in any way and throwing I/O errors. But it can only make sure that it correctly replicates whatever it’s being handed down from above — there is no way for it to second-guess whether that is actually good data.

Likewise, when DRBD reads data, it does so from its underlying block device. And if it happens to be fed garbage from there, there’s nothing it can do about that either (unless the read actually produces an I/O error, in which case we can detach, read transparently from the peer over the network, and all is dandy). So if you have silent data corruption introduced by your controller, or by a disk that’s gone haywire, then it will feed the application garbage. However, and this is a big plus compared to going without DRBD, DRBD gives you the option of switching your application over to another node, with presumably better hardware, where that read corruption does not occur. And you can keep your users happy while you’re fixing the other box with the shot I/O stack.

So no, DRBD does not replace the occasional fsck or whatever other data integrity features your filesystem may come with. DRBD also does not absolve you of adding a BBU (or capacitor-backed flash) to your controller write cache, or of having to turn off your disk write cache (which is always volatile). DRBD also does not protect against dd-ing a bunch of random data somewhere in the middle of the block device causing your filesystem to jump and scream.

Now, if you want complete, end-to-end I/O integrity checking, check out Linux DIX (Data Integrity Extensions), brought to you by a team around Martin Petersen at Oracle. I had the pleasure of sitting in his talk at LinuxCon this year. It’s in Linux as of 2.6.27, check out the project page for details. What’s nice about this is that it’s a Linux first — no other operating system, at this time, is known to have anything comparable.



PlanetMySQL Voting: Vote UP / Vote DOWN

LCA Miniconf Call for Papers: Data Storage: Databases, Filesystems, Cloud Storage, SQL and NoSQL

Сентябрь 29th, 2010

This miniconf aims to cover many of the current methods of data storage and retrieval and attempt to bring order to the universe. We’re aiming to cover what various systems do, what the latest developments are and what you should use for various applications.

We aim for talks from developers of and developers using the software in question.

Aiming for some combination of: PostgreSQL, Drizzle, MySQL, XFS, ext[34], Swift (open source cloud storage, part of OpenStack), memcached, TokyoCabinet, TDB/CTDB, CouchDB, MongoDB, Cassandra, HBase….. and more!

Call for Papers open NOW (Until 22nd October).


PlanetMySQL Voting: Vote UP / Vote DOWN

ScaleDB Cache Accelerator Server (CAS): A Game Changer for Clustered Databases

Сентябрь 18th, 2010
ScaleDB and Oracle RAC are both clustered databases that use a shared-disk architecture. As I have mentioned previously, they both actually share data via a shared cache, so it might be more appropriate to call them shared-cache databases.

Whether it is called shared-disk or shared-cache, these databases must orchestrate the sharing of a single set of data amongst multiple nodes. This introduces two challenges: the physical sharing of the data and the logical sharing of the data.

Physical Sharing:
Raw storage is meant to work on a 1:1 basis with a single server. In order to share that data amongst multiple servers, you need either a Network File System (NFS), which shares whole files, or a Cluster File System (CFS), which shares data blocks.

Logical Sharing:
This is specific to databases. A database may request a single block of data from the storage and then it may coordinate multiple sequential changes to that block, with only the final results being written back to the storage. The database can also discriminate between reading the data and writing the data, to facilitate parallelizing these actions.

Databases must control the logical sharing of data, in order to ensure that the database doesn’t become corrupted or inconsistent, and to ensure that it provides good performance. Because logical sharing is very specific to the database, it is something that clustered databases must handle themselves. This function is addressed by a lock manager.

Physical sharing of data requires less integration with the database logic. As such, you can use a general-purpose NFS or a CFS to provide the physical file sharing capabilities. This is what Oracle RAC does, they rely upon Oracle Cluster File System 2 (OCFS2) to provide generic physical file sharing. OCFS2 then relies upon a SAN or NAS that supports multi-attach, since all of the database nodes must share the same physical files. The NAS or SAN then handles the data duplication for high availability and other services like back-up and more.

ScaleDB takes a different approach. ScaleDB not only handles the logical data sharing—with its lock manager—but it also handles the physical data sharing with its Cache Accelerator Server (CAS). CAS connects directly to the storage and handles the sharing of that data among the database nodes. Because CAS is purpose-built for the ScaleDB database it does not need services such as membership management, which create complexity and overhead in a general purpose CFS. Furthermore, ScaleDB is able to tune the CAS, in conjunction with the lock manager, to extract superior performance.

CAS also offers additional benefits. It provides a scalable shared-cache that enables the database nodes to share via the cache, which is much faster than sharing via the disk. Furthermore, since it eliminates the need for an NFS or CFS, it enables you to work with any storage. You can choose to use local storage—inside the CAS—cloud storage, or a SAN or NAS. Many in the MySQL community balk at the high cost of SAN storage with fiber channel and switches and high-cost storage. CAS supports low-cost local storage, while providing a seamless path to high-end storage as needed. Furthermore, the CAS are deployed in pairs, so the data is mirrored. Because the data is mirrored, you have redundant storage, even when using local storage inside the servers running CAS. Because it can operate on commodity hardware and because it works with any storage, CAS is ideal for cloud computing.

In summary, clustered databases like Oracle RAC and ScaleDB must implement their own lock managers to manage the logical sharing of data amongst the database nodes. Providing a purpose-built solution for the physical sharing of the data, while not required, does provide some significant advantages over using a general purpose NFS or CFS.

PlanetMySQL Voting: Vote UP / Vote DOWN

No, DRBD doesn’t magically make your application crash safe

Август 17th, 2010

It is a common misconception that DRBD (or any block-level data replication) solution can magically make an application crash-safe that intrinsically isn’t. Baron highlights that misconception in a recent blog post.

I want to reiterate and stress that point here: if your application can’t reliably survive a node crash, it won’t successfully fail over on a replicated (or shared, for that matter) data device. But if it can, then DRBD won’t break it. In other words: try pulling the power plug on your machine while your app is running, and power back on. If your application recovers to a consistent state, you’re clear. If it doesn’t, don’t bother adding DRBD until you fix that.

You must fix any layer in your stack that isn’t crash safe, if you even want to start thinking about high availability. ext2, which Baron mentions in his post, isn’t crash safe. MySQL with a database using the MyISAM storage engine isn’t crash safe. KVM with virtual block devices in cache=writeback mode isn’t crash safe. Running on a RAID controller with the write cache enabled when its battery is dead isn’t crash safe.

Thus, if you want high availability, use ext3. Or ext4. Or any journaling file system. Use InnoDB for MySQL. Use cache=none for KVM. And check those batteries. It’s that simple.



PlanetMySQL Voting: Vote UP / Vote DOWN

Shared Cache Tier & Storage Flexibility

Август 4th, 2010
Any time you can get two for the price of one (a “2Fer”), you’re ahead of the game. By implementing our shared cache as a separate tier, you get (1) improved performance and (2) storage flexibility…a 2Fer.

What do I mean by storage flexibility? It means you can use enterprise storage, cloud storage or PC-based storage. Other shared-disk cluster databases require high-end enterprise storage like a NAS or SAN. This requirement was driven by the need for:

1. High-performance storage
2. Highly available storage
3. Multi-attach, or sharing data from a single volume of LUN across multiple nodes in the cluster.

Quite simply, you won’t see other shared-disk clustering databases using cloud storage or PC-based storage. However, the vast majority of MySQL users rely on PC-based storage, and most are not willing to pay the big bucks for high-end storage.

ScaleDB’s Cache Accelerator Server (CAS) enables users to choose the storage solution that fits their needs. See the diagram below:

Because all data is mirrored across paired CAS servers, it delivers high-availability, because if one fails the other continues running. Built-in recovery completes the HA solution. If you want further reassurance you can use a third CAS as a hot standby. This means that you can use the internal hard drives on your CAS on servers to provide highly-available storage.

The next post in this series on CAS will compare ScaleDB CAS, Network File System (NFS) and Cluster File System (CFS).

PlanetMySQL Voting: Vote UP / Vote DOWN