Archive for the ‘Replication’ Category

New Replication, Optimizer and High Availability features in MySQL 5.6.5!

Апрель 11th, 2012

As the Product Manager for the MySQL database it is always great to announce when the MySQL Engineering team delivers another great product release.  As a field DBA and developer it is even better when that release contains improvements and innovation that I know will help those currently using MySQL for apps that range from modest intranet sites to the most highly trafficked web sites on the web.  That said, it is my pleasure to take my hat off to MySQL Engineering for today's release of the MySQL 5.6.5 Development Milestone Release ("DMR"). The new highlighted features in MySQL 5.6.5 are discussed here:

New Self-Healing Replication Clusters
The 5.6.5 DMR improves MySQL Replication by adding Global Transaction Ids and automated utilities for self-healing Replication clusters.  Prior to 5.6.5 this has been somewhat of a pain point for MySQL users with most developing custom solutions or looking to costly, complex third-party solutions for these capabilities.  With 5.6.5 these shackles are all but removed by a solution that is included with the GPL version of the database and supporting GPL tools. 

You can learn all about the details of the great, problem solving Replication features in MySQL 5.6 in Mat Keep's Developer Zone article

New Replication Administration and Failover Utilities
As mentioned above, the new Replication features, Global Transaction Ids specifically, are now supported by a set of automated GPL utilities that leverage the new GTIDs to provide administration and manual or auto failover to the most up to date slave (that is the default, but user configurable if needed) in the event of a master failure.

The new utilities, along with links to Engineering related blogs, are discussed in detail in the DevZone Article noted above.

Better Query Optimization and Throughput
The MySQL Optimizer team continues to amaze with the latest round of improvements in 5.6.5. Along with much refactoring of the legacy code base, the Optimizer team has improved complex query optimization and throughput by adding these functional improvements:

  • Subquery Optimizations - Subqueries are now included in the Optimizer path for runtime optimization.  Better throughput of nested queries enables application developers to simplify and consolidate multiple queries and result sets into a single unit or work.
  • Optimizer now uses CURRENT_TIMESTAMP as default for DATETIME columns - For simplification, this eliminates the need for application developers to assign this value when a column of this type is blank by default.
  • Optimizations for Range based queries - Optimizer now uses ready statistics vs Index based scans for queries with multiple range values.
  • Optimizations for queries using filesort and ORDER BY.  Optimization criteria/decision on execution method is done now at optimization vs parsing stage.
  • Print EXPLAIN in JSON format for hierarchical readability and Enterprise tool consumption.

You can learn the details about these new features as well all of the Optimizer based improvements in MySQL 5.6 by following the Optimizer team blog.

You can download and try the MySQL 5.6.5 DMR here. (look under "Development Releases")  Please let us know what you think!  The new HA utilities for Replication Administration and Failover are available as part of the MySQL Workbench Community Edition, which you can download here .

Also New in MySQL Labs
As has become our tradition when announcing DMRs we also like to provide "Early Access" development features to the MySQL Community via the MySQL Labs.  Today is no exception as we are also releasing the following to Labs for you to download, try and let us know your thoughts on where we need to improve:

InnoDB Online Operations
MySQL 5.6 now provides Online ADD Index, FK Drop and Online Column RENAME.  These operations are non-blocking and will continue to evolve in future DMRs.  You can learn the grainy details by following John Russell's blog.

InnoDB data access via Memcached API ("NotOnlySQL") - Improved refresh of an earlier feature release
Similar to Cluster 7.2, MySQL 5.6 provides direct NotOnlySQL access to InnoDB data via the familiar Memcached API. This provides the ultimate in flexibility for developers who need fast, simple key/value access and complex query support commingled within their applications.

Improved Transactional Performance, Scale
The InnoDB Engineering team has once again under promised and over delivered in the area of improved performance and scale.  These improvements are also included in the aggregated Spring 2012 labs release:

InnoDB CPU cache performance improvements for modern, multi-core/CPU systems show great promise with internal tests showing:   

  • 2x throughput improvement for read only activity
  • 6x throughput improvement for SELECT range
  • Read/Write benchmarks are in progress

More details on the above are available here.

You can download all of the above in an aggregated "InnoDB 2012 Spring Labs Release" binary from the MySQL Labs. You can also learn more about these improvements and about related fixes to mysys mutex and hash sort by checking out the InnoDB team blog.

MySQL 5.6.5 is another installment in what we believe will be the best release of the MySQL database ever.  It also serves as a shining example of how the MySQL Engineering team at Oracle leads in MySQL innovation.

You can get the overall Oracle message on the MySQL 5.6.5 DMR and Early Access labs features here.

As always, thanks for your continued support of MySQL, the #1 open source database on the planet!



PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Utilities and Global Transaction Identifiers

Апрель 10th, 2012
The new MySQL 5.6 Development Milestone Release (DMR) includes many new enhancements. One of the most impressive is the use of Global Transaction Identifiers (GTIDs) for replication. With GTIDs enabled, administrators no longer need to keep track of binary log files and positions. In a nutshell, GTIDs simplify the setup and maintenance of replication.

MySQL Utilities has taken this a step further by providing two new utilities that automate two of the most complex replication administration tasks - switchover and failover. Switchover is changing the role of an active, healthy master to one of its slaves whereas failover is the act of promoting a candidate slave to become the new master. Clearly, switchover is an elective operation and failover is performed when there are issues with the master.

The GTID utilities are included in release-1.0.5 of MySQL Utilities. They are included as a plugin for MySQL Workbench or via source download from launchpad (see below). The new GTID utilities are included in Workbench version 5.2.39.

Automatic Failover Utility
The most impressive utility is mysqlfailover, It is an interactive tool used to report replication health, report GTIDs in use, and perform automatic failover. Yes, that’s right - you can setup mysqlfailover to automatically failover to one of a specific set of slaves whenever the master goes offline. It is designed to work with the MySQL 5.6.5 and later versions of the server.

failover-2012-04-3-20-40.png

As you can see in the screenshot, a list is presented including the host, port, role, state, and replication health for each server in the topology. The utility connects to a single master and its slaves. When used in multiple tier environments, users can run an instance for each master. The utility provides the ability to run a failover check and report health at specific intervals in seconds from five seconds and up

To start the utility, users can specify a list of slaves or provide a default user and password to be used in discovering the slaves connected to the master. Discovery of slaves requires the slaves to report the correct host and port when connecting to the master. Along with the list of slaves, the user can specify a list of servers to be used as candidates for selecting a new master when a failover event is detected.

Finally, the user can control how failover occurs with the failover mode. The auto mode tells the utility to failover to the list of candidates first and if none are viable, search the list of slaves for a candidate. The elect mode limits election to the candidate slave list and if none are viable, failover does not occur. The fail mode tells the utility to not perform failover and instead stop execution.

Along with these options are four extension points permitting users to interact with the utility during failover. These extension points permit users to specify a script to run at each of the following events.
  • exec-fail-check - execute a script to determine if failover is needed. This replaces the default downed master detection and allows users to perform application-specific detection for failover.
  • exec-before - execute a script before failover is performed. This can be used to tell the application to cease write attempts while a new master is setup.
  • exec-after - execute a script immediately after failover to a new master. This permits users to inform the application that the new master is ready.
  • exec-post-fail - execute a script after failover is complete and all slaves have been attached to the new master. This can be used to inform applications that use read-level scale out that it is safe to resume reads from the slaves.
The combination of options to control failover, the option to perform automatic failover, and the ability to inform applications of the failover event are powerful features that enable unattended automatic failover for critical replication-based applications.

Replication Administration Utility
The other utility, mysqlrpladmin, is used to perform switchover and failover operations and more on-demand permitting administrators to execute these tasks with a single command. You can use the command to perform one of the following commands.
  • elect - This command is available to only those servers supporting global transaction identifiers (GTIDs), perform best slave election and report best slave to use in the event a switchover or failover is required. Best slave election is simply the first slave to meet the prerequisites.
  • failover - This command is available to only those servers supporting GTIDs. Conduct failover to the best slave. The command will test each candidate slave listed for the prerequisites. Once a candidate slave is elected, it is made a slave of each of the other slaves thereby collecting any transactions executed on other slaves but not the candidate. In this way, the candidate becomes the most up-to-date slave.
  • gtid - This command is available to only those servers supporting GTIDs. It displays the contents of the GTID variables used to report GTIDs in replication. The command also displays universally unique identifiers (UUIDs) for all servers.
  • health - Display the replication health of the topology.
  • reset - Execute the STOP SLAVE and RESET SLAVE commands on all slaves.
  • start - Execute the START SLAVE command on all slaves.
  • stop - Execute the STOP SLAVE command on all slaves.
  • switchover - Perform slave promotion to a specified candidate slave as designated by the --new-master option. This command is available for both gtid-enabled servers and non-gtid-enabled scenarios.
These two utilities raise the bar for replication ease of use making the administrator’s job easier.

How Can I Get MySQL Utilities?
You can check out these new utilities and the entire suite of utilities by either downloading the source code from launchpad or by downloading and installing MySQL Workbench.

You can download MySQL Workbench from:

http://www.mysql.com/downloads/workbench/

You can also download the latest development source code tree for the MySQL Workbench Utilities from:

https://launchpad.net/mysql-utilities

To learn more about all of the great new replication features in MySQL 5.6, check out the developer zone article at:

http://dev.mysql.com/tech-resources/articles/mysql-5.6-replication.html

PlanetMySQL Voting: Vote UP / Vote DOWN

Benchmarking MySQL Replication with Multi-Threaded Slaves

Апрель 10th, 2012

The objective of this benchmark is to measure the performance improvement achieved when enabling the Multi-Threaded Slave enhancement delivered as a part MySQL 5.6.

As the results demonstrate, Multi-Threaded Slaves delivers 5x higher replication performance based on a configuration with 10 databases/schemas. For real-world deployments, higher replication performance directly translates to:

· Improved consistency of reads from slaves (i.e. reduced risk of reading "stale" data)

· Reduced risk of data loss should the master fail before replicating all events in its binary log (binlog)

The multi-threaded slave splits processing between worker threads based on schema, allowing updates to be applied in parallel, rather than sequentially. This delivers benefits to those workloads that isolate application data using databases - e.g. multi-tenant systems deployed in cloud environments.

Multi-Threaded Slaves are just one of many enhancements to replication previewed as part of the MySQL 5.6 Development Release, which include:

· Global Transaction Identifiers coupled with MySQL utilities for automatic failover / switchover and slave promotion

· Crash Safe Slaves and Binlog

· Optimized Row Based Replication

· Replication Event Checksums

· Time Delayed Replication

These and many more are discussed in the “MySQL 5.6 Replication: Enabling the Next Generation of Web & Cloud Services” Developer Zone article 

Back to the benchmark - details are as follows.


Environment
The test environment consisted of two Linux servers:

· one running the replication master

· one running the replication slave.

Only the slave was involved in the actual measurements, and was based on the following configuration:

- Hardware: Oracle Sun Fire X4170 M2 Server

- CPU: 2 sockets, 6 cores with hyper-threading, 2930 MHz.

- OS: 64-bit Oracle Enterprise Linux 6.1
- Memory: 48 GB

Test Procedure
Initial Setup:

Two MySQL servers were started on two different hosts, configured as replication master and slave.

10 sysbench schemas were created, each with a single table:

CREATE TABLE `sbtest` (
   `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
   `k` int(10) unsigned NOT NULL DEFAULT '0',
   `c` char(120) NOT NULL DEFAULT '',
   `pad` char(60) NOT NULL DEFAULT '',
   PRIMARY KEY (`id`),
   KEY `k` (`k`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

10,000 rows were inserted in each of the 10 tables, for a total of 100,000 rows. When the inserts had replicated to the slave, the slave threads were stopped. The slave data directory was copied to a backup location and the slave threads position in the master binlog noted.

10 sysbench clients, each configured with 10 threads, were spawned at the same time to generate a random schema load against each of the 10 schemas on the master. Each sysbench client executed 10,000 "update key" statements:

UPDATE sbtest set k=k+1 WHERE id = <random row>

In total, this generated 100,000 update statements to later replicate during the test itself.

Test Methodology:
The number of slave workers to test with was configured using:

SET GLOBAL slave_parallel_workers=<workers>

Then the slave IO thread was started and the test waited for all the update queries to be copied over to the relay log on the slave.

The benchmark clock was started and then the slave SQL thread was started. The test waited for the slave SQL thread to finish executing the 100k update queries, doing "select master_pos_wait()". When master_pos_wait() returned, the benchmark clock was stopped and the duration calculated.

The calculated duration from the benchmark clock should be close to the time it took for the SQL thread to execute the 100,000 update queries. The 100k queries divided by this duration gave the benchmark metric, reported as Queries Per Second (QPS).

Test Reset:

The test-reset cycle was implemented as follows:

· the slave was stopped

· the slave data directory replaced with the previous backup

· the slave restarted with the slave threads replication pointer repositioned to the point before the update queries in the binlog.

The test could then be repeated with identical set of queries but a different number of slave worker threads, enabling a fair comparison.

The Test-Reset cycle was repeated 3 times for 0-24 number of workers and the QPS metric calculated and averaged for each worker count.

MySQL Configuration
The relevant configuration settings used for MySQL are as follows:

binlog-format=STATEMENT
relay-log-info-repository=TABLE
master-info-repository=TABLE

As described in the test procedure, the
slave_parallel_workers setting was modified as part of the test logic. The consequence of changing this setting is:

0 worker threads:
   - current (i.e. single threaded) sequential mode
   - 1 x IO thread and 1 x SQL thread
   - SQL thread both reads and executes the events

1 worker thread:
   - sequential mode
   - 1 x IO thread, 1 x Coordinator SQL thread and 1 x Worker thread
   - coordinator reads the event and hands it to the worker who executes

2+ worker threads:
   - parallel execution
   - 1 x IO thread, 1 x Coordinator SQL thread and 2+ Worker threads
   - coordinator reads events and hands them to the workers who execute them

Results
Figure 1 below shows that Multi-Threaded Slaves deliver ~5x higher replication performance when configured with 10 worker threads, with the load evenly distributed across our 10 x schemas. This result is compared to the current replication implementation which is based on a single SQL thread only (i.e. zero worker threads).

Figure 1: 5x Higher Performance with Multi-Threaded Slaves

The following figure shows more detailed results, with QPS sampled and reported as the worker threads are incremented.

The raw numbers behind this graph are reported in the Appendix section of this post.



Figure 2: Detailed Results

As the results above show, the configuration does not scale noticably from 5 to 9 worker threads. When configured with 10 worker threads however, scalability increases significantly. The conclusion therefore is that it is desirable to configure the same number of worker threads as schemas.

Other conclusions from the results:

· Running with 1 worker compared to zero workers just introduces overhead without the benefit of parallel execution.

· As expected, having more workers than schemas adds no visible benefit.

Aside from what is shown in the results above, testing also demonstrated that the following settings had a very positive effect on slave performance:


relay-log-info-repository=TABLE
master-info-repository=TABLE

For 5+ workers, it was up to 2.3 times as fast to run with TABLE compared to FILE.

Conclusion

As the results demonstrate, Multi-Threaded Slaves deliver significant performance increases to MySQL replication when handling multiple schemas.

This, and the other replication enhancements introduced in MySQL 5.6 are fully available for you to download and evaluate now from the MySQL Developer site (select Development Release tab).

You can learn more about MySQL 5.6 from the documentation 

Please don’t hesitate to comment on this or other replication blogs with feedback and questions.

Appendix – Detailed Results


PlanetMySQL Voting: Vote UP / Vote DOWN

Announcing TokuDB v6.0: Less Slave Lag and More Compression

Апрель 9th, 2012

We are excited to announce TokuDB® v6.0, the latest version of Tokutek’s flagship storage engine for MySQL and MariaDB.

This version offers feature and performance enhancements over previous releases, support for XA (two-phase transactional commits), better compression, and reduced performance variability associated with checkpointing. This release also brings TokuDB support up to date on MySQL v5.1, MySQL v5.5 and MariaDB v5.2. There’s a lot of great technical stuff under the hood in this release and I’ll be reviewing the improvements one-by-one over the course of this week.

I’ll be posting more details about the new features and performance, so here’s an overview of what’s in store.

Replication Slave Lag
One of the things TokuDB does well is single-threaded insertions, which translates directly into less slave lag. With TokuDB v6.0, we introduce support for XA, which insures for a more robust environment for many replication use cases. High insertion rate and XA support make TokuDB a drop-in replacement for InnoDB in replication environments. In the next blog, I’ll be giving some performance numbers.

Compression
TokuDB has great compression. Starting with TokuDB v6.0, you’ll have a choice between standard compression and aggressive compression. Aggressive compression uses more cores and usually does a significantly better job at compressing. I’ll get into the details of our new compression feature in another post.

Checkpoint variability
TokuDB checkpoints frequently, which makes recovery super-fast. InnoDB checkpoints infrequently, because checkpoints slow InnoDB performance significantly. Our customers tell us they require stable and stall-free performance, even in the face of checkpointing. With TokuDB v6.0 we deliver that and with no drop in throughput. We think you’ll be happy with the results — frequent checkpoints and fast recovery with no performance hit! — details of which forthcoming.

Performance
This release continues our improvements for multi-client scaling and in-memory performance. We’ve made great strides. Numbers for this in an upcoming post.

TokuDB v6.0 maintains all our established advantages: fast trickle load, fast bulk load, fast range queries through clustering indexes, hot schema changes, no fragmentation, and full MySQL compatibility for ease of installation. See our benchmark page for details.

To learn more about TokuDB:

Replication, compression, reduced variability, improved performance and support for MySQL v5.5. Enjoy!


PlanetMySQL Voting:
Vote UP / Vote DOWN

Relearn about your battery

Март 15th, 2012

Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.

After a quick peek in all our monitoring systems I isolated it to this message:
Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless! :P

Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.

The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.

Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode. ;)


Tagged: battery relearn, hardware, mysql, omconfig, omreport, poweredge r710, replication
PlanetMySQL Voting: Vote UP / Vote DOWN

Relearn about your battery

Март 15th, 2012

Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.

After a quick peek in all our monitoring systems I isolated it to this message:
Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless! :P

Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.

The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.

Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode. ;)


Tagged: battery relearn, hardware, mysql, omconfig, omreport, poweredge r710, replication
PlanetMySQL Voting: Vote UP / Vote DOWN

Presenting at Percona Live and SkySQL MariaDB Solutions Day in Santa Clara

Март 13th, 2012
MySQL community conferences are alive and well in 2012.   Percona has taken the initiative to host the yearly MySQL event at the Santa Clara Hyatt; it's now called Percona Live MySQL Conference and Expo.  It runs from 10 through 12 April.  But don't plan on going home Thursday night.  On Friday 13 April you can also attend the SkySQL and MariaDB MySQL Solutions Day in the same location.  And wait, that's not all!  Drizzle Day is also on 13 April and also at the Hyatt, so you can catch up on what the Drizzle folks have been up to for the last 12 months.

Now for some specifics on the conferences where Continuent will be appearing.  Percona did an outstanding job of organizing conferences in 2011 and has a strong line-up for the 2012 Santa Clara event that includes a number of talks related to Tungsten.  Several of my Continuent colleagues, including Ed Archibald, Jeff Mace, and Giuseppe Maxia will be presenting.  I'm also pleased to report that the committee selected two of my proposals:

* Be a Data Management Hero with Good Backups! -- My worst mistake in 35 years of working in IT involved backups.  It's important to get them right.

Boost Your Replication Throughput with Parallel Apply, Prefetch, and Batching (with my colleague Stephane Giron, aka the master of the MySQL binlog)  -- We will cover how Tungsten speeds up replication in multiple ways, including new techniques like automatic batch loading that have not previously been available to MySQL users.

There will likely be a BOF for Tungsten--not confirmed but I am hopeful--on one of the evenings.    Please stop by if you would like to meet the engineers, and talk about Tungsten.  Afterwards we'll buy you a drink.  It comes out of the Continuent marketing budget, which proves that free software sometimes is linked to free beer.

Meanwhile, the SkySQL and MariaDB Solutions Day is new, at least for me.  Giuseppe Maxia and I will be doing a talk on building scalable apps with Tungsten replication and clustering.  It will stress how to build very large MySQL installations using Continuent's products.  I am looking forward to this conference, because we don't see as much on MariaDB here in the US as we do on the Oracle and Percona flavors of MySQL.  This is a great way for US-based folks to catch up. I also like the fact it will cover open source and commercial products like Tungsten Enterprise (our software) that help users take full advantage of MySQL.  Not everything needs to be open source to be cost-effective and useful.

Last year I suggested that we all lay down the fish for a while to work on conferences that build the MySQL community and sustain innovation.  Continuent will support any conference that meets those simple criteria.  I'm delighted that we are able to help sponsor both Percona Live and the SkySQL and MariaDB Solution Day to make Santa Clara worth visiting in April.  I look forward to meeting a lot of old friends and making new ones.  Please stop by and introduce yourself if you attend.

PlanetMySQL Voting: Vote UP / Vote DOWN

Auto caching INFORMATION_SCHEMA tables: seeking input

Март 8th, 2012

The short version

I have it all working. It's kind of magic. But there are issues, and I'm not sure it should even exist, and am looking for input.

The long version

In Auto caching tables I presented with a hack which allows getting cached or fresh results via a simple SELECT queries.

The drive for the above hack was INFORMATION_SCHEMA tables. There are two major problems with INFORMATION_SCHEMA:

  1. Queries on schema-oriented tables such as TABLES, COLUMNS, STATISTICS, etc. are heavyweight. How heavyweight? Enough to make a lockdown of your database. Enough to crash down your database in some cases.
  2. The data is always generated on-the-fly, as you request it. Query the COLUMNS table twice, and risk two lockdowns of your database.

The auto-cache mechanism solves issue #2. I have it working, time based. I have an auto-cache table for each of the INFORMATION_SCHEMA heavyweight tables. Say, every 30 minutes the cache is invalidated. Throughout those 30 minutes, you get a free pass!

The auto-cache mechanism also paves the road to solving issue #1: since it works by invoking a stored routine, I have better control of the way I read INFORMATION_SCHEMA. This, I can take advantage of INFORMATION_SCHEMA optimization. It's tedious, but not complicated.

For example, if I wanted to cache the TABLES table, I don't necessarily read the entire TABLES data in one read. Instead, I can iterate the schemata, get a list of table names per schema, then read full row data for these, table by table. The result? Many many more SELECTs, but more optimized, and no one-big-lock-it-all query.

And the problem is...

I have two burning problems.

  1. INFORMATION_SCHEMA optimization only works that much. It sometimes does not work. In particular, I've noticed that if you have a view which relies on another view (possibly relying on yet another view), things get out of hand. I author a monitoring tool for MySQL called mycheckpoint. It uses some fancy techniques for generating aggregated data, HTML and charts, by means of nested views. There are a few views there I can never query for in COLUMNS. It just crashes down my server. Repeatedly. And it's a good machine with good configuration. Make that 5 machines. They all crash, repeatedly. I just can't trust INFORMATION_SCHEMA!
  2. Replication: any caching table is bound to replicate. Does it make any sense to replicate cache for internal metadata? Does it make sense to query for the cached table on slave, to have it answer for master's data? With plain old INFORMATION_SCHEMA, every server is on its own. Caching kinda works against this. Or is it fair enough, since we would usually expect master/slaves to reflect same schema structure?

I would feel much better if I could read SHOW statements with a SELECT query. Though I've found this nice hack, it can't work from a stored function, only via stored procedure. So it can't be used from within a SELECT query. I've been banging my head for months now, I think I gave up on this one.

Any insights are welcome!


PlanetMySQL Voting: Vote UP / Vote DOWN

The CAP theorem and MySQL Cluster

Март 7th, 2012
tldr; A single MySQL Cluster prioritises Consistency in Network partition events. Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.


I was recently asked about the relationship between MySQL Cluster and the CAP theorem. The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast. You can have any two, but you can't have all three. For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'. CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?

Standard 'my interpretation of CAP' section

Everyone who discusses CAP like to rehash it, and I'm no exception. Daniel Abadi has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition. This is how I think of CAP. He also discusses related system behaviour in normal operation which I'll return to later.

While this reframing clarifies CAP, the terms network partition, availability and consistency also need some definition.

Network replicated database

CAP is only really relevant in the context of a network replicated database (or filesystem or state machine). A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network. Data can be read and updated. Updates are propagated to all nodes with replicas via the network. Database clients connect to database nodes via the network to read data and make updates. Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.

Availability

The network replicated database exists to provide services such as Read and Write on the data it stores. Its availability can be measured as the ability of any client to perform any service on any data item.

This Service Availability can be compromised by :
  • Failure of client nodes
  • Network failures between clients and database nodes
  • Network failures between database nodes
  • Failure of database nodes
Client node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP. However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service is within the scope of CAP.

Service Availability is not binary, it can partially degrade, perhaps by affecting :
  • A subset of all clients
  • A subset of all stored data
  • A subset of request types

The shades of grey within the definition of availability are responsible for most of the arguments around CAP. If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain. If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network. In the loosest definition, if any client receives any service on any data, then the system is still available. Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage. Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.

Consistency

For consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data. Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result. Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results. This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.

If one copy consistency is relaxed, then different database nodes may observably have different values for the same data item simultaneously. Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes. This behaviour may be described as eventual consistency. Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications. Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.

Network Partitions

Network partitions isolate subsets of the nodes of a network replicated database. The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :
  1. dead
  2. alive but isolated from clients
  3. alive and reachable by clients but isolated from us
Not knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency. The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them. If those clients are allowed to make writes on data copies stored on those database nodes, then we must lose one copy consistency as we cannot supply those new values in response to a read of our local copy. If those clients are not allowed to make writes then we have degraded service availability for them. Which is it to be? This is the unavoidable choice at the centre of the CAP theorem. Stated this way it seems less of a theorem and more of a fact.

Back to MySQL Cluster - which does it provide?

A single MySQL Cluster prioritises data consistency over availability when network partitions occur.

A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.

So you can have it both ways with MySQL Cluster - Great!

Single MySQL Cluster - CP

Within a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit. Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes. Where a network partition occurs, live nodes in each partition regroup and decide what to do next :
  • If there are not enough live nodes to serve all of the data stored - shutdown
    Serving a subset of user data (and risking data consistency) is not an option
  • If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
    No other subset of nodes can be isolated from us and serving clients
  • If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
    There could be another subset of nodes regrouped into a viable cluster out there.

Arbitration occurs to avoid the split brain scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others. In other words, arbitration occurs to preserve consistency.

Arbitration involves :
  • Database nodes agree on an arbitrator in advance
  • During node or network failure handling, no data writes are committed.
  • When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.
  • Each request to the arbitrator will result in either : Yes, No or timeout
  • Anything other than Yes results in node shutdown.
  • The arbitrator only says Yes once per election round (First come first served). Therefore the arbitrator only says yes to one potential cluster in a partitioned network.

Note that arbitration is not the same as achieving a quorum. A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor. The arbitrator role is lightweight as it is not involved in normal traffic. I am surprised that the lightweight arbitrator pattern is not more common.

How does a single MySQL Cluster degrade service availability as a result of network partitions?

Where some subset of data nodes are isolated and shut-down :
  • Those nodes are 100% out of service, until they restart and can rejoin the cluster
    They will attempt to do so automatically
  • Any clients connected only to those nodes are out of service
    By default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.
  • The remaining live nodes are 100% in-service
  • Clients connected to the remaining live nodes are 100% in service
Where no subset of data nodes is live
  • All clients experience 100% service loss, until the data nodes restart and can rejoin the cluster
    They will attempt to do so automatically.

A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions. It does not sacrifice consistency.

How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?

Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it. Some set of clients continue to receive 100% service. Only double failures in the network can cause a network partition resulting in full service loss.
Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.

Asynchronously replicating clusters - AP


Where two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters. Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not. On the other hand, availability is not compromised by network partitioning of the two clusters. Each cluster can continue to accept read and write requests to all of the data from any connected client.

Eventual consistency between the clusters is possible when using conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.

How does consistency degrade between replicating MySQL Clusters during a network partition?

This depends on the conflict resolution function chosen, and how detected conflicts are handled. Some details of consistency guarantees provided by NDB$EPOCH et al are described here.

What about normal operation?

Abadi's post introduced his PACELC acronym, standing for something like :

 if (network Partition)
{
trade-off Availability vs Consistency;
}
else
{
trade-off Latency vs Consistency;
}


My first comment has to be that it's bad form to put the common case in an else branch!
However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition. The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency. In theory this normal operation latency-vs-consistency tradeoff could be completely independent to the Network Partitioning availability-vs-consistency tradeoff, e.g. you could have any of :
  1. PA EL (Partition - Availability, Else - Latency minimisation)
  2. PA EC (Partition - Availability, Else - Consistency)
  3. PC EL (Partition - Consistency, Else - Latency minimisation)
  4. PC EC (Partition - Consistency, Else - Consistency)

The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency. Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability. Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.

Examples of systems of each type :
  1. Any eventually consistent system, especially with local-database-node updates + reads
  2. Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)
  3. ???
  4. Always consistent systems (e.g. single database instance, single MySQL Cluster)

I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition. Maybe this category should be called 'repentant systems'?

The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies. It is not enough to say that things will 'eventually' be consistent. It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.

There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database. The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use! A reasonable approach is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs). These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.

Stonebraker and others have commented on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented. Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower. This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable. Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.

At a wider network scale, latencies are naturally higher, implying that bandwidth is lower. The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable. The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable. Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data. For a particular application there may or may not be a point at which an AP system would meet its requirements better.

Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system. Paraphrasing John Mashey, "Bandwidth improves, latency is forever". Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.

Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available. Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links. As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.

Edit 7/3/12 : Fix bad link

PlanetMySQL Voting: Vote UP / Vote DOWN

Cool technology and usability in Tungsten Enterprise

Март 6th, 2012
When I joined Continuent, at the end of 2010, I was fascinated by the technology of its core products. Readers of this blog know that I have had my hands full with Tungsten Replicator, but what really turned me on was the flagship management suite, Tungsten Enterprise.After hammering at it for several months, and always marveling at the beauty of its technology, let me give a tour of the suite, so that you'll understand what's so exciting about it. First off, Tungsten Enterprise is not simply a replication tool. It is based on replication, but it is mostly a data management suite. Its aim is to reduce complexity for the user and to show a database cluster to the user as if it were a single server, always on, no matter what happens.The most amazing things that you will see in Tungsten Enterprise are
  • Automatic failover
  • Cluster console, and One-Command-Operations
  • transparent connections
  • No VIPs !!!
  • Multi site switch and failover

Automatic failover

This is probably the most amazing feature of all. It is a combination of the same efficient replication technology seen in Tungsten Replicator, which uses a global transaction ID to allow a seamless failover, and a management system, made of components that communicate to each other and can replace a failed master within seconds, even under heavy load. All this, without the application having more trouble than a few seconds delay (see transparent connections below). This feature is customizable. If the manager is in "automatic" mode, it will replace a failed master without manual intervention, and it will try to put online every element that goes offline. In "manual" mode, however, it will let the user take control of operations as needed.

Cluster Console, and One-Command-Operations

Tungsten Enterprise comes with a text-based console that gives immediate access to the cluster information, and lets the users perform maintenance without troubling them with the inner knowledge necessary to perform the tasks. Promoting a slave to master (a planned "switch", as opposed to an unplanned "failover") is just one command, even though behind the scenes the Tungsten Manager runs a dozen commands to complete the task safely.Backup and restore are also one command. And so are all the dozens of administrative tasks that the Tungsten Manager allows the user. The console comes with a comprehensive help that explains all commands in detail. The console allows the DBA to perform operations in any text terminal, without additional components such as a desktop application or a web interface.

Tungsten enterprise overview

Transparent connections

The suite includes a component called Tungsten Connector, which is a sort of high performance proxy between the application and the database. Instead of connecting your applications to the DBMS, you connect it to a Tungsten Connector, which looks and feels as a MySQl (or PostgreSQL) database. The difference is that, when the master changes, the connector will get notified by the Tungsten Manager and immediately re-routes the underlying connections to the appropriate server.Depending on how smart is your application, you can use the Tungsten Connector in two ways:
  • Static routing mode: You create one (or more) connector that will always bring you to the master, and use that connection whenever your application needs to write. And you also create one or more connectors that will always give you access to a slave, and use this one whenever your application needs to read.
  • Smart mode: you ask the connector to detect what you are doing and direct your queries to the appropriate server. This mode sends all transactions and updates to the master, and every read query that is not inside a transaction to the slaves. This mode can also guarantee data consistency, by directing the reading of a just saved record to a slave that has already received that record.
The connector can also do more interesting and even amazing things, such as showing you its status as a SQL query (what fun in being a proxy if you don't take advantage of it?) and allowing on-the-fly changes of policy for an existing connector using command line or SQL parameters. The connector plays well with well designed applications, such as the ones that retry a failed transaction rather than failing, and the ones that are replication-aware and can split reads and writes between connections. But it also plays well with applications that have been designed for a single server, without scalability in mind. In most cases, you replace a single server with a cluster of Tungsten Enterprise, and you are in business.

No VIPs !!!

The failover and switch features are not new in the replication arena. There are tools that do something similar and keep an application connected to the same IP using virtual IPs. I don't like virtual IPs, as they are dumb stateless components between two stateful elements, and I am not the only one who dislikes them (See Virtual IP Addresses and Their Discontents for Database Availability). Using Tungsten Connector instead of a dumb virtual IP makes life so much easier. When you do a failover with a VIP, quite often the application hangs, as the client doesn't detect that the server on the other side has gone away, and thus your failover technology has to somehow identify the hanging connections and cut them: a very painful experience. Instead, the Tungsten Connector will either kill the connection immediately or reroute your query, depending on the needs, and your application doesn't get more than a hiccup.

Composite data services

Multi site switch and failover

A recent addition to the suite is the ability of handling whole sites as single servers.The suite can create and maintain a so called composite data service, which is a cluster that is seen and treated as a single server. in a disaster recovery scenario, you want to have a functioning site in one location and a relay site in another location, ready to take over when sudden disaster strikes.Here's an example of what you can get:

cctrl -multi -expert
Tungsten Enterprise 1.5.0 build 426
sjc: session established
[LOGICAL:EXPERT] / > ls
+----------------------------------------------------------------------------+
|DATA SERVICES: |
+----------------------------------------------------------------------------+
great_company
nyc
sjc

[LOGICAL:EXPERT] / > use great_company
[LOGICAL:EXPERT] /great_company > ls

COORDINATOR[qa.tx6.continuent.com:AUTOMATIC:ONLINE]

DATASOURCES:
+----------------------------------------------------------------------------+
|nyc(composite master:ONLINE) |
|STATUS [OK] [2012/03/05 10:59:28 PM CET] |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|sjc(composite slave:ONLINE) |
|STATUS [OK] [2012/03/05 10:59:30 PM CET] |
+----------------------------------------------------------------------------+
In this scenario, there is a data service called "great_company", which contains two sites that loom like regular servers.Inside each site, there is a cluster, which we can examine at will:

[LOGICAL:EXPERT] /great_company > use nyc
nyc: session established
[LOGICAL:EXPERT] /nyc > ls

COORDINATOR[qa.tx2.continuent.com:AUTOMATIC:ONLINE]

DATASOURCES:
+----------------------------------------------------------------------------+
|qa.tx1.continuent.com(master:ONLINE, progress=1397, THL latency=0.857) |
|STATUS [OK] [2012/03/05 10:58:42 PM CET] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=54, active=0) |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|qa.tx2.continuent.com(slave:ONLINE, progress=1397, latency=0.000) |
|STATUS [OK] [2012/03/05 11:02:19 PM CET] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=qa.tx1.continuent.com, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|qa.tx3.continuent.com(slave:ONLINE, progress=1397, latency=0.000) |
|STATUS [OK] [2012/03/05 10:58:31 PM CET] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=qa.tx1.continuent.com, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
There is a master and two slaves. For each server, we can see the vitals at a glance.The relay site offers a similar view, with the distinction that, instead of a master, there is a relay server. All changes coming from the master in the main site will also go to the relay server, and from that to the slaves in the second site.

[LOGICAL:EXPERT] /nyc > use sjc
[LOGICAL:EXPERT] /sjc > ls

COORDINATOR[qa.tx6.continuent.com:AUTOMATIC:ONLINE]

DATASOURCES:
+----------------------------------------------------------------------------+
|qa.tx6.continuent.com(relay:ONLINE, progress=1397, THL latency=4.456) |
|STATUS [OK] [2012/03/05 10:58:32 PM CET] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=relay, master=qa.tx1.continuent.com, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|qa.tx7.continuent.com(slave:ONLINE, progress=1397, latency=0.000) |
|STATUS [OK] [2012/03/05 10:59:03 PM CET] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=qa.tx6.continuent.com, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|qa.tx8.continuent.com(slave:ONLINE, progress=1397, latency=0.000) |
|STATUS [OK] [2012/03/05 10:58:30 PM CET] |
+----------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=qa.tx6.continuent.com, state=ONLINE) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=0, active=0) |
+----------------------------------------------------------------------------+
If you need to bring the relay site as the main one, all you need to do is to run a switch command:

[LOGICAL:EXPERT] /sjc > use great_company
[LOGICAL:EXPERT] /great_company > switch
SELECTED SLAVE: 'sjc@great_company'
FLUSHING TRANSACTIONS THROUGH 'qa.tx1.continuent.com@nyc'
PUT THE NEW MASTER 'sjc@great_company' ONLINE
PUT THE PRIOR MASTER 'nyc@great_company' ONLINE AS A SLAVE
SWITCH TO 'sjc@great_company' WAS SUCCESSFUL
[LOGICAL:EXPERT] /great_company > ls

COORDINATOR[qa.tx6.continuent.com:AUTOMATIC:ONLINE]

DATASOURCES:
+----------------------------------------------------------------------------+
|nyc(composite slave:ONLINE) |
|STATUS [OK] [2012/03/06 09:44:48 AM CET] |
+----------------------------------------------------------------------------+

+----------------------------------------------------------------------------+
|sjc(composite master:ONLINE) |
|STATUS [OK] [2012/03/06 09:44:47 AM CET] |
+----------------------------------------------------------------------------+
If disaster strikes, instead of "switch" you say failover, and then use the relay site transparently. Did I mention that Tungsten Connector can be configured to use a composite data service transparently? It can, and if you switch your operations from one Coast to another, the applications will follow suit without any manual intervention.It's so cool! I am sure any geek must love it!BTW: this is not a toy application. This suite is handling data centers that are huge by any standard you care to use, with 100+ terabyte moved through this technology.

Usability

The experience gained with the installer for Tungsten Replicator has been very useful for the whole team. Using the same technology, we have now created a more advanced and simpler installation tool, which is summarized in the Tungsten Enterprise Cookbook. Installing a complex cluster has never been easier!

PlanetMySQL Voting: Vote UP / Vote DOWN