<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PlanetMysql.ru - информация о СУБД MySQL &#187; Cluster</title>
	<atom:link href="http://planetmysql.ru/category/cluster/feed/" rel="self" type="application/rss+xml" />
	<link>http://planetmysql.ru</link>
	<description>Блог о самой популярной СУБД MySQL</description>
	<lastBuildDate>Sat, 11 Feb 2012 12:38:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Announcing SkySQL™ Enterprise HA for the MariaDB® &amp; MySQL® databases</title>
		<link>http://www.skysql.com/blogs/jean-jerome-schmidt/announcing-skysql-enterprise-ha-mariadb-mysql-databases-0?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=announcing-skysql-enterprise-ha-for-the-mariadb-mysql-databases</link>
		<comments>http://www.skysql.com/blogs/jean-jerome-schmidt/announcing-skysql-enterprise-ha-mariadb-mysql-databases-0#comments</comments>
		<pubDate>Mon, 23 Jan 2012 14:57:52 +0000</pubDate>
		<dc:creator>SkySQL</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[High Availability]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[MySQL Cluster]]></category>
		<category><![CDATA[Replication]]></category>
		<category><![CDATA[SkySQL]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=082bf0edcc7da3ce5ecd195ac2b6a995</guid>
		<description><![CDATA[SkySQL&#8482; today announced the immediate availability of SkySQL&#8482; Enterprise HA, its leading 360&#176; degrees High Availability solution for the MySQL&#174; &#38; MariaDB&#174; databases.
High Availability is the #1 requested enhancement to the MySQL &#38; MariaDB servers, even more popular than scalability and performance.&#160; And with SkySQL&#039;s expertise at hand, it is now easier than ever before for customers to achieve the level of High Availability that they want.
SkySQL&#8482; Enterprise HA is SkySQL&#039;s 360&#176; answer to providing a ready-to-go solution for MySQL &#38; MariaDB High Availability &#8211; in no more than 3 days.
Check out the following resources for more information:
Visit the SkySQL Enterprise HA product page
Including:


		SkySQL&#8482; Enterprise HA Options Table

		SkySQL&#8482; Enterprise HA Statement of Work

Download the SkySQL High Availability whitepaper
Contact your local SkySQL representative to discuss your HA needs
Finally, if you are in New York City today, join Ivan Zoratti, SkySQL CTO, at the MySQL Meetup for a discussion about cool new tools &#38; tricks to achieve High Availability of your MySQL servers!
Fore more information, visit the New York City MySQL Group webpage.
We look forward to helping you achieve your High Availability objectives for your MySQL &#38; MariaDB databases!]]></description>
			<content:encoded><![CDATA[<p><strong>SkySQL&trade;</strong> today announced the immediate availability of <strong><a href="http://www.skysql.com/services/consulting/mysql-high-availability">SkySQL&trade; Enterprise HA</a></strong>, its leading 360&deg; degrees High Availability solution for the MySQL&reg; &amp; MariaDB&reg; databases.</p>
<p>High Availability is the #1 requested enhancement to the MySQL &amp; MariaDB servers, even more popular than scalability and performance.&nbsp; And with <a href="http://www.skysql.com/services/consulting/mysql-high-availability"><u>SkySQL&#39;s expertise at hand</u></a>, it is now easier than ever before for customers to achieve the level of High Availability that they want.</p>
<p><a href="http://www.skysql.com/services/consulting/mysql-high-availability">SkySQL&trade;</a><a href="http://www.skysql.com/services/consulting/mysql-high-availability"><u> Enterprise HA</u></a> is SkySQL&#39;s 360&deg; answer to providing a ready-to-go solution for MySQL &amp; MariaDB High Availability &ndash; <strong>in no more than 3 days</strong>.</p>
<p>Check out the following resources for more information:</p>
<p><a href="http://www.skysql.com/services/consulting/mysql-high-availability"><u>Visit the SkySQL Enterprise HA product page</u></a></p>
<p>Including:</p>
<ul>
<li>
		SkySQL&trade; Enterprise HA Options Table</li>
<li>
		SkySQL&trade; Enterprise HA Statement of Work</li>
</ul>
<p><a href="http://www.skysql.com/news-and-events/white-papers/high-availability-solutions-mysql-database"><u>Download the SkySQL High Availability whitepaper</u></a></p>
<p><a href="http://www.skysql.com/company/contact"><u>Contact your local SkySQL representative to discuss your HA needs</u></a></p>
<p>Finally, if you are in New York City today, join Ivan Zoratti, SkySQL CTO, at the MySQL Meetup for a discussion about cool new tools &amp; tricks to achieve High Availability of your MySQL servers!</p>
<p><a href="http://www.skysql.com/news-and-events/events/database-month-mysql-high-availability-reloaded">Fore more information, visit the New York City MySQL Group webpage.</a></p>
<p>We look forward to helping you achieve your High Availability objectives for your MySQL &amp; MariaDB databases!</p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31774&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31774&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2012/01/23/announcing-skysql-enterprise-ha-for-the-mariadb-mysql-databases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>2011, A great year for MySQL in review&#8230;</title>
		<link>http://feedproxy.google.com/~r/ItsJustAboutCommunication/~3/5Isa-1JnQjc/2011-great-year-for-mysql-in-review.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=2011-a-great-year-for-mysql-in-review</link>
		<comments>http://feedproxy.google.com/~r/ItsJustAboutCommunication/~3/5Isa-1JnQjc/2011-great-year-for-mysql-in-review.html#comments</comments>
		<pubDate>Thu, 29 Dec 2011 12:31:00 +0000</pubDate>
		<dc:creator>Luca Olivari</dc:creator>
				<category><![CDATA[2011]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[enterprise]]></category>
		<category><![CDATA[events]]></category>
		<category><![CDATA[marketing]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[opensource]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[windows]]></category>
		<category><![CDATA[workbench]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=75602c8a5ac8a5d4b30226af56776c9b</guid>
		<description><![CDATA[I see so many posts on what happened to company X, product Y and dream Z that I couldn't resist the temptation to summarize this great year for MySQL. At the end of 2010, Oracle did an announcement we were all waiting for:&#160;MySQL 5.5 is GA!&#160;Another year has passed since then and it's time to reflect on what has been done.

I know this is a long post. I tried to rewrite it at least 10 times to make it shorter, but I couldn't condense the list. Hence, I wrote a summary in the beginning for those who don't want to read it all.

I believe that 2011 was an exceptional year for MySQL and I really enjoy being part of this team. I wish all of us a lot of success and fun in the years to come!

Summary:
Oracle released many&#160;MySQL 5.6 and&#160;MySQL Cluster 7.2&#160;DMRs accompanied&#160;by new versions of MySQL Enterprise Monitor, MySQL Enterprise Backup,&#160;MySQL Workbench&#160;(and utilities), MySQL Proxy, MySQL Cluster Manager&#160;and&#160;Connectors.

The MySQL team unveiled new products like the MySQL Installer for Windows and Oracle VM Templates for MySQL. Besides, the&#160;MySQL Enterprise offering has been enriched with new commercial extensions.&#160;MySQL can now be leveraged as one of the Oracle data management solutions with new certifications&#160;and the integration with My Oracle Support&#160;increased the business value of customers' investment on Oracle technologies.

Additionally MySQL presented at mayor events across the world and won a few awards.


Long List:
If you're still reading, below you can find an hopefully-extensive list of announcements and blogs (in reverse&#160;chronological&#160;order). I've mainly covered product releases, events and awards. Please let me know if I missed something.

Products:&#160;
Dec 26 - MySQL Workbench 5.2.37 Has Been Released
Dec 20 - MySQL 5.6.4 Development Milestone Now Available!
Dec 02 - MySQL Enterprise Monitor 2.3.8 is now GA!
Nov 28 - MySQL 5.5.18 Debian packaging now available
Oct 10 - New MySQL Enterprise Oracle Certifications
Oct 10 - MySQL Utilities 1.0.3
Oct 07 - MySQL Cluster 7.2 (DMR2): NoSQL, Key/Value, Memcached
Oct 03 - More Early Access Features in the MySQL 5.6.3 Development Milestone!
Oct 03 -&#160;New Development Milestone Releases &#38; Certifications!
Sep 15 - New Commercial Extensions for MySQL Enterprise Editions
Sep 09 - MySQL@Oracle OpenWorld
Sep 06 -&#160;Oracle Enhances MySQL Installer and High Availability for Windows
Sep 06 - Oracle Enhances MySQL Manageability on Windows
Aug 19 - MySQL Proxy 0.8.2 Has Been Released
Aug 01 -&#160;More New MySQL 5.6 Early Access Features
Jul 19 -&#160;MySQL Enterprise Backup 3.6 - New backup streaming, integration with Oracle Secure Backup and other common backup media solutions
Jul 18 - Simpler and Safer Clustering: MySQL Cluster Manager Update
Jul 06 - Announced Oracle VM Templates for MySQL
Apr 12 - MySQL Cluster 7.2 Development Milestone Release - NoSQL with Memcached and 20x Higher JOIN Performance
Apr 11 -&#160;Top Features in MySQL 5.6.2 Development Milestone Release
Apr 11 - Introducing the MySQL Installer for Windows
Mar 15 - Oracle Enhances MySQL Enterprise Edition

Events:
Oct 26 - A lot of MySQL Events in Europe
Oct 12 - MySQL Roadshow in Germany
Sep 16 - OTN MySQL Developer Day in London
Aug 08 - OTN Developer Day: MySQL is Coming to Washington, DC
Jul 14 -&#160;New “Meet The MySQL Experts” Podcast Series
May 13 - Upcoming MySQL Events in Europe
Apr 26 -&#160;OTN Developer Day for MySQL - Santa Clara, CA
Mar 25 - MySQL (and Cluster) at Collaborate and O'Reilly MySQL Conference
Mar 14 -&#160;First Ever MySQL on Windows Online Forum - March 16, 2011

Awards:
Dec 15 -&#160;MySQL Wins Best Open Source Product of 2011 Award
Jun 03 - MySQL Wins the php&#124;architect Impact Award for Data Management
Jan 17 - MySQL Makes the Cover of Oracle Magazine

To all MySQL customers, partners, colleagues, developers, users, advocates or aficionados:&#160;Thank you for this terrific year!&#160;Go MySQL!]]></description>
			<content:encoded><![CDATA[I see so many posts on what happened to company X, product Y and dream Z that I couldn't resist the temptation to summarize this great year for MySQL. At the end of 2010, Oracle did an announcement we were all waiting for:&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/mysql_55_is_ga">MySQL 5.5 is GA</a>!&nbsp;Another year has passed since then and it's time to reflect on what has been done.<br />
<br />
I know this is a long post. I tried to rewrite it at least 10 times to make it shorter, but I couldn't condense the list. Hence, I wrote a summary in the beginning for those who don't want to read it all.<br />
<br />
I believe that 2011 was an exceptional year for MySQL and I really enjoy being part of this team. I wish all of us a lot of success and fun in the years to come!<br />
<br />
<b>Summary:</b><br />
<a href="http://www.mysql.com/common/logos/logo-mysql-110x57.png" imageanchor="1"><img border="0" src="http://www.mysql.com/common/logos/logo-mysql-110x57.png" /></a>Oracle released many&nbsp;<a href="http://dev.mysql.com/tech-resources/articles/whats-new-in-mysql-5.6.html">MySQL 5.6 </a>and&nbsp;<a href="http://dev.mysql.com/tech-resources/articles/mysql-cluster-labs-dev-milestone-release.html">MySQL Cluster 7.2</a>&nbsp;DMRs accompanied&nbsp;by new versions of <a href="http://mysql.com/products/enterprise/monitor.html">MySQL Enterprise Monitor</a>, <a href="http://mysql.com/products/enterprise/backup.html">MySQL Enterprise Backup</a>,&nbsp;<a href="http://www.mysql.com/products/workbench/">MySQL Workbench</a>&nbsp;(and <a href="http://drcharlesbell.blogspot.com/2011/10/mysql-utilities-release-103.html">utilities</a>), <a href="http://dev.mysql.com/downloads/mysql-proxy/">MySQL Proxy</a>, <a href="http://www.mysql.com/products/cluster/mcm/">MySQL Cluster Manager</a>&nbsp;and&nbsp;<a href="http://dev.mysql.com/downloads/connector/">Connectors</a>.<br />
<br />
The MySQL team unveiled new products like the <a href="http://dev.mysql.com/tech-resources/articles/mysql-installer-for-windows.html">MySQL Installer</a> for Windows and <a href="http://www.oracle.com/us/corporate/press/421994">Oracle VM Templates for MySQL</a>. Besides, the&nbsp;<a href="http://www.mysql.com/products/enterprise/">MySQL Enterprise</a> offering has been enriched with new <a href="http://blogs.oracle.com/MySQL/entry/new_commercial_extensions_for_mysql">commercial extensions</a>.&nbsp;MySQL can now be leveraged as one of the Oracle data management solutions with new <a href="http://blogs.oracle.com/MySQL/entry/new_mysql_enterprise_oracle_certifications">certifications</a>&nbsp;and the integration with <a href="http://www.oracle.com/us/support/mos-mysql-297243.html">My Oracle Support</a>&nbsp;increased the business value of customers' investment on Oracle technologies.<br />
<br />
Additionally MySQL presented at mayor <a href="http://mysql.com/news-and-events/events/">events </a>across the world and won a few <a href="http://www.mysql.com/why-mysql/awards/">awards</a>.<br />
<br />
<a name='more'></a><br />
<b>Long List:</b><br />
If you're still reading, below you can find an hopefully-extensive list of announcements and blogs (in reverse&nbsp;chronological&nbsp;order). I've mainly covered product releases, events and awards. Please let me know if I missed something.<br />
<br />
<b>Products:&nbsp;</b><br />
Dec 26 - <a href="http://blogs.oracle.com/mysqlworkbench/entry/mysql_workbench_5_2_37">MySQL Workbench 5.2.37 Has Been Released</a><br />
Dec 20 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_5_6_4_development">MySQL 5.6.4 Development Milestone Now Available!</a><br />
Dec 02 - <a href="http://blogs.oracle.com/mysqlenterprise/entry/mysql_enterprise_monitor_2_34">MySQL Enterprise Monitor 2.3.8 is now GA!</a><br />
Nov 28 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_5_5_18_debian">MySQL 5.5.18 Debian packaging now available</a><br />
Oct 10 - <a href="http://blogs.oracle.com/MySQL/entry/new_mysql_enterprise_oracle_certifications">New MySQL Enterprise Oracle Certifications</a><br />
Oct 10 - <a href="http://drcharlesbell.blogspot.com/2011/10/mysql-utilities-release-103.html">MySQL Utilities 1.0.3</a><br />
Oct 07 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_7_2_dmr2">MySQL Cluster 7.2 (DMR2): NoSQL, Key/Value, Memcached</a><br />
Oct 03 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_7_2_dmr2">More Early Access Features in the MySQL 5.6.3 Development Milestone!</a><br />
Oct 03 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/new_development_milestone_releases_certifications">New Development Milestone Releases &amp; Certifications!</a><br />
Sep 15 - <a href="http://blogs.oracle.com/MySQL/entry/new_commercial_extensions_for_mysql">New Commercial Extensions for MySQL Enterprise Editions</a><br />
Sep 09 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_oracle_openworld">MySQL@Oracle OpenWorld</a><br />
Sep 06 -&nbsp;<a href="http://www.oracle.com/us/corporate/press/485067">Oracle Enhances MySQL Installer and High Availability for Windows</a><br />
Sep 06 - <a href="http://blogs.oracle.com/MySQL/entry/oracle_enhances_mysql_manageability_on">Oracle Enhances MySQL Manageability on Windows</a><br />
Aug 19 - <a href="http://blogs.oracle.com/mysqlenterprise/entry/mysql_proxy_0_8_2">MySQL Proxy 0.8.2 Has Been Released</a><br />
Aug 01 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/more_new_mysql_5_6">More New MySQL 5.6 Early Access Features</a><br />
Jul 19 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/mysql_enterprise_backup_3_6">MySQL Enterprise Backup 3.6 - New backup streaming, integration with Oracle Secure Backup and other common backup media solutions</a><br />
Jul 18 - <a href="http://blogs.oracle.com/MySQL/entry/simpler_and_safer_clustering_mysql">Simpler and Safer Clustering: MySQL Cluster Manager Update</a><br />
Jul 06 - <a href="http://blogs.oracle.com/MySQL/entry/virtualizing_mysql_1_click_kick">Announced Oracle VM Templates for MySQL</a><br />
Apr 12 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_72_development_milestone_release_-_nosql_with_memcached_and_20x_higher_join_performanc">MySQL Cluster 7.2 Development Milestone Release - NoSQL with Memcached and 20x Higher JOIN Performance</a><br />
Apr 11 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/top_features_in_mysql_562_development_milestone_release">Top Features in MySQL 5.6.2 Development Milestone Release</a><br />
Apr 11 - <a href="http://dev.mysql.com/tech-resources/articles/mysql-installer-for-windows.html">Introducing the MySQL Installer for Windows</a><br />
Mar 15 - <a href="http://www.oracle.com/us/corporate/press/339030">Oracle Enhances MySQL Enterprise Edition</a><br />
<br />
<b>Events:</b><br />
Oct 26 - <a href="http://blogs.oracle.com/MySQL/entry/and_more_mysql_events_in">A lot of MySQL Events in Europe</a><br />
Oct 12 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_roadshow_in_germany">MySQL Roadshow in Germany</a><br />
Sep 16 - <a href="http://blogs.oracle.com/MySQL/entry/otn_mysql_developer_day_in">OTN MySQL Developer Day in London</a><br />
Aug 08 - <a href="http://blogs.oracle.com/MySQL/entry/otn_developer_day_mysql_is">OTN Developer Day: MySQL is Coming to Washington, DC</a><br />
Jul 14 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/new_meet_the_mysql_experts">New “Meet The MySQL Experts” Podcast Series</a><br />
May 13 - <a href="http://blogs.oracle.com/MySQL/entry/upcoming_mysql_events_in_europe">Upcoming MySQL Events in Europe</a><br />
Apr 26 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/otn_developer_day_for_mysql_-_santa_clara_ca">OTN Developer Day for MySQL - Santa Clara, CA</a><br />
Mar 25 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_on_the_road_oreilly_mysql_and_collaborate_conferences">MySQL (and Cluster) at Collaborate and O'Reilly MySQL Conference</a><br />
Mar 14 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/first_ever_mysql_on_windows_online_forum_-_march_16_2011">First Ever MySQL on Windows Online Forum - March 16, 2011</a><br />
<br />
<b>Awards:</b><br />
Dec 15 -&nbsp;<a href="http://mysql%20wins%20best%20open%20source%20product%20of%202011%20award/">MySQL Wins Best Open Source Product of 2011 Award</a><br />
Jun 03 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_wins_the_php_architect">MySQL Wins the php|architect Impact Award for Data Management</a><br />
Jan 17 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_makes_the_cover_of_oracle_magazine">MySQL Makes the Cover of Oracle Magazine</a><br />
<br />
To all MySQL customers, partners, colleagues, developers, users, advocates or aficionados:&nbsp;<b>Thank you for this terrific year!&nbsp;Go MySQL!</b><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/8877901999053801110-3078837993853253512?l=justaboutcommunication.blogspot.com" alt="" /></div>
<p><a href="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/0/da"><img src="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/1/da"><img src="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/1/di" border="0" ismap="true"></img></a></p><div>
<a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:4cEx4HpKnUU"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:4cEx4HpKnUU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/ItsJustAboutCommunication/~4/5Isa-1JnQjc" height="1" width="1" /><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31445&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31445&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/29/2011-a-great-year-for-mysql-in-review/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eventual Consistency in MySQL Cluster &#8212; implementation part 3</title>
		<link>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eventual-consistency-in-mysql-cluster-implementation-part-3</link>
		<comments>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html#comments</comments>
		<pubDate>Thu, 22 Dec 2011 17:36:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=08982a5a78aac34767dc093723414161</guid>
		<description><![CDATA[As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication.  This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.Transaction based conflict handling with NDB$EPOCH_TRANSNDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described here.Including user transaction ids in the BinlogNdb Binlog epoch transactions contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The mysqlbinlog --verbose tool can be used to see per-row transaction information in the Binlog.User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to sort the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?NDB$EPOCH_TRANS multiple slave passesInitially, NDB$EPOCH_TRANS proceeds in the same way as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied.  This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same way as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.Dependency tracking implementationTo build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.Transaction dependency detection hash :{Table, Primary keys} -&#62; {Transaction id}The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.Transaction dependency tracking and conflict marking hash :{Transaction id} -&#62; {in_conflict, List}As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction.  Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :Transaction in-conflict lookup hash :{Transaction id} -&#62; {in_conflict}This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction.  There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.Event countersLike the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.Additional counters have been added which specifically track the different events associated with transactional conflict detection.  These can be seen with the usual SHOW GLOBAL STATUS LIKE syntax, or via the INFORMATION_SCHEMA tables.ndb_conflict_trans_row_conflict_countThis is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.ndb_conflict_trans_row_reject_countThe number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_countndb_conflict_trans_reject_countThe number of discrete user transactions detected as in-conflict.ndb_conflict_trans_conflict_commit_countThe number of epoch transactions which had transactional conflicts detected during application.ndb_conflict_trans_detect_iter_countThe number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.Performance properties of NDB$EPOCH and NDB$EPOCH_TRANSI have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?Conflict detection logic is pushed down to data nodes for executionMinimising extra data transfer + lockingSlave operation batching is preservedMultiple row events are applied together, saving MySQLD &#60;-&#62; data node round trips, using data node parallelismFor both algorithms, one extra MySQLD &#60;-&#62; data node round-trip is required in the no-conflicts case (best case)NDB$EPOCH : One extra MySQLD &#60;-&#62; data node round-trip is required per *batch* in the all-conflicts case (worst case)NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.NDB$EPOCH : Minimal overhead to Slave SQL CPU consumptionNDB$EPOCH_TRANS : One extra MySQLD &#60;-&#62; data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict caseCurrent and intrinsic limitationsThese functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.Intrinsic limitationsReads from the Secondary are tentativeData committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable.  This is described in more detail here.Writes to the Secondary may be rolled backIf this occurs, the fact will be recorded on the Primary. Once a committed write is stable it will not be rolled back.Out-of-band dependencies between transactions are out-of-scopeFor example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.Current implementation limitationsDetected transaction dependencies are limited to dependencies between binlogged writes (Insert, Update, Delete)Reads are not currently included.Delete vs Delete+Insert conflicts risk data divergenceDelete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistencyWith pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.Want to try it out?Andrew Morgan has written a great post showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art.  This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.Edit 23/12/11 : Added index]]></description>
			<content:encoded><![CDATA[<a href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"><img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html#mymap" border="0" /><br /></a><br /><map name="mymap"><area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html" /><area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html" /><area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html" /><area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html" /><area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /><area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /></map><br />As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication.  This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.<br /><br /><span>Transaction based conflict handling with NDB$EPOCH_TRANS</span><br /><br />NDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described <a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html">here</a>.<br /><br /><span>Including user transaction ids in the Binlog</span><br /><br />Ndb Binlog <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html">epoch transactions</a> contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The <a href="http://dev.mysql.com/doc/refman/5.1/en/mysqlbinlog.html">mysqlbinlog</a> --verbose tool can be used to see per-row transaction information in the Binlog.<br /><br />User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to <a href="http://en.wikipedia.org/wiki/Topological_sorting">sort</a> the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?<br /><br /><span>NDB$EPOCH_TRANS multiple slave passes</span><br /><br />Initially, NDB$EPOCH_TRANS proceeds in the same <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html">way</a> as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied.  This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.<br /><br />In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.<br /><br />In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html">way</a> as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.<br /><br />These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.<br /><br /><span>Dependency tracking implementation</span><br /><br />To build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.<br /><br />Transaction dependency detection hash :<br /><div>{Table, Primary keys} -&gt; {Transaction id}<br /></div><br />The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.<br /><br />Transaction dependency tracking and conflict marking hash :<br /><div>{Transaction id} -&gt; {in_conflict, List}<br /></div><br />As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction.  Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.<br /><br />After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :<br /><br />Transaction in-conflict lookup hash :<br /><div>{Transaction id} -&gt; {in_conflict}<br /></div><br />This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.<br /><br />The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction.  There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.<br /><br /><span>Event counters</span><br /><br />Like the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.<br /><br />Additional counters have been added which specifically track the different events associated with transactional conflict detection.  These can be seen with the usual SHOW GLOBAL STATUS LIKE <a href="http://dev.mysql.com/doc/refman/5.1/en/show-status.html">syntax</a>, or via the INFORMATION_SCHEMA <a href="http://dev.mysql.com/doc/refman/5.1/en/status-table.html">tables</a>.<br /><br /><ul><li><span>ndb_conflict_trans_row_conflict_count</span><br />This is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.</li><li><span>ndb_conflict_trans_row_reject_count</span><br />The number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.<br />A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_count</li><li><span>ndb_conflict_trans_reject_count</span><br />The number of discrete user transactions detected as in-conflict.</li><li><span>ndb_conflict_trans_conflict_commit_count</span><br />The number of epoch transactions which had transactional conflicts detected during application.</li><li><span>ndb_conflict_trans_detect_iter_count</span><br />The number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.<br /></li></ul><br /><br /><span>Performance properties of NDB$EPOCH and NDB$EPOCH_TRANS</span><br /><br />I have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?<br /><br /><ul><li>Conflict detection logic is pushed down to data nodes for execution<br />Minimising extra data transfer + locking</li><li>Slave operation batching is preserved<br />Multiple row events are applied together, saving MySQLD &lt;-&gt; data node round trips, using data node parallelism<br />For both algorithms, one extra MySQLD &lt;-&gt; data node round-trip is required in the no-conflicts case (best case)</li><li>NDB$EPOCH : One extra MySQLD &lt;-&gt; data node round-trip is required per *batch* in the all-conflicts case (worst case)</li><li>NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.</li><li>NDB$EPOCH : Minimal overhead to Slave SQL CPU consumption</li><li>NDB$EPOCH_TRANS : One extra MySQLD &lt;-&gt; data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)</li><li>NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.</li><li>NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.</li><li>NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict case<br /></li></ul><br /><span>Current and intrinsic limitations</span><br /><br />These functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.<br /><br /><span>Intrinsic limitations</span><br /><ul><li><span>Reads from the Secondary are tentative</span><br />Data committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable.  This is described in more detail <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html">here</a>.</li><li><span>Writes to the Secondary may be rolled back</span><br />If this occurs, the fact will be recorded on the Primary. Once a committed write is <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html">stable</a> it will not be rolled back.</li><li><span>Out-of-band dependencies between transactions are out-of-scope</span><br />For example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.<br /></li></ul><br /><span>Current implementation limitations</span><br /><br /><ul><li><span>Detected transaction dependencies are limited to dependencies between binlogged writes</span> (Insert, Update, Delete)<br />Reads are not currently included.</li><li><span>Delete vs Delete+Insert conflicts risk data divergence</span><br />Delete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.</li><li><span>With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistency</span><br />With pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.<br /></li></ul><br /><span>Want to try it out?</span><br /><br />Andrew Morgan has written a great <a href="http://www.clusterdb.com/mysql-cluster/enhanced-conflict-resolution-with-mysql-cluster-active-active-replication/">post</a> showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art.  This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.<br /><br /><span>Edit 23/12/11 : Added index</span><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-3519742339745296117?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31406&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31406&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/22/eventual-consistency-in-mysql-cluster-implementation-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eventual consistency in MySQL Cluster &#8212; implementation part 2</title>
		<link>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eventual-consistency-in-mysql-cluster-implementation-part-2</link>
		<comments>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html#comments</comments>
		<pubDate>Mon, 19 Dec 2011 13:30:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=d98d0d71c89256f4c9e1ab4c94fa6c42</guid>
		<description><![CDATA[In previous posts I described how row conflicts are detected using epochs.  In this post I describe how they are handled.Row based conflict handling with NDB$EPOCHOnce a row conflict is detected, as well as rejecting the row change, row based conflict handling in the Slave will :Increment conflict countersOptionally insert a row into an exceptions tableFor NDB$EPOCH, conflict detection and handling operates on one Cluster in an Active-Active pair designated as the Primary.  When a Slave MySQLD attached to the Primary Cluster detects a conflict between data stored in the Primary and a replicated event from the Secondary, it needs to realign the Secondary to store the same values for the conflicting data.  Realignment involves injecting an event into the Primary Cluster's Binlog which, when applied idempotently on the Secondary Cluster, will force the row on the Secondary Cluster to take the supplied values.  This requires either a WRITE_ROW event, with all columns, or a DELETE_ROW event with just the primary key columns.  These events can be thought of as compensating events used to revert the original effect of the rejected events.Conflicts are detected by a Slave MySQLD attached to the Primary Cluster, and realignment events must appear in Binlogs recorded by the same MySQLD and/or other Binlogging MySQLDs attached to the Primary Cluster.  This is achieved using a new NdbApi primary key operation type called refreshTuple.When a refreshTuple operation is executed it will : Lock the affected row/primary key until transaction commit time, even if it does not exist (much as an Insert would).Set the affected row's author metacolum to 0The refresh is logically a local changeOn commit- Row exists case : Set the row's last committed epoch to the current epoch- Cause a WRITE_ROW (row exists case) or DELETE_ROW (no row exists) event to be generated by attached Binlogging MySQLDs.Locking the row as part of refreshTuple serialises the conflicting epoch transaction with other potentially conflicting local transactions.  Updating the stored epoch and author metacolumns results in the conflicting row conflicting with any further replicated changes occurring while the realignment event is 'in flight'.  The compensating row events are effectively new row changes originating at the Primary cluster which need to be monitored for conflicts in the same way as normal row changes.It is important that the Slave running at the Secondary Cluster where the realignment events will be applied, is running in idempotent mode, so that it can handle the realignment events correctly.  If this is not the case then WRITE_ROW realignment events may hit 'Row already exists' errors, and DELETE_ROW realignment events may hit 'Row does not exist' errors.Observations on conflict windows and consistencyWhen a conflict is detected, the refresh process results in the row's epoch and author metacolumns being modified so that the window of potential conflict is extended, until the epoch in which the refresh operation was recorded has itself been reflected.  If ongoing updates at both clusters continually conflict then refresh operations will continue to be generated, and the conflict window will remain open until a refresh operation manages to propagate with no further conflicts occurring.  As with any eventually consistent system, consistency is only guaranteed when the system (or at least the data of interest) is quiescent for a period.From the Primary cluster's point of view, the conflict window length is the time between committing a local transaction in epoch n, and the attached Slave committing a replicated epoch transaction indicating that epoch n has been applied at the Secondary.  Any Secondary-sourced overlapping change applied in this time is in-conflict.This Cluster conflict window length is comprised of : Time between commit of transaction, and next Primary Cluster epoch boundary(Worst = 1 * TimeBetweenEpochs, Best = 0, Avg = 0.5 * TimeBetweenEpochs)Time required to log event in Primary Cluster's Binlogging MySQLDs Binlog (~negligible)Time required for Secondary Slave MySQLD IO thread to- Minimum : Detect new Binlog data - negligible- Maximum : Consume queued Binlog prior to the new data - unbounded- Pull new epoch transaction- Record in Relay logTime required for Secondary Slave MySQLD SQL thread to- Minimum : Detect new events in relay log- Maximum : Consume queued Relay log prior to new data - unbounded- Read and apply events- Potentially multiple batches.- Commit epoch transaction at SecondaryTime between commit of replicated epoch transaction and next Secondary Cluster epoch boundary(Worst = 1 * TimeBetweenEpochs, Best = 0, Avg = 0.5 * TimeBetweenEpochs)After this point a Secondary-local commit on the data is possible without conflictTime required to log event in Secondary Cluster's Binlogging MySQLDs Binlog (~negligible)Time required for Primary Slave MySQLD IO thread to- Minimum : Detect new Binlog data- Maximum : Consume queued Binlog data prior to the new data - unbounded- Pull new epoch transaction- Record in Relay logTime required for Primary Slave MySQLD SQL thread to- Minimum : Detect new events in relay log- Maximum : Consume queued Relay log prior to new data - unbounded- Read and apply events- Potentially multiple batches.- For NDB$EPOCH_TRANS, potentially multiple passes- Commit epoch transaction- Update max replicated epoch to reflect new maximum.Further Secondary sourced modifications to the rows are now considered not-in-conflictFrom the point of view of an external client with access to both Primary and Secondary clusters, the conflict window only extends from the time transaction commit occurs at the Primary to the time the replicated operations are applied at the Secondary, and its commit time Secondary epoch ends. Changes committed at the Secondary after this will clearly appear to the Primary to have occurred after its epoch was applied on the Secondary and therefore are not in-conflict.Assuming that both Clusters have the same TimeBetweenEpochs, we can simplify the Cluster conflict window to :  Cluster_conflict_window_length = EpochDelay +                                  P_Binlog_lag +                                  S_Relay_lag +                                  S_Binlog_lag +                                  P_Relay_lag Where    EpochDelay minimum is 0    EpochDelay avg     is TimeBetweenEpochs    EpochDelay maximum is 2 * TimeBetweenEpochsSubstituting the default value of TimeBetweenEpochs of 100 millis, we get :     EpochDelay minimum is 0    EpochDelay avg     is 100 millis    EpochDelay maximum is 200 millisNote that TimeBetweenEpochs is an epoch-increment trigger delay.  The actual experienced time between epochs can be longer depending on system load.  The various Binlog and Relay log delays can vary from close to zero up to infinity.  Infinity occurs when replication stops in either direction.The Cluster conflict window length can be thought of as bothThe time taken to detect a conflict with a Primary transactionThe time taken for a committed Secondary transaction to become stable or be revertedWe can define a Client conflict window length as either : Primary-&#62;Secondary  Client_conflict_window_length = EpochDelay +                                  P_Binlog_lag +                                  S_Relay_lag +                                  EpochDelayorSecondary-&#62;Primary  Client_conflict_window_length = EpochDelay +                                  S_Binlog_lag +                                  P_Relay_lagWhere EpochDelay is defined as above.These definitions are asymmetric.  They represent the time taken by the system to determine that a particular change at one cluster definitely happened-before another change at the other cluster.  The asymmetry is due to the need for the Secondary part of a Primary-&#62;Secondary conflict to be recorded in a different Secondary epoch.  The first definition considers an initial change at the Primary cluster, and a following change at the Secondary.  The second definition is for the inverse case.An interesting observation is that for a single pair of near-concurrent updates at different clusters, happened-before depends only on latencies in one direction.  For example, an update to the Primary at time Ta, followed by an update to the Secondary at time Tb will not be considered in conflict if: Tb - Ta &#62; Client_conflict_window_length(Primary-&#62;Secondary)Client_conflict_window_length(Primary-&#62;Secondary) depends on the EpochDelay, the P_Binlog_lag and S_Relay_lag, but not on the S_Binlog_lag or P_Relay_lag.  This can mean that high replication latency, or a complete outage in one direction does not always result in increased conflict rates.  However, in the case of multiple sequences of near-concurrent updates at different sites, it probably will.A general property of the NDB$EPOCH family is that the conflict rate has some dependency on the replication latency.  Whether two updates to the same row at times Ta and Tb are considered to be in conflict depends on the relationship between those times and the current system replication latencies.  This can remove the need for highly synchronised real-time clocks as recommended for NDB$MAX, but can mean that the observed conflict rate increases when the system is lagging.  This also implies that more work is required to catch up, which could further affect lag.  NDB$MAX requires manual timestamp maintenance, and will not detect incorrect behaviour, but the basic decision on whether two updates are in-conflict is decided at commit time and is independent of the system replication latency.In summary :The Client_conflict_window_length in either direction will on average not be less than the EpochDelay (100 millis by default)Clients racing against replication to update both clusters need only beat the current Client_conflict_window_length to cause a conflictReplication latencies in either direction are potentially independentDetected conflict rates partly depend on replication latenciesStability of reads from the Primary ClusterIn the case of a conflict, the rows at the Primary Cluster will tentatively have replicated operations applied against them by a Slave MySQLD.   These conflicting operations will fail prior to commit as their interpreted precondition checks will fail, therefore the conflicting rows will not be modified on the Primary.  One effect of this is that a read from the Primary Cluster only ever returns stable data, as conflicting changes are never committed there.  In contrast, a read from the Secondary Cluster returns data which has been committed, but may be subject to later 'rollback' via refresh operations from the Primary Cluster.The same stability of reads observation applies to a row change event stream on the Primary Cluster - events received for a single key will be received in the order they were committed, and no later-to-be-rolled-back events will be observed in the stream.Stability of reads from the Secondary ClusterIf the Secondary Cluster is also receiving reflected applied epoch information back from the Primary then it will know when it's epoch x has been applied successfully at the Primary.  Therefore a read of some row y on the Secondary can be considered tentative while Max_Replicated_Epoch(Secondary) &#60; row_epoch(y), but once Max_Replicated_Epoch(Secondary) &#62;= row_epoch(y) then the read can be considered stable.  This is because if the Primary were going to detect a conflict with a Secondary change committed in epoch x, then the refresh events associated with the conflict would be recorded in the same Primary epoch as the notification of the application of epoch x.  So if the Secondary observes the notification of epoch x (and updates Max_Replicated_Epoch accordingly), and row y is not modified in the same epoch transaction, then it is stable.  The time taken to reach stability after a Secondary Cluster commit will be the Cluster conflict window length.Perhaps some applications can make better use of the potentially transiently inconsistent Secondary data by categorising their reads from the Secondary as either potentially-inconsistent or stable.  To do this, they need to maintain Max_replicated_epoch(Secondary) (By listening to row change events on the ndb_apply_status table) and read the NDB$GCI_64 metacolumn when reading row data.  A read from the Secondary is stable if all the NDB$GCI_64 values for all rows read are &#60;= the Secondary's Max_Replicated_Epoch.In the next post (final post I promise!) I will describe the implementation of the transaction dependency tracking in NDB$EPOCH_TRANS, and review the implementation of both NDB$EPOCH and NDB$EPOCH_TRANS.Edit 23/12/11 : Added index]]></description>
			<content:encoded><![CDATA[<a href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"><img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html#mymap" border="0" /><br /></a><br /><map name="mymap"><area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html" /><area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html" /><area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html" /><area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html" /><area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /><area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /></map><br />In previous posts I described how row conflicts are detected using epochs.  In this post I describe how they are handled.<br /><span><br />Row based conflict handling with NDB$EPOCH</span><br /><br />Once a row conflict is detected, as well as rejecting the row change, row based conflict handling in the Slave will :<br /><ul><li>Increment conflict counters</li><li>Optionally insert a row into an exceptions table<br /></li></ul>For NDB$EPOCH, conflict detection and handling operates on one Cluster in an Active-Active pair designated as the Primary.  When a Slave MySQLD attached to the Primary Cluster detects a conflict between data stored in the Primary and a replicated event from the Secondary, it needs to realign the Secondary to store the same values for the conflicting data.  Realignment involves injecting an event into the Primary Cluster's Binlog which, when applied idempotently on the Secondary Cluster, will force the row on the Secondary Cluster to take the supplied values.  This requires either a WRITE_ROW event, with all columns, or a DELETE_ROW event with just the primary key columns.  These events can be thought of as <a href="http://en.wikipedia.org/wiki/Compensating_transaction">compensating</a> events used to revert the original effect of the rejected events.<br /><br />Conflicts are detected by a Slave MySQLD attached to the Primary Cluster, and realignment events must appear in Binlogs recorded by the same MySQLD and/or other Binlogging MySQLDs attached to the Primary Cluster.  This is achieved using a new <a href="http://dev.mysql.com/doc/ndbapi/en/index.html">NdbApi</a> primary key operation type called <span>refreshTuple</span>.<br /><br />When a refreshTuple operation is executed it will :<br /><ol><li> Lock the affected row/primary key until transaction commit time, even if it does not exist (much as an Insert would).</li><li>Set the affected row's author metacolum to 0<br />The refresh is logically a local change</li><li>On commit<br />- Row exists case : Set the row's last committed epoch to the current epoch<br />- Cause a WRITE_ROW (row exists case) or DELETE_ROW (no row exists) event to be generated by attached Binlogging MySQLDs.<br /></li></ol><br />Locking the row as part of refreshTuple serialises the conflicting epoch transaction with other potentially conflicting local transactions.  Updating the stored epoch and author metacolumns results in the conflicting row conflicting with any further replicated changes occurring while the realignment event is 'in flight'.  The compensating row events are effectively new row changes originating at the Primary cluster which need to be monitored for conflicts in the same way as normal row changes.<br /><br />It is important that the Slave running at the Secondary Cluster where the realignment events will be applied, is running in idempotent mode, so that it can handle the realignment events correctly.  If this is not the case then WRITE_ROW realignment events may hit 'Row already exists' errors, and DELETE_ROW realignment events may hit 'Row does not exist' errors.<br /><br /><span>Observations on conflict windows and consistency</span><br /><br />When a conflict is detected, the refresh process results in the row's epoch and author metacolumns being modified so that the window of potential conflict is extended, until the epoch in which the refresh operation was recorded has itself been reflected.  If ongoing updates at both clusters continually conflict then refresh operations will continue to be generated, and the conflict window will remain open until a refresh operation manages to propagate with no further conflicts occurring.  As with any eventually consistent system, consistency is only guaranteed when the system (or at least the data of interest) is quiescent for a period.<br /><br />From the Primary cluster's point of view, the <span>conflict window length</span> is the time between committing a local transaction in epoch <span>n</span>, and the attached Slave committing a replicated epoch transaction indicating that epoch <span>n</span> has been applied at the Secondary.  Any Secondary-sourced overlapping change applied in this time is in-conflict.<br /><br />This <span>Cluster conflict window</span> <span>length</span> is comprised of :<br /><br /><ul><li> Time between commit of transaction, and next Primary Cluster epoch boundary<br />(Worst = 1 * <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-timebetweenepochs"><span>TimeBetweenEpochs</span></a>, Best = 0, Avg = 0.5 * <span>TimeBetweenEpochs</span>)</li><li>Time required to log event in Primary Cluster's Binlogging MySQLDs Binlog (~negligible)</li><li>Time required for Secondary Slave MySQLD IO thread to<br />- Minimum : Detect new Binlog data - negligible<br />- Maximum : Consume queued Binlog prior to the new data - unbounded<br />- Pull new epoch transaction<br />- Record in Relay log<br /></li><li>Time required for Secondary Slave MySQLD SQL thread to<br />- Minimum : Detect new events in relay log<br />- Maximum : Consume queued Relay log prior to new data - unbounded<br />- Read and apply events<br />- Potentially multiple batches.<br />- Commit epoch transaction at Secondary</li><li>Time between commit of replicated epoch transaction and next Secondary Cluster epoch boundary<br />(Worst = 1 * <span>TimeBetweenEpochs</span>, Best = 0, Avg = 0.5 * <span>TimeBetweenEpochs</span>)</li><li><span>After this point a Secondary-local commit on the data is possible without conflict</span></li><li>Time required to log event in Secondary Cluster's Binlogging MySQLDs Binlog (~negligible)</li><li>Time required for Primary Slave MySQLD IO thread to<br />- Minimum : Detect new Binlog data<br />- Maximum : Consume queued Binlog data prior to the new data - unbounded<br />- Pull new epoch transaction<br />- Record in Relay log</li><li>Time required for Primary Slave MySQLD SQL thread to<br />- Minimum : Detect new events in relay log<br />- Maximum : Consume queued Relay log prior to new data - unbounded<br />- Read and apply events<br />- Potentially multiple batches.<br />- For NDB$EPOCH_TRANS, potentially multiple passes<br />- Commit epoch transaction<br />- Update max replicated epoch to reflect new maximum.</li><li>Further Secondary sourced modifications to the rows are now considered not-in-conflict<br /></li></ul><br />From the point of view of an external client with access to both Primary and Secondary clusters, the conflict window only extends from the time transaction commit occurs at the Primary to the time the replicated operations are applied at the Secondary, and its commit time Secondary epoch ends. Changes committed at the Secondary after this will clearly appear to the Primary to have occurred after its epoch was applied on the Secondary and therefore are not in-conflict.<br /><br />Assuming that both Clusters have the same <span>TimeBetweenEpochs</span>, we can simplify the Cluster conflict window to :<br /><pre>  Cluster_conflict_window_length = EpochDelay +<br />                                  P_Binlog_lag +<br />                                  S_Relay_lag +<br />                                  S_Binlog_lag +<br />                                  P_Relay_lag<br /><br /> Where<br />    EpochDelay minimum is 0<br />    EpochDelay avg     is TimeBetweenEpochs<br />    EpochDelay maximum is 2 * TimeBetweenEpochs<br /></pre><br /><br />Substituting the default value of <span>TimeBetweenEpochs</span> of 100 millis, we get :<br /><pre>     EpochDelay minimum is 0<br />    EpochDelay avg     is 100 millis<br />    EpochDelay maximum is 200 millis<br /></pre><br /><br />Note that TimeBetweenEpochs is an epoch-increment trigger delay.  The actual experienced time between epochs can be longer depending on system load.  The various Binlog and Relay log delays can vary from close to zero up to infinity.  Infinity occurs when replication stops in either direction.<br /><br />The <span>Cluster conflict window</span> length can be thought of as both<br /><ul><li>The time taken to detect a conflict with a Primary transaction</li><li>The time taken for a committed Secondary transaction to become stable or be reverted</li></ul><br />We can define a <span>Client conflict window</span> <span>length </span>as either :<br /><pre> Primary-&gt;Secondary<br /><br />  Client_conflict_window_length = EpochDelay +<br />                                  P_Binlog_lag +<br />                                  S_Relay_lag +<br />                                  EpochDelay<br /><br />or<br /><br />Secondary-&gt;Primary<br /><br />  Client_conflict_window_length = EpochDelay +<br />                                  S_Binlog_lag +<br />                                  P_Relay_lag<br /><br />Where EpochDelay is defined as above.<br /></pre><br /><br />These definitions are asymmetric.  They represent the time taken by the system to determine that a particular change at one cluster definitely happened-before another change at the other cluster.  The asymmetry is due to the need for the Secondary part of a Primary-&gt;Secondary conflict to be recorded in a different Secondary epoch.  The first definition considers an initial change at the Primary cluster, and a following change at the Secondary.  The second definition is for the inverse case.<br /><br />An interesting observation is that for a single pair of near-concurrent updates at different clusters, happened-before depends only on latencies in one direction.  For example, an update to the Primary at time <span>Ta</span>, followed by an update to the Secondary at time <span>Tb</span> will not be considered in conflict if:<br /><br /><pre> Tb - Ta &gt; Client_conflict_window_length(Primary-&gt;Secondary)<br /></pre><br /><br /><span>Client_conflict_window_length(Primary-&gt;Secondary)</span> depends on the <span>EpochDelay</span>, the <span>P_Binlog_lag</span> and <span>S_Relay_lag</span>, but not on the <span>S_Binlog_lag</span> or <span>P_Relay_lag</span>.  This can mean that high replication latency, or a complete outage in one direction does not always result in increased conflict rates.  However, in the case of multiple sequences of near-concurrent updates at different sites, it probably will.<br /><br />A general property of the NDB$EPOCH family is that the conflict rate has some dependency on the replication latency.  Whether two updates to the same row at times <span>Ta</span> and <span>Tb</span> are considered to be in conflict depends on the relationship between those times and the <span>current</span> system replication latencies.  This can remove the need for highly synchronised real-time clocks as recommended for NDB$MAX, but can mean that the observed conflict rate increases when the system is lagging.  This also implies that more work is required to catch up, which could further affect lag.  NDB$MAX requires manual timestamp maintenance, and will not detect incorrect behaviour, but the basic decision on whether two updates are in-conflict is decided at commit time and is independent of the system replication latency.<br /><br />In summary :<br /><ul><li>The <span>Client_conflict_window_length</span> in either direction will on average not be less than the <span>EpochDelay</span> (100 millis by default)</li><li>Clients racing against replication to update both clusters need only beat the current <span>Client_conflict_window_length</span> to cause a conflict</li><li>Replication latencies in either direction are potentially independent</li><li>Detected conflict rates partly depend on replication latencies</li></ul><br /><span>Stability of reads from the Primary Cluster</span><br /><br />In the case of a conflict, the rows at the Primary Cluster will tentatively have replicated operations applied against them by a Slave MySQLD.   These conflicting operations will fail prior to commit as their interpreted precondition checks will fail, therefore the conflicting rows will not be modified on the Primary.  One effect of this is that a <span>read from the Primary Cluster only ever returns stable data</span>, as conflicting changes are never committed there.  In contrast, a read from the Secondary Cluster returns data which has been committed, but may be subject to later 'rollback' via refresh operations from the Primary Cluster.<br /><br />The same stability of reads observation applies to a row change event stream on the Primary Cluster - events received for a single key will be received in the order they were committed, and no later-to-be-rolled-back events will be observed in the stream.<br /><br /><span>Stability of reads from the Secondary Cluster<br /></span><br />If the Secondary Cluster is also receiving reflected applied epoch information back from the Primary then it will know when it's epoch <span>x</span> has been applied successfully at the Primary.  Therefore a read of some row <span>y</span> on the Secondary can be considered tentative while Max_Replicated_Epoch(Secondary) &lt; row_epoch(<span>y</span>), but once Max_Replicated_Epoch(Secondary) &gt;= row_epoch(<span>y</span>) then the read can be considered stable.  This is because if the Primary were going to detect a conflict with a Secondary change committed in epoch <span>x</span>, then the refresh events associated with the conflict would be recorded in the same Primary epoch as the notification of the application of epoch <span>x</span>.  So if the Secondary observes the notification of epoch <span>x</span> (and updates Max_Replicated_Epoch accordingly), and row <span>y</span> is not modified in the same epoch transaction, then it is stable.  The time taken to reach stability after a Secondary Cluster commit will be the <span>Cluster conflict window length.</span><br /><br />Perhaps some applications can make better use of the potentially transiently inconsistent Secondary data by categorising their reads from the Secondary as either potentially-inconsistent or stable.  To do this, they need to maintain Max_replicated_epoch(Secondary) (By listening to row change events on the ndb_apply_status table) and read the NDB$GCI_64 metacolumn when reading row data.  A read from the Secondary is stable if all the NDB$GCI_64 values for all rows read are &lt;= the Secondary's Max_Replicated_Epoch.<br /><br />In the next post (final post I promise!) I will describe the implementation of the transaction dependency tracking in NDB$EPOCH_TRANS, and review the implementation of both NDB$EPOCH and NDB$EPOCH_TRANS.<br /><br /><span>Edit 23/12/11 : Added index</span><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-5904731119010279019?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31358&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31358&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/19/eventual-consistency-in-mysql-cluster-implementation-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using MySQL Cluster to Protect &amp; Scale the HDFS Namenode</title>
		<link>http://blogs.oracle.com/MySQL/entry/using_mysql_cluster_to_protect?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=using-mysql-cluster-to-protect-scale-the-hdfs-namenode</link>
		<comments>http://blogs.oracle.com/MySQL/entry/using_mysql_cluster_to_protect#comments</comments>
		<pubDate>Mon, 19 Dec 2011 09:51:30 +0000</pubDate>
		<dc:creator>MySQL Community</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[evaluation]]></category>
		<category><![CDATA[guide]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[MySQL Cluster]]></category>

		<guid isPermaLink="false">http://blogs.oracle.com/MySQL/entry/using_mysql_cluster_to_protect</guid>
		<description><![CDATA[The
MySQL Cluster product team is always interested
to see new and innovative uses of the database. Last week, a team of students
at the KTH Royal Institute of Technology in Sweden blogged about their use of MySQL Cluster in
creating a scalable and highly available
HDFS&#160;Namenode. 
  There
are many established use cases of MySQL Cluster in the web, cloud/SaaS,
telecoms and even flight control systems – you can see those we are allowed to
talk about publicly here.&#160; 
   The
KTH team has been working on a project to move all of the metadata from the
HDFS / Hadoop nameenode to MySQL Cluster. Why did they want to do this, you may ask? Well…: 
  - The
namenode is a single point of failure. If it goes down, so too does the file
system 
  - As
a single server, the namenode becomes a bottleneck within heavily loaded HDFS /
Hadoop deployments. As server resources are consumed and write volumes
increase, so the system can grind to a halt. (And with data volumes growing
around 40% per year, this will only become more common!) 
   So
KTH decided to move metadata storage to MySQL Cluster. Why, you may ask? Well…. 
  - MySQL
Cluster already offered them a replicated, shared-nothing
database, distributed across commodity hardware. 
  - MySQL Cluster is widely deployed with proven stability 
  - The metadata can be distributed across nodes to scale
out capacity, while retaining complete consistency to the clients and
eliminating any Single Point of Failure 
  - Linear scaling of operations per second across the
cluster, as new namenodes are added. 
   Access to the cluster is via the MySQL Cluster Connector for Java,
providing a NoSQL, Java based ORM with very low latency. You can learn more about this ClusterJ API here.&#160; 
   Of course, the work at KTH is on-going with future optimizations planned
– which we will follow with interest. 
   So how can you determine if MySQL Cluster is the right choice for your
new project? We have just updated our MySQL Cluster Evaluation Guide (note, this will directly open the pdf). 
   This update is based around the latest MySQL Cluster 7.2 Development
Release which includes a series of enhancements to further broaden the use case of
MySQL Cluster, including: 
  - 70x higher JOIN performance with Adaptive Query
Localization pushing JOIN operations down to MySQL Cluster’s data  
  - Native Key-Value Memcached interface to the cluster
allowing schema and schemaless storage 
  - New cross-data center scalability enhancements 
  MySQL Cluster is not a fit for every use-case, but by
downloading the Evaluation Guide, you’ll get a clear picture of where MySQL
Cluster can be useful to you, and best practices in planning and executing your
evaluation. 
  Let us know of other interesting use-cases in the comments below]]></description>
			<content:encoded><![CDATA[<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Revision>0</o:Revision>
  <o:TotalTime>0</o:TotalTime>
  <o:Pages>1</o:Pages>
  <o:Words>566</o:Words>
  <o:Characters>3230</o:Characters>
  <o:Company>Homework</o:Company>
  <o:Lines>26</o:Lines>
  <o:Paragraphs>7</o:Paragraphs>
  <o:CharactersWithSpaces>3789</o:CharactersWithSpaces>
  <o:Version>14.0</o:Version>
 </o:DocumentProperties>
 <o:OfficeDocumentSettings>
  <o:AllowPNG/>
 </o:OfficeDocumentSettings>
</xml><![endif]--> <!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <w:Zoom>0</w:Zoom>
  <w:TrackMoves/>
  <w:TrackFormatting/>
  <w:PunctuationKerning/>
  <w:ValidateAgainstSchemas/>
  <w:SaveIfXMLInval>false</w:SaveIfXMLInvalid>
  <w:IgnoreMixedContent>false</w:IgnoreMixedContent>
  <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
  <w:DoNotPromoteQF/>
  <w:LidThemeOther>EN-US</w:LidThemeOther>
  <w:LidThemeAsian>JA</w:LidThemeAsian>
  <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
  <w:Compatibility>
   <w:BreakWrappedTables/>
   <w:SnapToGridInCell/>
   <w:WrapTextWithPunct/>
   <w:UseAsianBreakRules/>
   <w:DontGrowAutofit/>
   <w:SplitPgBreakAndParaMark/>
   <w:EnableOpenTypeKerning/>
   <w:DontFlipMirrorIndents/>
   <w:OverrideTableStyleHps/>
   <w:UseFELayout/>
  </w:Compatibility>
  <m:mathPr>
   <m:mathFont m:val="Cambria Math"/>
   <m:brkBin m:val="before"/>
   <m:brkBinSub m:val="&#45;-"/>
   <m:smallFrac m:val="off"/>
   <m:dispDef/>
   <m:lMargin m:val="0"/>
   <m:rMargin m:val="0"/>
   <m:defJc m:val="centerGroup"/>
   <m:wrapIndent m:val="1440"/>
   <m:intLim m:val="subSup"/>
   <m:naryLim m:val="undOvr"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="276">
  <w:LsdException Locked="false" Priority="0" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 1"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 2"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 3"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 4"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 5"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 6"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 7"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 8"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 9"/>
  <w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
  <w:LsdException Locked="false" Priority="10" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Title"/>
  <w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
  <w:LsdException Locked="false" Priority="11" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
  <w:LsdException Locked="false" Priority="22" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
  <w:LsdException Locked="false" Priority="20" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
  <w:LsdException Locked="false" Priority="59" SemiHidden="false"
   UnhideWhenUsed="false" Name="Table Grid"/>
  <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
  <w:LsdException Locked="false" Priority="1" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 1"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
  <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
  <w:LsdException Locked="false" Priority="34" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
  <w:LsdException Locked="false" Priority="29" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
  <w:LsdException Locked="false" Priority="30" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 1"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 2"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 2"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 3"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 3"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 4"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 4"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 5"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 5"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 6"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 6"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
  <w:LsdException Locked="false" Priority="19" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
  <w:LsdException Locked="false" Priority="21" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
  <w:LsdException Locked="false" Priority="31" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
  <w:LsdException Locked="false" Priority="32" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
  <w:LsdException Locked="false" Priority="33" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
  <w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
  <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
 </w:LatentStyles>
</xml><![endif]--> <!--[if gte mso 10]>

<![endif]--> <!--StartFragment--> 
  <p><span lang="EN-US">The
<a href="http://mysql.com/products/cluster/">MySQL Cluster</a> product team is always interested
to see new and innovative uses of the database. Last week, a team of students
at the <a href="http://www.kth.se/en">KTH Royal Institute of Technology</a> in Sweden <a href="http://lalith.in/2011/12/15/towards-a-scalable-and-highly-available-namenode/%20">blogged about</a> their use of MySQL Cluster in
creating a </span><span>scalable and highly available
HDFS&nbsp;Namenode.<o:p /></span></p> 
  <p><span>There
are many established use cases of MySQL Cluster in the web, cloud/SaaS,
telecoms and even flight control systems – you can see those we are allowed to
talk about publicly <a href="http://mysql.com/customers/cluster/">here</a>.&nbsp;</span></p> 
  <p><span lang="EN-US"><o:p> </o:p></span><span>The
KTH team has been working on a project to move all of the metadata from the
HDFS / Hadoop nameenode to MySQL Cluster.</span><span> </span><span>Why did they want to do this, you may ask? Well…:</span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span lang="EN-US">The
namenode is a single point of failure. If it goes down, so too does the file
system<o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span lang="EN-US">As
a single server, the namenode becomes a bottleneck within heavily loaded HDFS /
Hadoop deployments. As server resources are consumed and write volumes
increase, so the system can grind to a halt. (And with data volumes growing
around <a href="http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation">40% per year</a>, this will only become more common!)<o:p /></span></p> 
  <p><span lang="EN-US"><o:p> </o:p></span><span>So
KTH decided to move metadata storage to MySQL Cluster.</span><span> </span><span>Why, you may ask?</span><span> </span><span>Well….</span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span lang="EN-US">MySQL
Cluster already offered them a </span><span>replicated, shared-nothing
database, distributed across commodity hardware.<o:p /></span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>MySQL Cluster is widely deployed with proven stability<o:p /></span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>The metadata can be distributed across nodes to scale
out capacity, while retaining complete consistency to the clients and
eliminating any Single Point of Failure<o:p /></span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>Linear scaling of operations per second across the
cluster, as new namenodes are added.<o:p /></span></p> 
  <p><span><o:p> </o:p></span><span>Access to the cluster is via the <a href="http://dev.mysql.com/doc/ndbapi/en/mccj-using-clusterj.html">MySQL Cluster Connector for Java</a>,
providing a NoSQL, Java based ORM with very low latency.</span><span> </span><span>You can learn more about this <a href="http://mysql.com/why-mysql/white-papers/mysql_wp_cluster_connector_for_java.php">ClusterJ API here</a>.&nbsp;</span></p> 
  <p><span><o:p> </o:p></span><span>Of course, the work at KTH is on-going with future optimizations planned
– which we will follow with interest.</span></p> 
  <p><span><o:p> </o:p></span><span>So how can you determine if MySQL Cluster is the right choice for your
new project?</span><span> </span><span>We have just updated our <a href="http://dev.mysql.com/downloads/MySQL_Cluster_72_DMR_EvaluationGuide.pdf">MySQL Cluster Evaluation Guide</a> (note, this will directly open the pdf</span><span>).</span></p> 
  <p><span><o:p> </o:p></span><span>This update is based around the latest <a href="http://dev.mysql.com/tech-resources/articles/mysql-cluster-7.2.html">MySQL Cluster 7.2 Development
Release</a> </span><span>which includes a series of enhancements to further broaden the use case of
MySQL Cluster, including:</span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>70x higher JOIN performance with Adaptive Query
Localization pushing JOIN operations down to MySQL Cluster’s data <o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span>Native Key-Value Memcached interface to the cluster
allowing schema and schemaless storage</span><span lang="EN-US"><o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span>New cross-data center scalability enhancements</span><span lang="EN-US"><o:p /></span></p> 
  <p><span>MySQL Cluster is not a fit for every use-case, but by
downloading the Evaluation Guide, you’ll get a clear picture of where MySQL
Cluster can be useful to you, and best practices in planning and executing your
evaluation.</span></p> 
  <p><span>Let us know of other interesting use-cases in the comments below</span></p> 
  <p><span><o:p> </o:p></span></p> 
  <p><span> </span><span lang="EN-US"> <o:p /></span></p> 
  <p><span lang="EN-US"><o:p> </o:p></span></p> <!--EndFragment--><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31353&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31353&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/19/using-mysql-cluster-to-protect-scale-the-hdfs-namenode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eventual consistency in MySQL Cluster &#8212; implementation part 1</title>
		<link>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eventual-consistency-in-mysql-cluster-implementation-part-1</link>
		<comments>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html#comments</comments>
		<pubDate>Thu, 08 Dec 2011 00:20:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=8a38fd69c60b4103d906b0086f7acf58</guid>
		<description><![CDATA[The last post described MySQL Cluster epochs and why they provide a good basis for conflict detection, with a few enhancements required.  This post describes the enhancements.The following four mechanisms are required to implement conflict detection via epochs :Slaves should 'reflect' information about replicated epochs they have appliedApplied epoch numbers should be included in the Slave Binlog events returning to the originating cluster, in a Binlog position corresponding to the commit time of the replicated epoch transaction relative to Slave local transactions.Masters should maintain a maximum replicated epochA cluster should use the reflected epoch information to track which of its epochs has been applied by a Slave cluster.  This will be the maximum of all epochs applied by the Slave.Masters should track commit-time epoch per rowTo allow per-row detection of conflictsMasters should track commit-authorship per rowTo differentiate recent epochs due to replication or conflicting activity.'Reflecting' epoch information and maintaining the maximum replicated epochEvery epoch transaction in the Binlog contains a special WRITE_ROW event on the mysql.ndb_apply_status table which carries the epoch transaction's epoch number.  This is designed to give an atomically consistent way to determine a Slave cluster's position relative to a Master cluster.  Normally these WRITE_ROW events are applied by the Slave but not logged in the Slave's Binlog, even when --log-slave-updates is ON.  A new MySQLD option, --ndb-log-apply-status causes WRITE_ROW events applied to the mysql.ndb_apply_status table to be binlogged at a Slave, even when --log-slave-updates is OFF.  These events are logged with the ServerId of the Slave MySQLD, so that they can be applied on the Master, but will not loop infinitely.Allowing this applied epoch information to propagate through a Slave Cluster has the following effects :Downstream Clusters become aware of their position relative to all upstream Master clusters, not just their immediate Master cluster.They gain extra mysql.ndb_apply_status entries for all upstream Masters.Circularly replicating clusters become aware of which of their epochs, and epoch transactions, have been applied to all clusters in the circle.They gain extra mysql.ndb_apply_status entries for all Binlogging MySQLDs in the loopEffect 1 is useful for replication failover with more than two replication-chained clusters where an intermediate cluster is being routed-around (A-&#62;B-&#62;C) -&#62; (A-&#62;C).   Cluster C knows the correct Binlog file and position to resume from on A, without consulting B.Effect 2 could be used to allow clients to wait until their writes have been fully replicated and are globally visible, a kind of synchronous replication.  More relevantly, effect 2 allows us to maintain a maximum replicated epoch value for detecting conflicts.The visible result of using --ndb-log-apply-status on a Slave is that the mysql.ndb_apply_status table on the Master contains extra entries for the Binlogging MySQLDs attached to its Cluster.  The maximum replicated epoch is the maximum of these epoch values.      Cluster 1 Epoch transactions in flight in           a circular configuration          (Ignoring Cluster 2 epochs)                             39       38       37                        -&#62;----&#62;-----&#62;-----&#62;-----&#62;--                      /                           \ (Queued epochs 36-26)            Cluster 1                             Cluster 2(Queued epochs 23,24) \                           /                       -&#60;---&#60;------&#60;----&#60;----&#60;----                            25       26       27Current epoch = 40Max replicated epoch = 22                A MySQLD acting as a conflict detecting Slave for a cluster needs to know the attached cluster's maximum replicated epoch for conflict detection.  On Slave start, before the Slave starts applying replicated changes to the Ndb storage engine, it scans the mysql.ndb_apply_status table to find the highest reflected epoch value.   Rows in mysql.ndb_apply_status with server ids in the CHANGE MASTER TO IGNORE_SERVER_IDS list are considered to be local servers, as well as the Slave's own server id, and the maximum replicated epoch is the maximum epoch value from these rows.@ Slave start   max_replicated_epoch = SELECT MAX(epoch)                            FROM mysql.ndb_apply_status                           WHERE server_id IN @@IGNORE_SERVER_IDS;Once the Max_replicated_epoch has been initialised at slave start, it is updated as each reflected epoch event (WRITE_ROW event to mysql.ndb_apply_status) arrives and is processed by the Slave SQL thread.  The current Max_replicated_epoch can be seen by issuing the command SHOW STATUS LIKE 'Ndb_slave_max_replicated_epoch';.  Note that this is really just a cached copy of the current result of the SELECT MAX(epoch) query from above.  One subtlety is that the max_replicated_epoch is only changed when the Slave commits an epoch transaction, as it is only at this point that we know for sure that any event committed on the other cluster before the replicated epoch was applied has been handled.Per row last-modified epoch storageEach row stored in Ndb has a built-in hidden metadata column called NDB$GCI64.  This columns stores the epoch number at which the row was last modified.  For normal system recovery purposes, only the top 32 bits of the 64 bit epoch, called the Global Checkpoint Index or GCI are used.  NDB$EPOCH needs further bits to be stored per-row.  Epoch values only use a few of the bits in the bottom 32 bits of the epoch, so by default 6 extra bits per row are used to enable a full 64 bit epoch to be stored for each row.  The actual number of bits used can be controlled by a parameter to NDB$EPOCH.  Where some epoch is not fully expressible in the number of bits available, the bottom 32 bits are saturated, which again errs on the side of safety, potentially causing false conflicts, but ensuring no real conflicts are missed.  The ndb_select_all tool has a --gci64 option which shows each row's stored epoch value.A conflict detecting slave detects conflicts between transactions already committed, whose rows have their commit-time epoch numbers, and incoming operations in an epoch transaction, which are considered to have been committed at the epoch given by the current Maximum Replicated Epoch.  An incoming operation is considered to be in-conflict if the row it affects has a last-committed epoch that is greater than the current Maximum Replicated Epoch.  in_conflict = (ndb_gci64 &#62; max_replicated_epoch)In other words, at the time the change was committed on the other Cluster, that other Cluster was only aware of our changes as-of our epoch (max_replicated_epoch).  Therefore it was unaware of any changes committed in more recent epochs.  If the row being changed has been locally modified since that epoch then there have been concurrent modifications and a conflict has been discovered.Note that this mechanism is purely based on monitoring serialisation of updates to rows.  No semantic understanding of row data, or the meaning of applied changes is attempted.  Even if both clusters update some row to contain exactly the same value it will be considered to be a conflict, as the updates were not serialised with respect to each other.Per row hidden Author metacolumnOne advantage of reusing the row's last-modified epoch number for conflict detection is that it is automatically set on every commit.  However the downside is that when a replicated modification is found to not be in conflict, and is applied, the row's epoch is automatically set to the current value at commit time as normal.  By definition, the current epoch value is always greater than the maximum replicated epoch, and so if a further replicated modification to the same row were to arrive, it would find the row's epoch to be higher than the current maximum replicated epoch, and detect a false conflict.In theory we could consider the current maximum replicated epoch to be the row's commit time epoch, but as the per-row epoch is used for other more critical DB recovery purposes it's not safe to abuse it in this way.  Instead we use the observation that if we found a previous row update from some other cluster to be not-in-conflict, then further updates from it are also not-in-conflict.To detect this, a new hidden metadata column is introduced called NDB$AUTHOR.  This column is set to zero when a row is modified by any unmodified NdbApi client, including MySQLD, but when a row is modified by the MySQLD Slave SQL thread, it is set to one.  More generally, NDB$AUTHOR could be set to a non-zero identifier of which other cluster sourced an accepted change.  Just setting to one limits us to having one other cluster originating potentially conflicting changes.  The ndb_select_all tool has a --author option which shows each row's stored Author value.By extending the conflict detecting function to examine the NDB$AUTHOR value, we avoid the problem of falsely detecting conflicts when applied consecutive replicated changes.  in_conflict = (ndb$author != change_author) &#38;&#38; (ndb_gci64 &#62; max_replicated_epoch)We are currently just using 1 to mean 'other author', so this simplifies to :  in_conflict = (ndb$author != 1) &#38;&#38; (ndb_gci64 &#62; max_replicated_epoch)              = (ndb$author == 0) &#38;&#38; (ndb_gci64 &#62; max_replicated_epoch)This conflict detection function is encoded in an Ndb interpreted program and attached to the replicated DELETE and UPDATE NdbApi operations so that it can be quickly and atomically executed at the Ndb data nodes as a predicate prior to applying the operation.Ndb binlog row event ordering and false conflictsThe happened-before relationship between reflected epoch events (WRITE_ROW to mysql.ndb_apply_status) and incoming row events is used to determine whether a conflict has occurred.   As described in the last post, Ndb offers limited ordering guarantees on the row events within an epoch transaction.  The only guarantee is that multiple changes to the same row will be recorded in the order they committed.  This implies that the relative ordering of the reflected epoch WRITE_ROW event, on some row in mysql.ndb_apply_status, and other row events on other tables sharing the same epoch transaction is meaningless.  The only ordering guarantees between different rows exist at epoch boundaries.This means that if we see a reflected epoch WRITE_ROW event somewhere in replicated epoch j, then we can only safely assume that this happened before incoming row events in epoch j+1 and later.  The row events appearing before and after the reflected epoch WRITE_ROW event in epoch j may have committed before or after the reflected epoch event.The relaxed relative ordering gives us reduced precision in determining happened-before, and to be safe, we must err on the side of assuming that a conflict exists rather than that it does not.  Consider a Master committing a change to row X, recorded in epoch N.  This is then applied on the Slave in Slave epoch S.  If the Slave then commits a local change affecting the same row X in the same epoch S, this will be returned to the Master in the same Slave epoch transaction, and the Master will be unable to determine whether it occurred before or after it's original write to X, so must assume that it occurred before and is therefore in conflict.  If the Slave had committed its change in epoch S+1 or later, the happened-before relationship would be clear and the change would not be considered in conflict.These potential false conflicts are the price paid here for the lack of fine grained event ordering in the Ndb Binlog.I'm lostThere's been a lot of information, or at least a lot of words.  Let's summarise how NDB$EPOCH and NDB$EPOCH_TRANS detect row conflicts by following@Cluster ATransactions modify rows, automatically setting their hidden NDB$GCI64 column to the current epoch and their NDB$AUTHOR column to 0Binlogging MySQLDs record modified rows in epoch transactions in their Binlogs, together with MySQLD generated mysql.ndb_apply_status WRITE_ROW events@Cluster BSlave MySQLDs apply replicated epoch transactions along with their generated mysql.ndb_apply_status WRITE_ROW eventsOther clients of Cluster B commit transactions against the same data.Binlogging MySQLDs 'reflect' the applied-replicated epoch information by recording the mysql.ndb_apply_status WRITE_ROW events in their Binlogs as a result of --ndb-log-apply-status.Binlogging MySQLDs also record the row changes made by local clients.@Cluster ASlave MySQLDs track the incoming reflected epoch mysql.ndb_apply_status WRITE_ROW events to maintain their ndb_slave_max_replicated_epoch variablesSlave MySQLDs attach NdbApi interpreted programs to UPDATE and DELETE operations as they are applied to the database, comparing the row's stored NDB$GCI64 and NDB$AUTHOR columns with constant values supplied in the program.If there are no conflicts, the UPDATE and DELETE operations are applied, and the row's NDB$AUTHOR columns are set to one indicating a successful Slave modificationIf there are conflicts then conflict handling for the conflicting rows begins.Now does that make any sense?  Assuming it does, then next we look at how detected conflicts are handled.Once again, another wordy endurance test and we're not finished.  Surely the end must be near?]]></description>
			<content:encoded><![CDATA[The last <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html">post</a> described MySQL Cluster epochs and why they provide a good basis for conflict detection, with a few enhancements required.  This post describes the enhancements.<br /><br />The following four mechanisms are required to implement conflict detection via epochs :<br /><ol><li><span>Slaves should 'reflect' information about replicated epochs they have applied</span><br />Applied epoch numbers should be included in the Slave Binlog events returning to the originating cluster, in a Binlog position corresponding to the commit time of the replicated epoch transaction relative to Slave local transactions.<br /></li><li><span>Masters should maintain a maximum replicated epoch</span><br />A cluster should use the reflected epoch information to track which of its epochs has been applied by a Slave cluster.  This will be the maximum of all epochs applied by the Slave.<br /></li><li><span>Masters should track commit-time epoch per row</span><br />To allow per-row detection of conflicts</li><li><span>Masters should track commit-authorship per row</span><br />To differentiate recent epochs due to replication or conflicting activity.<br /></li></ol><br /><span>'Reflecting' epoch information and maintaining the maximum replicated epoch</span><br /><br />Every epoch transaction in the Binlog contains a special WRITE_ROW event on the mysql.ndb_apply_status table which carries the epoch transaction's epoch number.  This is designed to give an atomically consistent way to determine a Slave cluster's position relative to a Master cluster.  Normally these WRITE_ROW events are applied by the Slave but not logged in the Slave's Binlog, even when <a href="http://dev.mysql.com/doc/refman/5.1/en/replication-options-slave.html#option_mysqld_log-slave-updates">--log-slave-updates</a> is ON.  A new MySQLD option, <a href="http://dev.mysql.com/doc/mysql-cluster-excerpt/5.1/en/mysql-cluster-program-options-mysqld.html">--ndb-log-apply-status</a> causes WRITE_ROW events applied to the mysql.ndb_apply_status table to be binlogged at a Slave, even when --log-slave-updates is OFF.  These events are logged with the ServerId of the Slave MySQLD, so that they can be applied on the Master, but will not loop infinitely.<br /><br />Allowing this applied epoch information to propagate through a Slave Cluster has the following effects :<br /><ol><li>Downstream Clusters become aware of their position relative to all upstream Master clusters, not just their immediate Master cluster.<br /><span>They gain extra mysql.ndb_apply_status entries for all upstream Masters.</span></li><li>Circularly replicating clusters become aware of which of their epochs, and epoch transactions, have been applied to all clusters in the circle.<br /><span>They gain extra mysql.ndb_apply_status entries for all Binlogging MySQLDs in the loop</span><br /></li></ol><br />Effect 1 is useful for replication failover with more than two replication-chained clusters where an intermediate cluster is being routed-around (A-&gt;B-&gt;C) -&gt; (A-&gt;C).   Cluster C knows the correct Binlog file and position to resume from on A, without consulting B.<br /><br />Effect 2 could be used to allow clients to wait until their writes have been fully replicated and are globally visible, a kind of synchronous replication.  More relevantly, effect 2 allows us to maintain a maximum replicated epoch value for detecting conflicts.<br /><br />The visible result of using --ndb-log-apply-status on a Slave is that the mysql.ndb_apply_status table on the Master contains extra entries for the Binlogging MySQLDs attached to its Cluster.  The maximum replicated epoch is the maximum of these epoch values.<br /><br /><pre> <br />     Cluster 1 Epoch transactions in flight in<br />           a circular configuration<br />          (Ignoring Cluster 2 epochs)<br /><br />                             39       38       37 <br />                       -&gt;----&gt;-----&gt;-----&gt;-----&gt;--<br />                      /                           \ (Queued epochs 36-26)<br />            Cluster 1                             Cluster 2<br />(Queued epochs 23,24) \                           /<br />                       -&lt;---&lt;------&lt;----&lt;----&lt;----<br />                            25       26       27<br /><br />Current epoch = 40<br />Max replicated epoch = 22                <br /></pre><br /><br />A MySQLD acting as a conflict detecting Slave for a cluster needs to know the attached cluster's maximum replicated epoch for conflict detection.  On Slave start, before the Slave starts applying replicated changes to the Ndb storage engine, it scans the <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-schema.html">mysql.ndb_apply_status</a> table to find the highest reflected epoch value.   Rows in mysql.ndb_apply_status with server ids in the <a href="http://dev.mysql.com/doc/refman/5.1/en/change-master-to.html">CHANGE MASTER</a> TO IGNORE_SERVER_IDS list are considered to be local servers, as well as the Slave's own server id, and the maximum replicated epoch is the maximum epoch value from these rows.<br /><br /><pre><br />@ Slave start<br /><br />   max_replicated_epoch = SELECT MAX(epoch)<br />                            FROM mysql.ndb_apply_status<br />                           WHERE server_id IN @@IGNORE_SERVER_IDS;<br /><br /></pre><br /><br />Once the Max_replicated_epoch has been initialised at slave start, it is updated as each reflected epoch event (WRITE_ROW event to mysql.ndb_apply_status) arrives and is processed by the Slave SQL thread.  The current Max_replicated_epoch can be seen by issuing the command SHOW STATUS LIKE 'Ndb_slave_max_replicated_epoch';.  Note that this is really just a cached copy of the current result of the SELECT MAX(epoch) query from above.  One subtlety is that the max_replicated_epoch is only changed when the Slave commits an epoch transaction, as it is only at this point that we know for sure that any event committed on the other cluster before the replicated epoch was applied has been handled.<br /><br /><span>Per row last-modified epoch storage</span><br /><br />Each row stored in Ndb has a built-in hidden metadata column called NDB$GCI64.  This columns stores the epoch number at which the row was last modified.  For normal system recovery purposes, only the top 32 bits of the 64 bit epoch, called the Global Checkpoint Index or GCI are used.  NDB$EPOCH needs further bits to be stored per-row.  Epoch values only use a few of the bits in the bottom 32 bits of the epoch, so by default 6 extra bits per row are used to enable a full 64 bit epoch to be stored for each row.  The actual number of bits used can be controlled by a parameter to NDB$EPOCH.  Where some epoch is not fully expressible in the number of bits available, the bottom 32 bits are saturated, which again errs on the side of safety, potentially causing false conflicts, but ensuring no real conflicts are missed.  The <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-select-all.html">ndb_select_all</a> tool has a --gci64 option which shows each row's stored epoch value.<br /><br />A conflict detecting slave detects conflicts between transactions already committed, whose rows have their commit-time epoch numbers, and incoming operations in an epoch transaction, which are considered to have been committed at the epoch given by the current Maximum Replicated Epoch.  An incoming operation is considered to be in-conflict if the row it affects has a last-committed epoch that is greater than the current Maximum Replicated Epoch.<br /><br /><pre>  in_conflict = (ndb_gci64 &gt; max_replicated_epoch)<br /></pre><br /><br />In other words, at the time the change was committed on the other Cluster, that other Cluster was only aware of our changes as-of our epoch (max_replicated_epoch).  Therefore it was unaware of any changes committed in more recent epochs.  If the row being changed has been locally modified since that epoch then there have been concurrent modifications and a conflict has been discovered.<br /><br />Note that this mechanism is purely based on monitoring serialisation of updates to rows.  No semantic understanding of row data, or the meaning of applied changes is attempted.  Even if both clusters update some row to contain exactly the same value it will be considered to be a conflict, as the updates were not serialised with respect to each other.<br /><br /><span>Per row hidden Author metacolumn</span><br /><br />One advantage of reusing the row's last-modified epoch number for conflict detection is that it is automatically set on every commit.  However the downside is that when a replicated modification is found to <span>not</span> be in conflict, and is applied, the row's epoch is automatically set to the current value at commit time as normal.  By definition, the current epoch value is always greater than the maximum replicated epoch, and so if a further replicated modification to the same row were to arrive, it would find the row's epoch to be higher than the current maximum replicated epoch, and detect a false conflict.<br /><br />In theory we could consider the current maximum replicated epoch to be the row's commit time epoch, but as the per-row epoch is used for other more critical DB recovery purposes it's not safe to abuse it in this way.  Instead we use the observation that if we found a previous row update from some other cluster to be not-in-conflict, then further updates from it are also not-in-conflict.<br /><br />To detect this, a new hidden metadata column is introduced called NDB$AUTHOR.  This column is set to zero when a row is modified by any unmodified NdbApi client, including MySQLD, but when a row is modified by the MySQLD Slave SQL thread, it is set to one.  More generally, NDB$AUTHOR could be set to a non-zero identifier of which other cluster sourced an accepted change.  Just setting to one limits us to having one other cluster originating potentially conflicting changes.  The <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-select-all.html">ndb_select_all</a> tool has a --author option which shows each row's stored Author value.<br /><br />By extending the conflict detecting function to examine the NDB$AUTHOR value, we avoid the problem of falsely detecting conflicts when applied consecutive replicated changes.<br /><pre>  in_conflict = (ndb$author != change_author) &amp;&amp; (ndb_gci64 &gt; max_replicated_epoch)<br /></pre><br /><br />We are currently just using 1 to mean 'other author', so this simplifies to :<br /><pre><br />  in_conflict = (ndb$author != 1) &amp;&amp; (ndb_gci64 &gt; max_replicated_epoch)<br /><br />              = (ndb$author == 0) &amp;&amp; (ndb_gci64 &gt; max_replicated_epoch)<br /></pre><br /><br />This conflict detection function is encoded in an <a href="http://dev.mysql.com/doc/ndbapi/en/ndb-ndbinterpretedcode.html">Ndb interpreted program</a> and attached to the replicated DELETE and UPDATE NdbApi operations so that it can be quickly and atomically executed at the Ndb data nodes as a predicate prior to applying the operation.<br /><br /><span>Ndb binlog row event ordering and false conflicts</span><br /><br />The happened-before relationship between reflected epoch events (WRITE_ROW to mysql.ndb_apply_status) and incoming row events is used to determine whether a conflict has occurred.   As described in the last <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html">post</a>, Ndb offers limited ordering guarantees on the row events within an epoch transaction.  The only guarantee is that multiple changes to the same row will be recorded in the order they committed.  This implies that the relative ordering of the reflected epoch WRITE_ROW event, on some row in mysql.ndb_apply_status, and other row events on other tables sharing the same epoch transaction is meaningless.  The only ordering guarantees between different rows exist at epoch boundaries.<br /><br />This means that if we see a reflected epoch WRITE_ROW event somewhere in replicated epoch <span>j</span>, then we can only safely assume that this happened before incoming row events in epoch <span>j+1</span> and later.  The row events appearing before and after the reflected epoch WRITE_ROW event in epoch<span> j </span>may have committed before or after the reflected epoch event.<br /><br />The relaxed relative ordering gives us reduced precision in determining happened-before, and to be safe, we must err on the side of assuming that a conflict exists rather than that it does not.  Consider a Master committing a change to row <span>X</span>, recorded in epoch <span>N</span>.  This is then applied on the Slave in Slave epoch <span>S</span>.  If the Slave then commits a local change affecting the same row <span>X</span> in the same epoch <span>S</span>, this will be returned to the Master in the same Slave epoch transaction, and the Master will be unable to determine whether it occurred before or after it's original write to <span>X</span>, so must assume that it occurred before and is therefore in conflict.  If the Slave had committed its change in epoch <span>S+1</span> or later, the happened-before relationship would be clear and the change would not be considered in conflict.<br /><br />These potential false conflicts are the price paid here for the lack of fine grained event ordering in the Ndb Binlog.<br /><br /><span>I'm lost</span><br /><br />There's been a lot of information, or at least a lot of words.  Let's summarise how NDB$EPOCH and NDB$EPOCH_TRANS detect row conflicts by following<br /><br /><ul><li>@Cluster A<br />Transactions modify rows, automatically setting their hidden NDB$GCI64 column to the current epoch and their NDB$AUTHOR column to 0<br /><br />Binlogging MySQLDs record modified rows in epoch transactions in their Binlogs, together with MySQLD generated mysql.ndb_apply_status WRITE_ROW events<br /><br /></li><li>@Cluster B<br />Slave MySQLDs apply replicated epoch transactions along with their generated mysql.ndb_apply_status WRITE_ROW events<br /><br />Other clients of Cluster B commit transactions against the same data.<br /><br />Binlogging MySQLDs 'reflect' the applied-replicated epoch information by recording the mysql.ndb_apply_status WRITE_ROW events in their Binlogs as a result of --ndb-log-apply-status.<br /><br />Binlogging MySQLDs also record the row changes made by local clients.<br /><br /></li><li>@Cluster A<br />Slave MySQLDs track the incoming reflected epoch mysql.ndb_apply_status WRITE_ROW events to maintain their ndb_slave_max_replicated_epoch variables<br /><br />Slave MySQLDs attach NdbApi interpreted programs to UPDATE and DELETE operations as they are applied to the database, comparing the row's stored NDB$GCI64 and NDB$AUTHOR columns with constant values supplied in the program.<br /><br />If there are no conflicts, the UPDATE and DELETE operations are applied, and the row's NDB$AUTHOR columns are set to one indicating a successful Slave modification<br /><br />If there are conflicts then conflict handling for the conflicting rows begins.<br /></li></ul><br />Now does that make any sense?  Assuming it does, then next we look at how detected conflicts are handled.<br /><br />Once again, another wordy endurance test and we're not finished.  Surely the end must be near?<div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-2635540360364141806?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31164&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31164&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/08/eventual-consistency-in-mysql-cluster-implementation-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eventual Consistency in MySQL Cluster &#8212; using epochs</title>
		<link>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eventual-consistency-in-mysql-cluster-using-epochs</link>
		<comments>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html#comments</comments>
		<pubDate>Wed, 07 Dec 2011 14:28:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[parallel]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=5674386841ace9ff691efcc1e11d1f14</guid>
		<description><![CDATA[Before getting to the details of how eventual consistency is implemented, we need to look at epochs.  Ndb Cluster maintains an internal distributed logical clock known as the epoch, represented as a 64 bit number.  This epoch serves a number of internal functions, and is atomically advanced across all data nodes.Epochs and consistent distributed stateNdb is a parallel database, with multiple internal transaction coordinator components starting, executing and committing transactions against rows stored in different data nodes.  Concurrent transactions only interact where they attempt to lock the same row.  This design minimises unnecessary system-wide synchronisation, enabling linear scalability of reads and writes.The stream of changes made to rows stored at a data node are written to a local Redo log for node and system recovery.  The change stream is also published to NdbApi event listeners, including MySQLD servers recording Binlogs.  Each node's change stream contains the row changes it was involved in, as committed by multiple transactions, and coordinated by multiple independent transaction coordinators, interleaved in a partial order.  Incoming independent transactions   affecting multiple rows     T3         T4         T7     T1         T2         T5      &#124;         &#124;          &#124;      V         V          V   --------  --------  --------   &#124;  1   &#124;  &#124;  2   &#124;  &#124;  3   &#124;   &#124;  TC  &#124;  &#124;  TC  &#124;  &#124;  TC  &#124;   Data nodes with multiple   &#124;      &#124;--&#124;      &#124;--&#124;      &#124;   transaction coordinators   &#124;------&#124;  &#124;------&#124;  &#124;------&#124;   acting on data stored in   &#124;      &#124;  &#124;      &#124;  &#124;      &#124;       different nodes   &#124; DATA &#124;  &#124; DATA &#124;  &#124; DATA &#124;   --------  --------  --------      &#124;         &#124;          &#124;      V         V          V     t4        t4          t3     t1        t7          t2     t2        t1          t7               t5   Outgoing row change event    streams by causing       transactionThese row event streams are generated independently by each data node in a cluster, but to be useful they need to be correlated together.  For system recovery from a crash, the data nodes need to recover to a cluster-wide consistent state.  A state which contains only whole transactions, and a state which, logically at least, existed at some point in time.  This correlation could be done by an analysis of the transaction ids and row dependencies of each recorded row change to determine a valid order for the merged event streams, but this would add significant overhead. Instead, the Cluster uses a distributed logical clock known as the epoch to group large sets of committed transactions together.Each epoch contains zero or more committed transactions.  Each committed transaction is in only one epoch.  The epoch clock advances periodically, every 100 milliseconds by default.  When it is time for a new epoch to start, a distributed protocol known as the Global Commit Protocol (GCP) results in all of the transaction coordinators in the Cluster agreeing on a point of time in the flow of committing transactions at which to change epoch.  This epoch boundary, between the commit of the last transaction in epoch n, and the commit of the first transaction in epoch n+1, is a cluster-wide consistent point in time.  Obtaining this consistent point in time requires cluster-wide synchronisation, between all transaction coordinators, but it need only happen periodically.Furthermore, each node ensures that the all events for epoch n are published before any events for epoch n+1 are published.  Effectively the event streams are sorted by epoch number, and the first time a new epoch is encountered signifies a precise epoch boundary. Incoming independent transactions     T3         T4         T7     T1         T2         T5      &#124;         &#124;          &#124;      V         V          V   --------  --------  --------   &#124;  1   &#124;  &#124;  2   &#124;  &#124;  3   &#124;   &#124;  TC  &#124;  &#124;  TC  &#124;  &#124;  TC  &#124;   Data nodes with multiple   &#124;      &#124;--&#124;      &#124;--&#124;      &#124;   transaction coordinators   &#124;------&#124;  &#124;------&#124;  &#124;------&#124;   acting on data stored in   &#124;      &#124;  &#124;      &#124;  &#124;      &#124;      different nodes   &#124; DATA &#124;  &#124; DATA &#124;  &#124; DATA &#124;   --------  --------  --------      &#124;         &#124;          &#124;      V         V          V    t4(22)    t4(22)      t3(22)            Epoch 22    ......    ......      ......    t1(23)    t7(23)      t2(23)            Epoch 23    t2(23)    t1(23)      t7(23)              ......              t5(24)                        Epoch 24    Outgoing row change event    streams by causing transaction    with epoch numbers in ()When these independent streams are merge-sorted by epoch number we get a unified change stream.  Multiple possible orderings can result.One Partial ordering is shown here :      Events      Transactions                 contained in epoch     t4(22)     t4(22)      {T4,T3}     t3(22)        ......     t1(23)     t2(23)     t7(23)     t1(23)      {T1, T2, T7}     t2(23)     t7(23)     ......     t5(24)      {T5}Note that we can state from this that T4 -&#62; T1 (Happened before), and T1 -&#62; T5.  However we cannot say whether T4 -&#62; T3 or T3 -&#62; T4.  In epoch 23 we see that the row events resulting from T1, T2 and T7 are interleaved.Epoch boundaries act as markers in the flow of row events generated by each node, which are then used as consistent points to recover to.  Epoch boundaries also allow a single system wide unified transaction log to be generated from each node's row change stream, by merge-sorting the per-node row change streams by epoch number.  Note that the order of events within an epoch is still not tightly constrained. As concurrent transactions can only interact via row locks, the order of events on a single row (Table and Primary key value) signifies transaction commit order, but there is by definition no order between transactions affecting independent row sets.To record a Binlog of Ndb row changes, MySQLD listens to the row change streams arriving from each data node, and merge-sorts them them by epoch into a single, epoch-ordered stream.  When all events for a given epoch have been received, MySQLD records a single Binlog transaction containing all row events for that epoch.  This Binlog transaction is referred to as an 'Epoch transaction' as it describes all row changes that occurred in an epoch.Epoch transactions in the BinlogEpoch transactions in the Binlog have some interesting properties :Efficiency : They can be considered a kind of Binlog group commit, where multiple user transactions are recorded in one Binlog (epoch) transaction.  As an epoch normally contains 100 milliseconds of row changes from a cluster, this is a significant amortisation.Consistency : Each epoch transaction contains the row operations which occurred when moving the cluster from epoch boundary consistent state A to epoch boundary consistent state BTherefore, when applied as a transaction by a slave, the slave will atomically move from consistent state A to consistent state BInter-epoch ordering : Any row event recorded in epoch n+1 logically happened after every row event in epoch nIntra-epoch disorder : Any two row events recorded in epoch n, affecting different rows, may have happened in any order.Intra-epoch key-order : Any two row events recorded in epoch n, affecting the same row, happened in the order they are recorded.The ordering properties show that epochs give only a partial order, enough to subdivide the row change streams into self-consistent chunks.  Within an epoch, row changes may be interleaved in any way, except that multiple changes to the same row will be recorded in the order they were committed.Each epoch transaction contains the row changes for a particular epoch, and that information is recorded in the epoch transaction itself, as an extra WRITE_ROW event on a system table called mysql.ndb_apply_status.  This WRITE_ROW event contains the binlogging MySQLD's server id and the epoch number.  This event is added so that it will be atomically applied by the Slave along with the rest of the row changes in the epoch transaction, giving an atomically reliable indicator of the replication 'position' of the Slave relative to the Master Cluster in terms of epoch number.  As the epoch number is abstracted from the details of a particular Master MySQLD's binlog files and offsets, it can be used to failover to an alternative Master.We can visualise a MySQL Cluster Binlog as looking something like this.  Each Binlog transaction contains one 'artificially generated' WRITE_ROW event at the start, and then RBR row events for all row changes that occurred in that epoch.    BEGIN    WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=6998    WRITE_ROW ...    UPDATE_ROW ...    DELETE_ROW ...    ...    COMMIT # Consistent state of the database    BEGIN    WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=6999    ...    COMMIT # Consistent state of the database    BEGIN    WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=7000    ...    COMMIT # Consistent state of the database    ...A series of epoch transactions, each with a special WRITE_ROW event for recording the epoch on the Slave.  You can see this structure using the mysqlbinlog tool with the --verbose option.Rows tagged with last-commit epochEach row in a MySQL Cluster stores a hidden metadata column which contains the epoch at which a write to the row was last committed.  This information is used internally by the Cluster during node recovery and other operations.  The ndb_select_all tool can be used to see the epoch numbers for rows in a table by supplying the --gci or --gci64 options.  Note that the per-row epoch is not a row version, as two updates to a row in reasonably quick succession will have the same commit epoch.Epochs and eventual consistencyReviewing epochs from the point of view of my previous posts on eventual consistency we see that :Epochs provide an incrementing logical clockEpochs are recorded in the Binlog, and therefore shipped to SlavesEpoch boundaries imply happened-before relationships between events before and after them in the BinlogThe properties mean that epochs are almost perfect for monitoring conflict windows in an active-active circular replication setup, with only a few enhancements required.I'll describe these enhancements in the next post.]]></description>
			<content:encoded><![CDATA[Before getting to the details of how eventual consistency is implemented, we need to look at epochs.  Ndb Cluster maintains an internal distributed logical clock known as the epoch, represented as a 64 bit number.  This epoch serves a number of internal functions, and is atomically advanced across all data nodes.<br /><br /><span>Epochs and consistent distributed state</span><br /><br />Ndb is a parallel database, with multiple internal transaction coordinator components starting, executing and committing transactions against rows stored in different data nodes.  Concurrent transactions only interact where they attempt to lock the same row.  This design minimises unnecessary system-wide synchronisation, enabling linear scalability of reads and writes.<br /><br />The stream of changes made to rows stored at a data node are written to a local Redo log for node and system recovery.  The change stream is also published to NdbApi event listeners, including MySQLD servers recording Binlogs.  Each node's change stream contains the row changes it was involved in, as committed by multiple transactions, and coordinated by multiple independent transaction coordinators, interleaved in a partial order.<br /><br /><pre>  Incoming independent transactions<br />   affecting multiple rows<br /><br />     T3         T4         T7<br />     T1         T2         T5<br /><br />      |         |          |<br />      V         V          V<br /><br />   --------  --------  --------<br />   |  1   |  |  2   |  |  3   |<br />   |  TC  |  |  TC  |  |  TC  |   Data nodes with multiple<br />   |      |--|      |--|      |   transaction coordinators<br />   |------|  |------|  |------|   acting on data stored in<br />   |      |  |      |  |      |       different nodes<br />   | DATA |  | DATA |  | DATA |<br />   --------  --------  --------<br /><br />      |         |          |<br />      V         V          V<br /><br />     t4        t4          t3<br />     t1        t7          t2<br />     t2        t1          t7<br />               t5<br /><br />   Outgoing row change event<br />    streams by causing<br />       transaction<br /></pre><br /><br />These row event streams are generated independently by each data node in a cluster, but to be useful they need to be correlated together.  For system recovery from a crash, the data nodes need to recover to a cluster-wide consistent state.  A state which contains only whole transactions, and a state which, logically at least, existed at some point in time.  This correlation could be done by an analysis of the transaction ids and row dependencies of each recorded row change to determine a valid order for the merged event streams, but this would add significant overhead. Instead, the Cluster uses a distributed logical clock known as the epoch to group large sets of committed transactions together.<br /><br />Each epoch contains zero or more committed transactions.  Each committed transaction is in only one epoch.  The epoch clock advances periodically, every 100 milliseconds by <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-timebetweenepochs">default</a>.  When it is time for a new epoch to start, a distributed protocol known as the Global Commit Protocol (GCP) results in all of the transaction coordinators in the Cluster agreeing on a point of time in the flow of committing transactions at which to change epoch.  This epoch boundary, between the commit of the last transaction in epoch <span>n</span>, and the commit of the first transaction in epoch <span>n+1</span>, is a cluster-wide consistent point in time.  Obtaining this consistent point in time requires cluster-wide synchronisation, between all transaction coordinators, but it need only happen periodically.<br /><br />Furthermore, each node ensures that the all events for epoch <span>n</span> are published before any events for epoch <span>n+1</span> are published.  Effectively the event streams are sorted by epoch number, and the first time a new epoch is encountered signifies a precise epoch boundary.<br /><br /><pre> Incoming independent transactions<br /><br />     T3         T4         T7<br />     T1         T2         T5<br /><br />      |         |          |<br />      V         V          V<br /><br />   --------  --------  --------<br />   |  1   |  |  2   |  |  3   |<br />   |  TC  |  |  TC  |  |  TC  |   Data nodes with multiple<br />   |      |--|      |--|      |   transaction coordinators<br />   |------|  |------|  |------|   acting on data stored in<br />   |      |  |      |  |      |      different nodes<br />   | DATA |  | DATA |  | DATA |<br />   --------  --------  --------<br /><br />      |         |          |<br />      V         V          V<br /><br />    t4(22)    t4(22)      t3(22)            Epoch 22<br />    ......    ......      ......<br />    t1(23)    t7(23)      t2(23)            Epoch 23<br />    t2(23)    t1(23)      t7(23)<br />              ......<br />              t5(24)                        Epoch 24<br /><br />    Outgoing row change event<br />    streams by causing transaction<br />    with epoch numbers in ()<br /><br /></pre><br /><br />When these independent streams are merge-sorted by epoch number we get a unified change stream.  Multiple possible orderings can result.<br />One Partial ordering is shown here :<br /><br /><pre>      Events      Transactions<br />                 contained in epoch<br /><br />     t4(22)<br />     t4(22)      {T4,T3}<br />     t3(22)<br />   <br />     ......<br /><br />     t1(23)<br />     t2(23)<br />     t7(23)<br />     t1(23)      {T1, T2, T7}<br />     t2(23)<br />     t7(23)<br /><br />     ......<br /><br />     t5(24)      {T5}<br /><br /></pre><br /><br />Note that we can state from this that T4 -&gt; T1 (Happened before), and T1 -&gt; T5.  However we cannot say whether T4 -&gt; T3 or T3 -&gt; T4.  In epoch 23 we see that the row events resulting from T1, T2 and T7 are interleaved.<br /><br />Epoch boundaries act as markers in the flow of row events generated by each node, which are then used as consistent points to recover to.  Epoch boundaries also allow a single system wide unified transaction log to be generated from each node's row change stream, by merge-sorting the per-node row change streams by epoch number.  Note that the order of events within an epoch is still not tightly constrained. As concurrent transactions can only interact via row locks, the order of events on a single row (Table and Primary key value) signifies transaction commit order, but there is by definition no order between transactions affecting independent row sets.<br /><br />To record a Binlog of Ndb row changes, MySQLD listens to the row change streams arriving from each data node, and merge-sorts them them by epoch into a single, epoch-ordered stream.  When all events for a given epoch have been received, MySQLD records a single Binlog transaction containing all row events for that epoch.  This Binlog transaction is referred to as an 'Epoch transaction' as it describes all row changes that occurred in an epoch.<br /><br /><span>Epoch transactions in the Binlog</span><br /><br />Epoch transactions in the Binlog have some interesting properties :<br /><ul><li><span>Efficiency</span> : They can be considered a kind of Binlog group commit, where multiple user transactions are recorded in one Binlog (epoch) transaction.  As an epoch normally contains 100 milliseconds of row changes from a cluster, this is a significant amortisation.<br /></li><li><span>Consistency</span> : Each epoch transaction contains the row operations which occurred when moving the cluster from epoch boundary consistent state A to epoch boundary consistent state B<br />Therefore, when applied as a transaction by a slave, the slave will atomically move from consistent state A to consistent state B</li><li><span>Inter-epoch ordering</span> : Any row event recorded in epoch <span>n+1</span> logically happened after every row event in epoch <span>n</span></li><li><span>Intra-epoch disorder</span> : Any two row events recorded in epoch <span>n</span>, affecting different rows, may have happened in any order.</li><li><span>Intra-epoch key-order</span> : Any two row events recorded in epoch <span>n</span>, affecting the same row, happened in the order they are recorded.</li></ul><br />The ordering properties show that epochs give only a partial order, enough to subdivide the row change streams into self-consistent chunks.  Within an epoch, row changes may be interleaved in any way, except that multiple changes to the same row will be recorded in the order they were committed.<br /><br />Each epoch transaction contains the row changes for a particular epoch, and that information is recorded in the epoch transaction itself, as an extra WRITE_ROW event on a system table called <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-schema.html">mysql.ndb_apply_status</a>.  This WRITE_ROW event contains the binlogging MySQLD's server id and the epoch number.  This event is added so that it will be atomically applied by the Slave along with the rest of the row changes in the epoch transaction, giving an atomically reliable indicator of the replication 'position' of the Slave relative to the Master Cluster in terms of epoch number.  As the epoch number is abstracted from the details of a particular Master MySQLD's binlog files and offsets, it can be used to <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-replication-failover.html">failover</a> to an alternative Master.<br /><br />We can visualise a MySQL Cluster Binlog as looking something like this.  Each Binlog transaction contains one 'artificially generated' WRITE_ROW event at the start, and then RBR row events for all row changes that occurred in that epoch.<br /><br /><pre>    BEGIN<br />    WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=6998<br />    WRITE_ROW ...<br />    UPDATE_ROW ...<br />    DELETE_ROW ...<br />    ...<br />    COMMIT # Consistent state of the database<br /><br />    BEGIN<br />    WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=6999<br />    ...<br />    COMMIT # Consistent state of the database<br /><br />    BEGIN<br />    WRITE_ROW mysql.ndb_apply_status server_id=4, epoch=7000<br />    ...<br />    COMMIT # Consistent state of the database<br />    ...<br /><br /></pre><br />A series of epoch transactions, each with a special WRITE_ROW event for recording the epoch on the Slave.  You can see this structure using the <a href="http://dev.mysql.com/doc/refman/5.1/en/mysqlbinlog.html">mysqlbinlog</a> tool with the --verbose option.<br /><br /><span>Rows tagged with last-commit epoch</span><br /><br />Each row in a MySQL Cluster stores a hidden metadata column which contains the epoch at which a write to the row was last committed.  This information is used internally by the Cluster during node recovery and other operations.  The <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-programs-ndb-select-all.html">ndb_select_all</a> tool can be used to see the epoch numbers for rows in a table by supplying the --gci or --gci64 options.  Note that the per-row epoch is not a row version, as two updates to a row in reasonably quick succession will have the same commit epoch.<br /><br /><span>Epochs and eventual consistency</span><br /><br />Reviewing epochs from the point of view of my previous posts on eventual consistency we see that :<br /><ul><li>Epochs provide an incrementing logical clock</li><li>Epochs are recorded in the Binlog, and therefore shipped to Slaves</li><li>Epoch boundaries imply happened-before relationships between events before and after them in the Binlog</li></ul><br />The properties mean that epochs are almost perfect for monitoring conflict windows in an active-active circular replication setup, with only a few enhancements required.<br /><br />I'll describe these enhancements in the next post.<div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-3891833330745782871?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31151&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31151&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/07/eventual-consistency-in-mysql-cluster-using-epochs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Speaking at Oracle UK User Group conference</title>
		<link>http://messagepassing.blogspot.com/2011/11/speaking-at-oracle-uk-user-group.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=speaking-at-oracle-uk-user-group-conference</link>
		<comments>http://messagepassing.blogspot.com/2011/11/speaking-at-oracle-uk-user-group.html#comments</comments>
		<pubDate>Fri, 25 Nov 2011 12:02:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[talking]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=93b7490550c44fa8ede7ce464ddde0e9</guid>
		<description><![CDATA[I will be speaking in the MySQL track of the UK Oracle User Group conference on 5th December in Birmingham UK.  The title of the session is "Building Highly Available and Scalable, Real Time Services with MySQL Cluster" - full details here.I'm not a regular conference attendee, never mind speaker.  However I'm looking forward to meeting current and potential MySQL users, and also attending some of the talks in the MySQL and other tracks.  Maybe I can learn something about RAC, or Exadata?If you are attending and want to talk about MySQL or MySQL Cluster then please track me down and say hello.Note that this is the first picture I have included in 3 years of posts - maybe I shouldn't wait 3 years for the next one?]]></description>
			<content:encoded><![CDATA[<a href="http://2011.ukoug.org/"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 180px; height: 150px;" src="http://3.bp.blogspot.com/-tY5XvNC2REg/Ts-HPbOAlmI/AAAAAAAAAAQ/e_PHjNNY3LQ/s320/i-am-speaking-at-ukoug-2011-xsmall-copy.gif" alt="" id="BLOGGER_PHOTO_ID_5678906354211788386" border="0" /></a><br />I will be speaking in the MySQL track of the <a href="http://2011.ukoug.org/">UK Oracle User Group conference</a> on 5th December in Birmingham UK.  The title of the session is "Building Highly Available and Scalable, Real Time Services with MySQL Cluster" - full details <a href="http://2011.ukoug.org/default.asp?p=8850&amp;dlgact=shwprs&amp;prs_prsid=6385&amp;day_dayid=56">here</a>.<br /><br />I'm not a regular conference attendee, never mind speaker.  However I'm looking forward to meeting current and potential MySQL users, and also attending some of the talks in the MySQL and other tracks.  Maybe I can learn something about RAC, or Exadata?<br /><br />If you are attending and want to talk about MySQL or MySQL Cluster then please track me down and say hello.<br /><br /><span>Note that this is the first picture I have included in 3 years of posts - maybe I shouldn't wait 3 years for the next one?</span><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-254350422842436575?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31039&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31039&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/11/25/speaking-at-oracle-uk-user-group-conference/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MySQL Cluster, and NoSQL</title>
		<link>http://blogs.oracle.com/MySQL/entry/mysql_cluster_and_nosql?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=mysql-cluster-and-nosql</link>
		<comments>http://blogs.oracle.com/MySQL/entry/mysql_cluster_and_nosql#comments</comments>
		<pubDate>Wed, 02 Nov 2011 08:28:22 +0000</pubDate>
		<dc:creator>MySQL Community</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[MySQL Cluster]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Podcast]]></category>

		<guid isPermaLink="false">http://blogs.oracle.com/MySQL/entry/mysql_cluster_and_nosql</guid>
		<description><![CDATA[Those are the topics we cover in the latest episode
 of our “Meet The MySQL Experts”
podcast.  
  Mat Keep and Bernd Ocklin talk
about new database requirements, and walk us through what's new in the second
Development Milestone Release of MySQL Cluster 7.2, including impressive
performance improvements, new NoSQL access via memcached, cross data center
scalability, and more... 
    
    
  Enjoy
the podcast!]]></description>
			<content:encoded><![CDATA[<p><a href="http://feeds.feedburner.com/MeetTheMysqlExperts">  </a></p> 
  <p><a href="http://feeds.feedburner.com/MeetTheMysqlExperts"> </a></p> 
  <p>  <span>Those are the topics we cover in the latest episode
</span> <span>of our <a href="http://feeds.feedburner.com/MeetTheMysqlExperts">“Meet The MySQL Experts”
podcast.</a></span> </p> 
  <p><span>Mat Keep and Bernd Ocklin talk
about new database requirements, and walk us through what's new in the second
Development Milestone Release of MySQL Cluster 7.2, including impressive
performance improvements, new NoSQL access via memcached, cross data center
scalability, and more...</span></p> 
  <p> </p> 
  <p><span> </span></p> 
  <p><span><a href="http://streaming.oracle.com/ebn/podcasts/media/11013471_MySQL_110111.mp3">Enjoy
the podcast!</a></span></p> 
  <p><span> </span></p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=30569&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=30569&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/11/02/mysql-cluster-and-nosql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eventual Consistency &#8212; detecting conflicts</title>
		<link>http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eventual-consistency-detecting-conflicts</link>
		<comments>http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html#comments</comments>
		<pubDate>Thu, 20 Oct 2011 00:05:24 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=8ab1656467084713f4ba3a8e20818f2d</guid>
		<description><![CDATA[In my previous posts I introduced two new conflict detection functions, NDB$EPOCH and NDB$EPOCH_TRANS without explaining how these functions actually detect conflicts?   To simplify the explanation I'll initially consider two circularly replicating MySQL Servers, A and B, rather than two replicating Clusters, but the principles are the same.Commit orderingAvoiding conflicts requires that data is only modified on one Server at a time.  This can be done by defining Master/Slave roles or Active/Passive partitions etc.  Where this is not done, and data can be modified anywhere, there can be conflicts.  A conflict occurs when the same data is modified at both Servers concurrently, but what does concurrently mean?  On a single server, modifications to the same data are serialised by locking or MVCC mechanisms, so that there is a defined order between them.  e.g. two modifications MX and MY are committed either in order {MX, MY} or {MY, MX}.For the purposes of replication, two modifications MX and MY on the same data are concurrent if the order of commit is different at different servers in the system.  Each server will choose one order, but if they don't all choose the same order then there is a conflict.  Having a different order means that the last modification on each server is different, and therefore the final state of the data can be different on different servers.One way to avoid conflicts is to get all servers to agree on a commit order before processing an operation - this ensures that all replicas process operations in the same order, waiting if necessary for missing operations to arrive to ensure no commit-order variance.Note that commit-ordering is only important between modifications affecting the same data - modifications which do not overlap in their data footprint are unrelated and can be committed in any order.  A system which totally orders commits may be less efficient than one which only orders conflicting commits.Happened beforeFor the NDB$EPOCH asynchronous conflict detection functions, commit orders are monitored to detect when two modifications to the same data have been committed in different orders.Given two modifications MX and MY to the same data, each server will decide a happened before (denoted -&#62;) relationship between them :MX -&#62; MY  (MX happened before MY)orMY -&#62; MX  (MY happened before MX)If all servers agree on order 1, or all servers agree on order 2 then there is no conflict.  If there is any disagreement then there is a conflict.In practice, disagreement arises because the same data is modified at both Server A and Server B before the Server A modification is replicated to B and/or vice-versa.Sometimes when reading about commit ordering, the reason why commit orders should not diverge is lost - the only reason to care about commit ordering is because it is related to conflicting modifications and the potential for data divergence.Determining happened before from the BinlogWe assume a steady start state, where both Server A and Server B agree about the state of their data, and no modifications are in-flight.  If a client of Server A then commits modification MA1 to row X, then from Server A's point of view, MA1 happened before any future modification to row X.MA1 -&#62; M*If a client of Server B commits modification MB1 to row X around the same time (before, or after, or thereabouts), from Server B's point of view, MB1 happened before any future modification to row X.MB1 -&#62; M*Both Servers are correct, and content with their world view.  Note that in general, when committing a modification Mj, a server naturally asserts that from its point of view the modification happened before any as-yet-unseen modification Mk.Some time will pass and the replication mechanisms will pull Binlogged changes across and apply them.  When Server B pulls and applies Server A's Binlogged changes, modification MA1 will be applied to row X.  Server B will then naturally be of the opinion that :MB1 -&#62; MA1Independently, Server A will pull Server B's binlogged changes and apply modification MB1 to row X, and will come to the certain opinion that :MA1 -&#62; MB1These happened before relationships are contradictory so there is a conflict.  If nothing is done then A and B will have diverged, with Server A storing the outcome of MB1, and Server B storing the outcome of MA1.Note that if the --log-slave-updates server option were on, then Server A's Binlog would have recorded {...MA1...MB1...}, whereas Server B's Binlog would have recorded {...MB1...MA1...}.  By recording when the Slave applies replicated updates in the Binlog, we record the commit order of the replicated updates relative to other local updates, and encode the happened before relationship in the relative positions of events in the Binlog.The Binlog is of course transferred between servers, so in a circular replication setup, Server A can become aware of the happened before information from Server B and vice-versa by examining the received Binlogs.  The Slave SQL thread examines Binlogs as it applies them, so can be extended to extract happened before information, and use it to detect conflicts.Recall that Server A asserts that its committed modification to row X (MA1) happened before any as-yet-unseen replicated modification :MA1 -&#62; M*Therefore, to detect a conflict, Server A only needs to detect the case where the incoming Binlog from Server B infers that some modification MB* to row X happened before server A's already committed modification MA1.If Server B Binlog implies MB* -&#62; MA1  then there has been a conflictThis is in essence how the NDB$EPOCH functions work - the Binlog is used to capture happened before relationships which are checked to determine whether conflicting concurrent modifications have occurred.Conflict WindowsIn the previous example, Server A commits MA1 modifying row X, and Server B commits MB1 also modifying row X.  From Server A's point of view, as soon as it commits MA1, there is potential for a replicated modification from B such as MB1 to be found in-conflict with MA1.  We say that from Server A's point of view a window of potential conflict on row X has opened when MA1 was committed.  Server A monitors Server B's Binlog as it is applied and when it reaches the point where the commit of MA1 at Server B is recorded, Server A knows that any further MB* recorded in Server B's Binlog after this cannot have happened before MA1, therefore the window of potential conflict on row X has closed.We define the window of potential conflict on a row X as the time between the commit of a modification M1, and the Slave processing of an event in a replicated Binlog indicating that modification M1 has been applied on the other server(s) in the replication loop.Any incoming replicated modification M2 also affecting row X while it has an open conflict window is in conflict with M1, as it must appear to have happened-before M1 to the server which committed it.Observations about the window of potential conflict :It is defined per committed modification per disjoint data setIt can be extended by further modifications to the same data from the same serverThe window does not close all further modifications have been fully replicatedWindow duration is dependent on the replication round-trip delayWhich can vary greatlyOnce it closes, further modifications to the same data from anywhere are safe, but will each open their own window of potential conflict.From the point of view of one Server, conflicts can occur at any time until the conflict window is closedFrom the point of view of one Server, the duration of the window of potential conflict is similar toReplication Propagation Delay A to B + Replication Propagation Delay B to AThese delays may not be symmetric.From the point of view of an external observer/actor, the system will detect two modifications MA1 and MB1 committed at times tMA1 and tMA2 as in-conflict iftMB1 - tMA1 &#60; Replication Propagation Delay A to B( A before B, but not by enough to avoid conflict )ortMA1 - tMB1 &#60; Replication Propagation Delay B to A( B before A, but not by enough to avoid conflict )The window of potential conflict can only be as short as the replication propagation delay between systems, which can tend towards, but never reach zero.Tracking conflict windows with a logical clockA row's conflict window opens when a modification is committed to it, and closes when the Slave processes an event indicating that the modification was committed on the other server(s).  How can we track all of these independent conflict windows?  If only we had a database :)This is solved by maintaining a per-server logical clock, which increments periodically.  Each modification to a row sets a hidden metacolumn of the row to the current value of the server's logical clock.  This gives each row a kind of coarse logical timestamp.  When the logical clock increments, an event is included in the Binlog to record the transition.  Further, all row events for modifications with logical clock value X are stored in the Binlog before any row events for modifications with logical clock value X+1. Server A Binlog events    ClockVal stored in DB                           by Modification  ...  MA1                       39  MA2                       39  MA3                       39  ClockVal_A = 40  MA4                       40  MA5                       40  ClockVal_A = 41  MA6                       41When a Slave applies the Binlog, the ClockVal events are passed through into its Binlog, and are then made available to the original server in a circular configuration. Server B Binlog events  ...  MB1  MB2  ClockVal_A = 40  MB3  MB4  ClockVal_B = 234  MB5  MB6  ClockVal_A = 41  MB7  ...Using the Binlog ordering, we can see that ClockVal_A = 40 happened before MB3 and MB4 at Server B.  This implies that MA1, MA2 and MA3 happened before MB3 and MB4 at server B.When applying Server B's Binlog to Server A, the Slave at Server A maintains a maximum replicated clock value, which increases as it observes its ClockVal_A events returned.  When applying a row event originating from Server B, the affected row's stored clock value is first compared to the maximum replicated clock value to determine whether the row event from B conflicts with the latest committed change to the row at Server A.The two modifications are in conflict if the stored row's clock value is greater than or equal to the maximum replicated clock value.  in_conflict = row_clockval &#62;= maximum_replicated_clockvalUsing a logical clock to track conflict windows has the following benefits :Automatic update on commit of row modification, opening conflict windowAutomatic extension of conflict window on further modification on row with open conflict window.Automatic closure of conflict window on maximum replicated clock value exceeding row's stored valueEfficient storage cost per row - one clock value.Efficient runtime processing cost - inequality comparison between maximum replicated clock value and row's stored clock value.As you might have guessed, NDB$EPOCH uses the MySQL Cluster epoch values as a logical clock to detect conflicts.  The details of this will have to wait for yet another post.  In my first two posts on this subject I thought, 'one more post and I can finish describing this', but here I am at three posts and still not finished.  Hopefully the next will get more concrete and finally describe the mysterious workings of NDB$EPOCH.  We're getting closer, honest.]]></description>
			<content:encoded><![CDATA[In my <a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html">previous</a> <a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html">posts</a> I introduced two new conflict detection functions, NDB$EPOCH and NDB$EPOCH_TRANS without explaining how these functions actually detect conflicts?   To simplify the explanation I'll initially consider two circularly replicating MySQL Servers, A and B, rather than two replicating Clusters, but the principles are the same.<br /><br /><span>Commit ordering</span><br /><br />Avoiding conflicts requires that data is only modified on one Server at a time.  This can be done by defining Master/Slave roles or Active/Passive partitions etc.  Where this is not done, and data can be modified anywhere, there can be conflicts.  A conflict occurs when the same data is modified at both Servers concurrently, but what does concurrently mean?  On a single server, modifications to the same data are serialised by locking or MVCC mechanisms, so that there is a defined order between them.  e.g. two modifications MX and MY are committed either in order {MX, MY} or {MY, MX}.<br /><br />For the purposes of replication, two modifications MX and MY on the same data are concurrent if the order of commit is different at different servers in the system.  Each server will choose one order, but if they don't all choose the same order then there is a conflict.  Having a different order means that the last modification on each server is different, and therefore the final state of the data can be different on different servers.<br /><br /><a href="http://codership.com/">One way</a> to avoid conflicts is to get all servers to agree on a commit order before processing an operation - this ensures that all replicas process operations in the same order, waiting if necessary for missing operations to arrive to ensure no commit-order variance.<br /><br />Note that commit-ordering is only important between modifications affecting the same data - modifications which do not overlap in their data footprint are unrelated and can be committed in any order.  A system which totally orders commits may be less efficient than one which only orders conflicting commits.<br /><br /><span>Happened before</span><br /><br />For the NDB$EPOCH asynchronous conflict detection functions, commit orders are monitored to detect when two modifications to the same data have been committed in different orders.<br /><br />Given two modifications MX and MY to the same data, each server will decide a <a href="http://en.wikipedia.org/wiki/Happened-before">happened before</a> (denoted -&gt;) relationship between them :<br /><br /><ol><li>MX -&gt; MY  (MX happened before MY)<br /><br />or<br /><br /></li><li>MY -&gt; MX  (MY happened before MX)<br /></li></ol><br />If all servers agree on order 1, or all servers agree on order 2 then there is no conflict.  If there is any disagreement then there is a conflict.<br /><br />In practice, disagreement arises because the same data is modified at both Server A and Server B before the Server A modification is replicated to B and/or vice-versa.<br /><br />Sometimes when reading about commit ordering, the reason why commit orders should not diverge is lost - the only reason to care about commit ordering is because it is related to conflicting modifications and the potential for data divergence.<br /><span><br />Determining happened before from the Binlog</span><br /><br />We assume a steady start state, where both Server A and Server B agree about the state of their data, and no modifications are in-flight.  If a client of Server A then commits modification MA1 to row X, then from Server A's point of view, MA1 happened before any future modification to row X.<br /><br /><blockquote>MA1 -&gt; M*<br /></blockquote><br />If a client of Server B commits modification MB1 to row X around the same time (before, or after, or thereabouts), from Server B's point of view, MB1 happened before any future modification to row X.<br /><br /><blockquote>MB1 -&gt; M*<br /></blockquote><br />Both Servers are correct, and content with their world view.  Note that in general, when committing a modification Mj, a server naturally asserts that from its point of view the modification happened before any as-yet-unseen modification Mk.<br /><br />Some time will pass and the replication mechanisms will pull Binlogged changes across and apply them.  When Server B pulls and applies Server A's Binlogged changes, modification MA1 will be applied to row X.  Server B will then naturally be of the opinion that :<br /><br /><blockquote>MB1 -&gt; MA1<br /></blockquote><br />Independently, Server A will pull Server B's binlogged changes and apply modification MB1 to row X, and will come to the certain opinion that :<br /><br /><blockquote>MA1 -&gt; MB1<br /></blockquote><br />These happened before relationships are contradictory so there is a conflict.  If nothing is done then A and B will have diverged, with Server A storing the outcome of MB1, and Server B storing the outcome of MA1.<br /><br />Note that if the <a href="http://dev.mysql.com/doc/refman/5.1/en/replication-options-slave.html">--log-slave-updates</a> server option were on, then Server A's Binlog would have recorded {...MA1...MB1...}, whereas Server B's Binlog would have recorded {...MB1...MA1...}.  By recording when the Slave applies replicated updates in the Binlog, we record the commit order of the replicated updates relative to other local updates, and encode the happened before relationship in the relative positions of events in the Binlog.<br /><br />The Binlog is of course transferred between servers, so in a circular replication setup, Server A can become aware of the happened before information from Server B and vice-versa by examining the received Binlogs.  The Slave SQL thread examines Binlogs as it applies them, so can be extended to extract happened before information, and use it to detect conflicts.<br /><br />Recall that Server A asserts that its committed modification to row X (MA1) happened before any as-yet-unseen replicated modification :<br /><br /><blockquote>MA1 -&gt; M*</blockquote><br />Therefore, to detect a conflict, Server A only needs to detect the case where the incoming Binlog from Server B infers that some modification MB* to row X happened before server A's already committed modification MA1.<br /><br /><blockquote>If Server B Binlog implies MB* -&gt; MA1  then there has been a conflict</blockquote><br />This is in essence how the NDB$EPOCH functions work - the Binlog is used to capture happened before relationships which are checked to determine whether conflicting concurrent modifications have occurred.<br /><br /><br /><span>Conflict Windows<br /><br /></span>In the previous example, Server A commits MA1 modifying row X, and Server B commits MB1 also modifying row X.  From Server A's point of view, as soon as it commits MA1, there is potential for a replicated modification from B such as MB1 to be found in-conflict with MA1.  We say that from Server A's point of view a window of potential conflict on row X has opened when MA1 was committed.  Server A monitors Server B's Binlog as it is applied and when it reaches the point where the commit of MA1 at Server B is recorded, Server A knows that any further MB* recorded in Server B's Binlog after this cannot have happened before MA1, therefore the window of potential conflict on row X has closed.<br /><br />We define the window of potential conflict on a row X as the time between the commit of a modification M1, and the Slave processing of an event in a replicated Binlog indicating that modification M1 has been applied on the other server(s) in the replication loop.<br /><br />Any incoming replicated modification M2 also affecting row X while it has an open conflict window is in conflict with M1, as it must appear to have happened-before M1 to the server which committed it.<br /><br />Observations about the window of potential conflict :<br /><ul><li>It is defined per committed modification per disjoint data set<br /></li><li>It can be extended by further modifications to the same data from the same server<br />The window does not close all further modifications have been fully replicated</li><li>Window duration is dependent on the replication round-trip delay<br />Which can vary greatly<br /></li><li>Once it closes, further modifications to the same data from anywhere are safe, but will each open their own window of potential conflict.<br /></li><li>From the point of view of one Server, conflicts can occur at any time until the conflict window is closed<br /></li><li>From the point of view of one Server, the duration of the window of potential conflict is similar to<br /><br /><span>Replication Propagation Delay A to B</span> + <span>Replication Propagation Delay B to A</span><br /><br />These delays may not be symmetric.</li><li>From the point of view of an external observer/actor, the system will detect two modifications MA1 and MB1 committed at times tMA1 and tMA2 as in-conflict if<br /><br /><span>tMB1 - tMA1 &lt; Replication Propagation Delay A to B</span><br /><br />( A before B, but not by enough to avoid conflict )<br /><br />or<br /><br /><span>tMA1 - tMB1 &lt; Replication Propagation Delay B to A</span><br /><br />( B before A, but not by enough to avoid conflict )</li><li>The window of potential conflict can only be as short as the replication propagation delay between systems, which can tend towards, but never reach zero.<br /></li></ul><br /><span>Tracking conflict windows with a logical clock</span><br /><br />A row's conflict window opens when a modification is committed to it, and closes when the Slave processes an event indicating that the modification was committed on the other server(s).  How can we track all of these independent conflict windows?  If only we had a database :)<br /><br />This is solved by maintaining a per-server <a href="http://en.wikipedia.org/wiki/Logical_clock">logical clock</a>, which increments periodically.  Each modification to a row sets a hidden metacolumn of the row to the current value of the server's logical clock.  This gives each row a kind of coarse logical timestamp.  When the logical clock increments, an event is included in the Binlog to record the transition.  Further, all row events for modifications with logical clock value X are stored in the Binlog before any row events for modifications with logical clock value X+1.<br /><br /><pre> Server A Binlog events    ClockVal stored in DB<br />                           by Modification<br /><br />  ...<br />  MA1                       39<br />  MA2                       39<br />  MA3                       39<br />  ClockVal_A = 40<br />  MA4                       40<br />  MA5                       40<br />  ClockVal_A = 41<br />  MA6                       41<br /></pre><br /><br />When a Slave applies the Binlog, the ClockVal events are passed through into its Binlog, and are then made available to the original server in a circular configuration.<br /><br /><pre> Server B Binlog events<br /><br />  ...<br />  MB1<br />  MB2<br />  ClockVal_A = 40<br />  MB3<br />  MB4<br />  ClockVal_B = 234<br />  MB5<br />  MB6<br />  ClockVal_A = 41<br />  MB7<br />  ...<br /><br /></pre><br /><br />Using the Binlog ordering, we can see that ClockVal_A = 40 happened before MB3 and MB4 at Server B.  This implies that MA1, MA2 and MA3 happened before MB3 and MB4 at server B.<br /><br />When applying Server B's Binlog to Server A, the Slave at Server A maintains a maximum replicated clock value, which increases as it observes its ClockVal_A events returned.  When applying a row event originating from Server B, the affected row's stored clock value is first compared to the maximum replicated clock value to determine whether the row event from B conflicts with the latest committed change to the row at Server A.<br /><br />The two modifications are in conflict if the stored row's clock value is greater than or equal to the maximum replicated clock value.<br /><br /><blockquote>  in_conflict = row_clockval &gt;= maximum_replicated_clockval</blockquote><br />Using a logical clock to track conflict windows has the following benefits :<br /><ul><li>Automatic update on commit of row modification, opening conflict window</li><li>Automatic extension of conflict window on further modification on row with open conflict window.<br /></li><li>Automatic closure of conflict window on maximum replicated clock value exceeding row's stored value</li><li>Efficient storage cost per row - one clock value.<br /></li><li>Efficient runtime processing cost - inequality comparison between maximum replicated clock value and row's stored clock value.<br /></li></ul><br />As you might have guessed, NDB$EPOCH uses the MySQL Cluster epoch values as a logical clock to detect conflicts.  The details of this will have to wait for yet another post.  In my first two posts on this subject I thought, 'one more post and I can finish describing this', but here I am at three posts and still not finished.  Hopefully the next will get more concrete and finally describe the mysterious workings of NDB$EPOCH.  We're getting closer, honest.<div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-9036540843522030105?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=30430&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=30430&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/10/20/eventual-consistency-detecting-conflicts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

