<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PlanetMysql.ru - информация о СУБД MySQL &#187; Cluster</title>
	<atom:link href="http://planetmysql.ru/category/cluster/feed/" rel="self" type="application/rss+xml" />
	<link>http://planetmysql.ru</link>
	<description>Блог о самой популярной СУБД MySQL</description>
	<lastBuildDate>Thu, 24 May 2012 14:20:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>MySQL Cluster 7.2 &#8212; Unlimited Possibilities</title>
		<link>http://feedproxy.google.com/~r/dataandco/~3/fqwcCXdfBCs/mysql-cluster-72-unlimited.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=mysql-cluster-7-2-unlimited-possibilities</link>
		<comments>http://feedproxy.google.com/~r/dataandco/~3/fqwcCXdfBCs/mysql-cluster-72-unlimited.html#comments</comments>
		<pubDate>Wed, 16 May 2012 12:59:00 +0000</pubDate>
		<dc:creator>Luca Olivari</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[social]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=b4ec202e0531e84db90fa14ad672adeb</guid>
		<description><![CDATA[We've recently seen some great announcements of MySQL Cluster delivering amazing results for both&#160;selects&#160;and&#160;updates. The posts (see related articles below) are full of juicy technical details and proofs, but today I'd like to change the perspective a bit.&#160;Let's compare those figures with real-world data and imagine what could be done. Please note that I'm not using any scientific method here, just dreaming about the unlimited opportunities offered by MySQL Cluster today.



MySQL Cluster 7.2.7 -- 1B+ Writes per Minute

Cluster can deliver&#160;1B+ selects per minute&#160;with 8 nodes and&#160;1B+ updates per minute with 30 nodes.

Our planet is getting quite populated and interconnected.&#160;World population is 7B+ and 2B+ of us are using internet.&#160;Let's assume that, due to&#160;time-zones, only 1/3 of the total internet population is online at a given time (700M+) and that&#160;a single action generates one update and one select on the database.







What kind of services can we offer then?


With such scalability and performance, MySQL Cluster offers endless opportunities to develop something new that can support the exponential growth of the web and offer always-on services to everyone, for example:


Hellos from the world -- a&#160;website where everyone can say hello to the world, whenever they want. MySQL Cluster can handle the entire online population in less than 1 minute;
Let's shop together -- a global&#160;eCommerce&#160;website selling everything with 100% market share. If everyone would buy an item per minute, MySQL Cluster could easily&#160;fulfill&#160;the needs of the entire internet population with 30 nodes;
Like everything you like&#160;-- a like button that can be attached to everything in order to collect statistics on users' favorite things. MySQL Cluster could easily sustain the total online world assuming they'd like 1 thing per minute;


Furthermore, MySQL Cluster could handle updates from all of Zynga's 60M active daily users in 3 seconds or all of&#160;Facebook's 900M+ active users in less than a minute. All of that giving you ACID compliance and
    synchronous replication to ensure no data loss.

The Oracle MySQL engineering team did a great job with Cluster: let's build the next big thing with it!

Related articles


Where would I use MySQL Cluster?
MySQL Cluster 7.2.7 achieves 1BN update transactions per minute
1 Billion Queries Per Minute - MySQL Cluster 7.2 is GA!
Challenges in reaching 1BN reads and updates per minute for MySQL Cluster 7.2]]></description>
			<content:encoded><![CDATA[<div>
</div>
We've recently seen some great announcements of MySQL Cluster delivering amazing results for both&nbsp;selects&nbsp;and&nbsp;updates. The posts (see related articles below) are full of juicy technical details and proofs, but today I'd like to change the perspective a bit.&nbsp;Let's compare those figures with real-world data and imagine what could be done. Please note that I'm not using any scientific method here, just dreaming about the unlimited opportunities offered by MySQL Cluster today.<br />
<br />
<table cellpadding="0" cellspacing="0"><tbody>
<tr><td><a href="http://3.bp.blogspot.com/-eLvTXDjG87c/T7OZlgo-DXI/AAAAAAAAAhs/iODUuhFtB3k/s1600/1bupdates.png" imageanchor="1"><img border="0" height="228" src="http://3.bp.blogspot.com/-eLvTXDjG87c/T7OZlgo-DXI/AAAAAAAAAhs/iODUuhFtB3k/s320/1bupdates.png" width="320" /></a></td></tr>
<tr><td>MySQL Cluster 7.2.7 -- 1B+ Writes per Minute</td></tr>
</tbody></table>
Cluster can deliver&nbsp;<a href="http://dev.mysql.com/tech-resources/articles/mysql-cluster-7.2-ga.html">1B+ selects per minute</a>&nbsp;with 8 nodes and&nbsp;<a href="http://mikaelronstrom.blogspot.it/2012/05/mysql-cluster-727-achieves-1bn-update.html">1B+ updates per minute</a> with 30 nodes.<br />
<br />
Our planet is getting quite populated and interconnected.&nbsp;<a href="http://en.wikipedia.org/wiki/Global_Internet_usage">World population</a> is 7B+ and 2B+ of us are using internet.&nbsp;Let's assume that, due to&nbsp;time-zones, only 1/3 of the total internet population is online at a given time (700M+) and that&nbsp;a single action generates one update and one select on the database.<br />
<div>
<br />
<h4>




What kind of services can we offer then?</h4>
</div>
<div>
With such scalability and performance, MySQL Cluster offers endless opportunities to develop something new that can support the exponential growth of the web and offer always-on services to everyone, for example:</div>
<div>
<ul>
<li><b><u>Hellos from the world</u></b> -- a&nbsp;website where everyone can say hello to the world, whenever they want. MySQL Cluster can handle the entire online population in less than 1 minute;</li>
<li><b><u>Let's shop together</u></b> -- a global&nbsp;eCommerce&nbsp;website selling everything with 100% market share. If everyone would buy an item per minute, MySQL Cluster could easily&nbsp;fulfill&nbsp;the needs of the entire internet population with 30 nodes;</li>
<li><u><b>Like everything you like</b></u>&nbsp;-- a like button that can be attached to everything in order to collect statistics on users' favorite things. MySQL Cluster could easily sustain the total online world assuming they'd like 1 thing per minute;</li>
</ul>
<div>
Furthermore, MySQL Cluster could handle updates from all of<a href="http://company.zynga.com/about/advertise"> Zynga's 60M active daily users</a> in 3 seconds or all of&nbsp;<a href="http://newsroom.fb.com/content/default.aspx?NewsAreaId=22">Facebook's 900M+ active users</a> in less than a minute. All of that giving you ACID compliance and
    synchronous replication to ensure no data loss.</div>
<br />
The Oracle MySQL engineering team did a great job with Cluster: let's build the next big thing with it!<br />
<br />
<b>Related articles</b><br />
<br />
<ul>
<li><a href="https://blogs.oracle.com/MySQL/entry/where_would_i_use_mysql">Where would I use MySQL Cluster?</a></li>
<li><a href="http://mikaelronstrom.blogspot.com/2012/05/mysql-cluster-727-achieves-1bn-update.html" >MySQL Cluster 7.2.7 achieves 1BN update transactions per minute</a></li>
<li><a href="http://www.clusterdb.com/mysql-cluster/1-billion-queries-per-minute-mysql-cluster-7-2-is-ga/" >1 Billion Queries Per Minute - MySQL Cluster 7.2 is GA!</a></li>
<li><a href="http://mikaelronstrom.blogspot.com/2012/05/challenges-in-reaching-1bn-reads-and.html" >Challenges in reaching 1BN reads and updates per minute for MySQL Cluster 7.2</a></li>
</ul>
</div>
<div>
<br />
<div>
<br class="Apple-interchange-newline" /></div>
</div><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/8877901999053801110-4617839139028343003?l=dataandco.blogspot.com" alt="" /></div>
<p><a href="http://feedads.g.doubleclick.net/~a/RdVqrnA_ozJXucdJrw0YYLivHPg/0/da"><img src="http://feedads.g.doubleclick.net/~a/RdVqrnA_ozJXucdJrw0YYLivHPg/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/RdVqrnA_ozJXucdJrw0YYLivHPg/1/da"><img src="http://feedads.g.doubleclick.net/~a/RdVqrnA_ozJXucdJrw0YYLivHPg/1/di" border="0" ismap="true"></img></a></p><img src="http://feeds.feedburner.com/~r/dataandco/~4/fqwcCXdfBCs" height="1" width="1" /><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=33248&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=33248&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2012/05/16/mysql-cluster-7-2-unlimited-possibilities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Guide to MySQL &amp; NoSQL, Webinar Q&amp;A</title>
		<link>https://blogs.oracle.com/MySQL/entry/guide_to_mysql_nosql_webinar?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=guide-to-mysql-nosql-webinar-qa</link>
		<comments>https://blogs.oracle.com/MySQL/entry/guide_to_mysql_nosql_webinar#comments</comments>
		<pubDate>Fri, 30 Mar 2012 10:47:11 +0000</pubDate>
		<dc:creator>MySQL Community</dc:creator>
				<category><![CDATA[api]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[memcached]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[MySQL Cluster]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">https://blogs.oracle.com/MySQL/entry/guide_to_mysql_nosql_webinar</guid>
		<description><![CDATA[Yesterday we ran a webinar discussing the
demands of next generation web services and how blending the best of relational
and NoSQL technologies enables developers and architects to deliver the
agility, performance and availability needed to be successful. 
    
  Attendees posted a number of great
questions to the MySQL developers, serving to&#160;provide additional insights into areas like auto-sharding and cross-shard JOINs,
replication, performance, client libraries, etc. So I thought it would be useful to post those
below, for the benefit of those unable to attend the webinar. 
    
  Before getting to the Q&#38;A, there are a
couple of other resources that maybe useful to those looking at NoSQL
capabilities within MySQL: 
  - On-Demand webinar (coming
soon!) 
  - Slides used during the webinar  
  -  Guide to MySQL and NoSQL
whitepaper&#160; 
  - MySQL Cluster demo, including
NoSQL interfaces, auto-sharing, high availability, etc.&#160; 
    
  So here is the Q&#38;A from the event&#160; 
    
  Q.
Where does MySQL Cluster fit in to the CAP theorem? 
  A. MySQL Cluster is flexible. A single
Cluster will prefer consistency over availability in the presence of network
partitions. A pair of Clusters can be configured to prefer availability over
consistency. A full explanation can be found on the MySQL Cluster &#38; CAP Theorem blog post.&#160; 
    
  Q.
Can you configure the number of replicas? (the slide used a replication factor
of 1)  
  Yes. A cluster is configured by an .ini
file. The option NoOfReplicas sets the number of originals and replicas: 1 = no
data redundancy, 2 = one copy etc. Usually there's no benefit in setting it
&#62;2. 
  Q. Interestingly most (if not all) of the NoSQL databases recommend having 3 copies of data (the replication factor).&#160;&#160;&#160; 
  Yes, with configurable quorum based Reads and writes. MySQL Cluster does not need a quorum of replicas online to provide service. Systems that require a quorum need &#62; 2 replicas to be able to tolerate a single failure. Additionally, many NoSQL systems take liberal inspiration from the original GFS paper which described a 3 replica configuration. MySQL Cluster avoids the need for a quorum by using a lightweight arbitrator. You can configure more than 2 replicas, but this is a tradeoff between incrementally improved availability, and linearly increased cost.  
     
  Q.
Can you have cross node group JOINS? Wouldn't that run into the risk of
flooding the network?  
  MySQL Cluster 7.2 supports cross nodegroup
joins. A full cross-join can require a large amount of data transfer, which may
bottleneck on network bandwidth. However, for more selective joins, typically
seen with OLTP and light analytic applications, cross node-group joins give a
great performance boost and network bandwidth saving over having the MySQL
Server perform the join. 
    
  Q.
Are the details of the benchmark available anywhere? According to my
calculations it results in approx. 350k ops/sec per processor which is the
largest number I've seen lately  
  The details are linked from Mikael
Ronstrom's blog  
  The benchmark uses a benchmarking tool we
call flexAsynch which runs parallel asynchronous transactions. It involved 100
byte reads, of 25 columns each. Regarding the per-processor ops/s, MySQL
Cluster is particularly efficient in terms of throughput/node. It uses
lock-free minimal copy message passing internally, and maximizes ID cache
reuse. Note also that these are in-memory tables, there is no need to read
anything from disk. 
    
  Q. Is
access control (like table) planned to be supported for NoSQL access mode?  
  Currently we have not seen much need for
full SQL-like access control (which has always been overkill for web apps and
telco apps). So we have no plans, though especially with memcached it is
certainly possible to turn-on connection-level access control. But specifically
table level controls are not planned. 
  Q. How
is the performance of memcached APi with MySQL against memcached+MySQL or any
other Object Cache like Ecache with MySQL DB? 
  With the memcache API we generally see a
memcached response in less than 1 ms. and a small cluster with one memcached
server can handle tens of thousands of operations per second. 
    
  Q.
Can .NET can access MemcachedAPI?  
  Yes, just use a .Net memcache client such
as the enyim or BeIT memcache libraries. 
    
  Q. Is
the row level locking applicable when you update a column through memcached API?  
  An update that comes through memcached uses
a row lock and then releases it immediately. Memcached operations like
&#34;INCREMENT&#34; are actually pushed down to the data nodes. In most cases
the locks are not even held long enough for a network round trip. 
    
  Q. Has
anyone published an example using something like PHP?  I am assuming that you just use the PHP
memcached extension to hook into the memcached API. Is that correct?  
  Not that I'm aware of but absolutely you
can use it with php or any of the other drivers 
    
  Q. For
beginner we need more examples.  
  Take a look here for a fully worked example 
    
  Q. Can I access MySQL using Cobol (Open Cobol) or C and if so where can I find the coding libraries etc? 
  A. There is a cobol implementation that works well with MySQL, but I do not think it is Open Cobol. Also there is a MySQL C client library that is a standard part of every mysql distribution  
    
  Q. Is
there a place to go to find help when testing and/implementing the NoSQL
access?  
  If using Cluster then you can use the
cluster@lists.mysql.com alias or post on the MySQL Cluster forum  
    
  Q. Are there any white papers on this?&#160; 
  Yes - there is more detail in the MySQL Guide to NoSQL whitepaper  
    
  If you have further questions, please don’t
hesitate to use the comments below!]]></description>
			<content:encoded><![CDATA[<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Revision>0</o:Revision>
  <o:TotalTime>0</o:TotalTime>
  <o:Pages>1</o:Pages>
  <o:Words>959</o:Words>
  <o:Characters>5469</o:Characters>
  <o:Company>Homework</o:Company>
  <o:Lines>45</o:Lines>
  <o:Paragraphs>12</o:Paragraphs>
  <o:CharactersWithSpaces>6416</o:CharactersWithSpaces>
  <o:Version>14.0</o:Version>
 </o:DocumentProperties>
 <o:OfficeDocumentSettings>
  <o:AllowPNG/>
 </o:OfficeDocumentSettings>
</xml><![endif]--> <!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <w:Zoom>0</w:Zoom>
  <w:TrackMoves/>
  <w:TrackFormatting/>
  <w:PunctuationKerning/>
  <w:ValidateAgainstSchemas/>
  <w:SaveIfXMLInval>false</w:SaveIfXMLInvalid>
  <w:IgnoreMixedContent>false</w:IgnoreMixedContent>
  <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
  <w:DoNotPromoteQF/>
  <w:LidThemeOther>EN-US</w:LidThemeOther>
  <w:LidThemeAsian>JA</w:LidThemeAsian>
  <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
  <w:Compatibility>
   <w:BreakWrappedTables/>
   <w:SnapToGridInCell/>
   <w:WrapTextWithPunct/>
   <w:UseAsianBreakRules/>
   <w:DontGrowAutofit/>
   <w:SplitPgBreakAndParaMark/>
   <w:EnableOpenTypeKerning/>
   <w:DontFlipMirrorIndents/>
   <w:OverrideTableStyleHps/>
   <w:UseFELayout/>
  </w:Compatibility>
  <m:mathPr>
   <m:mathFont m:val="Cambria Math"/>
   <m:brkBin m:val="before"/>
   <m:brkBinSub m:val="&#45;-"/>
   <m:smallFrac m:val="off"/>
   <m:dispDef/>
   <m:lMargin m:val="0"/>
   <m:rMargin m:val="0"/>
   <m:defJc m:val="centerGroup"/>
   <m:wrapIndent m:val="1440"/>
   <m:intLim m:val="subSup"/>
   <m:naryLim m:val="undOvr"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="276">
  <w:LsdException Locked="false" Priority="0" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 1"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 2"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 3"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 4"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 5"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 6"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 7"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 8"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 9"/>
  <w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
  <w:LsdException Locked="false" Priority="10" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Title"/>
  <w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
  <w:LsdException Locked="false" Priority="11" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
  <w:LsdException Locked="false" Priority="22" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
  <w:LsdException Locked="false" Priority="20" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
  <w:LsdException Locked="false" Priority="59" SemiHidden="false"
   UnhideWhenUsed="false" Name="Table Grid"/>
  <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
  <w:LsdException Locked="false" Priority="1" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 1"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
  <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
  <w:LsdException Locked="false" Priority="34" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
  <w:LsdException Locked="false" Priority="29" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
  <w:LsdException Locked="false" Priority="30" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 1"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 2"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 2"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 3"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 3"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 4"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 4"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 5"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 5"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 6"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 6"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
  <w:LsdException Locked="false" Priority="19" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
  <w:LsdException Locked="false" Priority="21" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
  <w:LsdException Locked="false" Priority="31" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
  <w:LsdException Locked="false" Priority="32" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
  <w:LsdException Locked="false" Priority="33" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
  <w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
  <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
 </w:LatentStyles>
</xml><![endif]--> <!--[if gte mso 10]>

<![endif]--> <!--StartFragment--> 
  <p><span lang="EN-US">Yesterday we ran a webinar discussing the
demands of next generation web services and how blending the best of relational
and NoSQL technologies enables developers and architects to deliver the
agility, performance and availability needed to be successful.<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><span lang="EN-US">Attendees posted a number of great
questions to the MySQL developers, serving to&nbsp;provide additional insights into areas like auto-sharding and cross-shard JOINs,
replication, performance, client libraries, etc. So I thought it would be useful to post those
below, for the benefit of those unable to attend the webinar.<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><span lang="EN-US">Before getting to the Q&amp;A, there are a
couple of other resources that maybe useful to those looking at NoSQL
capabilities within MySQL:<o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span lang="EN-US">On-Demand webinar (coming
soon!)<o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span lang="EN-US"><a href="http://www.slideshare.net/matkeep/nosql-and-mysql-webinar-best-of-both-worlds">Slides used during the webinar </a><o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> <a href="http://mysql.com/why-mysql/white-papers/mysql-wp-guide-to-nosql.php"> </a></span></span><!--[endif]--><span lang="EN-US"><a href="http://mysql.com/why-mysql/white-papers/mysql-wp-guide-to-nosql.php">Guide to MySQL and NoSQL
whitepaper&nbsp;</a><o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span lang="EN-US"><a href="http://www.oracle.com/pls/ebn/swf_viewer.load?p_shows_id=11464419">MySQL Cluster demo</a>, including
NoSQL interfaces, auto-sharing, high availability, etc.&nbsp;<o:p /></span></p> 
  <p> </p> 
  <p><span lang="EN-US">So here is the Q&amp;A from the event&nbsp;</span></p> 
  <p> </p> 
  <p><b><span lang="EN-US">Q.
Where does MySQL Cluster fit in to the CAP theorem?<o:p /></span></b></p> 
  <p><span lang="EN-US">A. MySQL Cluster is flexible. A single
Cluster will prefer consistency over availability in the presence of network
partitions. A pair of Clusters can be configured to prefer availability over
consistency. A full explanation can be found on the <a href="http://messagepassing.blogspot.co.uk/2012/03/cap-theorem-and-mysql-cluster.html">MySQL Cluster &amp; CAP Theorem blog post.&nbsp;</a><o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q.
Can you configure the number of replicas? (the slide used a replication factor
of 1) <o:p /></span></b></p> 
  <p><span lang="EN-US">Yes. A cluster is configured by an .ini
file. The option NoOfReplicas sets the number of originals and replicas: 1 = no
data redundancy, 2 = one copy etc. Usually there's no benefit in setting it
&gt;2.</span></p> 
  <p><b><span lang="EN-US">Q. Interestingly most (if not all) of the NoSQL databases recommend having 3 copies of data (the replication factor).&nbsp;&nbsp;&nbsp;<o:p /></span></b></p> 
  <p><span lang="EN-US">Yes, with configurable quorum based Reads and writes. MySQL Cluster does not need a quorum of replicas online to provide service. Systems that require a quorum need &gt; 2 replicas to be able to tolerate a single failure. Additionally, many NoSQL systems take liberal inspiration from the original GFS paper which described a 3 replica configuration. MySQL Cluster avoids the need for a quorum by using a lightweight arbitrator. You can configure more than 2 replicas, but this is a tradeoff between incrementally improved availability, and linearly increased cost.</span> </p> 
  <p><span lang="EN-US"> </span> </p> 
  <p><b><span lang="EN-US">Q.
Can you have cross node group JOINS? Wouldn't that run into the risk of
flooding the network? <o:p /></span></b></p> 
  <p><span lang="EN-US">MySQL Cluster 7.2 supports cross nodegroup
joins. A full cross-join can require a large amount of data transfer, which may
bottleneck on network bandwidth. However, for more selective joins, typically
seen with OLTP and light analytic applications, cross node-group joins give a
great performance boost and network bandwidth saving over having the MySQL
Server perform the join.<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q.
Are the details of the benchmark available anywhere? According to my
calculations it results in approx. 350k ops/sec per processor which is the
largest number I've seen lately <o:p /></span></b></p> 
  <p><span lang="EN-US">The details are linked from <a href="http://mikaelronstrom.blogspot.co.uk/2012/02/105bn-qpm-using-mysql-cluster-72.html">Mikael
Ronstrom's blog</a> <o:p /></span></p> 
  <p><span lang="EN-US">The benchmark uses a benchmarking tool we
call flexAsynch which runs parallel asynchronous transactions. It involved 100
byte reads, of 25 columns each. Regarding the per-processor ops/s, MySQL
Cluster is particularly efficient in terms of throughput/node. It uses
lock-free minimal copy message passing internally, and maximizes ID cache
reuse. Note also that these are in-memory tables, there is no need to read
anything from disk.<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q. Is
access control (like table) planned to be supported for NoSQL access mode?</span></b><span lang="EN-US"> <o:p /></span></p> 
  <p><span lang="EN-US">Currently we have not seen much need for
full SQL-like access control (which has always been overkill for web apps and
telco apps). So we have no plans, though especially with memcached it is
certainly possible to turn-on connection-level access control. But specifically
table level controls are not planned.</span></p> 
  <p><b><span lang="EN-US">Q. How
is the performance of memcached APi with MySQL against memcached+MySQL or any
other Object Cache like Ecache with MySQL DB?<o:p /></span></b></p> 
  <p><span lang="EN-US">With the memcache API we generally see a
memcached response in less than 1 ms. and a small cluster with one memcached
server can handle tens of thousands of operations per second.<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q.
Can .NET can access MemcachedAPI? <o:p /></span></b></p> 
  <p><span lang="EN-US">Yes, just use a .Net memcache client such
as the enyim or BeIT memcache libraries.<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q. Is
the row level locking applicable when you update a column through memcached API? <o:p /></span></b></p> 
  <p><span lang="EN-US">An update that comes through memcached uses
a row lock and then releases it immediately. Memcached operations like
&quot;INCREMENT&quot; are actually pushed down to the data nodes. In most cases
the locks are not even held long enough for a network round trip.<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q. Has
anyone published an example using something like PHP?  I am assuming that you just use the PHP
memcached extension to hook into the memcached API. Is that correct? <o:p /></span></b></p> 
  <p><span lang="EN-US">Not that I'm aware of but absolutely you
can use it with php or any of the other drivers<o:p /></span></p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q. For
beginner we need more examples. <o:p /></span></b></p> 
  <p><span lang="EN-US">Take a look <a href="http://clusterdb.com/u/memcached">here</a> for a fully worked example</span></p> 
  <p> </p> 
  <p><b><span lang="EN-US">Q. Can I access MySQL using Cobol (Open Cobol) or C and if so where can I find the coding libraries etc?<o:p /></span></b></p> 
  <p><span lang="EN-US">A. There is a cobol implementation that works well with MySQL, but I do not think it is Open Cobol. Also there is a MySQL C client library that is a standard part of every mysql distribution</span> </p> 
  <p><span lang="EN-US"> </span></p> 
  <p><b><span lang="EN-US">Q. Is
there a place to go to find help when testing and/implementing the NoSQL
access? <o:p /></span></b></p> 
  <p><span lang="EN-US">If using Cluster then you can use the
cluster@lists.mysql.com alias or post on the <a href="http://forums.mysql.com/list.php?25">MySQL Cluster forum </a><o:p /></span></p> 
  <p> </p> 
  <p><b><span lang="EN-US">Q. Are there any white papers on this?&nbsp;<o:p /></span></b></p> 
  <p>Yes - there is more detail in the <a href="http://mysql.com/why-mysql/white-papers/mysql-wp-guide-to-nosql.php">MySQL Guide to NoSQL whitepaper</a> </p> 
  <p> </p> 
  <p><span lang="EN-US">If you have further questions, please don’t
hesitate to use the comments below!<o:p /></span></p> <!--EndFragment--><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32636&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32636&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2012/03/30/guide-to-mysql-nosql-webinar-qa/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The CAP theorem and MySQL Cluster</title>
		<link>http://messagepassing.blogspot.com/2012/03/cap-theorem-and-mysql-cluster.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-cap-theorem-and-mysql-cluster</link>
		<comments>http://messagepassing.blogspot.com/2012/03/cap-theorem-and-mysql-cluster.html#comments</comments>
		<pubDate>Wed, 07 Mar 2012 00:04:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[message-passing]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[rambling]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=7c0de672e5ba5240973f4ea2fd387d5a</guid>
		<description><![CDATA[tldr; A single MySQL Cluster prioritises Consistency in Network partition events.  Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.I was recently asked about the relationship between MySQL Cluster and the CAP theorem.  The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast.  You can have any two, but you can't have all three.  For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'.  CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?Standard 'my interpretation of CAP' sectionEveryone who discusses CAP like to rehash it, and I'm no exception.  Daniel Abadi has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition.  This is how I think of CAP.  He also discusses related system behaviour in normal operation which I'll return to later.While this reframing clarifies CAP, the terms network partition, availability and consistency also need some definition.Network replicated databaseCAP is only really relevant in the context of a network replicated database (or filesystem or state machine).   A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network.  Data can be read and updated.  Updates are propagated to all nodes with replicas via the network.  Database clients connect to database nodes via the network to read data and make updates.  Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.AvailabilityThe network replicated database exists to provide services such as Read and Write on the data it stores.  Its availability can be measured as the ability of any client to perform any service on any data item.This Service Availability can be compromised by :Failure of client nodesNetwork failures between clients and database nodesNetwork failures between database nodesFailure of database nodesClient node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP.  However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service is within the scope of CAP.Service Availability is not binary, it can partially degrade, perhaps by affecting :A subset of all clientsA subset of all stored dataA subset of request typesThe shades of grey within the definition of availability are responsible for most of the arguments around CAP.  If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain.  If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network.  In the loosest definition, if any client receives any service on any data, then the system is still available.  Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage.  Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.ConsistencyFor consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data.  Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result.  Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results.  This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.If one copy consistency is relaxed, then different database nodes may observably have different values for the same data item simultaneously.  Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes.  This behaviour may be described as eventual consistency.  Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications.  Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.Network PartitionsNetwork partitions isolate subsets of the nodes of a network replicated database.  The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :deadalive but isolated from clientsalive and reachable by clients but isolated from usNot knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency.  The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them.  If those clients are allowed to make writes on data copies stored on those database nodes, then we must lose one copy consistency as we cannot supply those new values in response to a read of our local copy.  If those clients are not allowed to make writes then we have degraded service availability for them.  Which is it to be?  This is the unavoidable choice at the centre of the CAP theorem.  Stated this way it seems less of a theorem and more of a fact.Back to MySQL Cluster - which does it provide?A single MySQL Cluster prioritises data consistency over availability when network partitions occur.A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.So you can have it both ways with MySQL Cluster - Great!Single MySQL Cluster - CPWithin a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit.  Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes.  Where a network partition occurs, live nodes in each partition regroup and decide what to do next :If there are not enough live nodes to serve all of the data stored - shutdownServing a subset of user data (and risking data consistency) is not an optionIf there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide serviceNo other subset of nodes can be isolated from us and serving clientsIf there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.There could be another subset of nodes regrouped into a viable cluster out there. Arbitration occurs to avoid the split brain scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others.  In other words, arbitration occurs to preserve consistency.Arbitration involves :Database nodes agree on an arbitrator in advanceDuring node or network failure handling, no data writes are committed.When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.Each request to the arbitrator will result in either : Yes, No or timeoutAnything other than Yes results in node shutdown.The arbitrator only says Yes once per election round (First come first served).  Therefore the arbitrator only says yes to one potential cluster in a partitioned network.Note that arbitration is not the same as achieving a quorum.  A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor.  The arbitrator role is lightweight as it is not involved in normal traffic.  I am surprised that the lightweight arbitrator pattern is not more common.How does a single MySQL Cluster degrade service availability as a result of network partitions?Where some subset of data nodes are isolated and shut-down :Those nodes are 100% out of service, until they restart and can rejoin the clusterThey will attempt to do so automaticallyAny clients connected only to those nodes are out of serviceBy default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.The remaining live nodes are 100% in-serviceClients connected to the remaining live nodes are 100% in serviceWhere no subset of data nodes is liveAll clients experience 100% service loss, until the data nodes restart and can rejoin the clusterThey will attempt to do so automatically.A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions.  It does not sacrifice consistency.How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it.  Some set of clients continue to receive 100% service.  Only double failures in the network can cause a network partition resulting in full service loss.Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.Asynchronously replicating clusters - APWhere two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters.  Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not.  On the other hand, availability is not compromised by network partitioning of the two clusters.  Each cluster can continue to accept read and write requests to all of the data from any connected client.Eventual consistency between the clusters is possible when using conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.How does consistency degrade between replicating MySQL Clusters during a network partition?This depends on the conflict resolution function chosen, and how detected conflicts are handled.  Some details of consistency guarantees provided by NDB$EPOCH et al are described here.What about normal operation?Abadi's post introduced his PACELC acronym, standing for something like : if (network Partition){  trade-off Availability vs Consistency;}else{  trade-off Latency vs Consistency;}My first comment has to be that it's bad form to put the common case in an else branch!However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition.  The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency.  In theory this normal operation latency-vs-consistency tradeoff could be completely independent to the Network Partitioning availability-vs-consistency tradeoff,  e.g. you could have any of :  PA EL  (Partition - Availability, Else - Latency minimisation)PA EC  (Partition - Availability, Else - Consistency)PC EL  (Partition - Consistency, Else - Latency minimisation)PC EC  (Partition - Consistency, Else - Consistency)The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency.  Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability.  Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.Examples of systems of each type :Any eventually consistent system, especially with local-database-node updates + readsBest-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)???Always consistent systems (e.g. single database instance, single MySQL Cluster)I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition.  Maybe this category should be called 'repentant systems'?The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies.  It is not enough to say that things will 'eventually' be consistent.  It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database.  The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use!  A reasonable approach is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs).  These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.Stonebraker and others have commented on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented.  Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower.  This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable.  Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.At a wider network scale, latencies are naturally higher, implying that bandwidth is lower.  The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable.   The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable.  Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data.  For a particular application there may or may not be a point at which an AP system would meet its requirements better.Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system.  Paraphrasing  John Mashey, "Bandwidth improves, latency is forever".  Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available.  Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links.  As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.Edit 7/3/12 : Fix bad link]]></description>
			<content:encoded><![CDATA[<blockquote>tldr; A single MySQL Cluster prioritises Consistency in Network partition events.  Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.</blockquote><br /><br />I was recently asked about the relationship between MySQL Cluster and the <a href="http://en.wikipedia.org/wiki/CAP_theorem">CAP theorem</a>.  The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast.  You can have any two, but you can't have all three.  For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'.  CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?<br /><br /><span>Standard 'my interpretation of CAP' section</span><br /><br />Everyone who discusses CAP like to rehash it, and I'm no exception.  <a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html">Daniel Abadi </a>has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition.  This is how I think of CAP.  He also discusses related system behaviour in normal operation which I'll return to later.<br /><br />While this reframing clarifies CAP, the terms <span>network partition</span>, <span>availability </span>and<span> consistency</span> also need some definition.<br /><br /><span>Network replicated database<br /></span><br />CAP is only really relevant in the context of a network replicated database (or filesystem or state machine).   A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network.  Data can be read and updated.  Updates are propagated to all nodes with replicas via the network.  Database clients connect to database nodes via the network to read data and make updates.  Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.<br /><br /><span>Availability<br /></span><br />The network replicated database exists to provide services such as Read and Write on the data it stores.  Its availability can be measured as the ability of any client to perform any service on any data item.<br /><br />This <span>Service Availability</span> can be compromised by :<br /><ul><li>Failure of client nodes</li><li>Network failures between clients and database nodes</li><li>Network failures between database nodes</li><li>Failure of database nodes<br /></li></ul>Client node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP.  However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service <span>is</span> within the scope of CAP.<br /><br />Service Availability is not binary, it can partially degrade, perhaps by affecting :<br /><ul><li>A subset of all clients</li><li>A subset of all stored data</li><li>A subset of request types<br /></li></ul><br />The shades of grey within the definition of availability are responsible for most of the arguments around CAP.  If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain.  If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network.  In the loosest definition, if any client receives any service on any data, then the system is still available.  Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage.  Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.<br /><br /><span>Consistency<br /></span><br />For consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data.  Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result.  Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results.  This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.<br /><br />If one copy consistency is relaxed, then different database nodes may <span>observably</span> have different values for the same data item simultaneously.  Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes.  This behaviour may be described as eventual consistency.  Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications.  Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.<br /><br /><span>Network Partitions<br /></span><br />Network partitions isolate subsets of the nodes of a network replicated database.  The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :<br /><ol><li>dead</li><li>alive but isolated from clients</li><li>alive and reachable by clients but isolated from us<br /></li></ol>Not knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency.  The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them.  If those clients are allowed to make writes on data copies stored on those database nodes, then we <span>must</span> lose one copy consistency as we cannot supply those new values in response to a read of our local copy.  If those clients are not allowed to make writes then we have degraded service availability for them.  Which is it to be?  This is the unavoidable choice at the centre of the CAP theorem.  Stated this way it seems less of a theorem and more of a fact.<br /><br /><span>Back to MySQL Cluster - which does it provide?<br /></span><br />A single MySQL Cluster prioritises data consistency over availability when network partitions occur.<br /><br />A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.<br /><br />So you can have it both ways with MySQL Cluster - Great!<br /><br /><span>Single MySQL Cluster - CP</span><br /><br />Within a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit.  Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes.  Where a network partition occurs, live nodes in each partition regroup and decide what to do next :<br /><ul><li>If there are not enough live nodes to serve all of the data stored - shutdown<br /><span>Serving a subset of user data (and risking data consistency) is not an option</span></li><li>If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service<br /><span>No other subset of nodes can be isolated from us and serving clients</span></li><li>If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.<br /><span>There could be another subset of nodes regrouped into a viable cluster out there. </span><br /></li></ul><br />Arbitration occurs to avoid the <a href="http://en.wikipedia.org/wiki/Split-brain_(computing)">split brain</a> scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others.  In other words, arbitration occurs to preserve consistency.<br /><br />Arbitration involves :<br /><ul><li>Database nodes agree on an arbitrator in advance</li><li>During node or network failure handling, no data writes are committed.<br /></li><li>When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.</li><li>Each request to the arbitrator will result in either : Yes, No or timeout</li><li>Anything other than Yes results in node shutdown.</li><li>The arbitrator only says Yes once per election round (First come first served).  Therefore the arbitrator only says yes to one potential cluster in a partitioned network.<br /></li></ul><br />Note that arbitration is not the same as achieving a quorum.  A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor.  The arbitrator role is lightweight as it is not involved in normal traffic.  I am surprised that the lightweight arbitrator pattern is not more common.<br /><br /><span>How does a single MySQL Cluster degrade service availability as a result of network partitions?<br /><br /></span>Where some subset of data nodes are isolated and shut-down :<br /><ul><li>Those nodes are 100% out of service, until they restart and can rejoin the cluster<br /><span>They will attempt to do so automatically</span><br /></li><li>Any clients connected <span>only</span> to those nodes are out of service<br /><span>By default clients attempt to connect to <span>all</span> data nodes, so partial connectivity issues needn't degrade client availability.</span><br /></li><li>The remaining live nodes are 100% in-service</li><li>Clients connected to the remaining live nodes are 100% in service<br /></li></ul>Where no subset of data nodes is live<br /><ul><li>All clients experience 100% service loss, until the data nodes restart and can rejoin the cluster<br /><span>They will attempt to do so automatically.</span><br /></li></ul><br />A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions.  It does not sacrifice consistency.<br /><br /><span>How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?<br /></span><br />Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it.  Some set of clients continue to receive 100% service.  Only <a href="http://en.wikipedia.org/wiki/Single_point_of_failure"><span>double failures</span></a> in the network can cause a network partition resulting in full service loss.<br />Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.<br /><span><br />Asynchronously replicating clusters - AP</span><br /><br />Where two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters.  Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not.  On the other hand, availability is not compromised by network partitioning of the two clusters.  Each cluster can continue to accept read and write requests to all of the data from any connected client.<br /><br />Eventual consistency between the clusters is possible when using conflict resolution functions such as <a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html">NDB$EPOCH_TRANS</a>, NDB$EPOCH, NDB$MAX etc.<br /><br /><span>How does consistency degrade between replicating MySQL Clusters during a network partition?<br /></span><br />This depends on the conflict resolution function chosen, and how detected conflicts are handled.  Some details of consistency guarantees provided by NDB$EPOCH et al are described <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html">here</a>.<br /><br /><span>What about normal operation?</span><br /><br />Abadi's <a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html">post</a> introduced his <span>PACELC</span> acronym, standing for something like :<br /><br /><pre> if (network Partition)<br />{<br />  trade-off Availability vs Consistency;<br />}<br />else<br />{<br />  trade-off Latency vs Consistency;<br />}<br /></pre><br /><br />My first comment has to be that it's bad form to put the common case in an else branch!<br />However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition.  The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency.  In theory this normal operation <span>latency-vs-consistency </span>tradeoff could be completely independent to the Network Partitioning <span>availability-vs-consistency</span> tradeoff,  e.g. you could have any of :<br /><ol><li>  PA EL  (Partition - Availability, Else - Latency minimisation)</li><li>PA EC  (Partition - Availability, Else - Consistency)</li><li>PC EL  (Partition - Consistency, Else - Latency minimisation)</li><li>PC EC  (Partition - Consistency, Else - Consistency)<br /></li></ol><br />The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency.  Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability.  Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.<br /><br />Examples of systems of each type :<br /><ol><li>Any eventually consistent system, especially with local-database-node updates + reads</li><li>Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)</li><li>???</li><li>Always consistent systems (e.g. single database instance, single MySQL Cluster)<br /></li></ol><br />I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition.  Maybe this category should be called 'repentant systems'?<br /><br />The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies.  It is not enough to say that things will 'eventually' be consistent.  It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.<br /><br />There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database.  The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use!  A reasonable <a href="http://albcom.lsi.upc.edu/ojs/index.php/beatcs/article/view/78">approach</a> is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs).  These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.<br /><br />Stonebraker and others have <a href="http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems-eventual-consistency-and-the-cap-theorem/fulltext">commented</a> on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented.  Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower.  This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable.  Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.<br /><br />At a wider network scale, latencies are naturally higher, implying that bandwidth is lower.  The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable.   The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable.  Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data.  For a particular application there may or may not be a point at which an AP system would meet its requirements better.<br /><br />Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system.  Paraphrasing  John Mashey, "<span>Bandwidth improves, latency is forever</span>".  Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.<br /><br />Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available.  Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links.  As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.<br /><br /><span>Edit 7/3/12 : Fix bad link</span><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-6238042609867575034?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32245&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32245&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2012/03/07/the-cap-theorem-and-mysql-cluster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>One billion</title>
		<link>http://messagepassing.blogspot.com/2012/02/one-billion.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=one-billion</link>
		<comments>http://messagepassing.blogspot.com/2012/02/one-billion.html#comments</comments>
		<pubDate>Tue, 21 Feb 2012 01:04:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[latency-hiding]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[parallel]]></category>
		<category><![CDATA[rambling]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=d68a7eb28ad404a6a9196f2f7be0ba7c</guid>
		<description><![CDATA[As always, I am a little late, but I want to jump on the bandwagon and mention the recent MySQL Cluster milestone of passing 1 billion queries per minute.  Apart from echoing the arbitrarily large ransom demand of Dr Evil, what does this mean? Obviously 1 billion is only of interest to us humans as we generally happen to have 10 fingers, and seem to name multiples in steps of 10^3 for some reason. Each processor involved in this benchmark is clocked at several billion cycles per second, so a single billion is not so vast or fast.Measuring over a minute also feels unnatural for a computer performance benchmark - we are used to lots of things happening every second!  A minute is a long time in silicon.What's more, these reads are served from tables stored entirely in memory - and everyone knows that main memory is infinitely fast and scalable and always getting cheaper, right?If we convert to seconds we are left with only 17 million reads per second!  Hardly worth getting out of bed for?On the contrary, I think that achieving 17 million independent random reads per second, each read returning 100 bytes across a network, from a database that also supports arbitrary SQL, row locking, transactions, high availability and all sorts of other stuff, is pretty cool.  I doubt that (m)any other similar databases can match this raw performance, though I look forward to being proved wrong.(Also, don't forget to meet + beat 1.9 million random updates/s, synchronously replicated)Raw performance is good, but not everyone just needs horsepower.  The parallel, independent work on improving join performance (also known as SPJ/AQL) and query optimisation helps more applications harness this power, by improving the efficiency of joins. I wrote a post about SPJ/AQL at the start of last year, when it was still in the early stages.  Since then much has improved, to the extent that the performance improvement factors have become embarrassingly high on real user queries.  A further post on the technical details of SPJ/AQL is long overdue...  Perhaps the most interesting details are on the integration between the parallel, streaming linked operations and the essentially serialised MySQL Nested Loops join executor.  A linked scan and lookup operation can be considered to be a form of parallel hash join, which the normal MySQL NLJ executor can invoke as part of executing a query.  Who says Nested Loop joins can't scale?]]></description>
			<content:encoded><![CDATA[As always, I am a little late, but I want to jump on the bandwagon and mention the recent MySQL Cluster milestone of passing 1 <a href="http://mikaelronstrom.blogspot.com/2012/02/105bn-qpm-using-mysql-cluster-72.html">billion</a> queries per minute.  Apart from echoing the arbitrarily large ransom demand of Dr Evil, what does this mean? <br /><br />Obviously 1 billion is only of interest to us humans as we generally happen to have 10 fingers, and seem to name multiples in steps of 10^3 for some <a href="http://en.wikipedia.org/wiki/Short_scale">reason</a>. Each processor involved in this benchmark is clocked at several billion cycles per second, so a single billion is not so vast or fast.<br /><br />Measuring over a minute also feels unnatural for a computer performance benchmark - we are used to lots of things happening every second!  A minute is a long time in silicon.<br /><br />What's more, these reads are served from tables stored entirely in memory - and everyone knows that main memory is infinitely fast and scalable and always getting cheaper, right?<br /><br />If we convert to seconds we are left with only 17 million reads per second!  Hardly worth getting out of bed for?<br /><br />On the contrary, I think that achieving 17 million independent random reads per second, each read returning 100 bytes across a network, from a database that also supports arbitrary SQL, row locking, transactions, high availability and all sorts of other stuff, is pretty cool.  I doubt that (m)any other similar databases can match this raw performance, though I look forward to being proved wrong.<br /><br />(Also, don't forget to meet + beat 1.9 million random updates/s, synchronously replicated)<br /><br />Raw performance is good, but not everyone just needs horsepower.  The parallel, independent work on improving join performance (also known as SPJ/AQL) and query optimisation helps more applications harness this power, by improving the efficiency of joins. <br /><br />I wrote a <a href="http://messagepassing.blogspot.com/2011/01/low-latency-distributed-parallel-joins.html">post</a> about SPJ/AQL at the start of last year, when it was still in the early stages.  Since then much has improved, to the extent that the performance improvement factors have become embarrassingly high on real user queries.  A further post on the technical details of SPJ/AQL is long overdue...  Perhaps the most interesting details are on the integration between the parallel, streaming linked operations and the essentially serialised MySQL <a href="http://en.wikipedia.org/wiki/Nested_loop_join">Nested Loops</a> join executor.  A linked scan and lookup operation can be considered to be a form of parallel <a href="http://en.wikipedia.org/wiki/Hash_join">hash join</a>, which the normal MySQL NLJ executor can invoke as part of executing a query.  Who says Nested Loop joins can't scale?<div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-7311069096084002017?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32072&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32072&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2012/02/21/one-billion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Surprises in store with ndb_restore</title>
		<link>http://www.skysql.com/blogs/kolbe/surprises-store-ndb-restore?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=surprises-in-store-with-ndb_restore</link>
		<comments>http://www.skysql.com/blogs/kolbe/surprises-store-ndb-restore#comments</comments>
		<pubDate>Thu, 16 Feb 2012 18:49:58 +0000</pubDate>
		<dc:creator>SkySQL</dc:creator>
				<category><![CDATA[backup]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[gotcha]]></category>
		<category><![CDATA[ndb]]></category>
		<category><![CDATA[ndb_restore]]></category>
		<category><![CDATA[restore]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=4e0a0e543b5179fa0bbeb90419ee9511</guid>
		<description><![CDATA[While doing some routine fiddling regarding some topic I've now forgotten, I discovered that ndb_restore was doing something quite surprising. It's been common wisdom for some time that one can use ndb_restore -m to restore metadata into a new cluster and automatically have your data re-partitioned across the data nodes in the destination cluster. In fact, this was the recommended procedure for adding nodes to a cluster before online add node came along. Since MySQL Cluster 7.0, though, ndb_restore hasn't behaved that way, though that change in behavior doesn't seem to be documented and most don't know that the change ever took place.
I'll go through some of the methods you can use to find information about the partitioning strategy for an NDB table, talk a bit about why ndb_restore stopped working the way most everyone expected (and still expect) it to, and discuss some possible alternatives and workarounds. 
Let's start out with an example of how ndb_restore worked in the pre-7.0 days. I'm going to create a 2-node cluster, create a table, put some rows in it, look at the partitioning strategy for that table, then take a backup and shut down my cluster.

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ cat ~/cluster_2.ini 
[ndb_mgmd]
Hostname=127.0.0.1
Datadir=/home/ndb/cluster-data
NodeId=1

[ndbd default]
#MaxNoOfExecutionThreads=4
Datadir=/home/ndb/cluster-data
NoOfReplicas=2
Hostname=127.0.0.1

[ndbd]
NodeId=3
[ndbd]
NodeId=4

[mysqld]
NodeId=11

[mysqld]
NodeId=12

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgmd -f ~/cluster_2.ini  
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndbd --initial;./bin/ndbd --initial;
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.56 ndb-6.3.45)
Node 4: started (mysql-5.1.56 ndb-6.3.45)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysqld_safe &#38;
[1] 2489
120215 20:10:49 mysqld_safe Logging to '/home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data/ip-10-0-0-59.err'.
120215 20:10:49 mysqld_safe Starting mysqld daemon with databases from /home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.56-ndb-6.3.45-cluster-gpl MySQL Cluster Server (GPL)

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]&#62; create table c1 (id int) engine=ndb;
Query OK, 0 rows affected (0.12 sec)

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]&#62; INSERT INTO c1 (id) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),
(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),
(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),
(91),(92),(93),(94),(95),(96),(97),(98),(99),(100);
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]&#62; Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 206
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'start backup'
Connected to Management Server at: localhost:1186
Waiting for completed, this may take several minutes
Node 3: Backup 1 started from node 1
Node 3: Backup 1 started from node 1 completed
 StartGCP: 88 StopGCP: 91
 #Records: 2156 #LogRecords: 0
 Data: 53208 bytes Log: 0 bytes
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ 
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysqladmin shutdown
120215 20:13:45 mysqld_safe mysqld from pid file /home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data/ip-10-0-0-59.pid ended
[1]+  Done                    ./bin/mysqld_safe
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e shutdown
Connected to Management Server at: localhost:1186
2 NDB Cluster node(s) have shutdown.
Disconnecting to allow management server to shutdown.

So, there we've created a 2-node cluster, created a table and put a few rows in it, created an NDB native backup, and then shut the cluster down. Now, we'll create a 4-node cluster, restore the backup, and see what our table looks like.

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ rm ./data/test/*
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ cat ~/cluster_4.ini 
[ndb_mgmd]
Hostname=127.0.0.1
Datadir=/home/ndb/cluster-data
NodeId=1

[ndbd default]
#MaxNoOfExecutionThreads=4
Datadir=/home/ndb/cluster-data
NoOfReplicas=2
Hostname=127.0.0.1

[ndbd]
NodeId=3
[ndbd]
NodeId=4

[ndbd]
NodeId=5
[ndbd]
NodeId=6

[mysqld]
NodeId=11

[mysqld]
NodeId=12
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgmd -f ~/cluster_4.ini  
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.56 ndb-6.3.45)
Node 4: started (mysql-5.1.56 ndb-6.3.45)
Node 5: started (mysql-5.1.56 ndb-6.3.45)
Node 6: started (mysql-5.1.56 ndb-6.3.45)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_restore -b 1 -r -n 3 -m ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 3
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.ctl'
Backup version in files: ndb-6.3.11 ndb version: mysql-5.1.56 ndb-6.3.45
Connected to ndb!!
Successfully restored table `mysql/def/ndb_apply_status`
Successfully restored table event REPL$mysql/ndb_apply_status
Successfully restored table `test/def/c1`
Successfully restored table event REPL$test/c1
Successfully restored table `mysql/def/ndb_schema`
Successfully restored table event REPL$mysql/ndb_schema
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.3.Data'
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(1) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(4) fragment 0
_____________________________________________________
Processing data in table: test/def/c1(5) fragment 0
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_2_3(3) fragment 0
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(0) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(2) fragment 0
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.log'
Restored 56 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_restore -b 1 -r -n 4 ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 4
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.ctl'
Backup version in files: ndb-6.3.11 ndb version: mysql-5.1.56 ndb-6.3.45
Connected to ndb!!
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.4.Data'
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(1) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(4) fragment 1
_____________________________________________________
Processing data in table: test/def/c1(5) fragment 1
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_2_3(3) fragment 1
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(0) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(2) fragment 1
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.log'
Restored 44 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 206
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               26              26              32768                   0                       0               0                       3,4
1               24              24              32768                   0                       0               0                       5,6
3               20              20              32768                   0                       0               0                       6,5
2               30              30              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

Alright! We created a new cluster with 4 data nodes, restored the backup into the cluster, and confirmed with ndb_desc that the data was automatically re-partitioned to give the table a number of partitions equal to the number of data nodes in the cluster. Why is that important? This way, each data node can be primary for one partition.
You can see in the Nodes column on the very right-hand side of the Per partition info section which nodes hold each partition. The left-most node listed in that column for a given partition is the primary for that partition; any other nodes listed hold secondary replicas for that partition.
When the cluster is handling a request, data is only retrieved from the primary replica. If we had 4 data nodes but only 2 partitions, that would mean that half of our nodes were not primary for any partition, which means that they would never be responsible for sending any data to API/MySQL nodes. Clearly, that is not the best solution in terms of spreading load across the data nodes.
Unfortunately, that is exactly the behavior you get with this same operation starting with MySQL Cluster 7.0.
Here's a demo identical to the one above, but using MySQL Cluster 7.2.4:

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgmd -f ~/cluster_2.ini --config-dir=/home/ndb/cluster-config/ --initial
MySQL Cluster Management Server mysql-5.5.19 ndb-7.2.4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndbd --initial;./bin/ndbd --initial
2012-02-15 20:29:17 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:29:17 [ndbd] INFO     -- Angel allocated nodeid: 3
2012-02-15 20:29:17 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:29:17 [ndbd] INFO     -- Angel allocated nodeid: 4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.5.19 ndb-7.2.4)
Node 4: started (mysql-5.5.19 ndb-7.2.4)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysqld_safe &#38;
[1] 3079
120215 20:29:35 mysqld_safe Logging to '/home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data/ip-10-0-0-59.err'.
120215 20:29:35 mysqld_safe Starting mysqld daemon with databases from /home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.5.19-ndb-7.2.4-gpl MySQL Cluster Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&#62; create table c1 (id int) engine=ndb;
Query OK, 0 rows affected (0.17 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&#62; INSERT INTO c1 (id) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),
(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),
(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),
(91),(92),(93),(94),(95),(96),(97),(98),(99),(100);
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&#62; Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 2
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'start backup'
Connected to Management Server at: localhost:1186
Waiting for completed, this may take several minutes
Node 3: Backup 1 started from node 1
Node 3: Backup 1 started from node 1 completed
 StartGCP: 25 StopGCP: 28
 #Records: 2157 #LogRecords: 0
 Data: 53592 bytes Log: 0 bytes
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysqladmin shutdown
120215 20:30:15 mysqld_safe mysqld from pid file /home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data/ip-10-0-0-59.pid ended
[1]+  Done                    ./bin/mysqld_safe
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e shutdown
Connected to Management Server at: localhost:1186
3 NDB Cluster node(s) have shutdown.
Disconnecting to allow management server to shutdown.

OK, everything there looks about the same as before. We created the same table, inserted the same rows, and we have the same number of partitions that we did after the first half of the exercise on MySQL Cluster 6.3.45. Now, let's try the restore.


[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ rm ./data/test/*
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgmd -f ~/cluster_4.ini --config-dir=/home/ndb/cluster-config/ --initial
MySQL Cluster Management Server mysql-5.5.19 ndb-7.2.4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;
2012-02-15 20:32:43 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:43 [ndbd] INFO     -- Angel allocated nodeid: 3
2012-02-15 20:32:43 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:43 [ndbd] INFO     -- Angel allocated nodeid: 4
2012-02-15 20:32:44 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:44 [ndbd] INFO     -- Angel allocated nodeid: 5
2012-02-15 20:32:44 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:44 [ndbd] INFO     -- Angel allocated nodeid: 6
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.5.19 ndb-7.2.4)
Node 4: started (mysql-5.5.19 ndb-7.2.4)
Node 5: started (mysql-5.5.19 ndb-7.2.4)
Node 6: started (mysql-5.5.19 ndb-7.2.4)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_restore -b 1 -r -n 3 -m ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 3
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.ctl'
File size 14088 bytes
Backup version in files: ndb-6.3.11 ndb version: mysql-5.5.19 ndb-7.2.4
Stop GCP of Backup: 27
Connected to ndb!!
Created hashmap: DEFAULT-HASHMAP-240-2
Successfully restored table `mysql/def/ndb_apply_status`
Successfully restored table event REPL$mysql/ndb_apply_status
Successfully restored table `test/def/c1`
Successfully restored table event REPL$test/c1
Successfully restored table `mysql/def/ndb_schema`
Successfully restored table event REPL$mysql/ndb_schema
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.3.Data'
File size 27448 bytes
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_7_3(8) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_sample(5) fragment 0
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(3) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(9) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_head(4) fragment 0
_____________________________________________________
Processing data in table: test/def/c1(10) fragment 0
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(2) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(7) fragment 0
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.log'
File size 52 bytes
Restored 56 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_restore -b 1 -r -n 4 ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 4
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.ctl'
File size 14088 bytes
Backup version in files: ndb-6.3.11 ndb version: mysql-5.5.19 ndb-7.2.4
Stop GCP of Backup: 27
Connected to ndb!!
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.4.Data'
File size 26688 bytes
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_7_3(8) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_sample(5) fragment 1
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(3) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(9) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_head(4) fragment 1
_____________________________________________________
Processing data in table: test/def/c1(10) fragment 1
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(2) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(7) fragment 1
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.log'
File size 52 bytes
Restored 44 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 2
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       5,6


NDBT_ProgramExit: 0 - OK


Uh oh, this didn't turn out quite the same as the example from MySQL Cluster 6.3.45. There are still only 2 partitions after the restore, even though there are 4 data nodes. Take a look at the Nodes column on the right of "Per partition info" and you can see, in fact, that the 2 partitions are actually on separate node groups. That's sort of interesting. It means that writes are still going to be scaled across all node groups, which is great, but it means that reads will not be scaled. All reads will have to come from nodes 3 and 5, because those nodes are the primaries for their respective partitions.
So, why did this change happen? It's not something that anyone decided to do consciously, I think; instead, I think it's the side effect of the implementation of the new HashMap partitioning algorithm that was introduced and made default in MySQL Cluster 7.0. Frazer Clement provides an exceptional discussion of the HashMap algorithm at http://messagepassing.blogspot.com/2011/03/mysql-cluster-online-scaling.....
It appears that the HashMap is stored as part of the schema data for the table; when the table metadata is restored with ndb_restore -m, the same HashMap is used. MySQL Cluster distributes the partitions across all the node groups in the destination cluster, but it does not change the number of partitions. (As a result, if you had a 6-node cluster, one node group would not hold any partitions for this table; that would mean 3 node groups, but there are only 2 partitions.)
Now we see how ndb_restore works starting in MySQL Cluster 7.0 and we can see that the results are not very desirable. What, then, can be done to get your table distributed across all nodes and node groups so that each data node in the cluster is primary for one partition? There are a couple options.
Part of the reason HashMap was put into place was to make it easier to redistribute data in the cluster in order to support online add node functionality. When using online add node, you execute an ALTER TABLE ... REORGANIZE PARTITION statement after creating the new node group(s) and starting the new data nodes. We can do the same, here, to reorganize the partitions of our table across all nodes in the cluster:

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&#62; select partition_name, table_rows from information_schema.partitions where table_schema='test' and table_name='c1';
+----------------+------------+
&#124; partition_name &#124; table_rows &#124;
+----------------+------------+
&#124; p0             &#124;         56 &#124;
&#124; p1             &#124;         44 &#124;
+----------------+------------+
2 rows in set (0.00 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&#62; alter table c1 reorganize partition;
Query OK, 0 rows affected (7.46 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&#62; select partition_name, table_rows from information_schema.partitions where table_schema='test' and table_name='c1';
+----------------+------------+
&#124; partition_name &#124; table_rows &#124;
+----------------+------------+
&#124; p0             &#124;         26 &#124;
&#124; p1             &#124;         24 &#124;
&#124; p2             &#124;         30 &#124;
&#124; p3             &#124;         20 &#124;
+----------------+------------+
4 rows in set (0.02 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&#62; Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 16777217
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 4
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               26              116             32768                   0                       0               0                       3,4
2               30              30              32768                   0                       0               0                       4,3
1               24              84              32768                   0                       0               0                       5,6
3               20              20              32768                   0                       0               0                       6,5


NDBT_ProgramExit: 0 - OK

That's a pretty easy way to re-partition a table across your data nodes. However, keep in mind that you'd need to do this for every table in the cluster. It's fairly easy to do that programatically by checking the number of partitions for a given table in information_schema.partitions and executing ALTER TABLE ... REORGANIZE PARTITON for any of them that have fewer partitions than the number of rows in ndbinfo.nodes. Still, though, I don't find that to be terribly appealing. There are also a couple big caveats for ALTER TABLE ... REORGANIZE PARTITION – it doesn't re-partition UNIQUE indexes or BLOBs. The first of those may not be such a big problem, because UNIQUE indexes (implemented in MySQL Cluster as a separate, hidden table) are not likely to be large in size to the point that scaling reads or spreading the data across additional node groups would be so important. BLOBs, on the other hand, (also implemented in MySQL Cluster as a separate, hidden table) can take up a lot of space, so having them relegated to only some nodes in the cluster might mean that those nodes would use considerably more DataMemory than other nodes.
Another solution, if ALTER TABLE ... REORGANIZE PARTITION doesn't strike your fancy, is to use mysqldump --no-data to backup and restore your schema instead of relying on ndb_restore -m. You'd still use ndb_restore to restore data, but you'd get the schema from mysqldump. When you execute the CREATE TABLE statements output by mysqldump, MySQL Cluster sees them as brand new tables and thus partitions them across all data nodes in the Cluster, as as would be the case for any new table created on the cluster.
Using mysqldump has the advantage of backing up triggers and stored routines, which you won't get if you use ndb_restore -m. If you are using those features, this is very important, of course; if you're not using them, there isn't a lot of practical value gained by using mysqldump. In fact, it means that you add an extra step for backup, and you add an extra step for restore. On top of that, you get no guarantee of consistency. Some DDL could be executed between the time that you run mysqldump and the time you start your NDB native backup. That means that there is no guarantee that the table structure in one part of your backup matches the structure of the data in the other part. That's a little bit scary, and it can only be worked around safely by essentially taking the cluster offline (single user mode) when executing a backup.
My hope is that the original (and still widely expected) behavior of ndb_restore will be ... restored. I've opened bug #64302 to track the issue. Let me know your thoughts here, and let the MySQL Cluster developers know your thoughts on the bug report.]]></description>
			<content:encoded><![CDATA[<p> While doing some routine fiddling regarding some topic I've now forgotten, I discovered that <code>ndb_restore</code> was doing something quite surprising. It's been common wisdom for some time that one can use <code>ndb_restore -m</code> to restore metadata into a new cluster and automatically have your data re-partitioned across the data nodes in the destination cluster. In fact, this was the recommended procedure for adding nodes to a cluster before online add node came along. Since MySQL Cluster 7.0, though, <code>ndb_restore</code> hasn't behaved that way, though that change in behavior doesn't seem to be documented and most don't know that the change ever took place.</p>
<p>I'll go through some of the methods you can use to find information about the partitioning strategy for an NDB table, talk a bit about why <code>ndb_restore</code> stopped working the way most everyone expected (and still expect) it to, and discuss some possible alternatives and workarounds. </p>
<p>Let's start out with an example of how <code>ndb_restore</code> worked in the pre-7.0 days. I'm going to create a 2-node cluster, create a table, put some rows in it, look at the partitioning strategy for that table, then take a backup and shut down my cluster.</p>
<pre>
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ cat ~/cluster_2.ini 
[ndb_mgmd]
Hostname=127.0.0.1
Datadir=/home/ndb/cluster-data
NodeId=1

[ndbd default]
#MaxNoOfExecutionThreads=4
Datadir=/home/ndb/cluster-data
NoOfReplicas=2
Hostname=127.0.0.1

[ndbd]
NodeId=3
[ndbd]
NodeId=4

[mysqld]
NodeId=11

[mysqld]
NodeId=12

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgmd -f ~/cluster_2.ini  
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndbd --initial;./bin/ndbd --initial;
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.56 ndb-6.3.45)
Node 4: started (mysql-5.1.56 ndb-6.3.45)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysqld_safe &amp;
[1] 2489
120215 20:10:49 mysqld_safe Logging to '/home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data/ip-10-0-0-59.err'.
120215 20:10:49 mysqld_safe Starting mysqld daemon with databases from /home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.1.56-ndb-6.3.45-cluster-gpl MySQL Cluster Server (GPL)

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]&gt; create table c1 (id int) engine=ndb;
Query OK, 0 rows affected (0.12 sec)

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]&gt; INSERT INTO c1 (id) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),
(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),
(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),
(91),(92),(93),(94),(95),(96),(97),(98),(99),(100);
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

mysql 5.1.56-ndb-6.3.45-cluster-gpl (root) [test]&gt; Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 206
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'start backup'
Connected to Management Server at: localhost:1186
Waiting for completed, this may take several minutes
Node 3: Backup 1 started from node 1
Node 3: Backup 1 started from node 1 completed
 StartGCP: 88 StopGCP: 91
 #Records: 2156 #LogRecords: 0
 Data: 53208 bytes Log: 0 bytes
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ 
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/mysqladmin shutdown
120215 20:13:45 mysqld_safe mysqld from pid file /home/ndb/mysql/mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23/data/ip-10-0-0-59.pid ended
[1]+  Done                    ./bin/mysqld_safe
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e shutdown
Connected to Management Server at: localhost:1186
2 NDB Cluster node(s) have shutdown.
Disconnecting to allow management server to shutdown.
</pre><p>
So, there we've created a 2-node cluster, created a table and put a few rows in it, created an NDB native backup, and then shut the cluster down. Now, we'll create a 4-node cluster, restore the backup, and see what our table looks like.</p>
<pre>
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ rm ./data/test/*
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ cat ~/cluster_4.ini 
[ndb_mgmd]
Hostname=127.0.0.1
Datadir=/home/ndb/cluster-data
NodeId=1

[ndbd default]
#MaxNoOfExecutionThreads=4
Datadir=/home/ndb/cluster-data
NoOfReplicas=2
Hostname=127.0.0.1

[ndbd]
NodeId=3
[ndbd]
NodeId=4

[ndbd]
NodeId=5
[ndbd]
NodeId=6

[mysqld]
NodeId=11

[mysqld]
NodeId=12
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgmd -f ~/cluster_4.ini  
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;
[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.1.56 ndb-6.3.45)
Node 4: started (mysql-5.1.56 ndb-6.3.45)
Node 5: started (mysql-5.1.56 ndb-6.3.45)
Node 6: started (mysql-5.1.56 ndb-6.3.45)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_restore -b 1 -r -n 3 -m ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 3
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.ctl'
Backup version in files: ndb-6.3.11 ndb version: mysql-5.1.56 ndb-6.3.45
Connected to ndb!!
Successfully restored table `mysql/def/ndb_apply_status`
Successfully restored table event REPL$mysql/ndb_apply_status
Successfully restored table `test/def/c1`
Successfully restored table event REPL$test/c1
Successfully restored table `mysql/def/ndb_schema`
Successfully restored table event REPL$mysql/ndb_schema
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.3.Data'
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(1) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(4) fragment 0
_____________________________________________________
Processing data in table: test/def/c1(5) fragment 0
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_2_3(3) fragment 0
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(0) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(2) fragment 0
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.log'
Restored 56 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_restore -b 1 -r -n 4 ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 4
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.ctl'
Backup version in files: ndb-6.3.11 ndb version: mysql-5.1.56 ndb-6.3.45
Connected to ndb!!
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.4.Data'
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(1) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(4) fragment 1
_____________________________________________________
Processing data in table: test/def/c1(5) fragment 1
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_2_3(3) fragment 1
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(0) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(2) fragment 1
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.log'
Restored 44 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-6.3.45-linux-x86_64-glibc23]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 5
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 206
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               26              26              32768                   0                       0               0                       3,4
1               24              24              32768                   0                       0               0                       5,6
3               20              20              32768                   0                       0               0                       6,5
2               30              30              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK
</pre><p>
Alright! We created a new cluster with 4 data nodes, restored the backup into the cluster, and confirmed with <code>ndb_desc</code> that the data was automatically re-partitioned to give the table a number of partitions equal to the number of data nodes in the cluster. Why is that important? This way, each data node can be primary for one partition.</p>
<p>You can see in the Nodes column on the very right-hand side of the Per partition info section which nodes hold each partition. The left-most node listed in that column for a given partition is the primary for that partition; any other nodes listed hold secondary replicas for that partition.</p>
<p>When the cluster is handling a request, data is only retrieved from the primary replica. If we had 4 data nodes but only 2 partitions, that would mean that half of our nodes were not primary for any partition, which means that they would never be responsible for sending any data to API/MySQL nodes. Clearly, that is not the best solution in terms of spreading load across the data nodes.</p>
<p>Unfortunately, that is exactly the behavior you get with this same operation starting with MySQL Cluster 7.0.</p>
<p>Here's a demo identical to the one above, but using MySQL Cluster 7.2.4:</p>
<pre>
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgmd -f ~/cluster_2.ini --config-dir=/home/ndb/cluster-config/ --initial
MySQL Cluster Management Server mysql-5.5.19 ndb-7.2.4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndbd --initial;./bin/ndbd --initial
2012-02-15 20:29:17 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:29:17 [ndbd] INFO     -- Angel allocated nodeid: 3
2012-02-15 20:29:17 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:29:17 [ndbd] INFO     -- Angel allocated nodeid: 4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.5.19 ndb-7.2.4)
Node 4: started (mysql-5.5.19 ndb-7.2.4)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysqld_safe &amp;
[1] 3079
120215 20:29:35 mysqld_safe Logging to '/home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data/ip-10-0-0-59.err'.
120215 20:29:35 mysqld_safe Starting mysqld daemon with databases from /home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.5.19-ndb-7.2.4-gpl MySQL Cluster Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&gt; create table c1 (id int) engine=ndb;
Query OK, 0 rows affected (0.17 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&gt; INSERT INTO c1 (id) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),
(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),
(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),
(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),
(91),(92),(93),(94),(95),(96),(97),(98),(99),(100);
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&gt; Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 2
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       4,3


NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'start backup'
Connected to Management Server at: localhost:1186
Waiting for completed, this may take several minutes
Node 3: Backup 1 started from node 1
Node 3: Backup 1 started from node 1 completed
 StartGCP: 25 StopGCP: 28
 #Records: 2157 #LogRecords: 0
 Data: 53592 bytes Log: 0 bytes
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/mysqladmin shutdown
120215 20:30:15 mysqld_safe mysqld from pid file /home/ndb/mysql/mysql-cluster-gpl-7.2.4-linux2.6-x86_64/data/ip-10-0-0-59.pid ended
[1]+  Done                    ./bin/mysqld_safe
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e shutdown
Connected to Management Server at: localhost:1186
3 NDB Cluster node(s) have shutdown.
Disconnecting to allow management server to shutdown.
</pre><p>
OK, everything there looks about the same as before. We created the same table, inserted the same rows, and we have the same number of partitions that we did after the first half of the exercise on MySQL Cluster 6.3.45. Now, let's try the restore.</p>
<pre>

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ rm ./data/test/*
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgmd -f ~/cluster_4.ini --config-dir=/home/ndb/cluster-config/ --initial
MySQL Cluster Management Server mysql-5.5.19 ndb-7.2.4
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;./bin/ndbd --initial;
2012-02-15 20:32:43 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:43 [ndbd] INFO     -- Angel allocated nodeid: 3
2012-02-15 20:32:43 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:43 [ndbd] INFO     -- Angel allocated nodeid: 4
2012-02-15 20:32:44 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:44 [ndbd] INFO     -- Angel allocated nodeid: 5
2012-02-15 20:32:44 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2012-02-15 20:32:44 [ndbd] INFO     -- Angel allocated nodeid: 6
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_mgm -e 'all status'
Connected to Management Server at: localhost:1186
Node 3: started (mysql-5.5.19 ndb-7.2.4)
Node 4: started (mysql-5.5.19 ndb-7.2.4)
Node 5: started (mysql-5.5.19 ndb-7.2.4)
Node 6: started (mysql-5.5.19 ndb-7.2.4)

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_restore -b 1 -r -n 3 -m ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 3
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.ctl'
File size 14088 bytes
Backup version in files: ndb-6.3.11 ndb version: mysql-5.5.19 ndb-7.2.4
Stop GCP of Backup: 27
Connected to ndb!!
Created hashmap: DEFAULT-HASHMAP-240-2
Successfully restored table `mysql/def/ndb_apply_status`
Successfully restored table event REPL$mysql/ndb_apply_status
Successfully restored table `test/def/c1`
Successfully restored table event REPL$test/c1
Successfully restored table `mysql/def/ndb_schema`
Successfully restored table event REPL$mysql/ndb_schema
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.3.Data'
File size 27448 bytes
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_7_3(8) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_sample(5) fragment 0
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(3) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(9) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_head(4) fragment 0
_____________________________________________________
Processing data in table: test/def/c1(10) fragment 0
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(2) fragment 0
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(7) fragment 0
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.3.log'
File size 52 bytes
Restored 56 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_restore -b 1 -r -n 4 ~/cluster-data/BACKUP/BACKUP-1/
Backup Id = 1
Nodeid = 4
backup path = /home/ndb/cluster-data/BACKUP/BACKUP-1/
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.ctl'
File size 14088 bytes
Backup version in files: ndb-6.3.11 ndb version: mysql-5.5.19 ndb-7.2.4
Stop GCP of Backup: 27
Connected to ndb!!
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1-0.4.Data'
File size 26688 bytes
_____________________________________________________
Processing data in table: mysql/def/NDB$BLOB_7_3(8) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_sample(5) fragment 1
_____________________________________________________
Processing data in table: sys/def/NDB$EVENTS_0(3) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_apply_status(9) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_index_stat_head(4) fragment 1
_____________________________________________________
Processing data in table: test/def/c1(10) fragment 1
_____________________________________________________
Processing data in table: sys/def/SYSTAB_0(2) fragment 1
_____________________________________________________
Processing data in table: mysql/def/ndb_schema(7) fragment 1
Opening file '/home/ndb/cluster-data/BACKUP/BACKUP-1/BACKUP-1.4.log'
File size 52 bytes
Restored 44 tuples and 0 log entries

NDBT_ProgramExit: 0 - OK

[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 1
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 2
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               56              56              32768                   0                       0               0                       3,4
1               44              44              32768                   0                       0               0                       5,6


NDBT_ProgramExit: 0 - OK

</pre><p>
Uh oh, this didn't turn out quite the same as the example from MySQL Cluster 6.3.45. There are still only 2 partitions after the restore, even though there are 4 data nodes. Take a look at the Nodes column on the right of "Per partition info" and you can see, in fact, that the 2 partitions are actually on separate node groups. That's sort of interesting. It means that writes are still going to be scaled across all node groups, which is great, but it means that reads will not be scaled. All reads will have to come from nodes 3 and 5, because those nodes are the primaries for their respective partitions.</p>
<p>So, why did this change happen? It's not something that anyone decided to do consciously, I think; instead, I think it's the side effect of the implementation of the new HashMap partitioning algorithm that was introduced and made default in MySQL Cluster 7.0. Frazer Clement provides an exceptional discussion of the HashMap algorithm at <a href="http://messagepassing.blogspot.com/2011/03/mysql-cluster-online-scaling.html" title="http://messagepassing.blogspot.com/2011/03/mysql-cluster-online-scaling.html">http://messagepassing.blogspot.com/2011/03/mysql-cluster-online-scaling....</a>.</p>
<p>It appears that the HashMap is stored as part of the schema data for the table; when the table metadata is restored with <code>ndb_restore -m</code>, the same HashMap is used. MySQL Cluster distributes the partitions across all the node groups in the destination cluster, but it does not change the number of partitions. (As a result, if you had a 6-node cluster, one node group would not hold any partitions for this table; that would mean 3 node groups, but there are only 2 partitions.)</p>
<p>Now we see how <code>ndb_restore</code> works starting in MySQL Cluster 7.0 and we can see that the results are not very desirable. What, then, can be done to get your table distributed across all nodes and node groups so that each data node in the cluster is primary for one partition? There are a couple options.</p>
<p>Part of the reason HashMap was put into place was to make it easier to redistribute data in the cluster in order to support online add node functionality. When using online add node, you execute an ALTER TABLE ... REORGANIZE PARTITION statement after creating the new node group(s) and starting the new data nodes. We can do the same, here, to reorganize the partitions of our table across all nodes in the cluster:</p>
<pre>
mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&gt; select partition_name, table_rows from information_schema.partitions where table_schema='test' and table_name='c1';
+----------------+------------+
| partition_name | table_rows |
+----------------+------------+
| p0             |         56 |
| p1             |         44 |
+----------------+------------+
2 rows in set (0.00 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&gt; alter table c1 reorganize partition;
Query OK, 0 rows affected (7.46 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&gt; select partition_name, table_rows from information_schema.partitions where table_schema='test' and table_name='c1';
+----------------+------------+
| partition_name | table_rows |
+----------------+------------+
| p0             |         26 |
| p1             |         24 |
| p2             |         30 |
| p3             |         20 |
+----------------+------------+
4 rows in set (0.02 sec)

mysql 5.5.19-ndb-7.2.4-gpl (root) [test]&gt; Bye
[ndb@ip-10-0-0-59 mysql-cluster-gpl-7.2.4-linux2.6-x86_64]$ ./bin/ndb_desc -d test c1 -pn
-- c1 --
Version: 16777217
Fragment type: 9
K Value: 6
Min load factor: 78
Max load factor: 80
Temporary table: no
Number of attributes: 2
Number of primary keys: 1
Length of frm data: 204
Row Checksum: 1
Row GCI: 1
SingleUserMode: 0
ForceVarPart: 1
FragmentCount: 4
ExtraRowGciBits: 0
ExtraRowAuthorBits: 0
TableStatus: Retrieved
-- Attributes -- 
id Int NULL AT=FIXED ST=MEMORY
$PK Bigunsigned PRIMARY KEY DISTRIBUTION KEY AT=FIXED ST=MEMORY AUTO_INCR

-- Indexes -- 
PRIMARY KEY($PK) - UniqueHashIndex

-- Per partition info -- 
Partition       Row count       Commit count    Frag fixed memory       Frag varsized memory    Extent_space    Free extent_space       Nodes   
0               26              116             32768                   0                       0               0                       3,4
2               30              30              32768                   0                       0               0                       4,3
1               24              84              32768                   0                       0               0                       5,6
3               20              20              32768                   0                       0               0                       6,5


NDBT_ProgramExit: 0 - OK
</pre><p>
That's a pretty easy way to re-partition a table across your data nodes. However, keep in mind that you'd need to do this for every table in the cluster. It's fairly easy to do that programatically by checking the number of partitions for a given table in information_schema.partitions and executing ALTER TABLE ... REORGANIZE PARTITON for any of them that have fewer partitions than the number of rows in ndbinfo.nodes. Still, though, I don't find that to be terribly appealing. There are also a couple big caveats for ALTER TABLE ... REORGANIZE PARTITION – it doesn't re-partition UNIQUE indexes or BLOBs. The first of those may not be such a big problem, because UNIQUE indexes (implemented in MySQL Cluster as a separate, hidden table) are not likely to be large in size to the point that scaling reads or spreading the data across additional node groups would be so important. BLOBs, on the other hand, (also implemented in MySQL Cluster as a separate, hidden table) can take up a lot of space, so having them relegated to only some nodes in the cluster might mean that those nodes would use considerably more DataMemory than other nodes.</p>
<p>Another solution, if ALTER TABLE ... REORGANIZE PARTITION doesn't strike your fancy, is to use <code>mysqldump --no-data</code> to backup and restore your schema instead of relying on <code>ndb_restore -m</code>. You'd still use <code>ndb_restore</code> to restore data, but you'd get the schema from <code>mysqldump</code>. When you execute the CREATE TABLE statements output by <code>mysqldump</code>, MySQL Cluster sees them as brand new tables and thus partitions them across all data nodes in the Cluster, as as would be the case for any new table created on the cluster.</p>
<p>Using <code>mysqldump</code> has the advantage of backing up triggers and stored routines, which you won't get if you use <code>ndb_restore -m</code>. If you are using those features, this is very important, of course; if you're not using them, there isn't a lot of practical value gained by using <code>mysqldump</code>. In fact, it means that you add an extra step for backup, and you add an extra step for restore. On top of that, you get no guarantee of consistency. Some DDL could be executed between the time that you run <code>mysqldump</code> and the time you start your NDB native backup. That means that there is no guarantee that the table structure in one part of your backup matches the structure of the data in the other part. That's a little bit scary, and it can only be worked around safely by essentially taking the cluster offline (single user mode) when executing a backup.</p>
<p>My hope is that the original (and still widely expected) behavior of <code>ndb_restore</code> will be ... restored. I've opened <a href="http://bugs.mysql.com/bug.php?id=64302">bug #64302</a> to track the issue. Let me know your thoughts here, and let the MySQL Cluster developers know your thoughts on the bug report.</p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32031&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=32031&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2012/02/16/surprises-in-store-with-ndb_restore/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing SkySQL™ Enterprise HA for the MariaDB® &amp; MySQL® databases</title>
		<link>http://www.skysql.com/blogs/jean-jerome-schmidt/announcing-skysql-enterprise-ha-mariadb-mysql-databases-0?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=announcing-skysql-enterprise-ha-for-the-mariadb-mysql-databases</link>
		<comments>http://www.skysql.com/blogs/jean-jerome-schmidt/announcing-skysql-enterprise-ha-mariadb-mysql-databases-0#comments</comments>
		<pubDate>Mon, 23 Jan 2012 14:57:52 +0000</pubDate>
		<dc:creator>SkySQL</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[High Availability]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[MySQL Cluster]]></category>
		<category><![CDATA[Replication]]></category>
		<category><![CDATA[SkySQL]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=082bf0edcc7da3ce5ecd195ac2b6a995</guid>
		<description><![CDATA[SkySQL&#8482; today announced the immediate availability of SkySQL&#8482; Enterprise HA, its leading 360&#176; degrees High Availability solution for the MySQL&#174; &#38; MariaDB&#174; databases.
High Availability is the #1 requested enhancement to the MySQL &#38; MariaDB servers, even more popular than scalability and performance.&#160; And with SkySQL&#039;s expertise at hand, it is now easier than ever before for customers to achieve the level of High Availability that they want.
SkySQL&#8482; Enterprise HA is SkySQL&#039;s 360&#176; answer to providing a ready-to-go solution for MySQL &#38; MariaDB High Availability &#8211; in no more than 3 days.
Check out the following resources for more information:
Visit the SkySQL Enterprise HA product page
Including:


		SkySQL&#8482; Enterprise HA Options Table

		SkySQL&#8482; Enterprise HA Statement of Work

Download the SkySQL High Availability whitepaper
Contact your local SkySQL representative to discuss your HA needs
Finally, if you are in New York City today, join Ivan Zoratti, SkySQL CTO, at the MySQL Meetup for a discussion about cool new tools &#38; tricks to achieve High Availability of your MySQL servers!
Fore more information, visit the New York City MySQL Group webpage.
We look forward to helping you achieve your High Availability objectives for your MySQL &#38; MariaDB databases!]]></description>
			<content:encoded><![CDATA[<p><strong>SkySQL&trade;</strong> today announced the immediate availability of <strong><a href="http://www.skysql.com/services/consulting/mysql-high-availability">SkySQL&trade; Enterprise HA</a></strong>, its leading 360&deg; degrees High Availability solution for the MySQL&reg; &amp; MariaDB&reg; databases.</p>
<p>High Availability is the #1 requested enhancement to the MySQL &amp; MariaDB servers, even more popular than scalability and performance.&nbsp; And with <a href="http://www.skysql.com/services/consulting/mysql-high-availability"><u>SkySQL&#39;s expertise at hand</u></a>, it is now easier than ever before for customers to achieve the level of High Availability that they want.</p>
<p><a href="http://www.skysql.com/services/consulting/mysql-high-availability">SkySQL&trade;</a><a href="http://www.skysql.com/services/consulting/mysql-high-availability"><u> Enterprise HA</u></a> is SkySQL&#39;s 360&deg; answer to providing a ready-to-go solution for MySQL &amp; MariaDB High Availability &ndash; <strong>in no more than 3 days</strong>.</p>
<p>Check out the following resources for more information:</p>
<p><a href="http://www.skysql.com/services/consulting/mysql-high-availability"><u>Visit the SkySQL Enterprise HA product page</u></a></p>
<p>Including:</p>
<ul>
<li>
		SkySQL&trade; Enterprise HA Options Table</li>
<li>
		SkySQL&trade; Enterprise HA Statement of Work</li>
</ul>
<p><a href="http://www.skysql.com/news-and-events/white-papers/high-availability-solutions-mysql-database"><u>Download the SkySQL High Availability whitepaper</u></a></p>
<p><a href="http://www.skysql.com/company/contact"><u>Contact your local SkySQL representative to discuss your HA needs</u></a></p>
<p>Finally, if you are in New York City today, join Ivan Zoratti, SkySQL CTO, at the MySQL Meetup for a discussion about cool new tools &amp; tricks to achieve High Availability of your MySQL servers!</p>
<p><a href="http://www.skysql.com/news-and-events/events/database-month-mysql-high-availability-reloaded">Fore more information, visit the New York City MySQL Group webpage.</a></p>
<p>We look forward to helping you achieve your High Availability objectives for your MySQL &amp; MariaDB databases!</p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31774&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31774&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2012/01/23/announcing-skysql-enterprise-ha-for-the-mariadb-mysql-databases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>2011, A great year for MySQL in review&#8230;</title>
		<link>http://feedproxy.google.com/~r/ItsJustAboutCommunication/~3/5Isa-1JnQjc/2011-great-year-for-mysql-in-review.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=2011-a-great-year-for-mysql-in-review</link>
		<comments>http://feedproxy.google.com/~r/ItsJustAboutCommunication/~3/5Isa-1JnQjc/2011-great-year-for-mysql-in-review.html#comments</comments>
		<pubDate>Thu, 29 Dec 2011 12:31:00 +0000</pubDate>
		<dc:creator>Luca Olivari</dc:creator>
				<category><![CDATA[2011]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[community]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[enterprise]]></category>
		<category><![CDATA[events]]></category>
		<category><![CDATA[marketing]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[opensource]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[windows]]></category>
		<category><![CDATA[workbench]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=75602c8a5ac8a5d4b30226af56776c9b</guid>
		<description><![CDATA[I see so many posts on what happened to company X, product Y and dream Z that I couldn't resist the temptation to summarize this great year for MySQL. At the end of 2010, Oracle did an announcement we were all waiting for:&#160;MySQL 5.5 is GA!&#160;Another year has passed since then and it's time to reflect on what has been done.

I know this is a long post. I tried to rewrite it at least 10 times to make it shorter, but I couldn't condense the list. Hence, I wrote a summary in the beginning for those who don't want to read it all.

I believe that 2011 was an exceptional year for MySQL and I really enjoy being part of this team. I wish all of us a lot of success and fun in the years to come!

Summary:
Oracle released many&#160;MySQL 5.6 and&#160;MySQL Cluster 7.2&#160;DMRs accompanied&#160;by new versions of MySQL Enterprise Monitor, MySQL Enterprise Backup,&#160;MySQL Workbench&#160;(and utilities), MySQL Proxy, MySQL Cluster Manager&#160;and&#160;Connectors.

The MySQL team unveiled new products like the MySQL Installer for Windows and Oracle VM Templates for MySQL. Besides, the&#160;MySQL Enterprise offering has been enriched with new commercial extensions.&#160;MySQL can now be leveraged as one of the Oracle data management solutions with new certifications&#160;and the integration with My Oracle Support&#160;increased the business value of customers' investment on Oracle technologies.

Additionally MySQL presented at mayor events across the world and won a few awards.


Long List:
If you're still reading, below you can find an hopefully-extensive list of announcements and blogs (in reverse&#160;chronological&#160;order). I've mainly covered product releases, events and awards. Please let me know if I missed something.

Products:&#160;
Dec 26 - MySQL Workbench 5.2.37 Has Been Released
Dec 20 - MySQL 5.6.4 Development Milestone Now Available!
Dec 02 - MySQL Enterprise Monitor 2.3.8 is now GA!
Nov 28 - MySQL 5.5.18 Debian packaging now available
Oct 10 - New MySQL Enterprise Oracle Certifications
Oct 10 - MySQL Utilities 1.0.3
Oct 07 - MySQL Cluster 7.2 (DMR2): NoSQL, Key/Value, Memcached
Oct 03 - More Early Access Features in the MySQL 5.6.3 Development Milestone!
Oct 03 -&#160;New Development Milestone Releases &#38; Certifications!
Sep 15 - New Commercial Extensions for MySQL Enterprise Editions
Sep 09 - MySQL@Oracle OpenWorld
Sep 06 -&#160;Oracle Enhances MySQL Installer and High Availability for Windows
Sep 06 - Oracle Enhances MySQL Manageability on Windows
Aug 19 - MySQL Proxy 0.8.2 Has Been Released
Aug 01 -&#160;More New MySQL 5.6 Early Access Features
Jul 19 -&#160;MySQL Enterprise Backup 3.6 - New backup streaming, integration with Oracle Secure Backup and other common backup media solutions
Jul 18 - Simpler and Safer Clustering: MySQL Cluster Manager Update
Jul 06 - Announced Oracle VM Templates for MySQL
Apr 12 - MySQL Cluster 7.2 Development Milestone Release - NoSQL with Memcached and 20x Higher JOIN Performance
Apr 11 -&#160;Top Features in MySQL 5.6.2 Development Milestone Release
Apr 11 - Introducing the MySQL Installer for Windows
Mar 15 - Oracle Enhances MySQL Enterprise Edition

Events:
Oct 26 - A lot of MySQL Events in Europe
Oct 12 - MySQL Roadshow in Germany
Sep 16 - OTN MySQL Developer Day in London
Aug 08 - OTN Developer Day: MySQL is Coming to Washington, DC
Jul 14 -&#160;New “Meet The MySQL Experts” Podcast Series
May 13 - Upcoming MySQL Events in Europe
Apr 26 -&#160;OTN Developer Day for MySQL - Santa Clara, CA
Mar 25 - MySQL (and Cluster) at Collaborate and O'Reilly MySQL Conference
Mar 14 -&#160;First Ever MySQL on Windows Online Forum - March 16, 2011

Awards:
Dec 15 -&#160;MySQL Wins Best Open Source Product of 2011 Award
Jun 03 - MySQL Wins the php&#124;architect Impact Award for Data Management
Jan 17 - MySQL Makes the Cover of Oracle Magazine

To all MySQL customers, partners, colleagues, developers, users, advocates or aficionados:&#160;Thank you for this terrific year!&#160;Go MySQL!]]></description>
			<content:encoded><![CDATA[I see so many posts on what happened to company X, product Y and dream Z that I couldn't resist the temptation to summarize this great year for MySQL. At the end of 2010, Oracle did an announcement we were all waiting for:&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/mysql_55_is_ga">MySQL 5.5 is GA</a>!&nbsp;Another year has passed since then and it's time to reflect on what has been done.<br />
<br />
I know this is a long post. I tried to rewrite it at least 10 times to make it shorter, but I couldn't condense the list. Hence, I wrote a summary in the beginning for those who don't want to read it all.<br />
<br />
I believe that 2011 was an exceptional year for MySQL and I really enjoy being part of this team. I wish all of us a lot of success and fun in the years to come!<br />
<br />
<b>Summary:</b><br />
<a href="http://www.mysql.com/common/logos/logo-mysql-110x57.png" imageanchor="1"><img border="0" src="http://www.mysql.com/common/logos/logo-mysql-110x57.png" /></a>Oracle released many&nbsp;<a href="http://dev.mysql.com/tech-resources/articles/whats-new-in-mysql-5.6.html">MySQL 5.6 </a>and&nbsp;<a href="http://dev.mysql.com/tech-resources/articles/mysql-cluster-labs-dev-milestone-release.html">MySQL Cluster 7.2</a>&nbsp;DMRs accompanied&nbsp;by new versions of <a href="http://mysql.com/products/enterprise/monitor.html">MySQL Enterprise Monitor</a>, <a href="http://mysql.com/products/enterprise/backup.html">MySQL Enterprise Backup</a>,&nbsp;<a href="http://www.mysql.com/products/workbench/">MySQL Workbench</a>&nbsp;(and <a href="http://drcharlesbell.blogspot.com/2011/10/mysql-utilities-release-103.html">utilities</a>), <a href="http://dev.mysql.com/downloads/mysql-proxy/">MySQL Proxy</a>, <a href="http://www.mysql.com/products/cluster/mcm/">MySQL Cluster Manager</a>&nbsp;and&nbsp;<a href="http://dev.mysql.com/downloads/connector/">Connectors</a>.<br />
<br />
The MySQL team unveiled new products like the <a href="http://dev.mysql.com/tech-resources/articles/mysql-installer-for-windows.html">MySQL Installer</a> for Windows and <a href="http://www.oracle.com/us/corporate/press/421994">Oracle VM Templates for MySQL</a>. Besides, the&nbsp;<a href="http://www.mysql.com/products/enterprise/">MySQL Enterprise</a> offering has been enriched with new <a href="http://blogs.oracle.com/MySQL/entry/new_commercial_extensions_for_mysql">commercial extensions</a>.&nbsp;MySQL can now be leveraged as one of the Oracle data management solutions with new <a href="http://blogs.oracle.com/MySQL/entry/new_mysql_enterprise_oracle_certifications">certifications</a>&nbsp;and the integration with <a href="http://www.oracle.com/us/support/mos-mysql-297243.html">My Oracle Support</a>&nbsp;increased the business value of customers' investment on Oracle technologies.<br />
<br />
Additionally MySQL presented at mayor <a href="http://mysql.com/news-and-events/events/">events </a>across the world and won a few <a href="http://www.mysql.com/why-mysql/awards/">awards</a>.<br />
<br />
<a name='more'></a><br />
<b>Long List:</b><br />
If you're still reading, below you can find an hopefully-extensive list of announcements and blogs (in reverse&nbsp;chronological&nbsp;order). I've mainly covered product releases, events and awards. Please let me know if I missed something.<br />
<br />
<b>Products:&nbsp;</b><br />
Dec 26 - <a href="http://blogs.oracle.com/mysqlworkbench/entry/mysql_workbench_5_2_37">MySQL Workbench 5.2.37 Has Been Released</a><br />
Dec 20 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_5_6_4_development">MySQL 5.6.4 Development Milestone Now Available!</a><br />
Dec 02 - <a href="http://blogs.oracle.com/mysqlenterprise/entry/mysql_enterprise_monitor_2_34">MySQL Enterprise Monitor 2.3.8 is now GA!</a><br />
Nov 28 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_5_5_18_debian">MySQL 5.5.18 Debian packaging now available</a><br />
Oct 10 - <a href="http://blogs.oracle.com/MySQL/entry/new_mysql_enterprise_oracle_certifications">New MySQL Enterprise Oracle Certifications</a><br />
Oct 10 - <a href="http://drcharlesbell.blogspot.com/2011/10/mysql-utilities-release-103.html">MySQL Utilities 1.0.3</a><br />
Oct 07 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_7_2_dmr2">MySQL Cluster 7.2 (DMR2): NoSQL, Key/Value, Memcached</a><br />
Oct 03 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_7_2_dmr2">More Early Access Features in the MySQL 5.6.3 Development Milestone!</a><br />
Oct 03 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/new_development_milestone_releases_certifications">New Development Milestone Releases &amp; Certifications!</a><br />
Sep 15 - <a href="http://blogs.oracle.com/MySQL/entry/new_commercial_extensions_for_mysql">New Commercial Extensions for MySQL Enterprise Editions</a><br />
Sep 09 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_oracle_openworld">MySQL@Oracle OpenWorld</a><br />
Sep 06 -&nbsp;<a href="http://www.oracle.com/us/corporate/press/485067">Oracle Enhances MySQL Installer and High Availability for Windows</a><br />
Sep 06 - <a href="http://blogs.oracle.com/MySQL/entry/oracle_enhances_mysql_manageability_on">Oracle Enhances MySQL Manageability on Windows</a><br />
Aug 19 - <a href="http://blogs.oracle.com/mysqlenterprise/entry/mysql_proxy_0_8_2">MySQL Proxy 0.8.2 Has Been Released</a><br />
Aug 01 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/more_new_mysql_5_6">More New MySQL 5.6 Early Access Features</a><br />
Jul 19 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/mysql_enterprise_backup_3_6">MySQL Enterprise Backup 3.6 - New backup streaming, integration with Oracle Secure Backup and other common backup media solutions</a><br />
Jul 18 - <a href="http://blogs.oracle.com/MySQL/entry/simpler_and_safer_clustering_mysql">Simpler and Safer Clustering: MySQL Cluster Manager Update</a><br />
Jul 06 - <a href="http://blogs.oracle.com/MySQL/entry/virtualizing_mysql_1_click_kick">Announced Oracle VM Templates for MySQL</a><br />
Apr 12 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_72_development_milestone_release_-_nosql_with_memcached_and_20x_higher_join_performanc">MySQL Cluster 7.2 Development Milestone Release - NoSQL with Memcached and 20x Higher JOIN Performance</a><br />
Apr 11 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/top_features_in_mysql_562_development_milestone_release">Top Features in MySQL 5.6.2 Development Milestone Release</a><br />
Apr 11 - <a href="http://dev.mysql.com/tech-resources/articles/mysql-installer-for-windows.html">Introducing the MySQL Installer for Windows</a><br />
Mar 15 - <a href="http://www.oracle.com/us/corporate/press/339030">Oracle Enhances MySQL Enterprise Edition</a><br />
<br />
<b>Events:</b><br />
Oct 26 - <a href="http://blogs.oracle.com/MySQL/entry/and_more_mysql_events_in">A lot of MySQL Events in Europe</a><br />
Oct 12 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_roadshow_in_germany">MySQL Roadshow in Germany</a><br />
Sep 16 - <a href="http://blogs.oracle.com/MySQL/entry/otn_mysql_developer_day_in">OTN MySQL Developer Day in London</a><br />
Aug 08 - <a href="http://blogs.oracle.com/MySQL/entry/otn_developer_day_mysql_is">OTN Developer Day: MySQL is Coming to Washington, DC</a><br />
Jul 14 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/new_meet_the_mysql_experts">New “Meet The MySQL Experts” Podcast Series</a><br />
May 13 - <a href="http://blogs.oracle.com/MySQL/entry/upcoming_mysql_events_in_europe">Upcoming MySQL Events in Europe</a><br />
Apr 26 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/otn_developer_day_for_mysql_-_santa_clara_ca">OTN Developer Day for MySQL - Santa Clara, CA</a><br />
Mar 25 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_cluster_on_the_road_oreilly_mysql_and_collaborate_conferences">MySQL (and Cluster) at Collaborate and O'Reilly MySQL Conference</a><br />
Mar 14 -&nbsp;<a href="http://blogs.oracle.com/MySQL/entry/first_ever_mysql_on_windows_online_forum_-_march_16_2011">First Ever MySQL on Windows Online Forum - March 16, 2011</a><br />
<br />
<b>Awards:</b><br />
Dec 15 -&nbsp;<a href="http://mysql%20wins%20best%20open%20source%20product%20of%202011%20award/">MySQL Wins Best Open Source Product of 2011 Award</a><br />
Jun 03 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_wins_the_php_architect">MySQL Wins the php|architect Impact Award for Data Management</a><br />
Jan 17 - <a href="http://blogs.oracle.com/MySQL/entry/mysql_makes_the_cover_of_oracle_magazine">MySQL Makes the Cover of Oracle Magazine</a><br />
<br />
To all MySQL customers, partners, colleagues, developers, users, advocates or aficionados:&nbsp;<b>Thank you for this terrific year!&nbsp;Go MySQL!</b><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/8877901999053801110-3078837993853253512?l=justaboutcommunication.blogspot.com" alt="" /></div>
<p><a href="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/0/da"><img src="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/0/di" border="0" ismap="true"></img></a><br/>
<a href="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/1/da"><img src="http://feedads.g.doubleclick.net/~a/PPQjtrF5oz_YcSxbtf8joobtFNY/1/di" border="0" ismap="true"></img></a></p><div>
<a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:4cEx4HpKnUU"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:4cEx4HpKnUU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=7Q72WNTAKBA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:l6gmwiTKsz0"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=l6gmwiTKsz0" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:gIN9vFwOqvQ"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?i=5Isa-1JnQjc:mNvShHfcYZ0:gIN9vFwOqvQ" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?a=5Isa-1JnQjc:mNvShHfcYZ0:TzevzKxY174"><img src="http://feeds.feedburner.com/~ff/ItsJustAboutCommunication?d=TzevzKxY174" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/ItsJustAboutCommunication/~4/5Isa-1JnQjc" height="1" width="1" /><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31445&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31445&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/29/2011-a-great-year-for-mysql-in-review/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eventual Consistency in MySQL Cluster &#8212; implementation part 3</title>
		<link>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eventual-consistency-in-mysql-cluster-implementation-part-3</link>
		<comments>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html#comments</comments>
		<pubDate>Thu, 22 Dec 2011 17:36:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=08982a5a78aac34767dc093723414161</guid>
		<description><![CDATA[As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication.  This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.Transaction based conflict handling with NDB$EPOCH_TRANSNDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described here.Including user transaction ids in the BinlogNdb Binlog epoch transactions contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The mysqlbinlog --verbose tool can be used to see per-row transaction information in the Binlog.User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to sort the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?NDB$EPOCH_TRANS multiple slave passesInitially, NDB$EPOCH_TRANS proceeds in the same way as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied.  This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same way as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.Dependency tracking implementationTo build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.Transaction dependency detection hash :{Table, Primary keys} -&#62; {Transaction id}The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.Transaction dependency tracking and conflict marking hash :{Transaction id} -&#62; {in_conflict, List}As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction.  Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :Transaction in-conflict lookup hash :{Transaction id} -&#62; {in_conflict}This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction.  There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.Event countersLike the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.Additional counters have been added which specifically track the different events associated with transactional conflict detection.  These can be seen with the usual SHOW GLOBAL STATUS LIKE syntax, or via the INFORMATION_SCHEMA tables.ndb_conflict_trans_row_conflict_countThis is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.ndb_conflict_trans_row_reject_countThe number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_countndb_conflict_trans_reject_countThe number of discrete user transactions detected as in-conflict.ndb_conflict_trans_conflict_commit_countThe number of epoch transactions which had transactional conflicts detected during application.ndb_conflict_trans_detect_iter_countThe number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.Performance properties of NDB$EPOCH and NDB$EPOCH_TRANSI have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?Conflict detection logic is pushed down to data nodes for executionMinimising extra data transfer + lockingSlave operation batching is preservedMultiple row events are applied together, saving MySQLD &#60;-&#62; data node round trips, using data node parallelismFor both algorithms, one extra MySQLD &#60;-&#62; data node round-trip is required in the no-conflicts case (best case)NDB$EPOCH : One extra MySQLD &#60;-&#62; data node round-trip is required per *batch* in the all-conflicts case (worst case)NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.NDB$EPOCH : Minimal overhead to Slave SQL CPU consumptionNDB$EPOCH_TRANS : One extra MySQLD &#60;-&#62; data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict caseCurrent and intrinsic limitationsThese functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.Intrinsic limitationsReads from the Secondary are tentativeData committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable.  This is described in more detail here.Writes to the Secondary may be rolled backIf this occurs, the fact will be recorded on the Primary. Once a committed write is stable it will not be rolled back.Out-of-band dependencies between transactions are out-of-scopeFor example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.Current implementation limitationsDetected transaction dependencies are limited to dependencies between binlogged writes (Insert, Update, Delete)Reads are not currently included.Delete vs Delete+Insert conflicts risk data divergenceDelete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistencyWith pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.Want to try it out?Andrew Morgan has written a great post showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art.  This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.Edit 23/12/11 : Added index]]></description>
			<content:encoded><![CDATA[<a href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"><img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html#mymap" border="0" /><br /></a><br /><map name="mymap"><area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html" /><area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html" /><area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html" /><area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html" /><area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /><area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /></map><br />As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication.  This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.<br /><br /><span>Transaction based conflict handling with NDB$EPOCH_TRANS</span><br /><br />NDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described <a href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html">here</a>.<br /><br /><span>Including user transaction ids in the Binlog</span><br /><br />Ndb Binlog <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html">epoch transactions</a> contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The <a href="http://dev.mysql.com/doc/refman/5.1/en/mysqlbinlog.html">mysqlbinlog</a> --verbose tool can be used to see per-row transaction information in the Binlog.<br /><br />User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to <a href="http://en.wikipedia.org/wiki/Topological_sorting">sort</a> the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?<br /><br /><span>NDB$EPOCH_TRANS multiple slave passes</span><br /><br />Initially, NDB$EPOCH_TRANS proceeds in the same <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html">way</a> as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied.  This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.<br /><br />In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.<br /><br />In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html">way</a> as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.<br /><br />These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.<br /><br /><span>Dependency tracking implementation</span><br /><br />To build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.<br /><br />Transaction dependency detection hash :<br /><div>{Table, Primary keys} -&gt; {Transaction id}<br /></div><br />The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.<br /><br />Transaction dependency tracking and conflict marking hash :<br /><div>{Transaction id} -&gt; {in_conflict, List}<br /></div><br />As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction.  Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.<br /><br />After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :<br /><br />Transaction in-conflict lookup hash :<br /><div>{Transaction id} -&gt; {in_conflict}<br /></div><br />This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.<br /><br />The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction.  There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.<br /><br /><span>Event counters</span><br /><br />Like the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.<br /><br />Additional counters have been added which specifically track the different events associated with transactional conflict detection.  These can be seen with the usual SHOW GLOBAL STATUS LIKE <a href="http://dev.mysql.com/doc/refman/5.1/en/show-status.html">syntax</a>, or via the INFORMATION_SCHEMA <a href="http://dev.mysql.com/doc/refman/5.1/en/status-table.html">tables</a>.<br /><br /><ul><li><span>ndb_conflict_trans_row_conflict_count</span><br />This is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.</li><li><span>ndb_conflict_trans_row_reject_count</span><br />The number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.<br />A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_count</li><li><span>ndb_conflict_trans_reject_count</span><br />The number of discrete user transactions detected as in-conflict.</li><li><span>ndb_conflict_trans_conflict_commit_count</span><br />The number of epoch transactions which had transactional conflicts detected during application.</li><li><span>ndb_conflict_trans_detect_iter_count</span><br />The number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.<br /></li></ul><br /><br /><span>Performance properties of NDB$EPOCH and NDB$EPOCH_TRANS</span><br /><br />I have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?<br /><br /><ul><li>Conflict detection logic is pushed down to data nodes for execution<br />Minimising extra data transfer + locking</li><li>Slave operation batching is preserved<br />Multiple row events are applied together, saving MySQLD &lt;-&gt; data node round trips, using data node parallelism<br />For both algorithms, one extra MySQLD &lt;-&gt; data node round-trip is required in the no-conflicts case (best case)</li><li>NDB$EPOCH : One extra MySQLD &lt;-&gt; data node round-trip is required per *batch* in the all-conflicts case (worst case)</li><li>NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.</li><li>NDB$EPOCH : Minimal overhead to Slave SQL CPU consumption</li><li>NDB$EPOCH_TRANS : One extra MySQLD &lt;-&gt; data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)</li><li>NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.</li><li>NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.</li><li>NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict case<br /></li></ul><br /><span>Current and intrinsic limitations</span><br /><br />These functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.<br /><br /><span>Intrinsic limitations</span><br /><ul><li><span>Reads from the Secondary are tentative</span><br />Data committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable.  This is described in more detail <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html">here</a>.</li><li><span>Writes to the Secondary may be rolled back</span><br />If this occurs, the fact will be recorded on the Primary. Once a committed write is <a href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html">stable</a> it will not be rolled back.</li><li><span>Out-of-band dependencies between transactions are out-of-scope</span><br />For example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.<br /></li></ul><br /><span>Current implementation limitations</span><br /><br /><ul><li><span>Detected transaction dependencies are limited to dependencies between binlogged writes</span> (Insert, Update, Delete)<br />Reads are not currently included.</li><li><span>Delete vs Delete+Insert conflicts risk data divergence</span><br />Delete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.</li><li><span>With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistency</span><br />With pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.<br /></li></ul><br /><span>Want to try it out?</span><br /><br />Andrew Morgan has written a great <a href="http://www.clusterdb.com/mysql-cluster/enhanced-conflict-resolution-with-mysql-cluster-active-active-replication/">post</a> showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art.  This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.<br /><br /><span>Edit 23/12/11 : Added index</span><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-3519742339745296117?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31406&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31406&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/22/eventual-consistency-in-mysql-cluster-implementation-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eventual consistency in MySQL Cluster &#8212; implementation part 2</title>
		<link>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eventual-consistency-in-mysql-cluster-implementation-part-2</link>
		<comments>http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html#comments</comments>
		<pubDate>Mon, 19 Dec 2011 13:30:00 +0000</pubDate>
		<dc:creator>Frazer Clement</dc:creator>
				<category><![CDATA[active-active]]></category>
		<category><![CDATA[Cluster]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Replication]]></category>

		<guid isPermaLink="false">http://planetmysql.ru/?guid=d98d0d71c89256f4c9e1ab4c94fa6c42</guid>
		<description><![CDATA[In previous posts I described how row conflicts are detected using epochs.  In this post I describe how they are handled.Row based conflict handling with NDB$EPOCHOnce a row conflict is detected, as well as rejecting the row change, row based conflict handling in the Slave will :Increment conflict countersOptionally insert a row into an exceptions tableFor NDB$EPOCH, conflict detection and handling operates on one Cluster in an Active-Active pair designated as the Primary.  When a Slave MySQLD attached to the Primary Cluster detects a conflict between data stored in the Primary and a replicated event from the Secondary, it needs to realign the Secondary to store the same values for the conflicting data.  Realignment involves injecting an event into the Primary Cluster's Binlog which, when applied idempotently on the Secondary Cluster, will force the row on the Secondary Cluster to take the supplied values.  This requires either a WRITE_ROW event, with all columns, or a DELETE_ROW event with just the primary key columns.  These events can be thought of as compensating events used to revert the original effect of the rejected events.Conflicts are detected by a Slave MySQLD attached to the Primary Cluster, and realignment events must appear in Binlogs recorded by the same MySQLD and/or other Binlogging MySQLDs attached to the Primary Cluster.  This is achieved using a new NdbApi primary key operation type called refreshTuple.When a refreshTuple operation is executed it will : Lock the affected row/primary key until transaction commit time, even if it does not exist (much as an Insert would).Set the affected row's author metacolum to 0The refresh is logically a local changeOn commit- Row exists case : Set the row's last committed epoch to the current epoch- Cause a WRITE_ROW (row exists case) or DELETE_ROW (no row exists) event to be generated by attached Binlogging MySQLDs.Locking the row as part of refreshTuple serialises the conflicting epoch transaction with other potentially conflicting local transactions.  Updating the stored epoch and author metacolumns results in the conflicting row conflicting with any further replicated changes occurring while the realignment event is 'in flight'.  The compensating row events are effectively new row changes originating at the Primary cluster which need to be monitored for conflicts in the same way as normal row changes.It is important that the Slave running at the Secondary Cluster where the realignment events will be applied, is running in idempotent mode, so that it can handle the realignment events correctly.  If this is not the case then WRITE_ROW realignment events may hit 'Row already exists' errors, and DELETE_ROW realignment events may hit 'Row does not exist' errors.Observations on conflict windows and consistencyWhen a conflict is detected, the refresh process results in the row's epoch and author metacolumns being modified so that the window of potential conflict is extended, until the epoch in which the refresh operation was recorded has itself been reflected.  If ongoing updates at both clusters continually conflict then refresh operations will continue to be generated, and the conflict window will remain open until a refresh operation manages to propagate with no further conflicts occurring.  As with any eventually consistent system, consistency is only guaranteed when the system (or at least the data of interest) is quiescent for a period.From the Primary cluster's point of view, the conflict window length is the time between committing a local transaction in epoch n, and the attached Slave committing a replicated epoch transaction indicating that epoch n has been applied at the Secondary.  Any Secondary-sourced overlapping change applied in this time is in-conflict.This Cluster conflict window length is comprised of : Time between commit of transaction, and next Primary Cluster epoch boundary(Worst = 1 * TimeBetweenEpochs, Best = 0, Avg = 0.5 * TimeBetweenEpochs)Time required to log event in Primary Cluster's Binlogging MySQLDs Binlog (~negligible)Time required for Secondary Slave MySQLD IO thread to- Minimum : Detect new Binlog data - negligible- Maximum : Consume queued Binlog prior to the new data - unbounded- Pull new epoch transaction- Record in Relay logTime required for Secondary Slave MySQLD SQL thread to- Minimum : Detect new events in relay log- Maximum : Consume queued Relay log prior to new data - unbounded- Read and apply events- Potentially multiple batches.- Commit epoch transaction at SecondaryTime between commit of replicated epoch transaction and next Secondary Cluster epoch boundary(Worst = 1 * TimeBetweenEpochs, Best = 0, Avg = 0.5 * TimeBetweenEpochs)After this point a Secondary-local commit on the data is possible without conflictTime required to log event in Secondary Cluster's Binlogging MySQLDs Binlog (~negligible)Time required for Primary Slave MySQLD IO thread to- Minimum : Detect new Binlog data- Maximum : Consume queued Binlog data prior to the new data - unbounded- Pull new epoch transaction- Record in Relay logTime required for Primary Slave MySQLD SQL thread to- Minimum : Detect new events in relay log- Maximum : Consume queued Relay log prior to new data - unbounded- Read and apply events- Potentially multiple batches.- For NDB$EPOCH_TRANS, potentially multiple passes- Commit epoch transaction- Update max replicated epoch to reflect new maximum.Further Secondary sourced modifications to the rows are now considered not-in-conflictFrom the point of view of an external client with access to both Primary and Secondary clusters, the conflict window only extends from the time transaction commit occurs at the Primary to the time the replicated operations are applied at the Secondary, and its commit time Secondary epoch ends. Changes committed at the Secondary after this will clearly appear to the Primary to have occurred after its epoch was applied on the Secondary and therefore are not in-conflict.Assuming that both Clusters have the same TimeBetweenEpochs, we can simplify the Cluster conflict window to :  Cluster_conflict_window_length = EpochDelay +                                  P_Binlog_lag +                                  S_Relay_lag +                                  S_Binlog_lag +                                  P_Relay_lag Where    EpochDelay minimum is 0    EpochDelay avg     is TimeBetweenEpochs    EpochDelay maximum is 2 * TimeBetweenEpochsSubstituting the default value of TimeBetweenEpochs of 100 millis, we get :     EpochDelay minimum is 0    EpochDelay avg     is 100 millis    EpochDelay maximum is 200 millisNote that TimeBetweenEpochs is an epoch-increment trigger delay.  The actual experienced time between epochs can be longer depending on system load.  The various Binlog and Relay log delays can vary from close to zero up to infinity.  Infinity occurs when replication stops in either direction.The Cluster conflict window length can be thought of as bothThe time taken to detect a conflict with a Primary transactionThe time taken for a committed Secondary transaction to become stable or be revertedWe can define a Client conflict window length as either : Primary-&#62;Secondary  Client_conflict_window_length = EpochDelay +                                  P_Binlog_lag +                                  S_Relay_lag +                                  EpochDelayorSecondary-&#62;Primary  Client_conflict_window_length = EpochDelay +                                  S_Binlog_lag +                                  P_Relay_lagWhere EpochDelay is defined as above.These definitions are asymmetric.  They represent the time taken by the system to determine that a particular change at one cluster definitely happened-before another change at the other cluster.  The asymmetry is due to the need for the Secondary part of a Primary-&#62;Secondary conflict to be recorded in a different Secondary epoch.  The first definition considers an initial change at the Primary cluster, and a following change at the Secondary.  The second definition is for the inverse case.An interesting observation is that for a single pair of near-concurrent updates at different clusters, happened-before depends only on latencies in one direction.  For example, an update to the Primary at time Ta, followed by an update to the Secondary at time Tb will not be considered in conflict if: Tb - Ta &#62; Client_conflict_window_length(Primary-&#62;Secondary)Client_conflict_window_length(Primary-&#62;Secondary) depends on the EpochDelay, the P_Binlog_lag and S_Relay_lag, but not on the S_Binlog_lag or P_Relay_lag.  This can mean that high replication latency, or a complete outage in one direction does not always result in increased conflict rates.  However, in the case of multiple sequences of near-concurrent updates at different sites, it probably will.A general property of the NDB$EPOCH family is that the conflict rate has some dependency on the replication latency.  Whether two updates to the same row at times Ta and Tb are considered to be in conflict depends on the relationship between those times and the current system replication latencies.  This can remove the need for highly synchronised real-time clocks as recommended for NDB$MAX, but can mean that the observed conflict rate increases when the system is lagging.  This also implies that more work is required to catch up, which could further affect lag.  NDB$MAX requires manual timestamp maintenance, and will not detect incorrect behaviour, but the basic decision on whether two updates are in-conflict is decided at commit time and is independent of the system replication latency.In summary :The Client_conflict_window_length in either direction will on average not be less than the EpochDelay (100 millis by default)Clients racing against replication to update both clusters need only beat the current Client_conflict_window_length to cause a conflictReplication latencies in either direction are potentially independentDetected conflict rates partly depend on replication latenciesStability of reads from the Primary ClusterIn the case of a conflict, the rows at the Primary Cluster will tentatively have replicated operations applied against them by a Slave MySQLD.   These conflicting operations will fail prior to commit as their interpreted precondition checks will fail, therefore the conflicting rows will not be modified on the Primary.  One effect of this is that a read from the Primary Cluster only ever returns stable data, as conflicting changes are never committed there.  In contrast, a read from the Secondary Cluster returns data which has been committed, but may be subject to later 'rollback' via refresh operations from the Primary Cluster.The same stability of reads observation applies to a row change event stream on the Primary Cluster - events received for a single key will be received in the order they were committed, and no later-to-be-rolled-back events will be observed in the stream.Stability of reads from the Secondary ClusterIf the Secondary Cluster is also receiving reflected applied epoch information back from the Primary then it will know when it's epoch x has been applied successfully at the Primary.  Therefore a read of some row y on the Secondary can be considered tentative while Max_Replicated_Epoch(Secondary) &#60; row_epoch(y), but once Max_Replicated_Epoch(Secondary) &#62;= row_epoch(y) then the read can be considered stable.  This is because if the Primary were going to detect a conflict with a Secondary change committed in epoch x, then the refresh events associated with the conflict would be recorded in the same Primary epoch as the notification of the application of epoch x.  So if the Secondary observes the notification of epoch x (and updates Max_Replicated_Epoch accordingly), and row y is not modified in the same epoch transaction, then it is stable.  The time taken to reach stability after a Secondary Cluster commit will be the Cluster conflict window length.Perhaps some applications can make better use of the potentially transiently inconsistent Secondary data by categorising their reads from the Secondary as either potentially-inconsistent or stable.  To do this, they need to maintain Max_replicated_epoch(Secondary) (By listening to row change events on the ndb_apply_status table) and read the NDB$GCI_64 metacolumn when reading row data.  A read from the Secondary is stable if all the NDB$GCI_64 values for all rows read are &#60;= the Secondary's Max_Replicated_Epoch.In the next post (final post I promise!) I will describe the implementation of the transaction dependency tracking in NDB$EPOCH_TRANS, and review the implementation of both NDB$EPOCH and NDB$EPOCH_TRANS.Edit 23/12/11 : Added index]]></description>
			<content:encoded><![CDATA[<a href="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s1600/image2.gif"><img style="display:block; margin:0px auto 10px; text-align:left;cursor:pointer; cursor:hand;width: 250px; height: 203px;" src="http://1.bp.blogspot.com/-uRfNNaOT5vw/TvRYK0hzgGI/AAAAAAAAAAg/rVaczy8-rds/s320/image2.gif" alt="" id="BLOGGER_PHOTO_ID_5689269172198146146" usemap="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html#mymap" border="0" /><br /></a><br /><map name="mymap"><area shape="rect" coords="0,182,249,200" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_22.html" /><area shape="rect" coords="0,166,249,183" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_16.html" /><area shape="rect" coords="0,147,249,166" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,127,249,147" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster_08.html" /><area shape="rect" coords="0,109,249,127" href="http://messagepassing.blogspot.com/2011/12/eventual-consistency-in-mysql-cluster.html" /><area shape="rect" coords="0,92,249,109" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,73,249,92" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-detecting.html" /><area shape="rect" coords="0,59,249,73" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-transactions.html" /><area shape="rect" coords="0,37,249,59" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /><area shape="rect" coords="0,0,249,37" href="http://messagepassing.blogspot.com/2011/10/eventual-consistency-with-mysql.html" /></map><br />In previous posts I described how row conflicts are detected using epochs.  In this post I describe how they are handled.<br /><span><br />Row based conflict handling with NDB$EPOCH</span><br /><br />Once a row conflict is detected, as well as rejecting the row change, row based conflict handling in the Slave will :<br /><ul><li>Increment conflict counters</li><li>Optionally insert a row into an exceptions table<br /></li></ul>For NDB$EPOCH, conflict detection and handling operates on one Cluster in an Active-Active pair designated as the Primary.  When a Slave MySQLD attached to the Primary Cluster detects a conflict between data stored in the Primary and a replicated event from the Secondary, it needs to realign the Secondary to store the same values for the conflicting data.  Realignment involves injecting an event into the Primary Cluster's Binlog which, when applied idempotently on the Secondary Cluster, will force the row on the Secondary Cluster to take the supplied values.  This requires either a WRITE_ROW event, with all columns, or a DELETE_ROW event with just the primary key columns.  These events can be thought of as <a href="http://en.wikipedia.org/wiki/Compensating_transaction">compensating</a> events used to revert the original effect of the rejected events.<br /><br />Conflicts are detected by a Slave MySQLD attached to the Primary Cluster, and realignment events must appear in Binlogs recorded by the same MySQLD and/or other Binlogging MySQLDs attached to the Primary Cluster.  This is achieved using a new <a href="http://dev.mysql.com/doc/ndbapi/en/index.html">NdbApi</a> primary key operation type called <span>refreshTuple</span>.<br /><br />When a refreshTuple operation is executed it will :<br /><ol><li> Lock the affected row/primary key until transaction commit time, even if it does not exist (much as an Insert would).</li><li>Set the affected row's author metacolum to 0<br />The refresh is logically a local change</li><li>On commit<br />- Row exists case : Set the row's last committed epoch to the current epoch<br />- Cause a WRITE_ROW (row exists case) or DELETE_ROW (no row exists) event to be generated by attached Binlogging MySQLDs.<br /></li></ol><br />Locking the row as part of refreshTuple serialises the conflicting epoch transaction with other potentially conflicting local transactions.  Updating the stored epoch and author metacolumns results in the conflicting row conflicting with any further replicated changes occurring while the realignment event is 'in flight'.  The compensating row events are effectively new row changes originating at the Primary cluster which need to be monitored for conflicts in the same way as normal row changes.<br /><br />It is important that the Slave running at the Secondary Cluster where the realignment events will be applied, is running in idempotent mode, so that it can handle the realignment events correctly.  If this is not the case then WRITE_ROW realignment events may hit 'Row already exists' errors, and DELETE_ROW realignment events may hit 'Row does not exist' errors.<br /><br /><span>Observations on conflict windows and consistency</span><br /><br />When a conflict is detected, the refresh process results in the row's epoch and author metacolumns being modified so that the window of potential conflict is extended, until the epoch in which the refresh operation was recorded has itself been reflected.  If ongoing updates at both clusters continually conflict then refresh operations will continue to be generated, and the conflict window will remain open until a refresh operation manages to propagate with no further conflicts occurring.  As with any eventually consistent system, consistency is only guaranteed when the system (or at least the data of interest) is quiescent for a period.<br /><br />From the Primary cluster's point of view, the <span>conflict window length</span> is the time between committing a local transaction in epoch <span>n</span>, and the attached Slave committing a replicated epoch transaction indicating that epoch <span>n</span> has been applied at the Secondary.  Any Secondary-sourced overlapping change applied in this time is in-conflict.<br /><br />This <span>Cluster conflict window</span> <span>length</span> is comprised of :<br /><br /><ul><li> Time between commit of transaction, and next Primary Cluster epoch boundary<br />(Worst = 1 * <a href="http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-timebetweenepochs"><span>TimeBetweenEpochs</span></a>, Best = 0, Avg = 0.5 * <span>TimeBetweenEpochs</span>)</li><li>Time required to log event in Primary Cluster's Binlogging MySQLDs Binlog (~negligible)</li><li>Time required for Secondary Slave MySQLD IO thread to<br />- Minimum : Detect new Binlog data - negligible<br />- Maximum : Consume queued Binlog prior to the new data - unbounded<br />- Pull new epoch transaction<br />- Record in Relay log<br /></li><li>Time required for Secondary Slave MySQLD SQL thread to<br />- Minimum : Detect new events in relay log<br />- Maximum : Consume queued Relay log prior to new data - unbounded<br />- Read and apply events<br />- Potentially multiple batches.<br />- Commit epoch transaction at Secondary</li><li>Time between commit of replicated epoch transaction and next Secondary Cluster epoch boundary<br />(Worst = 1 * <span>TimeBetweenEpochs</span>, Best = 0, Avg = 0.5 * <span>TimeBetweenEpochs</span>)</li><li><span>After this point a Secondary-local commit on the data is possible without conflict</span></li><li>Time required to log event in Secondary Cluster's Binlogging MySQLDs Binlog (~negligible)</li><li>Time required for Primary Slave MySQLD IO thread to<br />- Minimum : Detect new Binlog data<br />- Maximum : Consume queued Binlog data prior to the new data - unbounded<br />- Pull new epoch transaction<br />- Record in Relay log</li><li>Time required for Primary Slave MySQLD SQL thread to<br />- Minimum : Detect new events in relay log<br />- Maximum : Consume queued Relay log prior to new data - unbounded<br />- Read and apply events<br />- Potentially multiple batches.<br />- For NDB$EPOCH_TRANS, potentially multiple passes<br />- Commit epoch transaction<br />- Update max replicated epoch to reflect new maximum.</li><li>Further Secondary sourced modifications to the rows are now considered not-in-conflict<br /></li></ul><br />From the point of view of an external client with access to both Primary and Secondary clusters, the conflict window only extends from the time transaction commit occurs at the Primary to the time the replicated operations are applied at the Secondary, and its commit time Secondary epoch ends. Changes committed at the Secondary after this will clearly appear to the Primary to have occurred after its epoch was applied on the Secondary and therefore are not in-conflict.<br /><br />Assuming that both Clusters have the same <span>TimeBetweenEpochs</span>, we can simplify the Cluster conflict window to :<br /><pre>  Cluster_conflict_window_length = EpochDelay +<br />                                  P_Binlog_lag +<br />                                  S_Relay_lag +<br />                                  S_Binlog_lag +<br />                                  P_Relay_lag<br /><br /> Where<br />    EpochDelay minimum is 0<br />    EpochDelay avg     is TimeBetweenEpochs<br />    EpochDelay maximum is 2 * TimeBetweenEpochs<br /></pre><br /><br />Substituting the default value of <span>TimeBetweenEpochs</span> of 100 millis, we get :<br /><pre>     EpochDelay minimum is 0<br />    EpochDelay avg     is 100 millis<br />    EpochDelay maximum is 200 millis<br /></pre><br /><br />Note that TimeBetweenEpochs is an epoch-increment trigger delay.  The actual experienced time between epochs can be longer depending on system load.  The various Binlog and Relay log delays can vary from close to zero up to infinity.  Infinity occurs when replication stops in either direction.<br /><br />The <span>Cluster conflict window</span> length can be thought of as both<br /><ul><li>The time taken to detect a conflict with a Primary transaction</li><li>The time taken for a committed Secondary transaction to become stable or be reverted</li></ul><br />We can define a <span>Client conflict window</span> <span>length </span>as either :<br /><pre> Primary-&gt;Secondary<br /><br />  Client_conflict_window_length = EpochDelay +<br />                                  P_Binlog_lag +<br />                                  S_Relay_lag +<br />                                  EpochDelay<br /><br />or<br /><br />Secondary-&gt;Primary<br /><br />  Client_conflict_window_length = EpochDelay +<br />                                  S_Binlog_lag +<br />                                  P_Relay_lag<br /><br />Where EpochDelay is defined as above.<br /></pre><br /><br />These definitions are asymmetric.  They represent the time taken by the system to determine that a particular change at one cluster definitely happened-before another change at the other cluster.  The asymmetry is due to the need for the Secondary part of a Primary-&gt;Secondary conflict to be recorded in a different Secondary epoch.  The first definition considers an initial change at the Primary cluster, and a following change at the Secondary.  The second definition is for the inverse case.<br /><br />An interesting observation is that for a single pair of near-concurrent updates at different clusters, happened-before depends only on latencies in one direction.  For example, an update to the Primary at time <span>Ta</span>, followed by an update to the Secondary at time <span>Tb</span> will not be considered in conflict if:<br /><br /><pre> Tb - Ta &gt; Client_conflict_window_length(Primary-&gt;Secondary)<br /></pre><br /><br /><span>Client_conflict_window_length(Primary-&gt;Secondary)</span> depends on the <span>EpochDelay</span>, the <span>P_Binlog_lag</span> and <span>S_Relay_lag</span>, but not on the <span>S_Binlog_lag</span> or <span>P_Relay_lag</span>.  This can mean that high replication latency, or a complete outage in one direction does not always result in increased conflict rates.  However, in the case of multiple sequences of near-concurrent updates at different sites, it probably will.<br /><br />A general property of the NDB$EPOCH family is that the conflict rate has some dependency on the replication latency.  Whether two updates to the same row at times <span>Ta</span> and <span>Tb</span> are considered to be in conflict depends on the relationship between those times and the <span>current</span> system replication latencies.  This can remove the need for highly synchronised real-time clocks as recommended for NDB$MAX, but can mean that the observed conflict rate increases when the system is lagging.  This also implies that more work is required to catch up, which could further affect lag.  NDB$MAX requires manual timestamp maintenance, and will not detect incorrect behaviour, but the basic decision on whether two updates are in-conflict is decided at commit time and is independent of the system replication latency.<br /><br />In summary :<br /><ul><li>The <span>Client_conflict_window_length</span> in either direction will on average not be less than the <span>EpochDelay</span> (100 millis by default)</li><li>Clients racing against replication to update both clusters need only beat the current <span>Client_conflict_window_length</span> to cause a conflict</li><li>Replication latencies in either direction are potentially independent</li><li>Detected conflict rates partly depend on replication latencies</li></ul><br /><span>Stability of reads from the Primary Cluster</span><br /><br />In the case of a conflict, the rows at the Primary Cluster will tentatively have replicated operations applied against them by a Slave MySQLD.   These conflicting operations will fail prior to commit as their interpreted precondition checks will fail, therefore the conflicting rows will not be modified on the Primary.  One effect of this is that a <span>read from the Primary Cluster only ever returns stable data</span>, as conflicting changes are never committed there.  In contrast, a read from the Secondary Cluster returns data which has been committed, but may be subject to later 'rollback' via refresh operations from the Primary Cluster.<br /><br />The same stability of reads observation applies to a row change event stream on the Primary Cluster - events received for a single key will be received in the order they were committed, and no later-to-be-rolled-back events will be observed in the stream.<br /><br /><span>Stability of reads from the Secondary Cluster<br /></span><br />If the Secondary Cluster is also receiving reflected applied epoch information back from the Primary then it will know when it's epoch <span>x</span> has been applied successfully at the Primary.  Therefore a read of some row <span>y</span> on the Secondary can be considered tentative while Max_Replicated_Epoch(Secondary) &lt; row_epoch(<span>y</span>), but once Max_Replicated_Epoch(Secondary) &gt;= row_epoch(<span>y</span>) then the read can be considered stable.  This is because if the Primary were going to detect a conflict with a Secondary change committed in epoch <span>x</span>, then the refresh events associated with the conflict would be recorded in the same Primary epoch as the notification of the application of epoch <span>x</span>.  So if the Secondary observes the notification of epoch <span>x</span> (and updates Max_Replicated_Epoch accordingly), and row <span>y</span> is not modified in the same epoch transaction, then it is stable.  The time taken to reach stability after a Secondary Cluster commit will be the <span>Cluster conflict window length.</span><br /><br />Perhaps some applications can make better use of the potentially transiently inconsistent Secondary data by categorising their reads from the Secondary as either potentially-inconsistent or stable.  To do this, they need to maintain Max_replicated_epoch(Secondary) (By listening to row change events on the ndb_apply_status table) and read the NDB$GCI_64 metacolumn when reading row data.  A read from the Secondary is stable if all the NDB$GCI_64 values for all rows read are &lt;= the Secondary's Max_Replicated_Epoch.<br /><br />In the next post (final post I promise!) I will describe the implementation of the transaction dependency tracking in NDB$EPOCH_TRANS, and review the implementation of both NDB$EPOCH and NDB$EPOCH_TRANS.<br /><br /><span>Edit 23/12/11 : Added index</span><div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/2987855187574329171-5904731119010279019?l=messagepassing.blogspot.com" alt="" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31358&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31358&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/19/eventual-consistency-in-mysql-cluster-implementation-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using MySQL Cluster to Protect &amp; Scale the HDFS Namenode</title>
		<link>http://blogs.oracle.com/MySQL/entry/using_mysql_cluster_to_protect?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=using-mysql-cluster-to-protect-scale-the-hdfs-namenode</link>
		<comments>http://blogs.oracle.com/MySQL/entry/using_mysql_cluster_to_protect#comments</comments>
		<pubDate>Mon, 19 Dec 2011 09:51:30 +0000</pubDate>
		<dc:creator>MySQL Community</dc:creator>
				<category><![CDATA[Cluster]]></category>
		<category><![CDATA[evaluation]]></category>
		<category><![CDATA[guide]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[MySQL Cluster]]></category>

		<guid isPermaLink="false">http://blogs.oracle.com/MySQL/entry/using_mysql_cluster_to_protect</guid>
		<description><![CDATA[The
MySQL Cluster product team is always interested
to see new and innovative uses of the database. Last week, a team of students
at the KTH Royal Institute of Technology in Sweden blogged about their use of MySQL Cluster in
creating a scalable and highly available
HDFS&#160;Namenode. 
  There
are many established use cases of MySQL Cluster in the web, cloud/SaaS,
telecoms and even flight control systems – you can see those we are allowed to
talk about publicly here.&#160; 
   The
KTH team has been working on a project to move all of the metadata from the
HDFS / Hadoop nameenode to MySQL Cluster. Why did they want to do this, you may ask? Well…: 
  - The
namenode is a single point of failure. If it goes down, so too does the file
system 
  - As
a single server, the namenode becomes a bottleneck within heavily loaded HDFS /
Hadoop deployments. As server resources are consumed and write volumes
increase, so the system can grind to a halt. (And with data volumes growing
around 40% per year, this will only become more common!) 
   So
KTH decided to move metadata storage to MySQL Cluster. Why, you may ask? Well…. 
  - MySQL
Cluster already offered them a replicated, shared-nothing
database, distributed across commodity hardware. 
  - MySQL Cluster is widely deployed with proven stability 
  - The metadata can be distributed across nodes to scale
out capacity, while retaining complete consistency to the clients and
eliminating any Single Point of Failure 
  - Linear scaling of operations per second across the
cluster, as new namenodes are added. 
   Access to the cluster is via the MySQL Cluster Connector for Java,
providing a NoSQL, Java based ORM with very low latency. You can learn more about this ClusterJ API here.&#160; 
   Of course, the work at KTH is on-going with future optimizations planned
– which we will follow with interest. 
   So how can you determine if MySQL Cluster is the right choice for your
new project? We have just updated our MySQL Cluster Evaluation Guide (note, this will directly open the pdf). 
   This update is based around the latest MySQL Cluster 7.2 Development
Release which includes a series of enhancements to further broaden the use case of
MySQL Cluster, including: 
  - 70x higher JOIN performance with Adaptive Query
Localization pushing JOIN operations down to MySQL Cluster’s data  
  - Native Key-Value Memcached interface to the cluster
allowing schema and schemaless storage 
  - New cross-data center scalability enhancements 
  MySQL Cluster is not a fit for every use-case, but by
downloading the Evaluation Guide, you’ll get a clear picture of where MySQL
Cluster can be useful to you, and best practices in planning and executing your
evaluation. 
  Let us know of other interesting use-cases in the comments below]]></description>
			<content:encoded><![CDATA[<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Revision>0</o:Revision>
  <o:TotalTime>0</o:TotalTime>
  <o:Pages>1</o:Pages>
  <o:Words>566</o:Words>
  <o:Characters>3230</o:Characters>
  <o:Company>Homework</o:Company>
  <o:Lines>26</o:Lines>
  <o:Paragraphs>7</o:Paragraphs>
  <o:CharactersWithSpaces>3789</o:CharactersWithSpaces>
  <o:Version>14.0</o:Version>
 </o:DocumentProperties>
 <o:OfficeDocumentSettings>
  <o:AllowPNG/>
 </o:OfficeDocumentSettings>
</xml><![endif]--> <!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <w:Zoom>0</w:Zoom>
  <w:TrackMoves/>
  <w:TrackFormatting/>
  <w:PunctuationKerning/>
  <w:ValidateAgainstSchemas/>
  <w:SaveIfXMLInval>false</w:SaveIfXMLInvalid>
  <w:IgnoreMixedContent>false</w:IgnoreMixedContent>
  <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
  <w:DoNotPromoteQF/>
  <w:LidThemeOther>EN-US</w:LidThemeOther>
  <w:LidThemeAsian>JA</w:LidThemeAsian>
  <w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
  <w:Compatibility>
   <w:BreakWrappedTables/>
   <w:SnapToGridInCell/>
   <w:WrapTextWithPunct/>
   <w:UseAsianBreakRules/>
   <w:DontGrowAutofit/>
   <w:SplitPgBreakAndParaMark/>
   <w:EnableOpenTypeKerning/>
   <w:DontFlipMirrorIndents/>
   <w:OverrideTableStyleHps/>
   <w:UseFELayout/>
  </w:Compatibility>
  <m:mathPr>
   <m:mathFont m:val="Cambria Math"/>
   <m:brkBin m:val="before"/>
   <m:brkBinSub m:val="&#45;-"/>
   <m:smallFrac m:val="off"/>
   <m:dispDef/>
   <m:lMargin m:val="0"/>
   <m:rMargin m:val="0"/>
   <m:defJc m:val="centerGroup"/>
   <m:wrapIndent m:val="1440"/>
   <m:intLim m:val="subSup"/>
   <m:naryLim m:val="undOvr"/>
  </m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="276">
  <w:LsdException Locked="false" Priority="0" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
  <w:LsdException Locked="false" Priority="9" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
  <w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 1"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 2"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 3"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 4"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 5"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 6"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 7"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 8"/>
  <w:LsdException Locked="false" Priority="39" Name="toc 9"/>
  <w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
  <w:LsdException Locked="false" Priority="10" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Title"/>
  <w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
  <w:LsdException Locked="false" Priority="11" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
  <w:LsdException Locked="false" Priority="22" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
  <w:LsdException Locked="false" Priority="20" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
  <w:LsdException Locked="false" Priority="59" SemiHidden="false"
   UnhideWhenUsed="false" Name="Table Grid"/>
  <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
  <w:LsdException Locked="false" Priority="1" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 1"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
  <w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
  <w:LsdException Locked="false" Priority="34" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
  <w:LsdException Locked="false" Priority="29" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
  <w:LsdException Locked="false" Priority="30" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 1"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 2"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 2"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 3"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 3"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 4"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 4"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 5"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 5"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
  <w:LsdException Locked="false" Priority="60" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
  <w:LsdException Locked="false" Priority="61" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light List Accent 6"/>
  <w:LsdException Locked="false" Priority="62" SemiHidden="false"
   UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
  <w:LsdException Locked="false" Priority="63" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="64" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="65" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="66" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="67" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
  <w:LsdException Locked="false" Priority="68" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
  <w:LsdException Locked="false" Priority="69" SemiHidden="false"
   UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
  <w:LsdException Locked="false" Priority="70" SemiHidden="false"
   UnhideWhenUsed="false" Name="Dark List Accent 6"/>
  <w:LsdException Locked="false" Priority="71" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
  <w:LsdException Locked="false" Priority="72" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
  <w:LsdException Locked="false" Priority="73" SemiHidden="false"
   UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
  <w:LsdException Locked="false" Priority="19" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
  <w:LsdException Locked="false" Priority="21" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
  <w:LsdException Locked="false" Priority="31" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
  <w:LsdException Locked="false" Priority="32" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
  <w:LsdException Locked="false" Priority="33" SemiHidden="false"
   UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
  <w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
  <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
 </w:LatentStyles>
</xml><![endif]--> <!--[if gte mso 10]>

<![endif]--> <!--StartFragment--> 
  <p><span lang="EN-US">The
<a href="http://mysql.com/products/cluster/">MySQL Cluster</a> product team is always interested
to see new and innovative uses of the database. Last week, a team of students
at the <a href="http://www.kth.se/en">KTH Royal Institute of Technology</a> in Sweden <a href="http://lalith.in/2011/12/15/towards-a-scalable-and-highly-available-namenode/%20">blogged about</a> their use of MySQL Cluster in
creating a </span><span>scalable and highly available
HDFS&nbsp;Namenode.<o:p /></span></p> 
  <p><span>There
are many established use cases of MySQL Cluster in the web, cloud/SaaS,
telecoms and even flight control systems – you can see those we are allowed to
talk about publicly <a href="http://mysql.com/customers/cluster/">here</a>.&nbsp;</span></p> 
  <p><span lang="EN-US"><o:p> </o:p></span><span>The
KTH team has been working on a project to move all of the metadata from the
HDFS / Hadoop nameenode to MySQL Cluster.</span><span> </span><span>Why did they want to do this, you may ask? Well…:</span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span lang="EN-US">The
namenode is a single point of failure. If it goes down, so too does the file
system<o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span lang="EN-US">As
a single server, the namenode becomes a bottleneck within heavily loaded HDFS /
Hadoop deployments. As server resources are consumed and write volumes
increase, so the system can grind to a halt. (And with data volumes growing
around <a href="http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation">40% per year</a>, this will only become more common!)<o:p /></span></p> 
  <p><span lang="EN-US"><o:p> </o:p></span><span>So
KTH decided to move metadata storage to MySQL Cluster.</span><span> </span><span>Why, you may ask?</span><span> </span><span>Well….</span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span lang="EN-US">MySQL
Cluster already offered them a </span><span>replicated, shared-nothing
database, distributed across commodity hardware.<o:p /></span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>MySQL Cluster is widely deployed with proven stability<o:p /></span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>The metadata can be distributed across nodes to scale
out capacity, while retaining complete consistency to the clients and
eliminating any Single Point of Failure<o:p /></span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>Linear scaling of operations per second across the
cluster, as new namenodes are added.<o:p /></span></p> 
  <p><span><o:p> </o:p></span><span>Access to the cluster is via the <a href="http://dev.mysql.com/doc/ndbapi/en/mccj-using-clusterj.html">MySQL Cluster Connector for Java</a>,
providing a NoSQL, Java based ORM with very low latency.</span><span> </span><span>You can learn more about this <a href="http://mysql.com/why-mysql/white-papers/mysql_wp_cluster_connector_for_java.php">ClusterJ API here</a>.&nbsp;</span></p> 
  <p><span><o:p> </o:p></span><span>Of course, the work at KTH is on-going with future optimizations planned
– which we will follow with interest.</span></p> 
  <p><span><o:p> </o:p></span><span>So how can you determine if MySQL Cluster is the right choice for your
new project?</span><span> </span><span>We have just updated our <a href="http://dev.mysql.com/downloads/MySQL_Cluster_72_DMR_EvaluationGuide.pdf">MySQL Cluster Evaluation Guide</a> (note, this will directly open the pdf</span><span>).</span></p> 
  <p><span><o:p> </o:p></span><span>This update is based around the latest <a href="http://dev.mysql.com/tech-resources/articles/mysql-cluster-7.2.html">MySQL Cluster 7.2 Development
Release</a> </span><span>which includes a series of enhancements to further broaden the use case of
MySQL Cluster, including:</span></p> 
  <p><!--[if !supportLists]-->-<span> </span><!--[endif]--><span>70x higher JOIN performance with Adaptive Query
Localization pushing JOIN operations down to MySQL Cluster’s data <o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span>Native Key-Value Memcached interface to the cluster
allowing schema and schemaless storage</span><span lang="EN-US"><o:p /></span></p> 
  <p><!--[if !supportLists]--><span lang="EN-US">-<span> </span></span><!--[endif]--><span>New cross-data center scalability enhancements</span><span lang="EN-US"><o:p /></span></p> 
  <p><span>MySQL Cluster is not a fit for every use-case, but by
downloading the Evaluation Guide, you’ll get a clear picture of where MySQL
Cluster can be useful to you, and best practices in planning and executing your
evaluation.</span></p> 
  <p><span>Let us know of other interesting use-cases in the comments below</span></p> 
  <p><span><o:p> </o:p></span></p> 
  <p><span> </span><span lang="EN-US"> <o:p /></span></p> 
  <p><span lang="EN-US"><o:p> </o:p></span></p> <!--EndFragment--><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31353&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=31353&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/12/19/using-mysql-cluster-to-protect-scale-the-hdfs-namenode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

