Archive for the ‘NoSQL’ Category

Game Over for NoSQL? Discussing Databases in Online Social Gaming

Февраль 3rd, 2012
According to VentureBeat*, games companies raised a record-breaking $1.54 billion in funding last year and social gaming accounted for over half of that. No wonder everyone wants to have a piece of that pie!

With the arrival of social network platforms, the gaming industry has seen an explosion in casual and social gaming. The social gamer represents a massive audience that cuts across all age, gender and demographic boundaries. Online social games are some of the most demanding applications in the world, with millions of users, stringent response times, complex simulation models and billing requirements. Games take years to develop for a reason ...

Online social games are data-driven applications, and databases are central to these applications. However, there is no single database architecture that will fit the different types of data that the application needs to store. A data management architecture needs to account for the diversity of data, and optimize for some of the differences in the datatypes. E.g. it is ok to lose leaderboard data during a game as it can be reconstituted, whereas billing data needs to be 100% ACID.

Therefore, with the generous contribution of Joshua Butcher, we just published a whitepaper that discusses the different types of data stored for various functions in social gaming. We will see that there cannot be a one-size-fits-all approach to database architecture, and suggest a sharding strategy based on schema partitioning.

With our new whitepaper, we’re also starting a discussion on what the database of choice might be for anyone wanting to develop online social games. With so many NoSQL databases now available, one might wonder why MySQL would be a good database choice for the gaming industry. To find out, download our whitepaper today!

If you have any questions or comments, feel free to reply to this blog below or reach out to us on Facebook, LinkedIn, Xing, Twitter or directly via these contact details.

* http://venturebeat.com/2012/01/06/deanbeat-game-companies-raised-a-record-breaking-1-55b-in-2011/

PlanetMySQL Voting: Vote UP / Vote DOWN

Last chance to take part in our MySQL/NoSQL/NewSQL survey

Февраль 2nd, 2012

Thanks to everyone who has already taken part in our survey exploring changing attitudes to MySQL following its acquisition by Oracle and examining the competitive dynamic between MySQL and other database technologies, including NoSQL and NewSQL.

The response has been great and even a quick look at the results makes for interesting reading, particularly in the light of our previous findings which indicated declining MySQL usage.

I am really looking forward to having the opportunity for a deep dive into the results and break out the figures to get a better understanding of the potential impact of alternative MySQL distribution and support providers, as well as NoSQL and NewSQL, on continued usage of MySQL.

The survey results will be made freely available on our blogs, as well as being included in a long format report containing our additional analysis and research related to the MySQL ecosystem and competitive dynamic.

Right now, however, is your last chance to contribute to the survey and get your voice heard. There are just 12 questions to answer, spread over four pages, and the entire survey should take no longer than five minutes to complete. All individual responses are of course confidential.

The survey will close in 24 hours.


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Is MySQL usage really declining?

Январь 23rd, 2012

If you’re a MySQL user, tell us about your adoption plans by taking our current survey.

Back in late 2009, at the height of the concern about Oracle’s imminent acquisition of Sun Microsystems and MySQL, 451 Research conducted a survey of open source software users to assess their database usage and attitudes towards Oracle.

The results provided an interesting snapshot of the potential implications of the acquisition and the concerns of MySQL users and even, so I am told, became part of the European Commission’s hearing into the proposed acquisition (used by both sides, apparently, which says something about both our independence and the malleability of data).

One of the most interesting aspects concerned the apparently imminent decline in the usage of MySQL. Of the 285 MySQL users in our 2009 survey, only 90.2% still expected to be using it two years later, and only 81.8% in 2014.

Other non-MySQL users expected to adopt the open source database after 2009, but the overall prediction was decline. While 82.1% of our sample of 347 open source users were using MySQL in 2009, only 78.7% expected to be using it in 2011, declining to 72.3% in 2014.

This represented an interesting snapshot of sentiment towards MySQL, but the result also had to be taken with a pinch of salt given the significant level of concern regarding MySQL future at the time the survey was conducted.

The survey also showed that only 17% of MySQL users thought that Oracle should be allowed to keep MySQL, while 14% of MySQL users were less likely to use MySQL if Oracle completed the acquisition.

That is why we are asking similar questions again, in our recently launched MySQL/NoSQL/NewSQL survey.

More than two years later Oracle has demonstrated that it did not have nefarious plans for MySQL. While its stewardship has not been without controversial moments, Oracle has also invested in the MySQL development process and improved the performance of the core product significantly. There are undoubtedly users that have turned away from MySQL because of Oracle but we also hear of others that have adopted the open source database specifically because of Oracle’s backing.

That is why we are now asking MySQL users to again tell us about their database usage, as well as attitudes to MySQL following its acquisition by Oracle. Since the database landscape has changed considerably late 2009, we are now also asking about NoSQL and NewSQL adoption plans.

Is MySQL usage really in decline, or was the dip suggested by our 2009 survey the result of a frenzy of uncertainty and doubt given the imminent acquisition. Will our current survey confirm or contradict that result? If you’re a MySQL user, tell us about your adoption plans by taking our current survey.


PlanetMySQL Voting: Vote UP / Vote DOWN

CAOS Theory Podcast 2012.01.20

Январь 20th, 2012

Topics for this podcast:

*Hadoop v1.0 and year ahead
*Oracle-Cloudera deal for more Hadoop
*Oracle’s ‘Sun spot’ with Solaris
*Open Source M&A outlook for 2012
*Our new MySQL/NoSQL/NewSQL survey

iTunes or direct download (28:49, 4.9MB)


PlanetMySQL Voting: Vote UP / Vote DOWN

451 Research MySQL/NoSQL/NewSQL survey

Январь 18th, 2012

I’ve just launched a new survey that should be of interest if you are currently using or actively considering MySQL or any of the NoSQL or NewSQL offerings

The aim of the survey is threefold:

- identify trends in database usage over time
- explore changing attitudes to MySQL following its acquisition by Oracle
- examine the competitive dynamic between MySQL and other database technologies, including NoSQL and NewSQL

There are just 12 questions to answer, spread over four pages, and the entire survey should take no longer than five minutes to complete.

All individual responses are of course confidential. The results will be published as part of a major research report due at the end of Q1. Thanks in advance for your participation.

The survey can be found at: http://www.surveymonkey.com/s/MySQLNoSQLNewSQL


PlanetMySQL Voting: Vote UP / Vote DOWN

Eventual Consistency in MySQL Cluster — implementation part 3

Декабрь 22nd, 2011



As promised, this is the final post in a series looking at eventual consistency with MySQL Cluster asynchronous replication. This time I'll describe the transaction dependency tracking used with NDB$EPOCH_TRANS and review some of the implementation properties.

Transaction based conflict handling with NDB$EPOCH_TRANS

NDB$EPOCH_TRANS is almost exactly the same as NDB$EPOCH, except that when a conflict is detected on a row, the whole user transaction which made the conflicting row change is marked as conflicting, along with any dependent transactions. All of these rejected row operations are then handled using inserts to an exceptions table and realignment operations. This helps avoid the row-shear problems described here.

Including user transaction ids in the Binlog

Ndb Binlog epoch transactions contain row events from all the user transactions which committed in an epoch. However there is no information in the Binlog indicating which user transaction caused each row event. To allow detected conflicts to 'rollback' the other rows modified in the same user transaction, the Slave applying an epoch transaction needs to know which user transaction was responsible for each of the row events in the epoch transaction. This information can now be recorded in the Binlog by using the --ndb-log-transaction-id MySQLD option. Logging Ndb user transaction ids against rows in-turn requires a v2 format RBR Binlog, enabled with the --log-bin-use-v1-row-events=0 option. The mysqlbinlog --verbose tool can be used to see per-row transaction information in the Binlog.

User transaction ids in the Binlog are useful for NDB$EPOCH_TRANS and more. One interesting possibility is to use the user transaction ids and same-row operation dependencies to sort the row events inside an epoch into a partial order. This could enable recovery to a consistent point other than an epoch boundary. A project for a rainy day perhaps?

NDB$EPOCH_TRANS multiple slave passes

Initially, NDB$EPOCH_TRANS proceeds in the same way as NDB$EPOCH, attempting to apply replicated row changes, with interpreted code attached to detect conflicts. If no row conflicts are detected, the epoch transaction is committed as normal with the same minimal overhead as NDB$EPOCH. However if a row conflict is detected, the epoch transaction is rolled back, and reapplied. This is where NDB$EPOCH_TRANS starts to diverge from NDB$EPOCH.

In this second pass, the user transaction ids of rows with detected conflicts are tracked, along with any inter-transaction dependencies detectable from the Binlog. At the end of the second pass, prior to commit, the set of conflicting user transactions is combined with the user transaction dependency data to get a complete set of conflicting user transactions. The epoch transaction initiated in the second pass is then rolled-back and a third pass begins.

In the third pass, only row events for non-conflicting transactions are applied, though these are still applied with conflict detecting interpreted programs attached in case a further conflict has arisen since the second pass. Conflict handling for row events belonging to conflicting transactions is performed in the same way as NDB$EPOCH. Prior to commit, the applied row events are checked for further conflicts. If further conflicts have occurred then the epoch transaction is rolled back again and we return to the second pass. If no further conflicts have occurred then the epoch transaction is committed.

These three passes, and associated rollbacks are only externally visible via new counters added to the MySQLD server. From an external observer's point of view, only non-conflicting transactions are committed, and all row events associated with conflicting transactions are handled as conflicts. As an optimisation, when transactional conflicts have been detected, further epochs are handled with just two passes (second and third) to improve efficiency. Once an epoch transaction with no conflicts has been applied, further epochs are initially handled with the more optimistic and efficient first pass.

Dependency tracking implementation

To build the set of inter-transaction dependencies and conflicts, two hash tables are used. The first is a unique hashmap mapping row event tables and primary keys to transaction ids. If two events for the same table and primary key are found in a single epoch transaction then there is a dependency between those events, specifically the second event depends on the first. If the events belong to different user transactions then there is a dependency between the transactions.

Transaction dependency detection hash :
{Table, Primary keys} -> {Transaction id}

The second hash table is a hashmap of transaction id to an in-conflict marker and a list of dependent user transactions. When transaction dependencies are discovered using the first dependency detection hash, the second hash is modified to reflect the dependency. By the end of processing the epoch transaction, all dependencies detectable from the Binlog are described.

Transaction dependency tracking and conflict marking hash :
{Transaction id} -> {in_conflict, List}

As epoch operations are applied and row conflicts are detected, the operation's user transaction id is marked in the dependency hash as in-conflict. When marking a transaction as in-conflict, all of its dependent transactions must also be transitively marked as in-conflict. This is done by a traverse through the dependency tree of the in-conflict transaction. Due to slave batching, the addition of new dependencies and the marking of conflicting transactions is interleaved, so adding a dependency can result in a sub-tree being marked as in-conflict.

After the second pass is complete, the transaction dependency hash is used as a simple hash for looking up whether a particular transaction id is in conflict or not :

Transaction in-conflict lookup hash :
{Transaction id} -> {in_conflict}

This is used in the third pass to determine whether to apply each row event, or to proceed straight to conflict handling.

The size of these hashes, and the complexity of the dependency graph is bounded by the size of the epoch transaction. There is no need to track dependencies across the boundary of two epoch transactions, as any dependencies will be discovered via conflicts on the data committed by the first epoch transaction when attempting to apply the second epoch transaction.

Event counters

Like the existing conflict detection functions, NDB$EPOCH_TRANS has a row-conflict detection counter called ndb_conflict_epoch_trans.

Additional counters have been added which specifically track the different events associated with transactional conflict detection. These can be seen with the usual SHOW GLOBAL STATUS LIKE syntax, or via the INFORMATION_SCHEMA tables.

  • ndb_conflict_trans_row_conflict_count
    This is essentially the same as ndb_conflict_epoch_trans - the number of row events with conflict detected.
  • ndb_conflict_trans_row_reject_count
    The number of row events which were handled as in-conflict. It will be at least as large as ndb_conflict_trans_row_count, and will be higher if other rows are implicated by being in a conflicting transaction, or being dependent on a row in a conflicting transaction.
    A separate ndb_conflict_trans_row_implicated_count could be constructed as ndb_conflict_trans_row_reject_count - ndb_conflict_trans_row_conflict_count
  • ndb_conflict_trans_reject_count
    The number of discrete user transactions detected as in-conflict.
  • ndb_conflict_trans_conflict_commit_count
    The number of epoch transactions which had transactional conflicts detected during application.
  • ndb_conflict_trans_detect_iter_count
    The number of iterations of the three-pass algorithm that have occurred. Each set of passes counts as one. Normally this would be the same as ndb_conflict_trans_conflict_commit_count. Where further conflicts are found on the third pass, another iteration may be required, which would increase this count. So if this count is larger than ndb_conflict_trans_conflict_commit_count then there have been some conflicts generated concurrently with conflict detection, perhaps suggesting a high conflict rate.


Performance properties of NDB$EPOCH and NDB$EPOCH_TRANS

I have tried to avoid getting involved in an explanation of Ndb replication in general which would probably fill a terabyte of posts. Comparing replication using NDB$EPOCH and NDB$EPOCH_TRANS relative to Ndb replication with no conflict detection, what can we can say?

  • Conflict detection logic is pushed down to data nodes for execution
    Minimising extra data transfer + locking
  • Slave operation batching is preserved
    Multiple row events are applied together, saving MySQLD <-> data node round trips, using data node parallelism
    For both algorithms, one extra MySQLD <-> data node round-trip is required in the no-conflicts case (best case)
  • NDB$EPOCH : One extra MySQLD <-> data node round-trip is required per *batch* in the all-conflicts case (worst case)
  • NDB$EPOCH : Minimal impact to Binlog sizes - one extra row event per epoch.
  • NDB$EPOCH : Minimal overhead to Slave SQL CPU consumption
  • NDB$EPOCH_TRANS : One extra MySQLD <-> data node round-trip is required per *batch* per *pass* in the all-conflicts case (worst case)
  • NDB$EPOCH_TRANS : One round of two passes is required for each conflict newly created since the previous pass.
  • NDB$EPOCH_TRANS : Small impact to Binlog sizes - one extra row event per epoch plus one user transaction id per row event.
  • NDB$EPOCH_TRANS : Small overhead to Slave SQL CPU consumption in no-conflict case

Current and intrinsic limitations

These functions support automatic conflict detection and handling without schema or application changes, but there are a number of limitations. Some limitations are due to the current implementation, some are just intrinsic in the asynchronous distributed consistency problem itself.

Intrinsic limitations
  • Reads from the Secondary are tentative
    Data committed on the secondary may later be rolled back. The window of potential rollback is limited, after which Secondary data can be considered stable. This is described in more detail here.
  • Writes to the Secondary may be rolled back
    If this occurs, the fact will be recorded on the Primary. Once a committed write is stable it will not be rolled back.
  • Out-of-band dependencies between transactions are out-of-scope
    For example direct communication between two clients creating a dependency between their committed transactions, not observable from their database footprints.

Current implementation limitations

  • Detected transaction dependencies are limited to dependencies between binlogged writes (Insert, Update, Delete)
    Reads are not currently included.
  • Delete vs Delete+Insert conflicts risk data divergence
    Delete vs Delete conflicts are detected, but currently do not result in conflict handling, so that Delete vs Delete + Insert can result in data divergence.
  • With NDB$EPOCH_TRANS, unplanned Primary outages may require manual steps to restore Secondary consistency
    With pending multiple, time spaced, non-overlapping transactional conflicts, an unexpected failure may need some Binlog processing to ensure consistency.

Want to try it out?

Andrew Morgan has written a great post showing how to setup NDB$EPOCH_TRANS. He's even included non-ascii art. This is probably the easiest way to get started. NDB$EPOCH is slightly easier to get started with as the --ndb-log-transaction-id (and Binlog v2) options are not required.

Edit 23/12/11 : Added index

PlanetMySQL Voting: Vote UP / Vote DOWN

Eventual consistency in MySQL Cluster — implementation part 2

Декабрь 19th, 2011



In previous posts I described how row conflicts are detected using epochs. In this post I describe how they are handled.

Row based conflict handling with NDB$EPOCH


Once a row conflict is detected, as well as rejecting the row change, row based conflict handling in the Slave will :
  • Increment conflict counters
  • Optionally insert a row into an exceptions table
For NDB$EPOCH, conflict detection and handling operates on one Cluster in an Active-Active pair designated as the Primary. When a Slave MySQLD attached to the Primary Cluster detects a conflict between data stored in the Primary and a replicated event from the Secondary, it needs to realign the Secondary to store the same values for the conflicting data. Realignment involves injecting an event into the Primary Cluster's Binlog which, when applied idempotently on the Secondary Cluster, will force the row on the Secondary Cluster to take the supplied values. This requires either a WRITE_ROW event, with all columns, or a DELETE_ROW event with just the primary key columns. These events can be thought of as compensating events used to revert the original effect of the rejected events.

Conflicts are detected by a Slave MySQLD attached to the Primary Cluster, and realignment events must appear in Binlogs recorded by the same MySQLD and/or other Binlogging MySQLDs attached to the Primary Cluster. This is achieved using a new NdbApi primary key operation type called refreshTuple.

When a refreshTuple operation is executed it will :
  1. Lock the affected row/primary key until transaction commit time, even if it does not exist (much as an Insert would).
  2. Set the affected row's author metacolum to 0
    The refresh is logically a local change
  3. On commit
    - Row exists case : Set the row's last committed epoch to the current epoch
    - Cause a WRITE_ROW (row exists case) or DELETE_ROW (no row exists) event to be generated by attached Binlogging MySQLDs.

Locking the row as part of refreshTuple serialises the conflicting epoch transaction with other potentially conflicting local transactions. Updating the stored epoch and author metacolumns results in the conflicting row conflicting with any further replicated changes occurring while the realignment event is 'in flight'. The compensating row events are effectively new row changes originating at the Primary cluster which need to be monitored for conflicts in the same way as normal row changes.

It is important that the Slave running at the Secondary Cluster where the realignment events will be applied, is running in idempotent mode, so that it can handle the realignment events correctly. If this is not the case then WRITE_ROW realignment events may hit 'Row already exists' errors, and DELETE_ROW realignment events may hit 'Row does not exist' errors.

Observations on conflict windows and consistency

When a conflict is detected, the refresh process results in the row's epoch and author metacolumns being modified so that the window of potential conflict is extended, until the epoch in which the refresh operation was recorded has itself been reflected. If ongoing updates at both clusters continually conflict then refresh operations will continue to be generated, and the conflict window will remain open until a refresh operation manages to propagate with no further conflicts occurring. As with any eventually consistent system, consistency is only guaranteed when the system (or at least the data of interest) is quiescent for a period.

From the Primary cluster's point of view, the conflict window length is the time between committing a local transaction in epoch n, and the attached Slave committing a replicated epoch transaction indicating that epoch n has been applied at the Secondary. Any Secondary-sourced overlapping change applied in this time is in-conflict.

This Cluster conflict window length is comprised of :

  • Time between commit of transaction, and next Primary Cluster epoch boundary
    (Worst = 1 * TimeBetweenEpochs, Best = 0, Avg = 0.5 * TimeBetweenEpochs)
  • Time required to log event in Primary Cluster's Binlogging MySQLDs Binlog (~negligible)
  • Time required for Secondary Slave MySQLD IO thread to
    - Minimum : Detect new Binlog data - negligible
    - Maximum : Consume queued Binlog prior to the new data - unbounded
    - Pull new epoch transaction
    - Record in Relay log
  • Time required for Secondary Slave MySQLD SQL thread to
    - Minimum : Detect new events in relay log
    - Maximum : Consume queued Relay log prior to new data - unbounded
    - Read and apply events
    - Potentially multiple batches.
    - Commit epoch transaction at Secondary
  • Time between commit of replicated epoch transaction and next Secondary Cluster epoch boundary
    (Worst = 1 * TimeBetweenEpochs, Best = 0, Avg = 0.5 * TimeBetweenEpochs)
  • After this point a Secondary-local commit on the data is possible without conflict
  • Time required to log event in Secondary Cluster's Binlogging MySQLDs Binlog (~negligible)
  • Time required for Primary Slave MySQLD IO thread to
    - Minimum : Detect new Binlog data
    - Maximum : Consume queued Binlog data prior to the new data - unbounded
    - Pull new epoch transaction
    - Record in Relay log
  • Time required for Primary Slave MySQLD SQL thread to
    - Minimum : Detect new events in relay log
    - Maximum : Consume queued Relay log prior to new data - unbounded
    - Read and apply events
    - Potentially multiple batches.
    - For NDB$EPOCH_TRANS, potentially multiple passes
    - Commit epoch transaction
    - Update max replicated epoch to reflect new maximum.
  • Further Secondary sourced modifications to the rows are now considered not-in-conflict

From the point of view of an external client with access to both Primary and Secondary clusters, the conflict window only extends from the time transaction commit occurs at the Primary to the time the replicated operations are applied at the Secondary, and its commit time Secondary epoch ends. Changes committed at the Secondary after this will clearly appear to the Primary to have occurred after its epoch was applied on the Secondary and therefore are not in-conflict.

Assuming that both Clusters have the same TimeBetweenEpochs, we can simplify the Cluster conflict window to :
  Cluster_conflict_window_length = EpochDelay +
P_Binlog_lag +
S_Relay_lag +
S_Binlog_lag +
P_Relay_lag

Where
EpochDelay minimum is 0
EpochDelay avg is TimeBetweenEpochs
EpochDelay maximum is 2 * TimeBetweenEpochs


Substituting the default value of TimeBetweenEpochs of 100 millis, we get :
     EpochDelay minimum is 0
EpochDelay avg is 100 millis
EpochDelay maximum is 200 millis


Note that TimeBetweenEpochs is an epoch-increment trigger delay. The actual experienced time between epochs can be longer depending on system load. The various Binlog and Relay log delays can vary from close to zero up to infinity. Infinity occurs when replication stops in either direction.

The Cluster conflict window length can be thought of as both
  • The time taken to detect a conflict with a Primary transaction
  • The time taken for a committed Secondary transaction to become stable or be reverted

We can define a Client conflict window length as either :
 Primary->Secondary

Client_conflict_window_length = EpochDelay +
P_Binlog_lag +
S_Relay_lag +
EpochDelay

or

Secondary->Primary

Client_conflict_window_length = EpochDelay +
S_Binlog_lag +
P_Relay_lag

Where EpochDelay is defined as above.


These definitions are asymmetric. They represent the time taken by the system to determine that a particular change at one cluster definitely happened-before another change at the other cluster. The asymmetry is due to the need for the Secondary part of a Primary->Secondary conflict to be recorded in a different Secondary epoch. The first definition considers an initial change at the Primary cluster, and a following change at the Secondary. The second definition is for the inverse case.

An interesting observation is that for a single pair of near-concurrent updates at different clusters, happened-before depends only on latencies in one direction. For example, an update to the Primary at time Ta, followed by an update to the Secondary at time Tb will not be considered in conflict if:

 Tb - Ta > Client_conflict_window_length(Primary->Secondary)


Client_conflict_window_length(Primary->Secondary) depends on the EpochDelay, the P_Binlog_lag and S_Relay_lag, but not on the S_Binlog_lag or P_Relay_lag. This can mean that high replication latency, or a complete outage in one direction does not always result in increased conflict rates. However, in the case of multiple sequences of near-concurrent updates at different sites, it probably will.

A general property of the NDB$EPOCH family is that the conflict rate has some dependency on the replication latency. Whether two updates to the same row at times Ta and Tb are considered to be in conflict depends on the relationship between those times and the current system replication latencies. This can remove the need for highly synchronised real-time clocks as recommended for NDB$MAX, but can mean that the observed conflict rate increases when the system is lagging. This also implies that more work is required to catch up, which could further affect lag. NDB$MAX requires manual timestamp maintenance, and will not detect incorrect behaviour, but the basic decision on whether two updates are in-conflict is decided at commit time and is independent of the system replication latency.

In summary :
  • The Client_conflict_window_length in either direction will on average not be less than the EpochDelay (100 millis by default)
  • Clients racing against replication to update both clusters need only beat the current Client_conflict_window_length to cause a conflict
  • Replication latencies in either direction are potentially independent
  • Detected conflict rates partly depend on replication latencies

Stability of reads from the Primary Cluster

In the case of a conflict, the rows at the Primary Cluster will tentatively have replicated operations applied against them by a Slave MySQLD. These conflicting operations will fail prior to commit as their interpreted precondition checks will fail, therefore the conflicting rows will not be modified on the Primary. One effect of this is that a read from the Primary Cluster only ever returns stable data, as conflicting changes are never committed there. In contrast, a read from the Secondary Cluster returns data which has been committed, but may be subject to later 'rollback' via refresh operations from the Primary Cluster.

The same stability of reads observation applies to a row change event stream on the Primary Cluster - events received for a single key will be received in the order they were committed, and no later-to-be-rolled-back events will be observed in the stream.

Stability of reads from the Secondary Cluster

If the Secondary Cluster is also receiving reflected applied epoch information back from the Primary then it will know when it's epoch x has been applied successfully at the Primary. Therefore a read of some row y on the Secondary can be considered tentative while Max_Replicated_Epoch(Secondary) < row_epoch(y), but once Max_Replicated_Epoch(Secondary) >= row_epoch(y) then the read can be considered stable. This is because if the Primary were going to detect a conflict with a Secondary change committed in epoch x, then the refresh events associated with the conflict would be recorded in the same Primary epoch as the notification of the application of epoch x. So if the Secondary observes the notification of epoch x (and updates Max_Replicated_Epoch accordingly), and row y is not modified in the same epoch transaction, then it is stable. The time taken to reach stability after a Secondary Cluster commit will be the Cluster conflict window length.

Perhaps some applications can make better use of the potentially transiently inconsistent Secondary data by categorising their reads from the Secondary as either potentially-inconsistent or stable. To do this, they need to maintain Max_replicated_epoch(Secondary) (By listening to row change events on the ndb_apply_status table) and read the NDB$GCI_64 metacolumn when reading row data. A read from the Secondary is stable if all the NDB$GCI_64 values for all rows read are <= the Secondary's Max_Replicated_Epoch.

In the next post (final post I promise!) I will describe the implementation of the transaction dependency tracking in NDB$EPOCH_TRANS, and review the implementation of both NDB$EPOCH and NDB$EPOCH_TRANS.

Edit 23/12/11 : Added index

PlanetMySQL Voting: Vote UP / Vote DOWN