Archive for the ‘rambling’ Category

The CAP theorem and MySQL Cluster

Март 7th, 2012
tldr; A single MySQL Cluster prioritises Consistency in Network partition events. Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.


I was recently asked about the relationship between MySQL Cluster and the CAP theorem. The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast. You can have any two, but you can't have all three. For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'. CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?

Standard 'my interpretation of CAP' section

Everyone who discusses CAP like to rehash it, and I'm no exception. Daniel Abadi has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition. This is how I think of CAP. He also discusses related system behaviour in normal operation which I'll return to later.

While this reframing clarifies CAP, the terms network partition, availability and consistency also need some definition.

Network replicated database

CAP is only really relevant in the context of a network replicated database (or filesystem or state machine). A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network. Data can be read and updated. Updates are propagated to all nodes with replicas via the network. Database clients connect to database nodes via the network to read data and make updates. Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.

Availability

The network replicated database exists to provide services such as Read and Write on the data it stores. Its availability can be measured as the ability of any client to perform any service on any data item.

This Service Availability can be compromised by :
  • Failure of client nodes
  • Network failures between clients and database nodes
  • Network failures between database nodes
  • Failure of database nodes
Client node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP. However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service is within the scope of CAP.

Service Availability is not binary, it can partially degrade, perhaps by affecting :
  • A subset of all clients
  • A subset of all stored data
  • A subset of request types

The shades of grey within the definition of availability are responsible for most of the arguments around CAP. If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain. If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network. In the loosest definition, if any client receives any service on any data, then the system is still available. Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage. Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.

Consistency

For consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data. Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result. Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results. This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.

If one copy consistency is relaxed, then different database nodes may observably have different values for the same data item simultaneously. Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes. This behaviour may be described as eventual consistency. Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications. Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.

Network Partitions

Network partitions isolate subsets of the nodes of a network replicated database. The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :
  1. dead
  2. alive but isolated from clients
  3. alive and reachable by clients but isolated from us
Not knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency. The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them. If those clients are allowed to make writes on data copies stored on those database nodes, then we must lose one copy consistency as we cannot supply those new values in response to a read of our local copy. If those clients are not allowed to make writes then we have degraded service availability for them. Which is it to be? This is the unavoidable choice at the centre of the CAP theorem. Stated this way it seems less of a theorem and more of a fact.

Back to MySQL Cluster - which does it provide?

A single MySQL Cluster prioritises data consistency over availability when network partitions occur.

A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.

So you can have it both ways with MySQL Cluster - Great!

Single MySQL Cluster - CP

Within a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit. Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes. Where a network partition occurs, live nodes in each partition regroup and decide what to do next :
  • If there are not enough live nodes to serve all of the data stored - shutdown
    Serving a subset of user data (and risking data consistency) is not an option
  • If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
    No other subset of nodes can be isolated from us and serving clients
  • If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
    There could be another subset of nodes regrouped into a viable cluster out there.

Arbitration occurs to avoid the split brain scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others. In other words, arbitration occurs to preserve consistency.

Arbitration involves :
  • Database nodes agree on an arbitrator in advance
  • During node or network failure handling, no data writes are committed.
  • When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.
  • Each request to the arbitrator will result in either : Yes, No or timeout
  • Anything other than Yes results in node shutdown.
  • The arbitrator only says Yes once per election round (First come first served). Therefore the arbitrator only says yes to one potential cluster in a partitioned network.

Note that arbitration is not the same as achieving a quorum. A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor. The arbitrator role is lightweight as it is not involved in normal traffic. I am surprised that the lightweight arbitrator pattern is not more common.

How does a single MySQL Cluster degrade service availability as a result of network partitions?

Where some subset of data nodes are isolated and shut-down :
  • Those nodes are 100% out of service, until they restart and can rejoin the cluster
    They will attempt to do so automatically
  • Any clients connected only to those nodes are out of service
    By default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.
  • The remaining live nodes are 100% in-service
  • Clients connected to the remaining live nodes are 100% in service
Where no subset of data nodes is live
  • All clients experience 100% service loss, until the data nodes restart and can rejoin the cluster
    They will attempt to do so automatically.

A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions. It does not sacrifice consistency.

How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?

Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it. Some set of clients continue to receive 100% service. Only double failures in the network can cause a network partition resulting in full service loss.
Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.

Asynchronously replicating clusters - AP


Where two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters. Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not. On the other hand, availability is not compromised by network partitioning of the two clusters. Each cluster can continue to accept read and write requests to all of the data from any connected client.

Eventual consistency between the clusters is possible when using conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.

How does consistency degrade between replicating MySQL Clusters during a network partition?

This depends on the conflict resolution function chosen, and how detected conflicts are handled. Some details of consistency guarantees provided by NDB$EPOCH et al are described here.

What about normal operation?

Abadi's post introduced his PACELC acronym, standing for something like :

 if (network Partition)
{
trade-off Availability vs Consistency;
}
else
{
trade-off Latency vs Consistency;
}


My first comment has to be that it's bad form to put the common case in an else branch!
However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition. The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency. In theory this normal operation latency-vs-consistency tradeoff could be completely independent to the Network Partitioning availability-vs-consistency tradeoff, e.g. you could have any of :
  1. PA EL (Partition - Availability, Else - Latency minimisation)
  2. PA EC (Partition - Availability, Else - Consistency)
  3. PC EL (Partition - Consistency, Else - Latency minimisation)
  4. PC EC (Partition - Consistency, Else - Consistency)

The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency. Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability. Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.

Examples of systems of each type :
  1. Any eventually consistent system, especially with local-database-node updates + reads
  2. Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)
  3. ???
  4. Always consistent systems (e.g. single database instance, single MySQL Cluster)

I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition. Maybe this category should be called 'repentant systems'?

The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies. It is not enough to say that things will 'eventually' be consistent. It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.

There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database. The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use! A reasonable approach is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs). These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.

Stonebraker and others have commented on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented. Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower. This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable. Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.

At a wider network scale, latencies are naturally higher, implying that bandwidth is lower. The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable. The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable. Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data. For a particular application there may or may not be a point at which an AP system would meet its requirements better.

Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system. Paraphrasing John Mashey, "Bandwidth improves, latency is forever". Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.

Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available. Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links. As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.

Edit 7/3/12 : Fix bad link

PlanetMySQL Voting: Vote UP / Vote DOWN

One billion

Февраль 21st, 2012
As always, I am a little late, but I want to jump on the bandwagon and mention the recent MySQL Cluster milestone of passing 1 billion queries per minute. Apart from echoing the arbitrarily large ransom demand of Dr Evil, what does this mean?

Obviously 1 billion is only of interest to us humans as we generally happen to have 10 fingers, and seem to name multiples in steps of 10^3 for some reason. Each processor involved in this benchmark is clocked at several billion cycles per second, so a single billion is not so vast or fast.

Measuring over a minute also feels unnatural for a computer performance benchmark - we are used to lots of things happening every second! A minute is a long time in silicon.

What's more, these reads are served from tables stored entirely in memory - and everyone knows that main memory is infinitely fast and scalable and always getting cheaper, right?

If we convert to seconds we are left with only 17 million reads per second! Hardly worth getting out of bed for?

On the contrary, I think that achieving 17 million independent random reads per second, each read returning 100 bytes across a network, from a database that also supports arbitrary SQL, row locking, transactions, high availability and all sorts of other stuff, is pretty cool. I doubt that (m)any other similar databases can match this raw performance, though I look forward to being proved wrong.

(Also, don't forget to meet + beat 1.9 million random updates/s, synchronously replicated)

Raw performance is good, but not everyone just needs horsepower. The parallel, independent work on improving join performance (also known as SPJ/AQL) and query optimisation helps more applications harness this power, by improving the efficiency of joins.

I wrote a post about SPJ/AQL at the start of last year, when it was still in the early stages. Since then much has improved, to the extent that the performance improvement factors have become embarrassingly high on real user queries. A further post on the technical details of SPJ/AQL is long overdue... Perhaps the most interesting details are on the integration between the parallel, streaming linked operations and the essentially serialised MySQL Nested Loops join executor. A linked scan and lookup operation can be considered to be a form of parallel hash join, which the normal MySQL NLJ executor can invoke as part of executing a query. Who says Nested Loop joins can't scale?

PlanetMySQL Voting: Vote UP / Vote DOWN

ACID tradeoffs, modularity, plugins, Drizzle

Март 25th, 2010
Most software people are aware of the ACID acronym coined by Jim Gray. With the growth of the web and open source, the scaling and complexity constraints imposed on DBMS implementations supporting ACID are more visible, and new (or at least new terms for known) compromises and tradeoffs are being discussed widely. The better known NoSQL systems are giving insight by example into particular choices of tradeoffs.

Working at MySQL, I have often been surprised at the variety of potential alternatives when implementing a DBMS, and the number of applications which don't need the full set of ACID letters in the strictest form. The original MySQL storage engine, MyISAM is one of the first and most successful examples of an 'ACID remix'. The people drawn to DBMS development work often have a perfectionist streak, which can cause them to tend to prefer 'nothing' over 'imperfect'. MyISAM was and still is a flag-bearer for 'good enough'. Perhaps we should be less modest and call it 'more than good enough'.

One seldom discussed benefit of MySQL's storage engine architecture is that pressure to make 'The One True Storage Engine' is reduced. DBMS products with one fixed database engine need to optimise for all supported use cases. This is a great engineering challenge, but increases design effort, requirements for configuration and auto-tuning, constraints on any design change or reoptimisation etc. With MySQL, there are multiple existing storage engines, each with a (sub)set of target use-cases in mind. A single MySQL server can maintain and access tables in different storage engines, each tuned as closely as possible to the use-case for the data, without adding complexity to unrelated engines. Engines can be wildly optimised for a narrow use case as there are plausible alternative engines available for other use cases.

I understand that one aim of the Drizzle project is to extend the modularity of the MySQL Server on multiple axes, allowing diversity to flourish. As a one-time Java coder, who enjoyed the pleasures of design-by-interface, I can see the attraction. While the effort is guided by an actual need for modularity and real examples of alternative plugins, it can be a great force multiplier. There is always the risk of modularity for its own sake - a branch of Architecture Astronautics. Sure symptoms, which I may have suffered from in the past, include the class names FactoryFactory..., PolicyPolicy, or [Anything]Broker).

Another good vibe from Drizzle is the microkernel concept, although would say that there's some terminological abuse occurring here! Perhaps it could more reasonably be said that MySQL has a TeraKernel and Drizzle has a MegaKernel? In any case the motivations are good. Decoupling the huge chunks of functionality glued together inside MySQLD is great for long term software integrity, understanding dependencies, finding (and introducing) bugs, and might make it easier to start adding functionality again. Replication seems especially ripe ground for alternative plugins. User authentication is another often requested 'chunk'. It will take longer to crystalise interfaces for more deeply embedded areas like the query Optimizer/Executor, but if these interfaces are arising from a real need then that can drive the API design.

One aspect of storage engine modularity that is not often mentioned is that some MySQL storage engines also moonlight with other products. The Berkeley database (BDB) is probably the oldest and most promiscuous, embedded in DNS daemons, LDAP servers and all sorts of other places. Ndb is unusual in that it can be used from separate MySQLD and other NdbApi processes at the same time. InnoDB has also recently added an embedded variant. This trend will accelerate, especially when some of the distributed NoSQL systems start supporting 'pluggable local storage' APIs. I imagine that a NoSQL local storage engine API could be somewhat simpler to implement than the MySQL SE API, at least to start with!

PlanetMySQL Voting: Vote UP / Vote DOWN