Archive for the ‘message-passing’ Category

The CAP theorem and MySQL Cluster

Март 7th, 2012
tldr; A single MySQL Cluster prioritises Consistency in Network partition events. Asynchronously replicating MySQL Clusters prioritise Availability in Network partition events.


I was recently asked about the relationship between MySQL Cluster and the CAP theorem. The CAP theorem is often described as a pick two out of three problem, such as choosing from good, cheap, fast. You can have any two, but you can't have all three. For CAP the three qualities are 'Consistency', 'Availability' and 'Partition tolerance'. CAP states that in a system with data replicated over a network only two of these three qualities can be maintained at once, so which two does MySQL Cluster provide?

Standard 'my interpretation of CAP' section

Everyone who discusses CAP like to rehash it, and I'm no exception. Daniel Abadi has the best CAP write-up that I've read so far, which reframes CAP as a decision about whether to ultimately prioritise availability or data consistency in the event of a network partition. This is how I think of CAP. He also discusses related system behaviour in normal operation which I'll return to later.

While this reframing clarifies CAP, the terms network partition, availability and consistency also need some definition.

Network replicated database

CAP is only really relevant in the context of a network replicated database (or filesystem or state machine). A network replicated database stores copies of data in multiple different systems (database nodes), connected by a network. Data can be read and updated. Updates are propagated to all nodes with replicas via the network. Database clients connect to database nodes via the network to read data and make updates. Replication may occur to improve availability, to improve request latency, or to improve read bandwidth.

Availability

The network replicated database exists to provide services such as Read and Write on the data it stores. Its availability can be measured as the ability of any client to perform any service on any data item.

This Service Availability can be compromised by :
  • Failure of client nodes
  • Network failures between clients and database nodes
  • Network failures between database nodes
  • Failure of database nodes
Client node and networking failures cannot really be considered a property within the control of a database system, so I consider their effects out of the scope of CAP. However, where clients connect to a database node, and that database node is isolated from other database nodes, whether or not those clients are given service is within the scope of CAP.

Service Availability is not binary, it can partially degrade, perhaps by affecting :
  • A subset of all clients
  • A subset of all stored data
  • A subset of request types

The shades of grey within the definition of availability are responsible for most of the arguments around CAP. If we take a strict view - either all services available on all data for all clients, or nothing, then availability is fragile and hard to maintain. If we take a more flexible approach then some service availabilty can be preserved even with a completely decimated network. In the loosest definition, if any client receives any service on any data, then the system is still available. Rather than choose one position, I regard availability as a range from 100% down to 0% for a full outage. Anything in the middle is reduced availability, but it does not mean that the system is not serving its purpose adequately.

Consistency

For consistency to be satisfied, the multiple replicas of data in a network replicated database should behave as though there were only one copy of the data. Simultaneous reads of the same data item from clients connected to different database nodes must always return the same result. Where two or more updates to the same data item are submitted simulteneously, they must be serialised, or one must be rejected, or they must be merged so that a single value results. This one-copy model makes it simple for database clients to use the network replicated database as if it were a single database system with one atomically read/written copy of their data.

If one copy consistency is relaxed, then different database nodes may observably have different values for the same data item simultaneously. Over time the data copies may be aligned, but clients accessing the data must beware that reads may not return the results of the most recently accepted writes. This behaviour may be described as eventual consistency. Providing eventual consistency allows a network replicated database to maximise availability, but pushes the problem of dealing with transient inconsistencies up the stack to user applications. Furthermore there are varying qualities of eventual consistency, with varying guarantees and levels of application support available.

Network Partitions

Network partitions isolate subsets of the nodes of a network replicated database. The interesting property of a network partition is that each node subset cannot tell whether the other node subset(s) are :
  1. dead
  2. alive but isolated from clients
  3. alive and reachable by clients but isolated from us
Not knowing the state of the other subset(s) is what forces a system to decide between maximising service availability and maximising consistency. The interesting case is 3) where some database nodes (potentially containing all or some of the data) are alive elsewhere and have clients connected to them. If those clients are allowed to make writes on data copies stored on those database nodes, then we must lose one copy consistency as we cannot supply those new values in response to a read of our local copy. If those clients are not allowed to make writes then we have degraded service availability for them. Which is it to be? This is the unavoidable choice at the centre of the CAP theorem. Stated this way it seems less of a theorem and more of a fact.

Back to MySQL Cluster - which does it provide?

A single MySQL Cluster prioritises data consistency over availability when network partitions occur.

A pair of asynchronously replicating MySQL Clusters prioritise service availability over data consistency when network partitions occur.

So you can have it both ways with MySQL Cluster - Great!

Single MySQL Cluster - CP

Within a single MySQL Cluster, data is synchronously replicated between database nodes using two-phase commit. Nodes are monitored using heartbeats, and failed or silent nodes are promptly isolated by live and responsive nodes. Where a network partition occurs, live nodes in each partition regroup and decide what to do next :
  • If there are not enough live nodes to serve all of the data stored - shutdown
    Serving a subset of user data (and risking data consistency) is not an option
  • If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
    No other subset of nodes can be isolated from us and serving clients
  • If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
    There could be another subset of nodes regrouped into a viable cluster out there.

Arbitration occurs to avoid the split brain scenario where a cluster could theoretically split in two (or more), with each half (or third, or quarter) accepting writes and diverging from the others. In other words, arbitration occurs to preserve consistency.

Arbitration involves :
  • Database nodes agree on an arbitrator in advance
  • During node or network failure handling, no data writes are committed.
  • When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.
  • Each request to the arbitrator will result in either : Yes, No or timeout
  • Anything other than Yes results in node shutdown.
  • The arbitrator only says Yes once per election round (First come first served). Therefore the arbitrator only says yes to one potential cluster in a partitioned network.

Note that arbitration is not the same as achieving a quorum. A cluster with three replicas and an arbitrator node can survive the loss of two data nodes as long as the arbitrator remains reachable to the last survivor. The arbitrator role is lightweight as it is not involved in normal traffic. I am surprised that the lightweight arbitrator pattern is not more common.

How does a single MySQL Cluster degrade service availability as a result of network partitions?

Where some subset of data nodes are isolated and shut-down :
  • Those nodes are 100% out of service, until they restart and can rejoin the cluster
    They will attempt to do so automatically
  • Any clients connected only to those nodes are out of service
    By default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.
  • The remaining live nodes are 100% in-service
  • Clients connected to the remaining live nodes are 100% in service
Where no subset of data nodes is live
  • All clients experience 100% service loss, until the data nodes restart and can rejoin the cluster
    They will attempt to do so automatically.

A single MySQL Cluster does not degrade to partial data access, or read only modes as a result of network partitions. It does not sacrifice consistency.

How can MySQL Cluster be described as highly available if it sacrifices availability for consistency in the event of a network partition?

Availability is not binary - many types of network partition can erode availability, for some clients, but do not extinguish it. Some set of clients continue to receive 100% service. Only double failures in the network can cause a network partition resulting in full service loss.
Furthermore, network partitions are not the only risks to availability, software errors, power failures, upgrades, overloads are other potential sources of downtime which Cluster is designed to overcome.

Asynchronously replicating clusters - AP


Where two Clusters are asynchronously replicating via normal MySQL Replication, in a circular configuration, reads and writes can be performed locally at both clusters. Data consistency within each cluster is guaranteed as normal, but data consistency across the two clusters is not. On the other hand, availability is not compromised by network partitioning of the two clusters. Each cluster can continue to accept read and write requests to all of the data from any connected client.

Eventual consistency between the clusters is possible when using conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.

How does consistency degrade between replicating MySQL Clusters during a network partition?

This depends on the conflict resolution function chosen, and how detected conflicts are handled. Some details of consistency guarantees provided by NDB$EPOCH et al are described here.

What about normal operation?

Abadi's post introduced his PACELC acronym, standing for something like :

 if (network Partition)
{
trade-off Availability vs Consistency;
}
else
{
trade-off Latency vs Consistency;
}


My first comment has to be that it's bad form to put the common case in an else branch!
However, it is certainly true that the properties during normal operation are usually more important than what happens during a network partition. The ELC section is stating that while all database nodes are present, a network replicated database can choose between minimising request Latency, or maintaining Consistency. In theory this normal operation latency-vs-consistency tradeoff could be completely independent to the Network Partitioning availability-vs-consistency tradeoff, e.g. you could have any of :
  1. PA EL (Partition - Availability, Else - Latency minimisation)
  2. PA EC (Partition - Availability, Else - Consistency)
  3. PC EL (Partition - Consistency, Else - Latency minimisation)
  4. PC EC (Partition - Consistency, Else - Consistency)

The common cases are 1 + 4, where we choose either consistency at all times, or Maximum Availability and Minimum Latency. Case 2 is a system which aims for consistency, but when a network partition occurs, aims for Availability. Case 3 is a system which aims for minimal request Latency, and when a partition occurs aims for consistency.

Examples of systems of each type :
  1. Any eventually consistent system, especially with local-database-node updates + reads
  2. Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)
  3. ???
  4. Always consistent systems (e.g. single database instance, single MySQL Cluster)

I am not aware of systems meeting case 3 where normally they minimise latency over consistency, but start choosing consistency after a network partition. Maybe this category should be called 'repentant systems'?

The problem for systems in Cases 1 or 2 - anywhere where Latency minimisation or Availability is chosen over consistency - is the need for user applications to deal with potential inconsistencies. It is not enough to say that things will 'eventually' be consistent. It's important to describe how inconsistent they can be, whether the temporary inconsistencies are values which were once valid, how those values relate to other, connected values etc.

There are certainly applications which can operate correctly with practical eventually consistent databases, but it's not well known how to design applications and schemas to cope with the transient states of an eventually consistent database. The first ORM framework to opaquely support an underlying eventually consistent database may actually be worth the effort to use! A reasonable approach is to design schemas with associated read/modification 'protocols' as if they were abstract data types (ADTs). These ADTs can then have strengths and weaknesses, properties and limitations which make sense in some parts of an application schema where the need to support eventual consistency overcomes the inherent effort and limitations.

Stonebraker and others have commented on network partitions being a minor concern for a well designed datacentre-local network, where redundancy can be reliably implemented. Also the latency cost of maintaining consistency is lower as physical distances are smaller and hop counts are lower. This results in 'CP' systems being attractive at the data centre scale as the need to sacrifice availability due to network partition is rarely dominant, and the latency implications during normal operation are bearable. Perhaps this highlights the need in these theoretical discussions to illustrate theoretically problematic latencies and availabilities with real numbers.

At a wider network scale, latencies are naturally higher, implying that bandwidth is lower. The probability of network partitions of some sort may also increase, due to the larger number of components (and organisations) involved. The factors combine to make 'AP' systems more palatable. The everyday latency cost of consistency is higher, and losing availability due to potentially more frequent network partitions may not be acceptable. Again, real numbers are required to illuminate whether the achievable latencies and probable availability impacts are serious enough to warrant changing applications to deal with eventually consistent data. For a particular application there may or may not be a point at which an AP system would meet its requirements better.

Consistent systems can be scaled across many nodes and high latency links, but the observed operation latency, and the necessary impacts to availability implied by link failure set a natural ceiling on the desirable scale of a consistent system. Paraphrasing John Mashey, "Bandwidth improves, latency is forever". Applications that find the latency and availability constraints of a single consistent system unacceptable, must subdivide their datasets into smaller independent consistency zones and manage potential consistency shear between them.

Finally (another excessively long post), I think the technical and actual merits of widely distributed 'CP' systems are not well known as they have not been commonly available. Many different database systems support some form of asynchronous replication, but few offer synchronous replication, fewer still offer to support it over wide areas with higher latency and fluctuating links. As this changes, the true potential and weaknesses of these technologies, backed by real numbers, will start to appear.

Edit 7/3/12 : Fix bad link

PlanetMySQL Voting: Vote UP / Vote DOWN

Data distribution in MySQL Cluster

Март 26th, 2011
MySQL Cluster distributes rows amongst the data nodes in a cluster, and also provides data replication. How does this work? What are the trade offs?

Table fragments

Tables are 'horizontally fragmented' into table fragments each containing a disjoint subset of the rows of the table. The union of rows in all table fragments is the set of rows in the table. Rows are always identified by their primary key. Tables with no primary key are given a hidden primary key by MySQLD.

By default, one table fragment is created for each data node in the cluster at the time the table is created.

Node groups and Fragment replicas

The data nodes in a cluster are logically divided into Node groups. The size of each Node group is controlled by the NoOfReplicas parameter. All data nodes in a Node group store the same data. In other words, where the NoOfReplicas parameter is two or greater, each table fragment has a number of replicas, stored on multiple separate data nodes in the same nodegroup for availability.

One replica of each fragment is considered primary, and the other(s) are considered backup replicas. Normally, each node contains a mix of primary and backup fragments for every table, which encourages system balance.

Which replica to use?

The primary fragment replica is used to serialise locking between transactions concurrently accessing the same row. Write operations update all fragment replicas synchronously, ensuring no committed data loss on node failure. Read operations normally access the primary fragment replica, ensuring consistency. Reads with a special lock mode can access the backup fragment replicas.

Primary key read protocol

When an NdbApi client (for example a MySQLD process) wants to read a row by primary key, it sends a read request to a data node acting as a Transaction Coordinator (TC).
The TC node will determine which fragment the row would be stored in from the primary key, decide which replica to access (usually the primary), and send a read request to the data node containing that fragment replica. The data node containing the fragment replica then sends the row's data (if present) directly back to the requesting NdbApi client, and also sends a read acknowledgement or failure notification back to the TC node, which also propagates it back to the NdbApi client.

Minimising inter data node hops

The 'critical path' for this protocol in terms of potential inter-data-node hops is four hops :

Client -> TC -> Fragment -> TC -> Client

To minimise remote client experienced latency, ideally two inter-node hops can be avoided by having the TC node and the Fragment replica(s) on the same node. This requires controlling the choice of node for TC based on the primary key of the data which will be read. Where a transaction only reads rows stored on the same node as its TC, this can improve latency and system efficiency.

Distribution awareness

From NdbApi, users can specify a table and key when starting a transaction. The transaction will then choose a TC data node based on where the corresponding row's primary fragment replica is located in the system. This mechanism is sometimes referred to as 'transaction hinting'.

The Ndb handler in MySQLD generally waits for the first primary key lookup in a user session before starting an NdbApi transaction, so that it can choose a TC node based on this. This is a best-effort attempt at having the data node acting as TC colocated with the accessed data. This feature is usually referred to as 'Distribution Awareness'.

Write operations also benefit from distribution awareness, but not to the same extent in systems with NoOfReplicas > 1. Write operations must update all fragment replicas, which must be stored on different nodes, in the same nodegroup, so for NoOfReplicas > 1, distribution awareness avoids inter-node-group communication, and some intra-node-group communication, but some inter-data-node communication is always required. In a system with good data partitioning and distribution awareness, most read transactions will access only one data node, and write transactions will result in messaging between the data nodes of a single node group. Messaging between node groups will be minimal.

Distribution keys

By default, the whole of a table's primary key is used to determine which fragment replica will store a row. However, any subset of the columns in the primary key can be used. The key columns used to determine the row distribution are called the 'distribution key'.

Where a table's primary key contains only one column, the distribution key must be the full primary key. Where the primary key has more than one column, the distribution key can be different to (a subset of) the primary key.

From MySQLD, a distribution key can be set using the normal PARTITION BY KEY() syntax. The effect of using a distribution key which is a subset of the primary key is that rows with different primary key values, but the same distribution key values are guaranteed to be stored in the same table fragment.

For example, if we create a table :

CREATE TABLE user_accounts (user_id BIGINT,
account_type VARCHAR(255),
username VARCHAR(60),
state INT,
PRIMARY KEY (user_id, account_type))
engine = ndb partition by key (user_id);


Then insert some rows :


INSERT INTO user_accounts VALUES (22, "Twitter", "Bader", 2),
(22, "Facebook", "Bd77", 2),
(22, "Flickr", "BadB", 3),
(23, "Facebook", "JJ", 2);


Then we know that all rows with the same value(s) for the distribution key (user_id), will be stored on the same fragment. If we know that individual transactions are likely to access rows with the same distribution key value then this will increase the effectiveness of distribution awareness. Many schemas are 'partitionable' like this, though not all.

Note that partitioning is a performance hint in Ndb - correctness is not affected in any way, and transactions can always span table fragments on the same or different data nodes. This allows applications to take advantage of the performance advantages of distribution awareness without requiring that all transactions affect only one node etc as required by simpler 'sharding' mechanisms.

Correlated distribution keys across tables

A further guarantee from Ndb is that two tables with the same number of fragments, and the same number and type of distribution keys will have rows distributed in the same way.

For example, if we add another table :


CREATE TABLE user_prefs (user_id BIGINT,
type VARCHAR(60),
value VARCHAR(255),
PRIMARY KEY (user_id, type))
engine = ndb partition by key (user_id);


Then insert some rows :


INSERT INTO user_prefs VALUES (22, "Coffee", "Milk + 6 sugars"),
(22, "Eggs", "Over easy"),
(23, "Custard", "With skin");


Then we know that the rows with the same user_id in the user_prefs and user_accounts tables will be stored on the same data node. Again, this helps with distribution awareness. In this example, we are ensuring that rows related to a single user, as identified by a common user_id, will be located on one data node, maximising system efficiency, and minimising latency.

Ordered index scan pruning

MySQL Cluster supports arbitrary ordered indexes. Ordered indexes are defined on one or more columns and support range scan operations. Range scans are defined by supplying optional lower and upper bounds. All rows between these bounds are returned.

Each Ndb ordered index is implemented as a number of in memory tree structures (index fragments), distributed with the fragments of the indexed table. Each index fragment contains the index entries for the local table fragment. Having ordered indexes local to the table fragments makes index maintenance more efficient, but means that there may not be much locality to exploit when scanning as rows in a range may be spread across all index fragments of an index.

The only case where an ordered index scan does not require to scan all index fragments is where it is known that all rows in the range will be found in one table fragment.
This is the case where both :
  1. The ordered index has all of the table's distribution keys as a prefix
  2. The range is contained within one value of the table's distribution keys

NdbApi detects this case when a range scan is defined, and 'prunes' the scan to one index fragment (and therefore one data node). For all other cases, all index fragments must be scanned.

Continuing the example above, assuming an ordered index on the primary key, the following ordered index scans can be pruned :


SELECT * FROM user_accounts WHERE user_id = 22;
SELECT * FROM user_accounts WHERE user_id = 22 AND account_type LIKE 'F%';


However, the following ordered index scans cannot be pruned, as matching rows are not guaranteed to be stored in one table fragment :


SELECT * FROM user_accounts WHERE account_type = "Facebook";
SELECT * FROM user_accounts WHERE user_id > 20 AND user_id <>


MySQLD partitioning variants and manually controlling distribution

Since MySQL 5.1, table partitioning has been supported. Tables can be partitioned based on functions of the distribution keys such as :

  • KEY
  • LINEAR KEY
  • HASH
  • RANGE
  • LIST

For engines other than Ndb, partitioning is implemented in the Server, with each partition implemented as a separate table in the Storage engine. Ndb implements these partition functions natively, using them to control data distribution across table fragments in a single table.

From Ndb's point of view, KEY and LINEAR KEY are native partitioning functions. Ndb knows how to determine which table fragment to use for a row from a table's distribution key, based on an MD5 hash of the distribution key.

HASH, RANGE and LIST are not natively supported by Ndb. When accessing tables defined using these functions, MySQLD must supply information to NdbApi to indicate which fragments to access. For example before primary key insert, update, delete and read operations, the table fragment to perform the operation on must be supplied. From MySQLD, the partitioning layer supplies this information.

Any NdbApi application can use the same mechanisms to manually control data distribution across table fragments. At the NdbApi level this is referred to as 'User Defined' partitioning. This feature is rarely used. One downside of using User Defined partitioning is that online data redistribution is not supported. I'll discuss Online data redistribution in a future post here.

PlanetMySQL Voting: Vote UP / Vote DOWN

Ndb software architecture

Сентябрь 29th, 2009
I'm sure that someone else can describe the actual history of Ndb development much better, but here's my limited and vague understanding.

  • Ndb is developed in an environment (Ericsson AXE telecoms switch) where Ericsson's PLEX is the language of choice
    PLEX supports multiple state machines (known as blocks) sending messages (known as signals) between them with some system-level conventions for starting up, restart and message classes. Blocks maintain internal state and define signal handling routines for different signal types. Very little abstraction within a block beyond subroutines is supported. (I'd love to hear some more detail on PLEX and how it has evolved). This architecture maps directly to the AXE processor design (APZ) which is unusual in having signal buffers implemented directly in silicon rather than software. This hard-coding drove Ndb's initial max supported signal size of 25 x 32-bit words.
  • An emulated PLEX environment (VM) is made available on Unix systems, written in C++
    The VM runs as a Unix process. PLEX code for blocks is interpreted. Signals are routed between blocks by the VM. This allows development and deployment of PLEX based systems on standard Unix systems. It also allows Plex based systems to easily interact with Unix software. Each VM instance is a single threaded process routing incoming signals to the signal handling functions in each block class.
  • A PLEX to C++ translation system is designed
    Blocks are mapped to large C++ classes with signal handling methods and per-block global state mapped to member variables. The limited labelling and abstraction encoded in the PLEX source are mapped to C style code within C++ classes.
  • The VM environment is 'branched' from the original PLEX/AXE environment and starts to evolve independently as a base for Ndb.
    It offers access to more OS services such as communication, disk IO etc. Plex interpretation functionality is removed as all relevant Plex code has been mapped to native C++. VM instances can communicate with each other over various channels and form a distributed system.
  • At some point in the timeline around here the Ndb team and product leave Ericsson
  • Over time, common block functionality is abstracted into base and utility classes.
    Hardware and system-convention sourced constraints are eased, the level of abstraction is raised. New blocks are designed and implemented without a Plex heritage making use of C++ abstraction facilities. Existing blocks are refactored.
  • Multi-threaded Ndbd (ndbmtd) is introduced, with groups of block instances running on different threads.
    Rather than being a radical design, it's a move back towards the original PLEX design point of 1 block instance per processor.

Today, Ndb executes a blocks communicating via signals model. Signals are no longer limited to 25 words. In single threaded Ndb (ndbd), all blocks share a single thread, with separate threads used for inter-VM communication setup and disk IO. In multi threaded Ndb (ndbmtd), block instances are grouped, and different functional groups share threads. In all cases, each block instance remains single-threaded, although the thread may be shared with other blocks.

The blocks and signals model is reminiscent of Erlang and Hoare's CSP – where concurrency is modelled as serial (or sequential) processes communicating with explicit messages, as opposed to a shared-memory model where communication occurs via memory with correctness controlled by locks, memory barriers and atomic instructions. It can also be considered similar to MPI and the Active object / Actor model.

Using explicit messaging for synchronisation/communication has costs – at runtime a given algorithm may require more data copying. At design time, potential concurrency must be explicitly designed-in with messaging and state changes. Mapping sequential algorithms to message passing state machines may require a bigger code transformation than mapping to a naive multithread safe shared-memory and locks implementation.

However I believe that these costs are generally paid off by the benefit of improved code clarity. Inter state-machine synchronisation becomes clearly visible, making synchronisation costs easier to visualise and understand. With explicit messaging as the main mechanism for inter-thread and inter-process communication, there is only a small kernel of multithreaded code to be implemented, proved correct and optimised. The bulk of the code can be implemented in single threaded style. There is no need for diverse libraries of multithread-optimised data structures. Processor and system architecture specific code and tradeoffs are minimised.

Internally, Ndb's VM supports only asynchronous messages between blocks. Using an asynchonous message passing style has many benefits. As the sending thread does not block awaiting a response to a message sent, it can work on other jobs, perhaps including the message just sent. This allows it to make the best use of warm instruction and data caches, reduces voluntary context switches and can reduce the likelihood of deadlock. Blocking IO (network, disk) is outsourced to a pool of threads. The signal processing thread(s) never block, except when no signals are available to process. The responsiveness of the system can be ensured by using prioritised job queues to determine the job to execute next and minimising the time spent processing individual jobs. From a formal point of view the number of possible multithreaded interactions is vastly reduced as thread-interleaving is only significant at signal processing boundaries. These limitations can make it easier to reason about the correctness and timing properties of the system.

However, coding in this asynchronous, event-driven style can be demanding. Any blocking operations (disk access, blocking communications, requests to other threads or processes etc.) must be implemented as an asynchronous request and response pair. This style can have an abstraction-dissolving property as many published data structures and algorithms are implemented assuming a synchronous model and making much use of the caller's stack for state storage and managing control flow. It can be difficult to design abstractions for the asynchronous style which don't leak so much messy detail as to be pointless. Additionally, the asynchronous style tends to flatten a system – as the need to return control to the lowest-level call point whenever concurrency is possible acts as a force against deep layers of abstraction. Side effects of this can include a tendency for error handling code to be non-localised to the source of the error. However, that is part of the charm of working on the system. The C++ environment gives a wide set of tools for designing such abstractions, and each improvement made simplifies future work.

Comments, corrections?

PlanetMySQL Voting: Vote UP / Vote DOWN