Archive for the ‘architecture’ Category

Working with ScaleBase and NOSQL

Декабрь 15th, 2011

There is a huge amount of buzz around NOSQL, and we at ScaleBase are happy to see companies making the move to NOSQL. Despite what some people might think, we consider it a blessed change. It is time for applications to stop having a single data store – namely a relational database (probably Oracle) – and start using the best tool for the job.

In the last couple of years, since NOSQL technologies broke into our world, a lot of experience has been gathered on how to use them. Mainly, we see NoSQL technologies used for one of the following scenarios:

  • Queries that require a very short response time
  • Storing data without a well-defined schema, or storing data with a frequently modified schema

Now, I’m not in any way saying that NOSQL solutions are not used for other scenarios as well; I’m only saying that from our experience here at ScaleBase ,  these are the most common scenarios.

Other needs, like data backup, complex joins queries, consistent data storage – all are still being delivered by relational databases.

So the implementation is along the lines of a hybrid model – NOSQL for some tasks, MySQL (or other database, but MySQL is by far the most popular) for others.

ScaleBase is determined to assist in the relational database part of the problem, letting it scale and perform – just as the NOSQL side can scale and perform by itself (and frankly it can scale and perform very well, as this was the original requirement for most NOSQL solutions).

As NOSQL solutions grow in popularity and use, I expect we’ll see “design patterns” pop up – when to use relational databases and when to use NOSQL solutions (and of course – which one). For now, if you’re architecting your new web application/SaaS solution or social game – try to learn from the architectures of existing sites. You can get some at http://highscalability.com , and others at http://nosql.mypopescu.com/.


PlanetMySQL Voting: Vote UP / Vote DOWN

ScaleBase achieves 180K NO-TPM TPCC results on Amazon RDS

Декабрь 12th, 2011

ScaleBase Releases Database TPC-C Performance Results

Technology achieves unprecedented transaction speed for a MySQL database at a low cost

 

Boston, Mass., December 12, 2011ScaleBase, Inc. today announced the results of its MySQL database benchmark, based on the industry-standard TPC-C test. ScaleBase has achieved an unmatched 180,000 Transactions per Minute – the highest result for a MySQL database – while running on an Amazon RDS environment. Cost per Transaction was reported to be 50 cents, which demonstrates the cost-effectiveness of the ScaleBase solution on the Amazon EC2 cloud. Full details of the benchmark can be found at http://www.scalebase.com/resources/performance/.

TPC, the Transaction Processing Performance Council, defines transaction processing and database benchmarks and delivers reliable, independent results to the industry. The TPC-C benchmark is a popular yardstick for comparing Online Transaction Processing (OLTP) performance on various hardware and software configurations.

The ScaleBase Database Load Balancer is a packaged solution for transparently scaling MySQL databases. ScaleBase utilizes two techniques for scaling: read-write splitting and transparent sharding (a technique for massively scaling-out relational databases). The software enables MySQL to scale transparently, without forcing developers to change a single line of code or perform a long data migration process. The technology is ideally suited for any application in which scalability, performance and speed are critical, including: gaming, e-commerce, SaaS, machine-generated data, Web 2.0 and more.

“Some people feel that by using MySQL they stand the chance of limiting their performance options, however, these TPC-C results proves that  this simply is no longer the case,” said Rob Levine, ScaleBase’s VP of Sales. “Without writing specialized code you can still get top performance – perhaps optimal performance – at an affordable rate, accounting for the requisite hardware and infrastructure resources. Especially in today’s economy, getting such great performance and optimizing every dollar spent can save companies substantial amounts of money.”

ScaleBase’s Database Load Balancer solution has been successfully used by numerous customers since its official release in August 2011.

 

About ScaleBase

ScaleBase has developed an innovative database load balancing technology that enables MySQL users to achieve scalability and high availability, without changing a single line of application code. ScaleBase utilizes two techniques for scaling: read-write splitting and transparent sharding, which is a method for massively scaling-out relational databases. The ScaleBase technology is ideally suited for any application in which scalability, performance and speed are critical, including: gaming, e-commerce, SaaS, machine-generated data and more. The company is privately-held and headquartered near Boston, Mass. Follow @SCLBase on Twitter.

 

Media Contact

Candice Perodeau

508-475-0025 x112

cperodeau@rainierco.com


PlanetMySQL Voting: Vote UP / Vote DOWN

Making the case for Database Sharding using a Proxy

Декабрь 6th, 2011

There are several ways to implement sharding in your application. The first and by far the most popular, is to implement it inside your application. It can be implemented as part of your own Data Access Layer, database driver, or an ORM extension. However, there are many limitations with such implementation, which drove us, at ScaleBase, to look for an alternative architecture.

As the above diagram shows, ScaleBase is implemented as a standalone proxy. There are several benefits to using such an architecture.

First and foremost, since the sharding logic is not embedded inside the application, third party applications can be used, be it MySQL Workbench, MySQL command line interface or any other third party product. This translates to a huge saving in the day-to-day costs of both developers and system administrators.

Backup can be executed via the proxy, and so allows users to consistently backup a sharded environment – not an easy task when sharding is developed internally.

Since the application server machines are usually highly utilized (as they should be, to optimize costs), running additional code on application server machines will just slow them down. Running the code on external proxies allows for a more efficient division of tasks between the servers, and allows requests to be unaffected by data crunching (for instance cross-shard queries) requests.

So all in all there are many reasons to run sharding code outside the scope of the application and application server. If you’re interested – we’d love to chat.


PlanetMySQL Voting: Vote UP / Vote DOWN

How do you know when to shard your database?

Ноябрь 17th, 2011

We at ScaleBase talk about sharding so much, it’s difficult for us to see why someone wouldn’t want to shard. But just because we’re so enthusiastic about our transparent sharding mechanism, it doesn’t mean we can’t understand the very basic question, “When do I shard?”
Well, it’s not the most difficult question to answer. I’ll keep it short: if your database exceeds the memory you have on a single machine, you should shard. If you hit I/O, your performance suffers, and sharding will assist.
Why? That’s easy to explain.
Databases in general (and MySQL is no exception) try to cache data. Because accessing memory is so much faster than accessing disk (even with SSDs), database providers have developed rather sophisticated caching algorithms. For instance, running a query caches the query and its results. Indexes are stored in memory so that, when running a query, the database doesn’t have to hit the disk twice.
But if the database is big, it won’t fit into memory. Sometime even the index won’t fit into memory. This is when you start seeing database performance degradation. So the best date to start sharding is when you can’t add more memory to your database server. This can come sooner rather than later. As we all know, data is booming, and if you’re running in the cloud there is only so much memory your cloud provider will give you. With sharding, every machine has its own data, which fits in RAM. And if you need more – just add an additional shard.
The other parameter is the number of concurrent connections. If you reach the limit of connections your machine can handle, it’s time to shard your database. Every sharded database gets less hits/second, requires fewer connections – and can work faster.
So, if your database does not fit in memory, or if you have too many concurrent users hitting your database – try out ScaleBase, for our transparent sharding solution.


PlanetMySQL Voting: Vote UP / Vote DOWN

Oracle’s NoSQL

Октябрь 7th, 2011

OracleOracle's turn-about announcement of a NoSQL product wasn't really surprising. When Oracle spends time and effort putting down a technology, you can bet that its secretly impressed, and trying to re-implement it in its back room. So Oracle's paper "Debunking the NoSQL Hype" should really have been read as a backhanded product announcement. (By the way, don't click that link; the paper appears to have been taken down. Surprise.)

I have to agree with DataStax and other developers in the NoSQL movement: Oracle's announcement is a validation, more than anything else. It's certainly a validation of NoSQL, and it's worth thinking about exactly what that means. It's long been clear that NoSQL isn't about any particular architecture. When databases as fundamentally different as MongoDB, Cassandra, and Neo4J can all be legitimately characterized as "NoSQL," it's clear that NoSQL isn't a "thing." We've become accustomed to talking about the NoSQL "movement," but what does that mean?

As Justin Sheehy, CTO of Basho Technologies, said, the NoSQL movement isn't about any particular architecture, but about architectural choice. For as long as I can remember, application developers have debated software architecture choices with gusto. There were many choices for the front end; many choices for middleware; and careers rose and fell based on those choices. Somewhere along the way, "Software Architect" even became a job title. But for the backend, for the past 20 years there has really been only one choice: a relational database that looks a lot like Oracle (or MySQL, if you'd prefer). And choosing between Oracle, MySQL, PostgreSQL, or some other relational database just isn't that big a choice.

Did we really believe that one size fits all for database problems? If we ever did, the last three years have made it clear that the model was broken. I've got nothing against SQL (well, actually, I do, but that's purely personal), and I'm willing to admit that relational databases solve many, maybe even most, of the database problems out there. But just as it's clear that the universe is a more complicated place than physicists thought it was in 1990, it's also clear that there are data problems that don't fit 20-year-old models. NoSQL doesn't use any particular model for storing data; it represents the ability to think about and choose your data architecture. It's important to see Oracle recognize this. The company's announcement isn't just a validation of key-value stores, but of the entire discussion of database architecture.

Of course, there's more to the announcement than NoSQL. Oracle is selling a big data appliance: an integrated package including Hadoop and R. The software is available standalone, though Oracle clearly hopes that the package will be running on its Exadata Database hardware (or equivalent), which is an impressive monster of a database machine (though I agree with Mike Driscoll, that machines like these are on the wrong side of history). There are other bits and pieces to solve ETL and other integration problems. And it's fair to say that Oracle's announcement validates more than just NoSQL; it validates the "startup stack" or "data stack" that we've seen in many of most exciting new businesses that we watch. Hadoop plus a non-relational database (often MongoDB, HBase, or Cassandra), with R as an analytics platform, is a powerful combination. If nothing else, Oracle has given more conservative (and well-funded) enterprises permission to make the architectural decisions that the startups have been making all along, and to work with data that goes beyond what traditional data warehouses and BI technologies allow. That's a good move, and it grows the pie for everyone.

I don't think many young companies will be tempted to invest millions in Oracle products. Some larger enterprises should, and will, question whether investing in Oracle products is wise when there are much less expensive solutions. And I am sure that Oracle will take its share of the well-funded enterprise business. It's a win all around.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR


Related:




PlanetMySQL Voting: Vote UP / Vote DOWN

When Clever Goes Wrong & How Etsy Overcame – Arstechnica

Октябрь 5th, 2011

In 2007, Etsy made a big bet on homegrown middleware to help with the site’s scalability. A half-year after it was taken live, the company decided to abandon it. As a senior software engineer at Etsy put it, “if you’re doing something ‘clever,” you’re probably doing it wrong.”

Read the full article at Arstechnica.com

I want to focus on the important lessons from this article, about middleware and using stored procedures in this fashion for a public web application, creating unscalable design complexity (smart and “proper” according to the old enterprise design teachings…) – causing infrastructure, development and maintenance hassles.

In the process they did replace PostgreSQL with MySQL but that’s not the critical change that made all the difference. PostgreSQL is a fine database system also.


PlanetMySQL Voting: Vote UP / Vote DOWN

How to Implement MySQL Sharding – Part 3

Октябрь 4th, 2011

In the previous post of this series (which can be found here) I discussed how to migrate your data once you have decided how to shard your schema.

Once your data is sharded, it’s time to modify your application code. I will not dive into the many open source platforms that provide partial sharding support (Hibernate Shards, Gizzard, and the like), and will take Java (sorry, old habits are hard to overcome) as an example – however, the same holds true for any programming language.

Without Using ORM

If you wrote your code without an Object/Relational Mapping tool, kudos to you. Sharding will be easier, as you control the SQL statements.

Upgrading Connection Pool

Your first task is to write a connection pool that is “sharding” aware.  The class should look something like this:

public class ShardingAwareDatasource {
public static Connection getConnection(Object shardingKey) {…}
public static Connection getAnyConnection() {…}
}

Queries that run on specific shards need to use the getConnection method, which must contain the logic that returns the correct connection based on the sharding key. Queries that run on global tables (as described in the previous posts) can use getAnyConnection, which returns a connection to any of the shards.

Note that these methods must be session aware, to ensure transaction isolation.

Changing Queries

  1. Go over all of your queries.
  2. For each query:
    1. Identify whether it runs on a global or shard table.

i.      If it runs on a global table, make sure the connection used is from getAnyConnection.

  1. If it is based on a shard table, check if it contains the shard key.

i.      If so, then use that shard key in the getConnection method.

  1. If the query uses other tables, break it down into multiple queries.

ii.      If not, then split the query into multiple queries, so that each contains a shard key.

iii.      Make sure your code merges data that is gathered across multiple queries.

Usually, you’ll see that if the query is not trivial (contains only one table; if the table is sharded, it must contain the shard key in the where clause; etc.) it will have to be changed. It’s a lot of work, but it pays off in performance.

When Using ORM

Since most ORMs don’t support sharding, you’re out of luck. Most likely you’ll have to rewrite your ORM code, directly use SQL, and handle the object mapping by yourself. Not an easy task.

Summary

Implementing your own sharding is not impossible. It’s been done before, and in this series I tried to focus on what tasks are needed when you implement your own sharding.

Of course, if you’re serious about sharding your database, I strongly urge you to give ScaleBase a try. We’ll make sure all this heavy lifting is done for you, in a transparent way – no code changes, no schema configuration, everything is automatic.

I’ll be happy to chat on this page or through our contact us page. Also, you can get more information on how to write your own sharding code in our upcoming webinar. It will be held on November 2nd, 11AM PST. You can register here.

And if you’d like to find out what’s the best way to shard your schema, try our 100% free Analyzer. You can download it here.


PlanetMySQL Voting: Vote UP / Vote DOWN

How to Implement MySQL Sharding – Part 2

Сентябрь 26th, 2011

In the previous post of this series (which can be found here) I discussed how to identify tables that can serve as good candidates for sharding.

Once you have decided which tables should be sharded (all the rest should be global tables), the choice of sharding keys is rather straightforward, as most will use the table primary key as the shard key. Of course, if multiple tables are sharded, and there is a foreign key relationship between these tables, then the foreign key will serve as the shard key for some tables.

Many people attempt to shard based on customer_id or a resource id, but I have seen how this usually fails in production environments. It is very hard to know in advance which customers belong together in the same database, and since customers can suddenly increase their traffic, this might create an unbalanced situation in which some shards are very busy while others are relaxed (see the details of last year’s FourSquare outage for some possible results of unbalanced sharding).

As with database partitioning, there are multiple algorithms available for sharding: hash , list, or range. Usually you’ll use list and range for multi-tenancy – saving customer information across different databases and maybe even different data centers. I’ll touch on that subject in a future post. But hash will probably give you the best results when it comes to sharding, as statistically it ensures that data is evenly distributed across all shards.

So after sharding configuration, what’s next?

If you have a new application you can skip the next section and just wait for the next post. But if you have an existing database, you’re stuck with huge amounts of data that you need to split.

We at ScaleBase ran a lot of tests and found that the following is the best mechanism for the initial data migration (BTW – if you use ScaleBase,– we handle and also optimize the data migration for you).

  1. Have the database cloned in all shards. It can be done by cloning a VM, or copying the physical files, or using mysqldump to export once and import to all shards.
  2. For each shard (on shard tables only):
    1. Drop all indexes.
    2. Delete the irrelevant data from the shard (this should be done by an automatic script of course).
      Note: This action creates a lot of fragmentation. You might consider creating temporary a table, inserting to it only the relevant rows, drop the original table and rename the temporary one to the real name
    3. Create all indexes.

In the next post we’ll talk about the programming language modifications that sharding requires.


PlanetMySQL Voting: Vote UP / Vote DOWN

Backing Up MySQL With ScaleBase

Сентябрь 12th, 2011

Backing up data is critical for production databases – and there are a lot of well-known solutions for backing up databases.

When the database is sharded, backing up data becomes problematic. If the backup is not synchronized across all shards, data inconsistency might occur. In this blog post I’ll try to detail the possible backup scenarios for sharded databases when using ScaleBase.

Backup Types

Let’s start by understanding the different backup types that are out there. You can read all about it here.

A physical backup involves copying all database files to a different location. Copying can take several hours for a decent database if it’s done to a disk or a tape. It might take only seconds if the database files reside on SAN/NAS storage hardware that supports snapshot technology.

A logical backup is a copy of the logical database structure. It backs up meaningful data rather than the physical backup’s bits and bytes.  The logical backup is comprised of all CREATE TABLE statements and INSERT statements for the content.

Physical backup methods are faster than logical because they involve only file copying without conversion.

A full restore is also faster from a physical backup. However with a physical backup you can’t restore only one table, or selected specific data. If this is what you need, you’ll have to use logical backups.

Physical Backup

A physical backup can be cold, warm or hot.

Backup Type Single Database Sharded Database with ScaleBase
Cold
  1. Shut down the database server.
  2. Make the copy or snapshot.
  3. Start the database server.
  1. Shut down the ScaleBase instances that are connected to the database (using MySQLAdmin).
  2. Take the copy or snapshot (simultaneously) on all sharded databases.
  3. Startup the ScaleBase instances.
Warm
  1. Flush and lock all tables with the command “FLUSH TABLES WITH READ LOCK”.
  2. Perform the copy or snapshot.
  3. Then “UNLOCK TABLES”.
  1. Run the lock command through ScaleBase.
  2. While locked, take the copy or snapshot simultaneously on all databases.
  3. Run the unlock command through ScaleBase.
Hot Needs tools like “MySQL Enterprise Backup”, or “Percona xtrabackup”. Needs tools like “MySQL Enterprise Backup”, or “Percona xtrabackup” on all databases servers.

Logical Backup

1 DB Sharded
The most common command for a logical backup is:mysqldump –single-transaction –all-databases Run the command through ScaleBase.

Benefits of Backing Up with ScaleBase

The added value of using ScaleBase when backing up data is:

  • Vs. single database:
    • Backup takes only a fraction of the time. Since each database is smaller, copying the data is faster.
  • Vs. home-grown sharded environment:
    • Instead of updating backup scripts, just change the IP address to ScaleBase. Everything will continue working exactly as before.

PlanetMySQL Voting: Vote UP / Vote DOWN

OpenDBCamp: Information Lifecycle Architecture

Май 7th, 2011
The Open DB Camp in Sardinia 2011 has had a number of sessions on varying topics. Topics range from MySQL over MongoDB to replication and High Availability.

I decided to tap into the database expert resources present here at Sardegna Ricerche by discussing a non-database issue, where one can expert database experts to have insights beyond those of end users. And they did.

The topic was the particular case of information overload many of us suffer from on our hard disks: Too many files, too hard to find.
  • How do we find the bank statement from April 2007 from the more-seldom-used account?
  • What are the ten best work-related pictures from last year?
  • Is this the most current version of the presentation of BlackRay?
  • Are these films from Cagliari already backed up? Also offsite?
It turned out that I am not the only one suffering from a slight chaos on my hard disk. We all have some basic discipline we try to follow to keep things in order, but the consensus seemed to be that disorder on the hard disk is a psychological problem to be solved by good habits, more than a technical problem to be solved by an application. This in itself is a revolutionary insight, to come from a bunch of techies.

Before going into the individual points, let me first share how I had framed the discussion:
Many OpenSQLCamp attendees spend lots of time communicating about our SQL projects, internally and externally. We spend lots of time architecting database systems, and managing the lifecycle of products.

We do little to implement a proper architecture for the non-database information we create and manage, in business and privately. We drown in emails, digital pictures, versions of downloaded PDF documents, video snippets, and attachments sent by colleagues, partners and private friends. Chaos ensues.

Disorder and low productivity are inevitable unless we are very disciplined in following some basic rules for keeping order on our hard disks, pods and pads
. But what are those basic rules? And what tools can implement them?

I don't sit in with more than a rough first sketch of "an Information Lifecycle Architecture", but I'd like to share ideas, thoughts and attitudes with my fellow OpenSQLCamp attendees. I'll present some slides and guidelines, and will make an attempt at collecting your thoughts into a summary afterwards!
I threw in a couple of basic ideas on how to handle the type of information that we have to manage as individuals, usually on our own hard disks:
  1. Separate /pub from /rep: Store raw information in its original form in one directory tree, the "repository". Store distilled information ready to be consumed in a separate directory tree, the "publications".
  2. Limit the allowed /pub formats: Allow very few formats for publishing (such as .jpg .mov .pdf .mp3 .ogg but not .doc .ppt .xls .cr2 .psd .oo3 or anything even more "exotic").
  3. Delete systematically: Don't save many versions of the same file. Don't save information that isn't needed.
  4. Sync easily: Set up the directories (and configure your software) so that it's very easy to sync the published files with your mobile devices (Androids, iPhones, iPads, iPods, digi frames), regardless if PDFs, JPGs, MOVs or MP3s.
  5. Order files by type: Above /pub and /rep, separate files by rough category: Pictures, Movies, Documents, Music.
  6. Order files by year: Under /pub and /rep, separate most files into directories by year. Month or quarter would be too frequent for most personal information.
  7. Order files by common sense: Under the year (or in exceptional cases directly under /pub or /rep), separate files by placing them into a smart directory structure, which you yourself decide about according to the topic, as opposed to delegating the file structure to the random preferences of some software (like iPhoto).
Beat Vontobel, Liz van Dijk, Markus Popp, Sheeri Kritzer Cabral, Sergei Golubchik, René Cannao and others came with very good ideas and anecdotes. Let me here relate some of them, while they're in fresh memory:
  1. Blog your notes! Write your personal notes so that they're reusable for others. Publish them on your blog. Then you can use Google to find your own notes. I think this tip is smarter than what it sounds at first, i.e. it's applicable for quite a few situations.
  2. Use version control! For some who are familiar with version control anyway, it may make sense to put presentations and various types of other personal information into a version control system.
  3. Use the cloud! Put some of the information onto the cloud, for easy availability across machines, for easy synching, for backup.
  4. Tags for fields should be part of the operating system. You could tag expense reports, notes, contacts, pictures, films, documents and emails alike with #opendbcamp. The tagging should ideally work across operating systems.
  5. Order needs discipline. Any good habit of keeping order on the hard disk needs to be backed up by a commitment in time. If you slip once, and twice, and one more time, the discipline is lacking.
  6. Storage is cheap. Or is it? Here I noted two schools of thought. One would rather just tag anything and keep order by sorting. The other school would rather delete as much as possible, so that the remainder is smaller and hence easier to keep ordered. I belong to the latter one.
  7. Bad banks throw important yet unstructured information at you. You can get a bank account statement with a long filename which doesn't denote the year and month or bank account. You yourself have to parse the file, and name it properly. That's a burden even for a geeky OpenDBCamp visitor. Think of the poor average bank customers!
  8. The analog world forced you to have a physical relationship to your data. In order to use your CDs or spices or books, your mental maps of organising them were backed up by some physical structure. This physical structure is missing from digital data. It becomes easier to forget that you even have the information. We end up with a lot of pictures, music and videos we never use.
  9. Use Yojimbo http://www.barebones.com/products/yojimbo/ as an information organiser, if you're a Mac user.
  10. Does technology solve issues or create them? Earlier, we didn't have as many pics, films, CDs or books. Now, we have more of them, in a variety of forms. Does it really make sense to spend tens of hours sorting and otherwise maintaining your collections (of films, music, pictures)? Or is it better to have smaller collections, even of the seemingly "free" items such as digital pictures and films taken by yourself?
On that philosophic observation, let me end my personal notes from the "Information Lifecycle Architecture" session at the Open DB Camp, which I have now published and will be able to find later on by Googling it.

PlanetMySQL Voting: Vote UP / Vote DOWN