Archive for the ‘indexing’ Category

FictionPress Selects TokuDB for Consistent Performance and Fast Disaster Recovery

Январь 3rd, 2012

FictionPress

Issues addressed:

  • Support complex and efficient indexes at 100+ million rows.
  • Predicable and consistent performance regardless of data size growth.
  • Fast recovery.

Ensuring Predictable Performance at Scale

The Company:  FictionPress operates both FictionPress.com and FanFiction.net and is home to over 6 million works of fiction, with millions of writers/readers participating from around the world in over 30 languages

The Challenge: FictionPress offers a number of interactive features to its large user base. These include discussion forums, in-site messaging and user reviews. FictionPress made the decision to build its own discussion forums to meet its strict security and performance requirements. Xing Li, CTO of FictionPress, noted that the site “needs to host hundreds of thousands of forums. Existing forum software doesn’t do this while meeting our performance and security targets.”

To ensure the real-time responsiveness of the forums, FictionPress needs the ability to create and efficiently maintain complex indexes and be able to support millions of small rows. In addition, it needs the ability to index them with minimal impact to resource costs and performance. “The only way to make this all work and provide a good customer experience is to guarantee that we can deliver a flat predictable performance with our database back-end even as the number of rows crosses the 100 million mark,” according to Li.

FictionPress considered InnoDB, the default storage engine for MySQL, but it did not offer predictable performance at scale. Indexes became dramatically slower as the number of rows increased, causing a reduction of both read and write performance. InnoDB also did not offer the performance-enhancing feature of multiple clustering indexes.

The Solution:  FictionPress uses MariaDB and TokuDB to manage its discussion forums, reviews, and in-site messaging systems.

FictionPress installed TokuDB in a Linux environment with dedicated hardware. Each configuration has a single master with multiple read slaves. “TokuDB’s high write concurrency and support for multiple clustering indexes gave us the freedom to design and deploy better performing queries at scale,” according to Li. This was important to FictionPress as its environment is continually expanding.

The Benefits:

Predictable Performance: “While raw performance is important, the predictability of response time as one scales the system was our focal point” according to Li. “InnoDB can only have one clustering index, but TokuDB gives you basically an unlimited number. In addition, both MyISAM and InnoDB slow down with many indexes on databases of our size. MyISAM also causes replication lag due to concurrency. In the end, TokuDB gives us predictability, performance at scale, and more flexible indexing without the limitations found in other MySQL options.”

Cost: “To get additional performance, one can always throw hardware at the problem,” according to Li. “By utilizing TokuDB instead we improved scalability and at the same time saved on costs for additional server hardware that would have been required if TokuDB was not in the picture. In addition, we saw an 8x size reduction in disk space compared to MyISAM due to improved compression. The hardware cost saving made moving to TokuDB an easy decision.”

Crash Recovery: FictionPress had been using MyISAM initially. “We needed a replacement for MyISAM for small BLOB data,” according to Li. “In fact, we wanted to move away from MyISAM whenever possible to shorten its long crash recovery. InnoDB was an option but TokuDB offered better compression and a smaller storage footprint for both core data and index data for our own data sets.”

Hot Schema Changes: “For performance reasons we need a lot of indexes but also need to add and maintain these indexes quickly,” according to Li. “TokuDB is the only MySQL solution I found that offers Hot Schema changes such as Hot Indexing. Hot Schema changes are a powerful capability which we use to minimize downtime during system-wide upgrades and shorten our application/schema development cycle.”

 

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Slides of my talk on B+Tree Indexes and InnoDB

Декабрь 18th, 2011
The slides of my talk on B+Tree Indexes and InnoDB are now available for download. This slide was presented during Percona Live London 2011. You can download the slides from here. There are many other interesting and informative talks that were presented during Percona Live London 2011, and I think you should definitely check them out, if you haven't. They are available here.
PlanetMySQL Voting: Vote UP / Vote DOWN

Book Review – Effective MySQL

Ноябрь 4th, 2011

Effective MySQL: Optimizing SQL Statements

by Ronald Bradford

No Nonsense, Readable, Practical, and Compact

Effective MySQLI like that this book is small; 150 pages means you can carry it easily.  It's also very no nonsense.  It does not dig too deeply into theory unless it directly relates to your day-to-day needs.  And those needs probably cluster heavily around optimizing SQL queries, as those pesky developers are always breaking things ;)

Jokes aside, this new book out on Oracle Press is a very readable volume. Bradford has drawn directly from real-world experience to give you the right bite size morsels you need in your day-to-day MySQL activities.

Highlights

Chapter one, The Five Minute DBA gives you the basic methodology if you don't already know it.  Enable the slow query log, analyze it, and use the explain facility.  Then index as appropriate, or eliminate queries if you can.

Chapter two digs a little deeper into the basics, introducing explain extended, table statistics and storage engines.  You'll also learn how to use show session & global status, as well as session & global variables.  You'll also have your first look at MySQL's data dictionary - INFORMATION_SCHEMA.

Chapter three is where it starts to get meaty.  You probably know that MySQL has b-tree indexes, but did you know that it has b+tree indexes, or hash indexes?

Chapter four digs into indexes further with single & multi-column indexes using them for sorting and joining.  You'll also find out about covering indexes which are multi-column matching the where clause, but also including columns needed in the SELECT predicate.  Do you have duplicate or unused indexes?  You'll learn why they matter to performance and how to eliminate them with tools like mk-duplicate-key-checker.

Chapter five continues along the same lines, with more coverage of indexes.  Learn to identify when you are using a covering index, fulfilling the entire query by only accessing the index.  You'll also learn about partial indexes, how they can reduce the size of index storage and retrieval while still getting your data for you.

Chapter six covers configuring the server itself, hitting on the system variables such as the innodb buffer pool (innodb_buffer_pool_size) and key buffer (key_buffer_size) as well as the query cache.  You'll also learn how to set the four main session memory settings - sort buffer (sort_buffer_size) and join buffer ( join_buffer_size) as well as the lesser known read buffers (read_buffer_size and read_rnd_buffer_size).

Chapter seven is all about the process of tuning and optimizing MySQL.  Rolling all the previous sections into marching orders, and prescriptive advice, he takes you through step by step how to apply the principles.  You'll get an introduction to mk-query-digest (though strangely without attribution to Baron Schwartz), the great maatkit tool for query analysis and aggregation, as well as the microsecond precision patch, which allows your mysql shell client to display more exact timing data.  For the patch he links back to an article on his own site which seems to be not found.  The author of the high precision mysql timer patch is Stewart Smith.

I personally got the most out of Chapter eight, full of self-described hidden performance tips.  From identifying unused or duplicate indexes, to replacing inefficient data types with better ones, why it's important to use NOT NULL where possible or how to store IP addresses efficiently, this chapter has a lot of goodies.  For those still struggling with SQL statement tuning, there are a few patterns that are described, offering advice on how to rewrite a subquery as an inner join,

What you might not know

  • MySQL includes Oracle's index organized tables by a different name
  • Too many indexes can dramatically impact INSERT & UPDATE performance
  • Many DDL operations can be done online - see oak-online-alter-table (Shlomi Noach)
  • Datatypes matter - use enum, int unsigned, timestamp & not null where possible
  • Covering indexes are your friend, duplicate & unused indexes are not!
  • A replication slave can have different storage engines or indexes from the master. These can support different uses - such as data warehousing or non-transactional requirements.
  • While a_string LIKE '%end of my sentence.' won't use an index, you can index reverse_string, then use reverse_string LIKE REVERSE '%end of my sentence.' and MySQL will use this index.  You've simulated an advanced Oracle feature, reverse key indexes!

A few small gripes

If I were to add a few complaints it would be to say that some of the examples were rather simplistic.  In many cases tuning SQL is not as simple as just adding the right index.  For instance there was no good discussion of the dreaded "using temporary, using filesort" that we see a lot in MySQL explains when sorting has to be done, but will not fit in memory.  Or what about tmpdir=/dev/shm, how will that improve things?  What about UNION versus UNION ALL where appropriate.  Why does DISTINCT do a sort?

The book was also missing a discussion of triggers, stored procedures, when or if the query cache can cause problems and so forth.  Also the article link mentioned about chapter seven isn't the only missing link.  I followed links to optimizing sql  statements and it seems to go to a generic holding page.  Also the main link effectivemysql.com/book leads to an outline of an as yet unreleased title on Backup and Recovery.

All in all, well worth your money

However, other than these few gripes the book overall is a very welcome addition to the small family of MySQL books.  Get a copy quick before they're all gone!

 

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Book Review – Effective MySQL

Ноябрь 4th, 2011

Effective MySQL: Optimizing SQL Statements

by Ronald Bradford

No Nonsense, Readable, Practical, and Compact

Effective MySQLI like that this book is small; 150 pages means you can carry it easily.  It's also very no nonsense.  It does not dig too deeply into theory unless it directly relates to your day-to-day needs.  And those needs probably cluster heavily around optimizing SQL queries, as those pesky developers are always breaking things ;)

Jokes aside, this new book out on Oracle Press is a very readable volume. Bradford has drawn directly from real-world experience to give you the right bite size morsels you need in your day-to-day MySQL activities.

Highlights

Chapter one, The Five Minute DBA gives you the basic methodology if you don't already know it.  Enable the slow query log, analyze it, and use the explain facility.  Then index as appropriate, or eliminate queries if you can.

Chapter two digs a little deeper into the basics, introducing explain extended, table statistics and storage engines.  You'll also learn how to use show session & global status, as well as session & global variables.  You'll also have your first look at MySQL's data dictionary - INFORMATION_SCHEMA.

Chapter three is where it starts to get meaty.  You probably know that MySQL has b-tree indexes, but did you know that it has b+tree indexes, or hash indexes?

Chapter four digs into indexes further with single & multi-column indexes using them for sorting and joining.  You'll also find out about covering indexes which are multi-column matching the where clause, but also including columns needed in the SELECT predicate.  Do you have duplicate or unused indexes?  You'll learn why they matter to performance and how to eliminate them with tools like mk-duplicate-key-checker.

Chapter five continues along the same lines, with more coverage of indexes.  Learn to identify when you are using a covering index, fulfilling the entire query by only accessing the index.  You'll also learn about partial indexes, how they can reduce the size of index storage and retrieval while still getting your data for you.

Chapter six covers configuring the server itself, hitting on the system variables such as the innodb buffer pool (innodb_buffer_pool_size) and key buffer (key_buffer_size) as well as the query cache.  You'll also learn how to set the four main session memory settings - sort buffer (sort_buffer_size) and join buffer ( join_buffer_size) as well as the lesser known read buffers (read_buffer_size and read_rnd_buffer_size).

Chapter seven is all about the process of tuning and optimizing MySQL.  Rolling all the previous sections into marching orders, and prescriptive advice, he takes you through step by step how to apply the principles.  You'll get an introduction to mk-query-digest (though strangely without attribution to Baron Schwartz), the great maatkit tool for query analysis and aggregation, as well as the microsecond precision patch, which allows your mysql shell client to display more exact timing data.  For the patch he links back to an article on his own site which seems to be not found.  The author of the high precision mysql timer patch is Stewart Smith.

I personally got the most out of Chapter eight, full of self-described hidden performance tips.  From identifying unused or duplicate indexes, to replacing inefficient data types with better ones, why it's important to use NOT NULL where possible or how to store IP addresses efficiently, this chapter has a lot of goodies.  For those still struggling with SQL statement tuning, there are a few patterns that are described, offering advice on how to rewrite a subquery as an inner join,

What you might not know

  • MySQL includes Oracle's index organized tables by a different name
  • Too many indexes can dramatically impact INSERT & UPDATE performance
  • Many DDL operations can be done online - see oak-online-alter-table (Shlomi Noach)
  • Datatypes matter - use enum, int unsigned, timestamp & not null where possible
  • Covering indexes are your friend, duplicate & unused indexes are not!
  • A replication slave can have different storage engines or indexes from the master. These can support different uses - such as data warehousing or non-transactional requirements.
  • While a_string LIKE '%end of my sentence.' won't use an index, you can index reverse_string, then use reverse_string LIKE REVERSE '%end of my sentence.' and MySQL will use this index.  You've simulated an advanced Oracle feature, reverse key indexes!

A few small gripes

If I were to add a few complaints it would be to say that some of the examples were rather simplistic.  In many cases tuning SQL is not as simple as just adding the right index.  For instance there was no good discussion of the dreaded "using temporary, using filesort" that we see a lot in MySQL explains when sorting has to be done, but will not fit in memory.  Or what about tmpdir=/dev/shm, how will that improve things?  What about UNION versus UNION ALL where appropriate.  Why does DISTINCT do a sort?

The book was also missing a discussion of triggers, stored procedures, when or if the query cache can cause problems and so forth.  Also the article link mentioned about chapter seven isn't the only missing link.  I followed links to optimizing sql  statements and it seems to go to a generic holding page.  Also the main link effectivemysql.com/book leads to an outline of an as yet unreleased title on Backup and Recovery.

All in all, well worth your money

However, other than these few gripes the book overall is a very welcome addition to the small family of MySQL books.  Get a copy quick before they're all gone!

 

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Understanding Indexing – NY Effective MySQL Meetup

Октябрь 7th, 2011

At next week’s NY Effective MySQL Meetup, I will give a talk: “Understanding Indexing: Three rules on making indexes around queries to provide good performance.” The meetup is 7 pm Tuesday, October 11th, and will be held at Hive at 55 (55 Broad Street, New York, NY). Thanks to host Ronald Bradford for the invitation.

Application performance often depends on how fast a query can respond and query performance almost always depends on good indexing. So one of the quickest and least expensive ways to increase application performance is to optimize the indexes. This talk presents three simple and effective rules on how to construct indexes around queries that result in good performance.

This is a general discussion applicable to all databases using indexes and is not specific to any particular MySQL® storage engine (e.g., InnoDB, TokuDB®, etc.). The rules are explained using a simple model that does NOT rely on understanding B-trees, Fractal Tree® indexing, or any other data structure used to store the data on disk.

The rules are derived from these simple properties:

  • Point queries are slow
  • Range queries are fast

I hope to see you there!


PlanetMySQL Voting: Vote UP / Vote DOWN

Write Optimization: Myths, Comparison, Clarifications, Part 2

Октябрь 4th, 2011

In my last post, we talked about the read/write tradeoff of indexing data structures, and some ways that people augment B-trees in order to get better write performance. We also talked about the significant drawbacks of each method, and I promised to show some more fundamental approaches.

We had two “workload-based” techniques: inserting in sequential order, and using fewer indexes, and two “data structure-based” techniques: a write buffer, and OLAP. Remember, the most common thing people do when faced with an insertion bottleneck is to use fewer indexes, and this kills query performance. So keep in mind that all our work on write-optimization is really work for read-optimization, in that write-optimized indexes are cheap enough that you can keep all the ones you need to get good read performance.

Today, I’ll draw some parallels with the write buffer and OLAP. Recall that the write buffer gets you a small insertion speedup but doesn’t really hurt query time, and OLAP gets you a big insertion speedup but doesn’t let you query your data for a long time. We’ll also use the fact that sequential insertions are orders of magnitude faster than random insertions.

With that in mind, let’s move on to the new generation of write optimization.

Two Great Tastes: Log-Structured Merge trees (LSMs)

We couldn’t manage to get both insertion and query performance out of either a write buffer or OLAP. But they’re similar techniques, just at two extremes.

LSMs are two great tastes that taste great together. To start, make the buffer big, but make it a B-tree, so you can query data as it arrives. In fact, suppose you have log(N) B-trees, B1, B2, …, Blog(N), each twice the size of the one before. If B-trees are slow, using more of them sounds crazy, but I promise we’re getting somewhere.

An LSM Tree with 4 levels

Each B-tree has twice the capacity of the one before it. So Bk can hold 2k elements. When a new row arrives, put it in B-tree B1. When B1 reaches capacity (which in this case is 2 rows), dump those rows into B2. B2‘s capacity is 4 rows. When B2 overflows, you dump the items into B3, which has a capacity of 8 rows, and so on. The trick here is that each time we dump things down to the next B-tree, they’re already sorted, so we get the insertion boost out of doing sequential insertions.

The first log(M) B-trees are in memory (where M is the size of memory). A simple optimization is to just have one B-tree for all these levels, because in-memory B-trees are fast. Once you start flowing out of memory, you are always merging one tree with another which has at most twice as many rows. This way, the smaller B-tree can be treated like the large, OLAP-style buffer, and you get a similarly large speedup, in fact, this merge happens at disk bandwidth speeds.

Not so fast, you say: You don’t get to use all the bandwidth, because each row gets moved from B-tree to B-tree, and it uses up bandwidth each time. This is true, but it turns out that you’re operating at a 1/log(N/M) fraction of bandwidth, which is a lot better than a B-tree, by orders of magnitude.

Alas, the queries are not so great. Even though we made the buffer into B-trees, which are good for queries, you now need to do a query in each one. There are log(N/M) of them on disk, so this ends up being slower than a B-tree by a log(N/M) factor. There’s that pesky tradeoff, which is much better than the B-tree tradeoff, but still not the mathematically optimal tradeoff.

One last point: if instead of growing the B-trees by a factor of 2, you grow them by a larger factor, you slow down your insertions but speed up your queries. Once again, the tradeoff emerges.

Have Your Cake and Eat It Too: COLAs

A COLA (that’s Cache-Oblivious Lookahead Array) is a lot like an LSM, with the queries done in a better way. To begin with, you use a technique called fractional cascading to maintain some information about the relationship from one level to the next. This information can take many forms, but what’s important is that you don’t restart your query at each level and end up doing a full B-tree query log(N) times. Instead, you get to do a small local search. If you do things just right, you can match the query time of a single B-tree. This is true even if you are doubling your structures at each level, so in addition, COLAs are as fast at insertions as LSMs.

Let me repeat that: they match B-trees for queries while simultaneously matching LSMs for insertions. It’s nice to note that COLAs are on the mathematically optimal write/read tradeoff curve, and they’re a proof, by example, that B-trees are not optimal.

COLAs are on the optimal read/write tradeoff curve

This flavor of data structure, which combines the insertion speed of the size-doubling hierarchy of sorted structures (the LSM) with the query speed boost of fractional cascading, goes by many names and can be found dressed up in a bunch of surprising ways, but the underlying math, as well as the performance, is exactly the same.

For bonus points, if you read my colleagues’ paper on COLAs, you’ll see that they are described as being log(B) slower than B-trees on queries. This log(B) is easily recouped in practice—giving you the same query speed as B-trees—if you give up so-called cache obliviousness (a property which is nice mathematically, but not as nice as having faster queries).

Write Optimization is the Best Read Optimization

I’ve been focusing on write optimization, and Fractal Trees do go a couple of orders of magnitude faster than B-trees for indexing that pesky non-sequential data. What that means for the user is typically read optimization: you start adding all the indexes you needed all along, since indexes are so wonderfully cheap to update. My motto is: write optimization is the best read optimization!

You can get COLA-style read-optimal, write-optimized goodness here at Tokutek, where it is marketed as Fractal Trees and available in TokuDB for MySQL and MariaDB.


PlanetMySQL Voting: Vote UP / Vote DOWN

Are You Forcing MySQL to Do Twice as Many JOINs as Necessary?

Сентябрь 29th, 2011
.
Baron Schwartz
This guest post is from our friends at Percona. They’re hosting Percona Live London from October 24-25, 2011. Percona Live is a two day summit with 100% technical sessions led by some of the most established speakers in the MySQL field.

In the London area and interested in attending? We are giving away two free passes in the next few days. Watch our @tokutek twitter feed for a chance to win.

Did you know that the following query actually performs a JOIN? You can’t see it, but it’s there:

SELECT the_day, COUNT(*), SUM(clicks), SUM(cost)
FROM ad_clicks_by_day
WHERE the_day >= '2005-07-01' AND the_day < '2005-07-07'
GROUP BY the_day;

Let me explain.

Suppose you define the table as follows:

CREATE TABLE ad_clicks_by_day (
customer INT  NOT NULL,
the_day  DATE NOT NULL,
clicks   INT  NOT NULL DEFAULT 0,
cost     INT  NOT NULL DEFAULT 0,
PRIMARY KEY(customer, the_day),
INDEX(the_day)
) ENGINE=InnoDB;

What happens when MySQL executes this query? Here’s the EXPLAIN:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ad_clicks_by_day
         type: range
possible_keys: the_day
          key: the_day
      key_len: 3
          ref: NULL
         rows: 368
        Extra: Using where

That looks fine, doesn’t it? It’s using an index range scan to find approximately 368 matching rows, and adding up the clicks and cost for them. Can it be done any more efficiently than this?

In fact, it turns out that it’s possible to execute this query much more efficiently. What happens internally when the server executes this query is that it begins reading from the ‘the_day’ index, and for each row it finds, it performs a lookup in the primary key (which, in InnoDB, stores the whole table) to find the other columns mentioned in the query. That is, it’s joining the index to the primary key! This is not just an ivory-tower abstraction. In real-life workloads, the random I/O caused by such index-to-table joins slows queries dramatically.

The “join” here isn’t really the same as a true table-to-table SQL JOIN in some ways, and it has a little less overhead at the server level than a table-to-table join, but in terms of the way an index-to-row lookup works, it’s quite similar in many ways.

If you’re skilled at logical and physical database design, you already noticed that my example can be improved by indexing differently: just put the primary key on (the_day, customer) instead of the other way around. Then the query will use the primary key instead of the secondary key, and the whole row is available to the query without doing a join to another index. This is called a covering index, which means that the index covers the whole query–there is no need for any data outside the index. You’ll see “Using index” in the Extra column from EXPLAIN when an index covers a query.

But then what if we want to group the results by customer? No problem, we can just add another index. To achieve the same results for ad-hoc querying, we’ll need to index every column:

ALTER TABLE ad_clicks_by_day ADD INDEX(the_day, customer, clicks, cost);

That works, but isn’t that going to be kind of expensive? We’re essentially duplicating the whole table. And what if we have some other columns we want to target for grouping, or filtering, or sorting, or any other operation that can be optimized with an index? Uh-oh, that’s going to be a lot of indexes.

The overhead of maintaining indexes is often mentioned, but rarely measured. It is sometimes exaggerated as a result. (For further reading on this point, see the book Relational Database Index Design and the Optimizers by Tapio Lahdenmaki.) However, if you did measure the overhead, you would notice two important things: increased disk I/O for modifications to the table, and increased disk space consumption. Even if it’s sometimes blown out of proportion, it’s real.

That’s why I like TokuDB’s indexing technology and modifications to the storage engine API. Indexes are much cheaper in TokuDB than they are in InnoDB, for several reasons:

  • TokuDB compresses its data quite well, which reduces the footprint both in space consumption and in I/O operations.
  • TokuDB lets you define multiple clustering indexes–indexes that sort and compare only on the key columns, but which include all columns in the table, transparently. That avoids invisible JOINs. I want this feature in InnoDB.
  • TokuDB uses a different data structure for indexes, which is significantly more efficient. Instead of the classic B-Tree structure, TokuDB uses Fractal Trees, which don’t slow down when they get bigger than memory.

As a result of these three properties of TokuDB, you can maintain a lot of indexes on a lot of data relatively cheaply. And that means that you can index your data to match your query patterns, and get much faster performance by removing a lot of random I/O operations caused by finding data and then looking up the rest of the row.

Disclosure: Tokutek paid Percona to evaluate the technology, but it is our practice to offer only our honest opinion in blog posts such as this one.


About the Author

Baron is Percona’s Chief Performance Architect. He’s the lead author of High Performance MySQL, and creator of Maatkit. He consults with Percona’s customers, and develops tools and practices for Percona’s team.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Write Optimization: Myths, Comparison, Clarifications

Сентябрь 22nd, 2011

Some indexing structures are write optimized in that they are better than B-trees at ingesting data. Other indexing structures are read optimized in that they are better than B-trees at query time. Even within B-trees, there is a tradeoff between write performance and read performance. For example, non-clustering B-trees (such as MyISAM) are typically faster at indexing than clustering B-trees (such as InnoDB), but are then slower at queries.

This post is the first of two about how to understand write optimization, what it means for overall performance, and what the difference is between different write-optimized indexing schemes. We’ll be talking about how to deal with workloads that don’t fit in memory—in particular, if we had our data in B-trees, only the internal nodes (perhaps not even all of them) would fit in memory.

As I’ve already said, there is a tradeoff between write and read speed in B-trees. There is also a mathematically optimal tradeoff that’s been proven between data ingestion and query speed, no matter what data structure you have.

But these are not the same tradeoff! B-trees are not even remotely optimal in terms of read/write tradeoff. I’d say this is the most common confusion I run into on this topic. The way it comes up is in the assumption people seem to make that if you are faster than a B-tree at indexing, you must pay a read penalty. This is simply not true, and I intend to convince you, but today I’ll just describe how I think people arrive at this confusion.

Extreme solutions

What’s the best we can do if we only worry about writes? Or reads? By considering these extreme cases; we can get a feel for the space of all possible solutions.

What’s the most write-optimized structure? Simply log all the data. You ingest data at the bandwidth of the storage system, and you can’t do better than that. The problem is that each query now requires all the data to be examined, and that’s a pretty lousy way to get query performance (unless, of course, you intend to look at it all anyway, like MapReduce does).

At the other extreme, you get maximum read performance by re-sorting the data and laying it out optimally for queries (for each index!) every time new data is inserted. The layout is technical, but can be done, and query performance doesn’t get any better than this. But then insertion speeds are disastrously slow.

B-trees

For most real workloads, we can’t sacrifice one aspect of performance to benefit another, we need something that works well for both.

A B-tree is one compromise between reads and writes. They’re easy to understand and they’ve been popular for decades, but as the data structure grows, their performance falls to pieces. Heavy use over the years has led to all kinds of refinements, and there are plenty of write optimizations for B-trees we could discuss at length. Here’s a little observation that will help us out: write optimization in B-trees involves maximizing the useful work we accomplish each time we touch a leaf.1

Let’s examine some write optimizations and see how this observation applies. We’ll also see some drawbacks of each technique.

  • Insert in sequential order: B-trees are a couple of orders of magnitude faster at inserting data in sequential order compared to random order. This is because once we insert a row into a leaf, that leaf is in memory, and since the next run of rows are destined for the same leaf, they can all be inserted cheaply. According to the disk, it’s almost like you’re just logging the data.

    This property of B-trees leads many people to bend over backwards trying to keep their insertions sequential. In many applications, this is not a natural or easy thing to do, and causes all sorts of problems elsewhere in the system.

  • Use a write buffer: We store up a bunch of insertions in memory. When we have too many, we pick a leaf and write all the rows we’ve stored that are going to that leaf.

    This optimization works when we get to pick a bunch of stored rows that are going to the same leaf. When this happens, we see a speedup: you get to accomplish a bunch of work just for touching a leaf once.

    You can expect a win of a factor of 2 or so for this optimization, which is nice, but leaves a factor of more than 100 on the table. The query cost doesn’t take much of a hit in this case. Even though you do have to query the write buffer, it’s in memory so it’s way faster than querying the tree on disk.

  • OLAP: OLAP is a bunch of things, but with respect to insertions, the idea is to save up a big batch of rows, pre-sort them, and then insert them into an existing B-tree in sorted order. You can think of it like an on-disk version of the insertion buffer. The big win happens when the amount you batch up is of size comparable to the B-tree you already have, and this gets you the insertion boost that a write buffer can’t achieve.

    The downside is that, unlike the in-memory write buffer, the rows you are buffering don’t get indexed—they just get logged, and we already said how bad query performance is for a log. In practice, OLAP users don’t even look at their data until it gets indexed.

    So by batching more, you get better insertion speed, but wait longer before your data is available for queries. In this case, you get a write/freshness tradeoff, rather than a write/read tradeoff, but the fundamental reason is the same.

  • Use fewer indexes: In general, each row inserted must go to the leaf of every index. This technique is just: maintain fewer indexes, so we have less to do. We wanted to maximize the work accomplished per leaf we touch, but now we’re minimizing the number of leaves we need to touch per amount of work (row insertion), but we get the same effect.

    This is less like a technique to be employed, and more like an artifact of bad B-tree performance, but it’s probably the most common “optimization.” Its downside makes it the clearest case for true write optimization: if you can’t afford to keep the right set of indexes, you kill your query performance, and this is without a doubt a read/write tradeoff.

These are pretty common tricks for speeding up B-tree insertions. You may have even tried some of them yourself. Each one has a downside though, and often they’re not obvious in production. Maybe if you didn’t need to insert sequentially for speed, you could simplify your application in a useful way. Or if you thought you could afford more indexes, you’d spend the time to think up cool new ways to analyze your data.

I think the reason people are convinced that B-trees are on the optimal tradeoff curve is pretty simple. Since all of the popular write optimizations are modifications to a B-tree, most people end up stuck on the B-tree tradeoff curve. But contrary to conventional wisdom, it is possible to do a lot better. Next post, I’ll explain why.


1: That’s because we can amortize the cost of the I/O—which is needed by a B-tree for most leaves, when the database is bigger than memory—against the work we are able to complete.

PlanetMySQL Voting: Vote UP / Vote DOWN

common_schema rev. 68: eval(), processlist_grantees, candidate_keys, easter_day()

Сентябрь 6th, 2011

Revision 68 of common_schema is out, and includes some interesting features:

  • eval(): Evaluates the queries generated by a given query
  • match_grantee(): Match an existing account based on user+host
  • processlist_grantees: Assigning of GRANTEEs for connected processes
  • candidate_keys: Listing of prioritized candidate keys: keys which are UNIQUE, by order of best-use.
  • easter_day(): Returns DATE of easter day in given DATETIME’s year.

Let’s take a slightly closer look at these:

eval()

I’ve dedicated this blog post on MySQL eval() to describe it. In simple summary: eval() takes a query which generates queries (most common use queries on INFORMATION_SCHEMA) and auto-evaluates (executes) those queries. Read more

match_grantee()

As presented in Finding CURRENT_USER for any user, I’ve developed the algorithm to match a connected user+host details (as presented with PROCESSLIST) with the grantee tables (i.e. the mysql.user table), in a manner which simulates the MySQL server account matching algorithm.

This is now available as a stored function: given a user+host, the function returns with the best matched grantee. Read more

processlist_grantees

This view relies on the above, and maps the entire PROCESSLIST onto GRANTEEs. The view maps each process onto the GRANTEE (MySQL account) which is the owner of that process. Surprisingly, MySQL does not provide one with such information.

The view also provides with the following useful metadata:

  • Is said process executes under a SUPER privilege?
  • Is this a replication thread, or serving a replicating client?
  • Is this process the current connection (myself)?

In the spirit of common_schema, it provides with the SQL commands necessary to KILL and KILL QUERY for each process. A sample output:

mysql> SELECT * FROM common_schema.processlist_grantees;
+--------+------------+---------------------+------------------------+--------------+--------------+----------+---------+-------------------+---------------------+
| ID     | USER       | HOST                | GRANTEE                | grantee_user | grantee_host | is_super | is_repl | sql_kill_query    | sql_kill_connection |
+--------+------------+---------------------+------------------------+--------------+--------------+----------+---------+-------------------+---------------------+
| 650472 | replica    | jboss00.myweb:34266 | 'replica'@'%.myweb'    | replica      | %.myweb      |        0 |       1 | KILL QUERY 650472 | KILL 650472         |
| 692346 | openarkkit | jboss02.myweb:43740 | 'openarkkit'@'%.myweb' | openarkkit   | %.myweb      |        0 |       0 | KILL QUERY 692346 | KILL 692346         |
| 842853 | root       | localhost           | 'root'@'localhost'     | root         | localhost    |        1 |       0 | KILL QUERY 842853 | KILL 842853         |
| 843443 | jboss      | jboss03.myweb:40007 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 843443 | KILL 843443         |
| 843444 | jboss      | jboss03.myweb:40012 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 843444 | KILL 843444         |
| 843510 | jboss      | jboss00.myweb:49850 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 843510 | KILL 843510         |
| 844559 | jboss      | jboss01.myweb:37031 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 844559 | KILL 844559         |
+--------+------------+---------------------+------------------------+--------------+--------------+----------+---------+-------------------+---------------------+

Finally, it is now possible to execute the following:  “Kill all slow queries which are not executed by users with the SUPER privilege or are replication threads”. To just generate the commands, execute:

mysql> SELECT sql_kill_connection FROM common_schema.processlist_grantees WHERE is_super = 0 AND is_repl = 0;

Sorry, did you only want to kill the queries? Those which are very slow? Do as follows:

mysql> SELECT sql_kill_connection FROM common_schema.processlist_grantees JOIN INFORMATION_SCHEMA.PROCESSLIST USING(ID) WHERE TIME > 10 AND is_super = 0 AND is_repl = 0;

But, really, we don’t just want commands. We really want to execute this!

Good! Step in eval():

mysql> CALL common_schema.eval('SELECT sql_kill_query FROM common_schema.processlist_grantees JOIN INFORMATION_SCHEMA.PROCESSLIST USING(id) WHERE TIME > 10 AND is_super = 0 AND is_repl = 0');

Read more

candidate_keys

A view which lists the candidate keys for tables and provides ranking for those keys, based on some simple heuristics.

This view uses  the same algorithm as that used by oak-chunk-update and oak-online-alter-table, tools in the openark kit. So it provides with a way to choose the best candidate key to walk through a table. At current, a table’s PRIMARY KEY is always considered to be best, because of InnoDB’s structure of clustered index. But I intend to change that as well and provide general recommendation about candidate keys (so for example, I would be able to recommend that the PRIMARY KEY is not optimal for some table).

Actually, after a discussion initiated by Giuseppe and Roland, starting here and continuing on mail, there are more checks to be made for candidate keys, and I suspect the next version of candidate_keys will be more informational.

Read more

easter_day()

Many thanks to Roland Bouman who suggested his code for calculating easter day for a given year. Weehee! This is the first contribution to common_schema! Read more

Get it

common_schema is an open source project. It is released under the BSD license.

Find it here.


PlanetMySQL Voting: Vote UP / Vote DOWN

common_schema rev. 68: eval(), processlist_grantees, candidate_keys, easter_day()

Сентябрь 6th, 2011

Revision 68 of common_schema is out, and includes some interesting features:

  • eval(): Evaluates the queries generated by a given query
  • match_grantee(): Match an existing account based on user+host
  • processlist_grantees: Assigning of GRANTEEs for connected processes
  • candidate_keys: Listing of prioritized candidate keys: keys which are UNIQUE, by order of best-use.
  • easter_day(): Returns DATE of easter day in given DATETIME’s year.

Let’s take a slightly closer look at these:

eval()

I’ve dedicated this blog post on MySQL eval() to describe it. In simple summary: eval() takes a query which generates queries (most common use queries on INFORMATION_SCHEMA) and auto-evaluates (executes) those queries. Read more

match_grantee()

As presented in Finding CURRENT_USER for any user, I’ve developed the algorithm to match a connected user+host details (as presented with PROCESSLIST) with the grantee tables (i.e. the mysql.user table), in a manner which simulates the MySQL server account matching algorithm.

This is now available as a stored function: given a user+host, the function returns with the best matched grantee. Read more

processlist_grantees

This view relies on the above, and maps the entire PROCESSLIST onto GRANTEEs. The view maps each process onto the GRANTEE (MySQL account) which is the owner of that process. Surprisingly, MySQL does not provide one with such information.

The view also provides with the following useful metadata:

  • Is said process executes under a SUPER privilege?
  • Is this a replication thread, or serving a replicating client?
  • Is this process the current connection (myself)?

In the spirit of common_schema, it provides with the SQL commands necessary to KILL and KILL QUERY for each process. A sample output:

mysql> SELECT * FROM common_schema.processlist_grantees;
+--------+------------+---------------------+------------------------+--------------+--------------+----------+---------+-------------------+---------------------+
| ID     | USER       | HOST                | GRANTEE                | grantee_user | grantee_host | is_super | is_repl | sql_kill_query    | sql_kill_connection |
+--------+------------+---------------------+------------------------+--------------+--------------+----------+---------+-------------------+---------------------+
| 650472 | replica    | jboss00.myweb:34266 | 'replica'@'%.myweb'    | replica      | %.myweb      |        0 |       1 | KILL QUERY 650472 | KILL 650472         |
| 692346 | openarkkit | jboss02.myweb:43740 | 'openarkkit'@'%.myweb' | openarkkit   | %.myweb      |        0 |       0 | KILL QUERY 692346 | KILL 692346         |
| 842853 | root       | localhost           | 'root'@'localhost'     | root         | localhost    |        1 |       0 | KILL QUERY 842853 | KILL 842853         |
| 843443 | jboss      | jboss03.myweb:40007 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 843443 | KILL 843443         |
| 843444 | jboss      | jboss03.myweb:40012 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 843444 | KILL 843444         |
| 843510 | jboss      | jboss00.myweb:49850 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 843510 | KILL 843510         |
| 844559 | jboss      | jboss01.myweb:37031 | 'jboss'@'%.myweb'      | jboss        | %.myweb      |        0 |       0 | KILL QUERY 844559 | KILL 844559         |
+--------+------------+---------------------+------------------------+--------------+--------------+----------+---------+-------------------+---------------------+

Finally, it is now possible to execute the following:  “Kill all slow queries which are not executed by users with the SUPER privilege or are replication threads”. To just generate the commands, execute:

mysql> SELECT sql_kill_connection FROM common_schema.processlist_grantees WHERE is_super = 0 AND is_repl = 0;

Sorry, did you only want to kill the queries? Those which are very slow? Do as follows:

mysql> SELECT sql_kill_connection FROM common_schema.processlist_grantees JOIN INFORMATION_SCHEMA.PROCESSLIST USING(ID) WHERE TIME > 10 AND is_super = 0 AND is_repl = 0;

But, really, we don’t just want commands. We really want to execute this!

Good! Step in eval():

mysql> CALL common_schema.eval('SELECT sql_kill_query FROM common_schema.processlist_grantees JOIN INFORMATION_SCHEMA.PROCESSLIST USING(id) WHERE TIME > 10 AND is_super = 0 AND is_repl = 0');

Read more

candidate_keys

A view which lists the candidate keys for tables and provides ranking for those keys, based on some simple heuristics.

This view uses  the same algorithm as that used by oak-chunk-update and oak-online-alter-table, tools in the openark kit. So it provides with a way to choose the best candidate key to walk through a table. At current, a table’s PRIMARY KEY is always considered to be best, because of InnoDB’s structure of clustered index. But I intend to change that as well and provide general recommendation about candidate keys (so for example, I would be able to recommend that the PRIMARY KEY is not optimal for some table).

Actually, after a discussion initiated by Giuseppe and Roland, starting here and continuing on mail, there are more checks to be made for candidate keys, and I suspect the next version of candidate_keys will be more informational.

Read more

easter_day()

Many thanks to Roland Bouman who suggested his code for calculating easter day for a given year. Weehee! This is the first contribution to common_schema! Read more

Get it

common_schema is an open source project. It is released under the BSD license.

Find it here.


PlanetMySQL Voting: Vote UP / Vote DOWN