Archive for the ‘PBXT’ Category

My Presentation at the DOAG 2010

Ноябрь 18th, 2010
Yesterday I presented PBXT: A Transactional Storage Engine for MySQL at the German Oracle User Group Conference (DOAG) in Nuremberg. A number of people asked for the slides, so here is the link.

The talk was scheduled to be in English, but since I had a German-only audience I presented in German. There was quite a bit of interest, particularly in the Engine Level replication built into PBXT 2.0.

As Ronny observed, this feature can be used effectively for many tasks, including for online backup and maintaining a hot-standby. This all with the addition of a "small" feature:

The Master could initially stream the entire database over to the Slave before actual replication begins. This would also make it extremely easy to setup replication.

A brilliant idea, but a good 3 months work...

PlanetMySQL Voting: Vote UP / Vote DOWN

PBXT in tpcc-like benchmark

Июнь 3rd, 2010

Finally I was able to run PBXT 1.0.11 pre-GA in tpcc-like workload, apparently there was bug with did not allow me to get the result earlier, and I am happy to see that PBXT team managed it.

For initial runs I took tpcc 100 warehouses ( about 10GB of data) which fully fits into memory (32 GB on server),
and compared 1 and 16 users in MySQL-5.1.46/PBXT and Percona Server / XtraDB – 5.1.45-rel10.2. As workload is totally memory based it will show how PBXT scales in CPU-bond cases on 16 cores systems.

As storage system it was Intel SSD X25-M card.

While full results and config are on Wiki:
http://www.percona.com/docs/wiki/benchmark:pbxt:tpcc:start

there are graphs for 1 user:

and 16 users:

Interesting to see that in case with 1 user the maximal throughput in PBXT is about 1.5x better XtraDB, but
there periodical drops which are very similar to periodical drops in InnoDB without adaptive checkpointing, and I guess it is also related to checkpoint activity.
The final results are also better for PBXT: 5785.567 TpmC vs 4905.967 TpmC ( XtraDB)

For 16 threads final result is: 26129.350 TpmC for PBXT and 29485.518 TpmC for XtraDB , and from the graph you can see that the maximal throughput is about identical, while PBXT spends more time in drops area. Again it looks like PBXT are not fully keeping up with checkpoint activity and I am looking PBXT addresses this problem also. Beside this issue PBXT looks pretty good and in next round I am going to run IO intensive workloads.


Entry posted by Vadim | One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Unqualified COUNT(*) speed PBXT vs InnoDB

Май 27th, 2010

So this is about a SELECT COUNT(*) FROM tblname without a WHERE clause. MyISAM has an optimisation for that since it maintains a rowcount for each table. InnoDB and PBXT can’t do that (at least not easily) because of their multi-versioned nature… different transactions may see a different number of rows for the table table!

So, it’s kinda known but nevertheless often ignored that this operation on InnoDB is costly in terms of time; what InnoDB has to do to figure out the exact number of rows is scan the primary key and just tally. Of course it’s faster if it doesn’t have to read a lot of the blocks from disk (i.e. smaller dataset or a large enough buffer pool).

I was curious about PBXT’s performance on this, and behold it appears to be quite a bit faster! For a table with 50 million rows, PBXT took about 20 minutes whereas the same table in InnoDB took 30 minutes. Interesting!

From those numbers you can tell that doing the query at all is not an efficient thing to do, and definitely not something a frontend web page should be doing. Usually you just need a ballpark figure so running the query in a cron job and putting the value into memcached (or just an include file) will work well in such cases.

If you do use a WHERE clause, all engines (including MyISAM) are in the same boat… they might be able to use an index to filter on the conditions – but the bigger the table, the more work it is for the engine. PBXT being faster than InnoDB for this task makes it potentially interesting for reporting purposes as well, where otherwise you might consider using MyISAM – we generally recommend using a separate reporting slave with particular settings anyway (fewer connections but larger session-specific buffers), but it’s good to have extra choices for the task.

(In case you didn’t know, it’s ok for a slave to use a different engine from a master – so you can really make use of that ability for specialised tasks such as reporting.)


PlanetMySQL Voting: Vote UP / Vote DOWN

PBXT early impressions in production use

Май 27th, 2010

With Paul McCullagh’s PBXT storage engine getting integrated into MariaDB 5.1, it’s never been easier to it out. So we have, on a slave off one of our own production systems which gets lots of inserts from our Zabbix monitoring system.

That’s possibly an ideal usage profile, since PBXT is a log based engine (simplistically stated, it indexes its transaction logs, rather than rewriting data from log into index and indexing that) so it should require less disk I/O than say InnoDB. And that means it should be particularly suited to for instance logging, which have lots of inserts on a sustained basis. Note that for short insert burst you may not see a difference with InnoDB because of caching, but sustain it and then you can notice.

Because PBXT has such different/distinct architecture there’s a lot of learning involved. Together with Paul and help from Roland Bouman we also created a stored procedure that can calculate the optimal average row size for PBXT, and even ALTER TABLE statements you can paste to convert tables. The AVG_ROW_LENGTH option is quite critical with PBXT, if set too big (or if you let PBXT guess and it gets it wrong) it’ll eat heaps more diskspace as well as being much slower, and if too small it’ll be slower also; this, it needs to be in the right ballpark. For existing datasets it can be calculated, so that’s what we’ve worked on. The procs will be published shortly, and Paul will also put them in with the rest of the PBXT files.

Another important aspect for PBXT is having sufficient cache memory allocated, otherwise operations can take much much longer. While the exact “cause” is different, one would notice similar performance aspects when using InnoDB on larger datasets and buffers that are too small for the purpose.

So, while using or converting some tables to PBXT takes a bit of consideration, effort and learning, it appears to be dealing with the real world very well so far – and that’s a testament to Paul’s experience. Paul is also very responsive to questions. As we gain more experience, it is our intent to try PBXT for some of our clients that have operational needs that might be a particularly good fit for PBXT.

I should also mention that it is possible to have a consistent transaction between PBXT, InnoDB and the binary log, because of the 2-phase commit (XA) infrastructure. This means that you should even be able to do a mysqldump with –single-transaction if you have both PBXT and InnoDB tables, and acquire a consistent snapshot!

More experiences and details to come.


PlanetMySQL Voting: Vote UP / Vote DOWN

Slides of the PBXT Presentation

Апрель 15th, 2010
Here are the slides to my talk yesterday: A Practical Guide to the PBXT Storage Engine.

For anyone who missed my talk, I think it is worth going through the slides, because the are fairly self explanatory.

If there are any questions, please post them as a comment to the blog. I will be glad to answer :)

PlanetMySQL Voting: Vote UP / Vote DOWN

PBXT at the MySQL User Conference 2010

Апрель 9th, 2010
At this year's User Conference I have some interesting results to present. But more than anything else, my talk will explain how you can really get the most out of the engine. The design of PBXT makes it flexible, but this provides a lot of options. What tools are available to help you make the right decisions? I will explain.

Every design has trade-offs. How does this work out in practice for PBXT? And how can you take advantage of the strengths of the PBXT storage engine? I will explain in:

A Practical Guide to the PBXT Storage Engine
Paul McCullagh
2:00pm - 3:00pm Tuesday, 04/13/2009
Ballroom E

Don't miss it! :)


PlanetMySQL Voting: Vote UP / Vote DOWN

Fast reads or fast scans?

Март 29th, 2010
MyISAM is frequently described and marketed as providing fast reads when it really provides fast index and table scans. This is a more narrow use case as fast reads implies great performance for most queries while fast scans implies great performance for single-table queries that are index only or do a full table scan.

MyISAM caches index blocks but not data blocks. There can be a lot of overhead from re-reading data blocks from the OS buffer cache assuming mmap is not used. InnoDB and PBXT are 20X faster than MyISAM for some of my tests. However, I suspect that mutex contention on the key cache is also a factor in the performance differences.

While there are many claims about the great performance of MyISAM. There are not as many examples that explain when it is fast. Alas, the same marketing technique is being repeated with NoSQL to the disadvantage of MySQL.
Tests were run on a server that reports 16 CPU cores. The full test configuration is described elsewhere. For this test I modified the sysbench oltp test to do a self-join query. I will publish the code soon. The schema for the test is:
CREATE TABLE sbtest (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  k int(10) unsigned NOT NULL DEFAULT '0',
  c char(120) NOT NULL DEFAULT '',
  pad char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (id),
  KEY k (k)
) ENGINE=InnoDB;
The self-join query uses a range predicate that selects a fixed number (1, 10, 100, 1000 or 10000) of rows. This is an example that selects 1000 rows.
SELECT t1.c, t2.c FROM sbtest t1, sbtest t2
WHERE t1.id between 245793 and 246792 and t2.id = 2000000 - t1.id
Tests were run using MySQL 5.1.45 for MyISAM, InnoDB plugin 1.0.6 and PBXT 1.1. Results are in queries per second for 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 concurrent clients. I do not report results for 512 and 1024 clients to avoid long lines in this post.

The performance of MyISAM is much worse compared to InnoDB and PBXT as the number of rows selected grows from 1 to 10,000.

Queries per second when the between predicate selects 1 row:
  6843  13157  24552  46822  62588  57023  46568  30582  18745 innodb
  6164  13627  25671  48705  63741  59217  48300  30964  18866 pbxt
  6354  12061  23373  44284  50778  49546  44412  30444  18827 myisam

Queries per second when the between predicate selects 10 rows:
  4240   8466  16387  33221  53902  39599  36214  28026  18084 innodb
  4802   8835  17688  35917  57461  47691  41578  29087  18558 pbxt
  3890   7129  12512  16450  12272  12304  12441  12448  11304 myisam

Queries per second when the between predicate selects 100 rows:
  1842   3455   7249  14842  20206  13875  13471  12942  12344 innodb
  2113   3522   7893  13411  18597  18905  18694  18123  12301 pbxt
  1608   2260   2263   1899   1371   1399   1451   1468   1442 myisam

Queries per second when the between predicate selects 1000 rows:
   380    654   1222   2023   2487   1866   1791   1794   1942 innodb
   303    641   1149   1699   2044   2069   2072   2063   2056 pbxt
   232    248    227    189    141    143    149    148    148 myisam

Queries per second when the between predicate selects 10000 rows:
    43     70    130    213    254    199    194    196    199 innodb
    49     69    123    182    213    216    216    216    216 pbxt
    24     24     23     19     14     14     15     15     15 myisam

MyISAM is at a disadvantage because it does not cache data blocks, so I changed the query to be index only and it is listed below. This did not make MyISAM faster. I think the bottleneck is contention on the key cache mutex.
SELECT t1.id, t2.id FROM sbtest t1, sbtest t2
WHERE t1.id between 245793 and 246792 and t2.id = 2000000 - t1.id
Queries per second for range 1000 using the index only query:
   457    706   1354   2146   2596   2044   1918   1887   1953 innodb
   576    837   1386   1681   2058   2094   2103   2095   2087 pbxt
   353    244    223    190    140    142    147    146    146 myisam

Results for MySQL 5.0.84 are similar to 5.1.45 for the range 1000 query:
   390    642   1241   2045   2547   1891   1825   1813   1930 innodb
   303    239    225    189    140    141    147    146    146 myisam

The query plan for the basic query:
explain  SELECT t1.c, t2.c
from sbtest t1, sbtest t2
where t1.id between 245793 and 246792 and t2.id = 2000000 - t1.id

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 1072
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: func
         rows: 1
        Extra: Using where; Using index
2 rows in set (0.01 sec)

The query plan for the index only join:
explain  SELECT t1.id, t2.id
from sbtest t1, sbtest t2
where t1.id between 1916457 and 1917456 and t2.id = 2000000 - t1.id
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 978
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: func
         rows: 1
        Extra: Using where; Using index
2 rows in set (0.00 sec)

PlanetMySQL Voting: Vote UP / Vote DOWN

PBXT Engine Level replication, works!

Март 17th, 2010
I have been talking about this for a while, now at last I have found the time to get started! Below is a picture from my 2008 MySQL User Conference presentation. It illustrates how engine level replication works, and also shows how this can be ramped up to provide a multi-master HA setup.


What I now have running is the first phase: asynchronous replication, in a master/slave configuration. The way it works is simple. For every slave in the configuration the master PBXT engine starts a thread which reads the transaction log, and transfers modifications to a thread which applies the changes to PBXT tables on the slave.

Where to get it

I have pushed the changes that do this trick to PBXT 2.0 on Launchpad. The branch to try out is lp:pbxt/2.0.

Getting started

Setup of the replication is dead easy. Assuming you already have a PBXT database, what you need to do is the following:

1. Copy the Master data: Shutdown the MySQL server and make a complete copy of the data directory.

2. Setup a Slave server: Setup a second MySQL server using the copy of the data directory.

3. Declare the Slave: Create a text file called slaves, in the data/pbxt directory of the master server, with the following entry:
[slave]
name=slave-thread-name
host=host-name-of-slave
port=37656
slave-process-name is any name you like, and is used to identify the replication thread running on the master. host-name-of-slave is the host name or IP address of the slave MySQL server. 37656 is the default port used by the PBXT slave engine to receive replication changes.

4. Enable replication: On the master server set pbxt_enable_replication=1, and on the slave server set pbxt_enable_replication=2. Also make sure that both servers have different server IDs (system parameter: server_id).

5. Start both servers: Replication will begin immediately if the slave server is started before master server, otherwise replication will begin after a minute (see below).

How it works


PBXT engine level replication, unlike MySQL replication, pushes changes to the slave. For every entry in the data/pbxt/slaves file, PBXT starts a thread (the supplier thread). The thread connects to the slave on the given address, and pushes the changes to an applier thread run by the PBXT engine on the slave side. If any error occurs, the supplier thread on the master will pause, and then try again in a minute.

On connect the supplier thread requests the global transaction ID (GID) of the last transaction committed on the slave. The applier determines the GID of the last transaction by searching backwards through its own transaction logs.

Replication is row-based, and fairly low level. Changes refer to the PBXT internal row and table IDs. The row data is transferred in the same format used to store the information on disk. This makes the replication extremely efficient. The supplier thread does not even have to read the log from disk if it is fairly up-to-date, because PBXT already caches the last changes to the transaction log for use by the writer and the sweeper threads.

Probably the most important thing about this type of replication is that it (theoretically) has almost no affect on the "foreground" activity on the master machine. I am interested to find out if this really is the case.

What's next?

Replication of DDL changes are not implemented yet. So if you do ALTER TABLE or any other such operation, replication will stop, and have to be restarted by copying over the data directory to the slave again.

After DDL changes the next step is to add synchronous replication, as illustrated above. This requires waiting for a commit from the slave before continuing. Latency in this case can be kept to a minimum by sending transactions to the slave before they have been committed on the master.

I believe this would then provide the basis for an extremely simple (and efficient) HA solution based on MySQL.

PlanetMySQL Voting: Vote UP / Vote DOWN

Embedded PBXT is Cool

Февраль 26th, 2010
Martin Scholl (@zeit_geist) has started a new project based on the PBXT storage engine: EPBXT - Embedded PBXT! In his first blog he describes how you can easily build the latest version: Building Embedded PBXT from bzr.

The interesting thing about this project is that it exposes the "raw" power of the engine. Some basic performance tests show this really is the case.

At the lowest level, PBXT does not impose any format on the data stored in tables and indexes. When running as a MySQL storage engine it uses the MySQL native row and index formats. Theoretically it would be possible to expose this in an embedded API. The work Martin is doing goes in at this level. The wrapper around the engine determines the data types, data sizes, row and index format. Comparison operations for the data types are also supplied by the embedded code or user program.

This flexibility will make it possible for an application to store its own data very efficiently. As Martin suggested, it would also be possible to use Google's protobuf's for the row format. This would eliminate the need to use an ALTER TABLE for many types of changes to a table's definition!

Of course, EPBXT is still a way from realizing this vision, and Martin has some very specific problems he wants to solve with the development. However, judging by his command of the code within such a short time, this is going to be a project to watch in the future!

PlanetMySQL Voting: Vote UP / Vote DOWN

Building 5.1.38-maria packages

Сентябрь 30th, 2009

We’ve been able to do MySQL 5.1 binary tarballs for a bit now (great working together with Kristian Nielsen of Monty Program), but packages are bit more tricky. Peter has been working on Debian/Ubuntu while I’ve focused on RH/CentOS. The following is from an OurDelta (trial build run) RPM install on CentOS 5 x64:

$ mysql -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.1.38-maria-beta1-ourdelta (OurDelta - http://ourdelta.org/)

mysql> CREATE TABLE test.t1 (i int) ENGINE=PBXT;
Query OK, 0 rows affected (0.10 sec)

mysql> SHOW CREATE TABLE test.t1\G
*************************** 1. row ***************************
Table: test.t1
Create Table: CREATE TABLE `test.t1` (
`i` int(11) DEFAULT NULL
) ENGINE=PBXT DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> INSERT INTO test.t1 values (1);
Query OK, 1 row affected (0.03 sec)

mysql> SELECT * FROM test.t1;
+------+
| i    |
+------+
|    1 |
+------+
1 row in set (0.00 sec)

mysql> SHOW ENGINE PBXT STATUS\G
*************************** 1. row ***************************
Type: PBXT
Name:
Status:
090929 23:05:41 PBXT 1.0.08d RC STATUS OUTPUT
Record cache usage: 65675
Record cache size:  33554432
Record cache high:  65675
Index cache usage:  0
Index cache size:   33554432
Log cache usage:    295128
Log cache size:     16756712
Data log files:
1 row in set (0.00 sec)

So yes, PBXT is fully in there, as is XtraDB (the enhanced InnoDB plugin).

mysql> SHOW GLOBAL VARIABLES LIKE 'innodb_version';
+----------------+----------+
| Variable_name  | Value    |
+----------------+----------+
| innodb_version | 1.0.3-6a |
+----------------+----------+
1 row in set (0.00 sec)

PlanetMySQL Voting: Vote UP / Vote DOWN