Archive for the ‘storage’ Category

BlitzDB Crash Safety and Auto Recovery

Июль 22nd, 2010

Crash Safety is a big deal in the database league. Lack of durability can lead to all sorts of terrible things upon a catastrophic event. Many projects, especially in the so called NoSQL world compromises crash safety in return for higher QPS. The argument there is that the availability of the overall system should be accomplished by replication since a database server can’t be rescued if the physical disk breaks. I happen to agree with this philosophy but I am also aware that this isn’t a correct answer for everyone. So, what will I do with BlitzDB?

Several relational database hackers have pointed out that BlitzDB isn’t any safer than MyISAM since it doesn’t guarantee crash safety. This is currently true but I plan on making BlitzDB much safer than MyISAM by providing following features.

  1. Auto Recovery Routine (startup option)
  2. Tokyo Cabinet’s Transaction API (table-specific option)

The second feature above would actually guarantee BlitzDB to be crash safe (especially combined with auto recovery) but I won’t get into depth in this post since this topic deserves a blog post of it’s own. Let me just state that this feature will be provided in a form like this:

CREATE TABLE t1 (
  a int PRIMARY KEY,
  b varchar(256)
) ENGINE = BLITZDB, CRASH_SAFE;

From here on, I’ll cover how I plan on hacking auto recovery in BlitzDB.

Auto Recovery Challenges

As I blogged a while back, recovering Tokyo Cabinet is relatively simple. However, this is not a sufficient solution in BlitzDB since the data file (hash database that actually holds the rows) and the index file(s) are independent from each other. That is, the likelihood of the data file and the index file(s) to be inconsistent is very high after a crash. So, how can we hack on this? Pretty simple.

Indexes aren’t Important at Recovery Phase

Because BlitzDB logically separates the data file and it’s indexes, index files aren’t that important. If a server crash had occurred, BlitzDB could delete the index file(s) and recompute them from the data file. Needless to say, this process would involve a lot of random access and computation but it would not dominate the time space of the system since it’s a one-time cost. This approach however has one flaw in it such that the index files can’t be recomputed if the data file is broken or is unrecoverable.

Therefore to guarantee crash safety, BlitzDB must ensure that the data file is unbreakable. This is precisely where Tokyo Cabinet’s Transaction API comes in. I’m planning on using it to protect the data file from breaking. If the data file is protected, the table can be rescued. Simple!

So, that’s what I have in mind for making BlitzDB a safer engine. Unfortunately I can’t start hacking on it immediately since I have several bugs to fix first. Nevertheless I’m looking forward to start hacking on it. This challenge should be quite fun to tackle.


PlanetMySQL Voting: Vote UP / Vote DOWN

LinuxTag presentation now available for download

Июль 8th, 2010

A live recording of my LinuxTag 2010 presentation entitled Storage Done Right: Building a Resilient, Distributed, Highly Available Open Source iSCSI SAN is now available from our web site. If you want to find out how to build a complete SAN from 100% open source, do take a look!

I do apologize for the less-than-optimal sound quality. I did the recording myself with my laptop mike, so unfortunately there’s quite a bit of clipping in the audio track. I hope my ramblings are still somewhat audible.



PlanetMySQL Voting: Vote UP / Vote DOWN

Speaking at LinuxTag 2010 this Friday

Июнь 9th, 2010

Just as a quick reminder, as LinuxTag 2010 in Berlin is going into full swing: I am speaking on highly available, distributed, open source storage this Friday in room Europa I at 4pm CEST.

If you’re in Berlin, drop by! Admission is free.



PlanetMySQL Voting: Vote UP / Vote DOWN

Berkeley DB now supports SQL (again)

Март 25th, 2010

Berkeley DB (BDB) is undoubtedly the workhorse among the opensource embedded database engines. It started as a university project in the mid-eighties and was further developed by Sleepycat Software, until it got acquired by Oracle in February 2006.

I had the impression that BDB had lost a lot of its popularity among opensource developers to SQLite in recent times, which has evolved into becoming the default choice for developers looking for an embedded data store. I'd assume primarily because the code is not released under any particular license, but put in the public domain (which makes it very attractive for embedding it into one's code), and also because it's lightweight, supports SQL and has interfaces to a number of languages.

Of course, SQLite has its limitations and use cases (as every product), so it may not be suited for some particular application. As the SQLite developers put it: "SQLite is not designed to replace Oracle. It is designed to replace fopen().".

Yesterday, Oracle announced a new version of BDB. One of the notable features of this release is the introduction of a new SQL API, based on SQLite. According to Gregory Burd, Product Manager for Berkeley DB at Oracle, they did so by including a version of SQLite which uses Berkeley DB for storage (replacing btree.c). I think this is a very smart move – instead of introducing a new API, developers can now easily switch to a different storage backend in case they are experiencing issues with the default SQLite implementation. So now MySQL isn't the only database with different storage backends anymore :-)

I am curious to learn more about how the BDB implementation compares against the original (both feature- and performance-wise).

Oh, and this is actually not the first time someone put an SQL interface in front of Berkeley DB – BDB was the first transaction-safe storage engine that provided page-level locking for MySQL in version 3.23.15 (released in May 2000). The InnoDB storage engine was added some time afterwards (MySQL 3.23.34a, released in March 2001).


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL University: MySQL Column Databases

Март 2nd, 2010

This Thursday (March 4th, 15:00 UTC - slightly later than usual), Robin Schumacher will present MySQL Column Databases. If you're doing Data Warehouse with your databases this is a must-attend, but it's also interesting to learn what typical other scenarios there are for using column-based storage engines, and how column databases work in the first place.

For MySQL University sessions, point your browser to this page. You need a browser with a working Flash plugin. You may register for a Dimdim account, but you don't have to. (Dimdim is the conferencing system we're using for MySQL University sessions. It provides integrated voice streaming, chat, whiteboard, session recording, and more.)

MySQL University is a free educational online program for engineers/developers. MySQL University sessions are open to anyone. All sessions (slides & audio) are recorded; the links will be on the respective MySQL University session pages which are listed on the MySQL University home page.

Here's the tentative list of upcoming sessions:

  • May 27: Improving MySQL Full-Text Search (Kristofer Pettersson)
  • June 10: Securich - Security Plugin for MySQL (Darren Cassar; we've set up that second session because unfortunately recording of the original session didn't work)

As you can see, there are big gaps in the upcoming months. Care to fill them by giving a MySQL University presentation? If you'd like to be a speaker, have a look at this blog article!


PlanetMySQL Voting: Vote UP / Vote DOWN

Notes on HEAP/MyISAM Index Key Handling on WRITE

Январь 26th, 2010

Disclaimer: This post is based on HEAP/MyISAM’s sourcecode in Drizzle.

Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I’ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible memory for names). I’m only going to cover key generation on write in this post. Otherwise this post is going to be massive.

HEAP Engine

The index structure of HEAP can be either BTREE or HASH (in MySQL doc terms). Like other engines HEAP has a structure for keeping Key definition (parts, type, logic and etc). This structure is called HP_KEYDEF and it contains function pointers for write, delete, and getting the length of the key. These function pointers are assigned to at table creation or when the table is opened. The assigned function depends on the data structure of the index and it can be either of the following:

BTREE

  • hp_rb_write_key()
  • hp_rb_delete_key()

HASH

  • hp_write_key()
  • hp_delete_key()

As for get_key_length(), either of the following functions are used for both data structures.

  • hp_rb_var_key_length()
  • hp_rb_null_key_length()
  • hp_rb_key_length()

When writing a row to the tree, HEAP writes to the index using a key generated by hp_rb_make_key(). Note that it does not use this for the hash index. The generated key is populated inside ‘recbuffer’ in HEAP’s handler object (HP_INFO structure).

From my understanding, it loops through the key segments (I suspect it is similar the internal KEY_PART_INFO structure) and appropriately copies each key field value to the output buffer. By meaning “appropriately” it respects the characteristics of the data type when packing the buffer. For example, for a variable length field, it will only copy the actual data and not the max possible size of it. The final byte that is copied to the buffer is the address of the chunk where the record lives.

MyISAM Engine

The upper layer of key handling in MyISAM looks somewhat similar to HEAP so you can really tell that it was written by the same people. Things are nicely wrapped together by the MYISAM_SHARE structure so it’s relatively easy to follow. BlitzDB has a class called BlitzShare for the same purpose (This is based off Archive Engine’s ArchiveShare class).

Like HEAP, MyISAM has a structure for individual key definition called MI_KEYDEF (it’s defined in myisam.h). There are more function pointers in this structure than HEAP.

  • bin_search()
  • get_key()
  • pack_key()
  • store_key()
  • ck_insert()
  • ck_delete()

In Drizzle, _mi_ck_write() is assigned to ck_insert() which is the entry point to writing a MyISAM index. The key that MyISAM uses to write to the index is generated by _mi_make_key(). Like HEAP, it will loop through the key segments and pack the relevant fields accordingly to the characteristic of the data type. The output buffer belongs to MyISAM’s hander (lastkey2).

From Here

I’ve actually written a naive key generator for BlitzDB already based on Drizzle/MySQL’s internal KEY_PART_INFO array. It seems to be working on EXACT MATCH but I still need to implement an index scanner which looks much harder to pull off than a table scanner. What I’m really worried about is supporting composite indexes (namely reading/searching on it) but hopefully I’ll understand how this area of the storage system works soon.


PlanetMySQL Voting: Vote UP / Vote DOWN

TEXT vs. VARCHAR

Январь 20th, 2010

On first glance, it looks like TEXT and VARCHAR can store the same information. However, there are fundamental differences between the way TEXT fields and VARCHAR fields work, which are important to take into consideration.

Standard
VARCHAR is actually part of the ISO SQL:2003 standard; The TEXT data types, including TINYTEXT, are non-standard.

Storage
TEXT data types are stored as separate objects from the tables and result sets that contain them. This storage is transparent — there is no difference in how a query involving a TEXT field is written versus one involving a VARCHAR field. Since TEXT is not stored as part of a row, retrieval of TEXT fields requires extra I/O overhead.


Maximum VARCHAR length
The maximum row length of a VARCHAR is restricted by the maximum row length of a table. This is 65,535 bytes for most storage engines (NDB has a different maximum row value). Theoretically the maximum length of a VARCHAR is 65,536 bytes. Overhead further limits the actual maximum size of a VARCHAR.

Storing the length of a VARCHAR field takes 1 byte if the VARCHAR field has a maximum length of 0-255 bytes; if it is greater than 255 bytes, the overhead to store the length is 2 bytes. If the VARCHAR field allows NULL values, that adds additional overhead — every table uses 1 byte of overhead for each set of 8 fields that allow NULL values. If the VARCHAR is the only row in the table, and does not allow NULL values, the maximum length allowed for VARCHAR is 65,532 bytes.

Keep in mind that that the number in VARCHAR(x) represents number of characters, not number of bytes. Therefore, you may have difficulties trying to define a table with only VARCHAR(65532) if the character set uses multi-byte characters, such as UTF-8.

If you attempt to define a VARCHAR value that is longer than allowed, you will run into an error such as 1118 or 1074:

ERROR 1118 (42000): Row size too large. The maximum row size for the used table type, not counting BLOBs, is 65535. You have to change some columns to TEXT or BLOBs.

ERROR 1074 (42000): Column length too big for column 'col_name' (max=[max number here]); use BLOB or TEXT instead

Maximum TEXT length
The maximum size of a TEXT data type depends on which type of TEXT data type is being used. Because they are stored as objects, the only row overhead in the table object is a pointer (8 or 16 bytes). Here is a list of the maximum TEXT length, and the overhead (in the TEXT object):

  • TINYTEXT – up to 255 bytes, 1 byte overhead

  • TEXT – up to 64 Kb, 2 bytes overhead

  • MEDIUMTEXT – up to 16 Mb, 3 bytes overhead

  • LONGTEXT – up to 4 Gb, 4 bytes overhead

DEFAULT values
MySQL does not allow TEXT data types to have a default value other than NULL. VARCHAR fields are allowed to be created with a DEFAULT value.

Conclusions
Because of the storage implications, it is preferable to use VARCHAR instead of TINYTEXT.

If you need to have a DEFAULT value that is not NULL, you must use VARCHAR (or CHAR).

If you need to store strings longer than approximately 64 Kb, use MEDIUMTEXT or LONGTEXT. VARCHAR cannot support storing values that large.

Make sure you are aware of the effects of a multi-byte character set. VARCHAR(255) stores 255 characters, which may be more than 255 bytes.


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL University: The Spider Storage Engine

Октябрь 12th, 2009

This Thursday (October 15th, 13:00 UTC), Giuseppe Maxia will present the Spider Storage Engine. Here's from the abstract: Everybody needs sharding. Which is not easy to maintain. Being tied to the application layer, sharding is hard to export and to interact with. The Spider storage engine, a plugin for MySQL 5.1 and later, solves the problem in a transparent way. It is an extension of partitioning. Using this engine, the user can deal transparently with multiple backends in the server layer. This means that the data is accessible from any application without code changes. This lecture will briefly introduce MySQL partitioning, and then show how to create and use the Spider engine, with some practical examples. The talk covers the latest version of the Spider engine, which includes a condition pushdown feature that increases performance significantly.

For MySQL University sessions, point your browser to this page. You need a browser with a working Flash plugin. You may register for a Dimdim account, but you don't have to. (Dimdim is the conferencing system we're using for MySQL University sessions. It provides integrated voice streaming, chat, whiteboard, session recording, and more.) All MySQL University sessions are recorded, that is, slides and voice can be viewed as a Flash movie (.flv). You can find those recordings on the respective MySQL University session pages which are listed on the MySQL University home page.

MySQL University is a free educational online program for engineers/developers. MySQL University sessions are open to anyone, not just Sun employees. Sessions are recorded (slides and audio), so if you can't attend the live session you can look at the recording anytime after the session.

Here's the schedule for the upcoming weeks:

  • October 22: Dual Master Setups With MMM (Arjen Lentz)
  • October 29: MySQL scalability on SPARC & INTEL X5500 (Nehalem) (Benoit Chaffanjon)
  • November 12: Gearman for MySQL (Giuseppe Maxia)
  • November 19: memcached Functions for MySQL (UDFs) (Patrick Galbraith)
  • December 3: Practical Full-Text Search in MySQL (Bill Karwin)

The schedule is not engraved in stone at this point. Please visit http://forge.mysql.com/wiki/MySQL_University#Upcoming_Sessions for the up-to-date list. On that page, you can also find the starting time for many time zones.


PlanetMySQL Voting: Vote UP / Vote DOWN

Storage Miniconf Deadline Extended!

Сентябрь 30th, 2009

The linux.conf.au organisers have given all miniconfs an additional few weeks to spruik for more proposal submissions, huzzah!

So if you didn’t submit a proposal because you weren’t sure whether you’d be able to attend LCA2010, you now have until October 23 to convince your boss to send you and get your proposal in.


PlanetMySQL Voting: Vote UP / Vote DOWN

Storage Miniconf Deadline Extended!

Сентябрь 30th, 2009

The linux.conf.au organisers have given all miniconfs an additional few weeks to spruik for more proposal submissions, huzzah!

So if you didn’t submit a proposal because you weren’t sure whether you’d be able to attend LCA2010, you now have until October 23 to convince your boss to send you and get your proposal in.


PlanetMySQL Voting: Vote UP / Vote DOWN