Archive for the ‘indexing’ Category

Databases: Normalization or Denormalization. Which is the better technique?

Август 30th, 2010

This has really been a long debate as to which approach is more performance orientated, normalized databases or denormalized databases. So this article is a step on my part to figure out the right strategy, because neither one of these approaches can be rejected outright. I will start of by discussing the pros and cons of both the approaches.

Pros and Cons of a Normalized database design.

Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. This is because of the following reasons:

  • Normalized tables are usually smaller and have a smaller foot-print because the data is divided vertically among many tables. This allows them to perform better as they are small enough to get fit into the buffer.
  • The updates are very fast because the data to be updated is located at a single place and there are no duplicates.
  • Similarly the inserts are very fast because the data has to be inserted at a single place and does not have to be duplicated.
  • The selects are fast in cases where data has to be fetched from a single table, because normally normalized tables are small enough to get fit into the buffer.
  • Because the data is not duplicated so there is less need for heavy duty group by or distinct queries.

Although there seems to be much in favor of normalized tables, with all the cons outlined above, but the main cause of concern with fully normalized tables is that normalized data means joins between tables. And this joining means that read operations have to suffer because indexing strategies do not go well with table joins.

Now lets have a look at the pros and cons of a denormalized database design.

Pros and cons of denormalized database design.

Denormalized databases fair well under heavy read-load and when the application is read intensive. This is because of the following reasons:

  • The data is present in the same table so there is no need for any joins, hence the selects are very fast.
  • A single table with all the required data allows much more efficient index usage. If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index. While in the case of a normalized table, since the data would be spread out in different tables, this would not be possible.

Although for reasons mentioned above selects can be very fast on denormalized tables, but because the data is duplicated, the updates and inserts become complex and costly.

Having said that neither one of the approach can be entirely neglected, because a real world application is going to have both read-loads and write-loads. Hence the correct way would be to utilize both the normalized and denormalized approaches depending on situations.

Using normalized and denormalized approaches together.

The most common way of mixing denormalized and normalized approaches is to duplicate related columns from one table into another table. Let me show you by example:

Suppose you have a products table and an orders table.
The normalized approach would be to only have the product_id in the orders table and all the other product related information in the products table.

But that would make the query that filters by product_name and sorts by order_date inefficient because both are stored in different tables.

In a fully normalized schema, such a query would be performed in the following manner:

SELECT product_name, order_date
FROM orders INNER JOIN products USING(product_id)
WHERE product_name like 'A%'
ORDER by order_date DESC

As you can see MySQL here will have to scan the order_date index on the orders table and then compare the corresponding product_name in the products table to see if the name starts with A.

The above query can be drastically improved by denormalizing the schema a little bit, so that the orders table now includes the product_name column as well.

SELECT product_name, order_date
FROM orders
WHERE product_name like 'A%'
ORDER by order_date DESC

See how the query has become much simpler, there is no join now and a single index on columns product_name, order_date can be used to do the filtering as well as the sorting.

So can both the techniques be used together? Yes they can be, because real word applications have a mix of read and write loads.

Final words.

Although, denormalized schema can greatly improve performance under extreme read-loads but the updates and inserts become complex as the data is duplicate and hence has to be updated/inserted in more than one places.

One clean way to go about solving this problem is through the use of triggers. For example in our case where the orders table has the product_name column as well, when the value of product_name has to be updated, then it can simply be done in the following way:

  • Have a trigger setup on the products table that updates the product_name on any update to the products table.
  • Execute the update query on the products table. The data would automatically be updated in the orders table because of the trigger.

However, when denormalizing the schema, do take into consideration, the number of times you would be updating records compared to the number of times you would be executing SELECTs. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.


PlanetMySQL Voting: Vote UP / Vote DOWN

Table refactoring & application version upgrades, Part II

Август 12th, 2010

Continuing Table refactoring & application version upgrades, Part I, we now discuss code & database upgrades which require DROP operations. As before, we break apart the upgrade process into sequential steps, each involving either the application or the database, but not both.

As I’ll show, DROP operations are significantly simpler than creation operations. Interestingly, it the same as in life.

DROP COLUMN

A column turns to be redundant, unused. Before it is dropped from the database, we must ensure no one is using it anymore. The steps are:

  1. App: V1 -> V2. Remove all references to column; make sure no queries use said column.
  2. DB: V1 -> V2 (possibly failover from M1 to M2), change is DROP COLUMN.

DROP INDEX

A possibly simpler case here. Why would you drop an index? Is it because you found out you never use it anymore? Then all you have to do is just drop it.

Or perhaps you don’t need the functionality the index supports anymore? Then first drop the functionality:

  1. (optional) App: V1 -> V2. Discard using functionality which relies on index.
  2. DB: V1 -> V2 (possibly failover from M1 to M2), change is DROP INDEX. Check out InnoDB Plugin here.

DROP UNIQUE INDEX

When using Master-Slave failover for table refactoring, we’re now removing a constraint from the slave. Since the master is more constrained than the slave, there is no problem here. It’s mostly the same as with a normal DROP INDEX, with a minor addition:

  1. (optional) App: V1 -> V2. Discard using functionality which relies on index.
  2. DB: V1 -> V2 (possibly failover from M1 to M2), change is DROP INDEX.
  3. (optional) App: V2 -> V3. Enable functionality that inserts duplicates.

DROP FOREIGN KEY

Again, we are removing a constraint.

  1. DB: V1 -> V2 (possibly failover from M1 to M2), change is DROP INDEX.
  2. (optional) App: V2 -> V3. Enable functionality that conflicts with removed constraint. I mean, if you really know what you are doing.

DROP TABLE

The very simple steps are:

  1. App: V1 -> V2. Make sure no reference to table is made.
  2. DB: V1 -> V2. Issue a DROP TABLE.

With ext3 dropping a large table is no less than a nightmare. Not only does the action take long time, it also locks down the table cache, which very quickly leads to having dozens of queries hang. xfs is a good alternative.

Conclusion

We looked at single table operations, coupled with application upgrades. By carefully looking at the process breakdown, multiple changes can be addressed with ease and safety. Not all operations are completely safe when used with replication failover. But they are mostly safe if you have some trust in your code.


PlanetMySQL Voting: Vote UP / Vote DOWN

Table refactoring & application version upgrades, Part I

Август 10th, 2010

A developer’s major concern is: How do I do application & database upgrades with minimal downtime? How do I synchronize between a DB’s version upgrade and an application’s version upgrade?

I will break down the discussion into types of database refactoring operations, and I will limit to single table refactoring. The discussion will try to understand the need for refactoring and will dictate the steps towards a successful upgrade.

Reader prerequisites

I will assume MySQL to be the underlying database. To take a major component out of the equation: we may need to deal with very large tables, for which an ALTER command may take long hours. I will assume familiarity with Master-Master (Active-Passive) replication, with possible use of MMM for MySQL. When I describe “Failover from M1 to M2“, I mean “Make the ALTER changes on M2 (passive), then switch your application from M1 to M2 (change of IPs, VIP, etc.), promoting M2 to active position, then apply same changes on M1 (now passive) or completely rebuild it”.

Phew, a one sentence description of M-M usage…

I also assume the reader’s understanding that a table’s schema can be different on master & slave, which is the basis for the “use replication for refactoring” trick. But it cannot be too different, or, to be precise, the two schemata must both support the ongoing queries for the table.

A full discussion of the above is beyond the scope of this post.

Types of refactoring needs

As I limit this discussion to single table refactoring,we can look at major refactoring operations and their impact on application & upgrades. We will discuss ADD/DROP COLUMN, ADD/DROP INDEX, ADD/DROP UNIQUE INDEX, ADD/DROP FOREIGN KEY, ADD/DROP TABLE.

We will assume the database and application are both in Version #1 (V1), and need to be upgraded to V2 or greater.

ADD INDEX

Starting with the easier actions. Why would you add an index? Either:

  1. There is some existing query which can be optimized by the new query
  2. Or there is some new functionality which issues a query for which the new index is required.

Adding an index is an easy action in that the table’s data does not really change.

In case #1, all you need to do is to add the new index (if the table is large, fail over from M1 to M2). There is no application upgrade, so all that happens is that the database upgrades V1 -> V2.

In case #2, the database must be prepared with new schema before the new functionality/query is introduced (since it depends on the existence of the index). The steps, therefore, are:

  1. DB: V1 -> V2 (possibly failover from M1 to M2)
  2. (Sometime later) App: V1 -> V2. Application will issue queries which utilize the new index.

The application does not have to be upgraded at the same instant the DB gets upgraded. In fact, we’ll see that this is a typical scenario: we can separate upgrades into smaller steps, which allow for time lapse. One could work out steps 1 & 2 together, but that would take an extra effort.

ADD COLUMN

This must be one of the most common table schema upgrades: a new property is needed on the application side. It must be supported by the database. Perhaps a new field in some Java Object, with Hibernate mapping that field onto a new column. Or maybe the new column is there for purpose of de-normalization.

This is also a more complicated task. Let’s look at the required steps:

  1. DB: V1 -> V2 (possibly failover from M1 to M2), change is ADD COLUMN.
  2. App: V1 -> V2. Change is: provide column value for newly INSERTed rows.
  3. If needed, retroactively update column values for all pre-existing rows.
  4. App: V2 -> V3. Application begins to use (read, SELECT) new column.

The above procedure assumes that the new column must have some calculated value. A 10-million rows table must now be updated, to have the correct values filled in. So we ask of the application to start filling in data for new rows, which makes the invalid row set static. We can just take a “from row” and a “to row” and fill in the missing column’s value for those rows. Only when all rows contain valid values can we let the application start using that row. This makes for two application upgrades.

If you’re content with just a static DEFAULT value, then step 3 can be skipped, and step 4 can be merged with step 2.

ADD UNIQUE INDEX

This is an altogether different case than the normal ADD INDEX, even though they may seem similar. And the case is particularly different when using Master-Slave failover for rebuilding the table.

Consider the case where we add a UNIQUE INDEX on a slave. Some INSERT query executes on the master, successfully, and is logged to the binary log. The slave picks it up, tries to execute it, to find that it fails on a DUPLICATE KEY error.

The UNIQUE INDEX is a constraint, and it makes the slave more constrained than the master. This is a delicate situation. Here how to (mostly) work it out:

  1. App: V1 -> V2. Change INSERT queries on relevant table to INSERT IGNORE or REPLACE queries, whichever is more appropriate.
  2. DB: V1 -> V2 (possibly failover from M1 to M2), change is ADD UNIQUE KEY (and while at it, a tip: are you aware of ALTER IGNORE TABLE?)

The change of query ensures that the query will succeed on the slave (either by silently doing nothing or by actually replacing content). It also means that the slave can now have different data than the master. Of course, it you trust your application to never INSERT duplicates, you can sleep better.

We do not handle UPDATE statements here.

ADD CONSTRAINT FOREIGN KEY

As with ADD UNIQUE INDEX, there is a new constraint here. A slave becomes more constrained than the master. But we now have to make sure INSERT, UPDATE and DELETE statements all go peacefully (well, it also depends on the type of ON DELETE and ON UPDATE property of the FK).

The steps would be:

  1. DB: V1 -> V2 (possibly failover from M1 to M2), change is ADD CONSTRAINT FOREIGN KEY.

And then cross your fingers or have trust in your application. If the table is small enough, one does not have to use replication to do the refactoring, and life is simpler. Just execute the ALTER on the active master, and continue with your life.

CREATE TABLE

This is a simple case, since the table is new. The steps are:

  1. DB: V1 -> V2 (no need to use slaves here)
  2. App: V1 -> V2. Application will start using new table.

Conslusion

Having such steps formalized help with development management and database management. It makes clear what is expected of the application, and what is expected of the database. The breaking down of these operations into sequential steps allows us to work more slowly; make preparation work; work within our own working hours; get a chance to see the family.

In this post we took a look at “creation” refactoring changes. New columns, new keys, new constraints. In the next part of this article, we’ll discuss DROP operations.


PlanetMySQL Voting: Vote UP / Vote DOWN

SQL: forcing single row tables integrity

Июнь 22nd, 2010

Single row tables are used in various cases. Such tables can be used for “preferences” or “settings”; for managing counters (e.g. summary tables), for general-purpose administration tasks (e.g. heartbeat table) etc.

The problem with single row tables is that, well, they must have s single row. And the question is: how can you force them to have just one row?

The half-baked solution

The common solution is to create a PRIMARY KEY and always use the same value for that key. In addition, using REPLACE or INSERT INTO ON DUPLICATE KEY UPDATE helps out in updating the row. For example:

CREATE TABLE heartbeat (
 id int NOT NULL PRIMARY KEY,
 ts datetime NOT NULL
 );

The above table definition is taken from mk-heartbeat. It should be noted that mk-heartbeat in itself does not require that the table has a single row, so it is not the target of this post. I’m taking the above table definition as a very simple example.

So, we assume we want this table to have a single row, for whatever reasons we have. We would usually do:

REPLACE INTO heartbeat (id, ts) VALUES (1, NOW());

or

INSERT INTO heartbeat (id, ts) VALUES (1, NOW()) ON DUPLICATE KEY UPDATE ts = NOW();

Why is the above a “half baked solution”? Because it is up to the application to make sure it reuses the same PRIMARY KEY value. There is nothing in the database to prevent the following:

REPLACE INTO heartbeat (id, ts) VALUES (73, NOW()); -- Ooops

One may claim that “my application has good integrity”. That may be the case; but I would then raise the question: why, then, would you need FOREIGN KEYs? Of course, many people don’t use FOREIGN KEYs, but I think the message is clear.

A heavyweight solution

Triggers can help out. But really, this is an overkill.

A solution

I purpose a solution where, much like FOREIGN KEYs, the database will force the integrity of the table; namely, have it contain at most one row.

For this solution to work, we will need a strict sql_mode. I’ll show later what happens when using a relaxed sql_mode:

SET sql_mode='STRICT_ALL_TABLES'; -- Session scope for the purpose of this article

Here’s a new table definition:

CREATE TABLE heartbeat (
 integrity_keeper ENUM('') NOT NULL PRIMARY KEY,
 ts datetime NOT NULL
);

Let’s see what happens now:

mysql> INSERT INTO heartbeat (ts) VALUES (NOW());
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO heartbeat (ts) VALUES (NOW());
ERROR 1062 (23000): Duplicate entry '' for key 'PRIMARY'
mysql> INSERT INTO heartbeat (integrity_keeper, ts) VALUES ('', NOW());
ERROR 1062 (23000): Duplicate entry '' for key 'PRIMARY'
mysql> INSERT INTO heartbeat (integrity_keeper, ts) VALUES (0, NOW());
ERROR 1265 (01000): Data truncated for column 'integrity_keeper' at row 1
mysql> INSERT INTO heartbeat (integrity_keeper, ts) VALUES (1, NOW());
ERROR 1062 (23000): Duplicate entry '' for key 'PRIMARY'

mysql> REPLACE INTO heartbeat (ts) VALUES (NOW());
Query OK, 2 rows affected (0.00 sec)

mysql> INSERT INTO heartbeat (ts) VALUES (NOW()) ON DUPLICATE KEY UPDATE ts = NOW();
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM heartbeat;
+------------------+---------------------+
| integrity_keeper | ts                  |
+------------------+---------------------+
|                  | 2010-06-15 09:12:19 |
+------------------+---------------------+

So the trick is to create a PRIMARY KEY column which is only allowed a single value.

The above shows I cannot force another row into the table: the schema will prevent me from doing so. Mission accomplished.

Further thoughts

The CHECK keyword is the real solution to this problem (and other problems). However, it is ignored by MySQL.

It is interesting to note that with a relaxed sql_mode, the INSERT INTO heartbeat (integrity_keeper, ts) VALUES (0, NOW()); query succeeds. Why? The default ENUM value is 1, and, being in relaxed mode, 0 is allowed in, even though it is not a valid value (Argh!).


PlanetMySQL Voting: Vote UP / Vote DOWN

Reducing locks by narrowing primary key

Май 4th, 2010

In a period of two weeks, I had two cases with the exact same symptoms.

Database users were experiencing low responsiveness. DBAs were seeing locks occurring on seemingly normal tables. In particular, looking at Innotop, it seemed that INSERTs were causing the locks.

In both cases, tables were InnoDB. In both cases, there was a PRIMARY KEY on the combination of all 5 columns. And in both cases, there was no clear explanation as for why the PRIMARY KEY was chosen as such.

Choosing a proper PRIMARY KEY

Especially with InnoDB, which uses clustered index structure, the PRIMARY KEY is of particular importance. Besides the fact that a bloated PRIMARY KEY bloats the entire clustered index and secondary keys (see: The depth of an index: primer), it is also a source for locks. It’s true that any UNIQUE KEY can serve as a PRIMARY KEY. But not all such keys are good candidates.

Reducing the locks

In both described cases, the solution was to add an AUTO_INCREMENT column to serve as the PRIMARY KEY, and have that 5 column combination under a secondary UNIQUE KEY. The impact was immediate: no further locks on that table were detected, and query responsiveness turned very high.


PlanetMySQL Voting: Vote UP / Vote DOWN

Monotonic functions, SQL and MySQL

Февраль 9th, 2010

In mathematics, a monotonic function (or monotone function) is a function which preserves the given order. [Wikipedia]

To be more precise, a function f is monotonic increasing, if for every x ≤ y it holds that f(x) ≤ f(y). f is said to be strictly monotonic increasing is for every x < y it holds that f(x) < f(y).

So, if we follow values in some order, we say that f is monotonic increasing if f’s value never decreases (it either increases or stays the same), and we say that f is strictly increasing if f’s value is always changes “upwards”.

Monotonic functions play an important role in SQL. To discuss monotonic functions in SQL we must first determine what the order is, and then, what the function is.

Well, they both change according to our point of view. Let’s look at some examples. Take a look at the following table:

CREATE TABLE `log` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `error_level` tinyint(4) DEFAULT NULL,
 `subject` varchar(32) DEFAULT NULL,
 `description` varchar(255) DEFAULT NULL,
 PRIMARY KEY (`id`)
)

In the above log table, log entries are added with id and ts getting automatically evaluated. Assuming no dirty hacks occur, we can expect that ts in monotonic by order of id. That is, as id increases, so does ts. Is is possible that we get the same ts for a few rows (it is not unique), but once it increases, it never decreases again.

Why is this interesting?

Because it simplifies common problems.

For example, it simplifies a search for a given ts value, when no index exists on the ts column. If we were to look for a log entry from ‘2009-02-07 11:58:00′ by simple SELECT, we would have to use a full table scan. But, by knowing that ts is monotonic, we can also use binary search on id.

As another example, it simplifies the task of purging all rows up to last midnight. Instead of issuing “DELETE FROM log WHERE ts < DATE(NOW())”, thus using, again, full table scan plus locking all rows (depending on storage engine), we can use other methods:

  • We can detect the id for the first row which holds the condition using binary search, then “DELETE FROM log WHERE id < ###”
  • Or we can slowly work our way in ascending id order, issuing something like “DELETE FROM log WHERE ts < DATE(NOW()) ORDER BY id ASC LIMIT 1000″, and stop once the ROW_COUNT() is less than 1000. We know we need not advance any further, without needing to compute anything. We thus block less, while retaining correctness of our operation.

Monotonic functions in MySQL

When we iterate InnoDB tables (as in full table scan), we know that rows are iterated in ascending PRIMARY KEY order [¹]. So the PRIMARY KEY dictates the order by which monotonic functions are evaluated.

With MyISAM, rows are iterated according to internal storage order. It has nothing to do with PRIMARY KEYs (though depending on concurrent_insert this can be somewhat controlled). It also has nothing to do with chronological order. Newer rows may capture space held by older rows.

But MyISAM allows for ALTER TABLE … ORDER BY … syntax, which allows us to do a one-time sort of the table. Assuming no writes shortly thereafter, a full table scan will iterate the rows according to specified order.

Monotonic functions and indexes

A column which is indexed dictates a monotonic function by index order.

Wait. Isn’t that obvious? Of course: if we index a column, then the index sorts by that column, and the column is ascending by the index order which is,… itself.

I call that trivial, but it does interest us: because, while mathematically there may be nothing significant here, we do care about this order when we have index scans. So, if we can force an index scan on our query, then we can anticipate the order by which rows are processed; we now have some order by which to evaluate monotonic functions.

OK, maybe I made it sound more complicated than it really is. Monotonic functions work well when the order by which they are monotonic is some indexed column(s). The AUTO_INCREMENT PRIMARY KEY we saw in the log example above, it perhaps the most trivial case.

While MySQL does not support function indexes, if the functions we consider is monotonic, we still benefit from adding an index on the raw column.

Other examples of monotonic functions

So, where else can we find them? Timestamp columns are probably the most common (this post holds true until time travel to the past is introduced).

But also summaries: like a reporting table which lists down some ever-ascending value (the number of books ever sold in our store; trip mileage; hop counter; etc.).

I’ve seen many cases (though difficult to illustrate in this scope) when foreign key values are in ascending order. A very brief example is a 1-1 relation between two denormalized tables, where the tables ids do not necessarily have to match, but is always ascending).

And Baron’s wishlist for SQL can also benefit from monotonic functions.

Conclusion

When a monotonic function is present, it brings an added value to our schema and query design. It allows for less indexing; quicker operations. Look for these. I’ve only discussed increasing functions. Indeed, MySQL’s indexes are always increasing (they cannot be defined in decreasing order), but query simplifications work just as well for monotonic decreasing functions.

[¹] I’ve actually seen a different behavior on temporary InnoDB tables and on compressed InnoDB Plugin tables; I’ll write on this on another occasion.


PlanetMySQL Voting: Vote UP / Vote DOWN

Beware of implicit casting

Февраль 2nd, 2010

Ever so often a query provides a “bad” execution plan. Adding a missing index can many times solve the problem. However, not everything can be solved with an index. I wish to highlight the point of having an implicit cast, which negates the use of an index on MySQL.

I see this happening a lot on customers’ databases, and this begs for a short introduction.

MySQL doesn’t support index functions

Let’s assume the following table:

CREATE TABLE `person` (
 `id` bigint(20) NOT NULL AUTO_INCREMENT,
 `first_name` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
 `last_name` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
 `driver_license_registration` bigint(20) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `last_name` (`last_name`),
 KEY `driver_license_registration` (`driver_license_registration`)
)

And suppose we’re looking for persons whose last name begin with “Smith”. The following query will NOT utilize an index:

SELECT * FROM person WHERE LEFT(last_name, 5) = 'Smith';

Why not? Because we use a function (LEFT) over the last_name column, and this makes MySQL unable to deduce the outcome of the expression. Will it be ordered in the same way as the original column? (in this case, the answer is yes, LEFT is a monotonic function) Is it possible to produce the reverse function? (In this case, no). MySQL cannot and does not handle these questions and therefore avoids using the index on last_name altogether:

EXPLAIN SELECT * FROM person WHERE LEFT(last_name, 5) = 'Smith';
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | person | ALL  | NULL          | NULL | NULL    | NULL | 1000 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+

To solve the problem above, a simple change to the query will do the trick:

EXPLAIN SELECT * FROM person WHERE last_name LIKE 'Smith%';
+----+-------------+--------+-------+---------------+-----------+---------+------+------+-------------+
| id | select_type | table  | type  | possible_keys | key       | key_len | ref  | rows | Extra       |
+----+-------------+--------+-------+---------------+-----------+---------+------+------+-------------+
|  1 | SIMPLE      | person | range | last_name     | last_name | 99      | NULL |    5 | Using where |
+----+-------------+--------+-------+---------------+-----------+---------+------+------+-------------+

Not all functions are explicit

This leads us to the subject of this post. What if we were to look for persons whose driver license registration number begins with ‘123′? Trying to learn from the above example, we write:

SELECT * FROM person WHERE driver_license_registration LIKE '123%';

But the above query does NOT utilize an index! See:

EXPLAIN SELECT * FROM person WHERE driver_license_registration LIKE '123%';
+----+-------------+--------+------+-----------------------------+------+---------+------+------+-------------+
| id | select_type | table  | type | possible_keys               | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------+------+-----------------------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | person | ALL  | driver_license_registration | NULL | NULL    | NULL | 1000 | Using where |
+----+-------------+--------+------+-----------------------------+------+---------+------+------+-------------+

Why not?

Because the driver_license_registration column is an integer. The query asks for a string comparison (LIKE ‘123%’). This kind of comparison cannot be performed on integers. The column value will have to transform to text in order to work out the query. Effectively, it’s as if we wrote:

SELECT * FROM person WHERE CAST(driver_license_registration AS CHAR) LIKE '123%';

There’s no immediate solution to this last query. Perhaps the column should be textual after all, perhaps maintain a “ghost” column. If the number of digits in a number if fixed an known, then we can convert the above to a range query.

Casting is an implicit function

And, alas, it’s a well hidden function. There’s nothing to suggest a problem here. This is SQL, not Java. It’s not a strongly typed language. I see the above example quite a lot. Another common casting mistake is comparing Timestamp values to strings.

So, until MySQL supports index functions, watch out for these nuances.


PlanetMySQL Voting: Vote UP / Vote DOWN

How (not) to find unused indexes

Октябрь 16th, 2009

I've seen a few people link to an INFORMATION_SCHEMA query to be able to find any indexes that have low cardinality, in an effort to find out what indexes should be removed.  This method is flawed - here's the first reason why:

SQL:
  1. CREATE TABLE `sales` (
  2. `id` int(11) NOT NULL AUTO_INCREMENT,
  3. `customer_id` int(11) DEFAULT NULL,
  4. `status` enum('arhicved','active') DEFAULT NULL,
  5. PRIMARY KEY (`id`),
  6. KEY `status` (`status`)
  7. ) ENGINE=MyISAM AUTO_INCREMENT=65691 DEFAULT CHARSET=latin1;
  8.  
  9. mysql&gt; SELECT count(*), STATUS FROM sales GROUP BY STATUS;
  10. +----------+---------+
  11. | count(*) | STATUS  |
  12. +----------+---------+
  13. |    65536 | archived |
  14. |      154 | active  |
  15. +----------+---------+
  16. 2 rows IN SET (0.17 sec)
  17.  
  18. mysql&gt; EXPLAIN SELECT * FROM sales WHERE STATUS='active'; # query 1
  19. +----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
  20. | id | select_type | TABLE | type | possible_keys | KEY    | key_len | ref   | rows | Extra       |
  21. +----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
  22. 1 | SIMPLE      | sales | ref  | STATUS        | STATUS | 2       | const |  196 | USING WHERE |
  23. +----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
  24. 1 row IN SET (0.06 sec)
  25.  
  26. mysql&gt; EXPLAIN SELECT * FROM sales WHERE STATUS='archived'; # query 2
  27. +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
  28. | id | select_type | TABLE | type | possible_keys | KEY  | key_len | ref  | rows  | Extra       |
  29. +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
  30. 1 | SIMPLE      | sales | ALL  | STATUS        | NULL | NULL    | NULL | 65690 | USING WHERE |
  31. +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
  32. 1 row IN SET (0.01 sec)

The cardinality of status index is woeful, but provided that the application is always only sending query 1 to MySQL it's actually a pretty good index!  It's not always like this, but there are a lot of cases where applications have good selectivity with some queries despite what cardinality shows.

Not convinced?  Here's reason number two:

SQL:
  1. CREATE TABLE `Country` (
  2. `Code` char(3) NOT NULL DEFAULT '',
  3. `Name` char(52) NOT NULL DEFAULT '',
  4. `Continent` enum('Asia','Europe','North America','Africa','Oceania','Antarctica','South America') NOT NULL DEFAULT 'Asia',
  5. `Region` char(26) NOT NULL DEFAULT '',
  6. `SurfaceArea` float(10,2) NOT NULL DEFAULT '0.00',
  7. `IndepYear` smallint(6) DEFAULT NULL,
  8. `Population` int(11) NOT NULL DEFAULT '0',
  9. `LifeExpectancy` float(3,1) DEFAULT NULL,
  10. `GNP` float(10,2) DEFAULT NULL,
  11. `GNPOld` float(10,2) DEFAULT NULL,
  12. `LocalName` char(45) NOT NULL DEFAULT '',
  13. `GovernmentForm` char(45) NOT NULL DEFAULT '',
  14. `HeadOfState` char(60) DEFAULT NULL,
  15. `Capital` int(11) DEFAULT NULL,
  16. `Code2` char(2) NOT NULL DEFAULT '',
  17. PRIMARY KEY (`Code`),
  18. KEY `Population` (`Population`)
  19. ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  20.  
  21. mysql&gt; SELECT count(*) FROM Country;
  22. +----------+
  23. | count(*) |
  24. +----------+
  25. |      239 |
  26. +----------+
  27. 1 row IN SET (0.00 sec)
  28.  
  29. mysql&gt; SELECT count(DISTINCT(population)) FROM Country;
  30. +-----------------------------+
  31. | count(DISTINCT(population)) |
  32. +-----------------------------+
  33. |                         226 |
  34. +-----------------------------+
  35. 1 row IN SET (0.05 sec)
  36.  
  37. mysql&gt; EXPLAIN SELECT * FROM country WHERE population&gt; 1000; # query 3
  38. +----+-------------+---------+------+---------------+------+---------+------+------+-------------+
  39. | id | select_type | TABLE   | type | possible_keys | KEY  | key_len | ref  | rows | Extra       |
  40. +----+-------------+---------+------+---------------+------+---------+------+------+-------------+
  41. 1 | SIMPLE      | country | ALL  | Population    | NULL | NULL    | NULL239 | USING WHERE |
  42. +----+-------------+---------+------+---------------+------+---------+------+------+-------------+
  43. 1 row IN SET (0.04 sec)
  44.  
  45. mysql&gt; EXPLAIN SELECT * FROM country WHERE population&gt; 100000000; # query 4
  46. +----+-------------+---------+-------+---------------+------------+---------+------+------+-------------+
  47. | id | select_type | TABLE   | type  | possible_keys | KEY        | key_len | ref  | rows | Extra       |
  48. +----+-------------+---------+-------+---------------+------------+---------+------+------+-------------+
  49. 1 | SIMPLE      | country | range | Population    | Population | 4       | NULL |   23 | USING WHERE |
  50. +----+-------------+---------+-------+---------------+------------+---------+------+------+-------------+
  51. 1 row IN SET (0.00 sec)

The index on query 3 had high cardinality but should not be used since too many countries have a population greater than 1000.  An automated search for low cardinality indexes wouldn't have revealed it's uselessness.  For range scans, it's very easy to lead yourself into a trap where your index can not filter out enough rows to be effective.  I see this a lot in consulting issues where customers have queries that use a BETWEEN on a date, but the window of time it is searching in is too wide.

Side Note: In some texts you'll see people quote the numbers "20-30%" as the minimum amount of rows you have to filter down to for an index to be useful (that is, eliminate 70-80% of rows).  It's not quite correct to quote this as an exact percentage, since this value is not fixed in MySQL and can be a much wider window (15-60%) depending on the circumstances.  In this case, MySQL flipped from tablescan to index at about 34%.

How am I supposed to find unused indexes then?
You really have to run queries against your server - there is no other way.  From there, there's a helpful patch in 5.0-percona called INDEX_STATISTICS that can then show you which indexes were touched and which were not.

If you are not running a patched server, then the alternative is to either use a proxy that checks EXPLAIN information (like QUAN) or set your slow query log to zero microseconds (5.1 feature) and then find someway to parse and EXPLAIN all results, then subtract the indexes that were mentioned from all indexes known.  There's an old tool called mysqlidxchx which should be able to do this.


Entry posted by Morgan Tocker | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Comparison Between Solr And Sphinx Search Servers (Solr Vs Sphinx – Fight!)

Сентябрь 3rd, 2009

In the past few weeks I've been implementing advanced search at Plaxo, working quite closely with Solr enterprise search server. Today, I saw this relatively detailed comparison between Solr and its main competitor Sphinx (full credit goes to StackOverflow user mausch who had been using Solr for the past 2 years). For those still confused, Solr and Sphinx are similar to MySQL FULLTEXT search, or for those even more confused, think Google (yeah, this is a bit of a stretch, I know).

Similarities

  • Both Solr and Sphinx satisfy all of your requirements. They're fast and designed to index and search large bodies of data efficiently.
  • Both have a long list of high-traffic sites using them (Solr, Sphinx)
  • Both offer commercial support. (Solr, Sphinx)
  • Both offer client API bindings for several platforms/languages (Sphinx, Solr)
  • Both can be distributed to increase speed and capacity (Sphinx, Solr)

Here are some differences

Related questions

Conclusion

In my experience, Solr is very-very fast on the query side. It is also very powerful. The indexing side is very CPU and memory intensive and is an unfortunate side effect of having such a feature-rich, fast application. Nevertheless, I highly recommend Solr.

For disclaimer purposes, I have not had much experience with Sphinx and, again, all credit for this comparison goes to mausch.

 
Similar Posts:Share/Bookmark
PlanetMySQL Voting: Vote UP / Vote DOWN