Archive for the ‘indexing’ Category

Monotonic functions, SQL and MySQL

Февраль 9th, 2010

In mathematics, a monotonic function (or monotone function) is a function which preserves the given order. [Wikipedia]

To be more precise, a function f is monotonic increasing, if for every x ≤ y it holds that f(x) ≤ f(y). f is said to be strictly monotonic increasing is for every x < y it holds that f(x) < f(y).

So, if we follow values in some order, we say that f is monotonic increasing if f’s value never decreases (it either increases or stays the same), and we say that f is strictly increasing if f’s value is always changes “upwards”.

Monotonic functions play an important role in SQL. To discuss monotonic functions in SQL we must first determine what the order is, and then, what the function is.

Well, they both change according to our point of view. Let’s look at some examples. Take a look at the following table:

CREATE TABLE `log` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `error_level` tinyint(4) DEFAULT NULL,
 `subject` varchar(32) DEFAULT NULL,
 `description` varchar(255) DEFAULT NULL,
 PRIMARY KEY (`id`)
)

In the above log table, log entries are added with id and ts getting automatically evaluated. Assuming no dirty hacks occur, we can expect that ts in monotonic by order of id. That is, as id increases, so does ts. Is is possible that we get the same ts for a few rows (it is not unique), but once it increases, it never decreases again.

Why is this interesting?

Because it simplifies common problems.

For example, it simplifies a search for a given ts value, when no index exists on the ts column. If we were to look for a log entry from ‘2009-02-07 11:58:00′ by simple SELECT, we would have to use a full table scan. But, by knowing that ts is monotonic, we can also use binary search on id.

As another example, it simplifies the task of purging all rows up to last midnight. Instead of issuing “DELETE FROM log WHERE ts < DATE(NOW())”, thus using, again, full table scan plus locking all rows (depending on storage engine), we can use other methods:

  • We can detect the id for the first row which holds the condition using binary search, then “DELETE FROM log WHERE id < ###”
  • Or we can slowly work our way in ascending id order, issuing something like “DELETE FROM log WHERE ts < DATE(NOW()) ORDER BY id ASC LIMIT 1000″, and stop once the ROW_COUNT() is less than 1000. We know we need not advance any further, without needing to compute anything. We thus block less, while retaining correctness of our operation.

Monotonic functions in MySQL

When we iterate InnoDB tables (as in full table scan), we know that rows are iterated in ascending PRIMARY KEY order [¹]. So the PRIMARY KEY dictates the order by which monotonic functions are evaluated.

With MyISAM, rows are iterated according to internal storage order. It has nothing to do with PRIMARY KEYs (though depending on concurrent_insert this can be somewhat controlled). It also has nothing to do with chronological order. Newer rows may capture space held by older rows.

But MyISAM allows for ALTER TABLE … ORDER BY … syntax, which allows us to do a one-time sort of the table. Assuming no writes shortly thereafter, a full table scan will iterate the rows according to specified order.

Monotonic functions and indexes

A column which is indexed dictates a monotonic function by index order.

Wait. Isn’t that obvious? Of course: if we index a column, then the index sorts by that column, and the column is ascending by the index order which is,… itself.

I call that trivial, but it does interest us: because, while mathematically there may be nothing significant here, we do care about this order when we have index scans. So, if we can force an index scan on our query, then we can anticipate the order by which rows are processed; we now have some order by which to evaluate monotonic functions.

OK, maybe I made it sound more complicated than it really is. Monotonic functions work well when the order by which they are monotonic is some indexed column(s). The AUTO_INCREMENT PRIMARY KEY we saw in the log example above, it perhaps the most trivial case.

While MySQL does not support function indexes, if the functions we consider is monotonic, we still benefit from adding an index on the raw column.

Other examples of monotonic functions

So, where else can we find them? Timestamp columns are probably the most common (this post holds true until time travel to the past is introduced).

But also summaries: like a reporting table which lists down some ever-ascending value (the number of books ever sold in our store; trip mileage; hop counter; etc.).

I’ve seen many cases (though difficult to illustrate in this scope) when foreign key values are in ascending order. A very brief example is a 1-1 relation between two denormalized tables, where the tables ids do not necessarily have to match, but is always ascending).

And Baron’s wishlist for SQL can also benefit from monotonic functions.

Conclusion

When a monotonic function is present, it brings an added value to our schema and query design. It allows for less indexing; quicker operations. Look for these. I’ve only discussed increasing functions. Indeed, MySQL’s indexes are always increasing (they cannot be defined in decreasing order), but query simplifications work just as well for monotonic decreasing functions.

[¹] I’ve actually seen a different behavior on temporary InnoDB tables and on compressed InnoDB Plugin tables; I’ll write on this on another occasion.


PlanetMySQL Voting: Vote UP / Vote DOWN

Beware of implicit casting

Февраль 2nd, 2010

Ever so often a query provides a “bad” execution plan. Adding a missing index can many times solve the problem. However, not everything can be solved with an index. I wish to highlight the point of having an implicit cast, which negates the use of an index on MySQL.

I see this happening a lot on customers’ databases, and this begs for a short introduction.

MySQL doesn’t support index functions

Let’s assume the following table:

CREATE TABLE `person` (
 `id` bigint(20) NOT NULL AUTO_INCREMENT,
 `first_name` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
 `last_name` varchar(32) CHARACTER SET utf8 DEFAULT NULL,
 `driver_license_registration` bigint(20) DEFAULT NULL,
 PRIMARY KEY (`id`),
 KEY `last_name` (`last_name`),
 KEY `driver_license_registration` (`driver_license_registration`)
)

And suppose we’re looking for persons whose last name begin with “Smith”. The following query will NOT utilize an index:

SELECT * FROM person WHERE LEFT(last_name, 5) = 'Smith';

Why not? Because we use a function (LEFT) over the last_name column, and this makes MySQL unable to deduce the outcome of the expression. Will it be ordered in the same way as the original column? (in this case, the answer is yes, LEFT is a monotonic function) Is it possible to produce the reverse function? (In this case, no). MySQL cannot and does not handle these questions and therefore avoids using the index on last_name altogether:

EXPLAIN SELECT * FROM person WHERE LEFT(last_name, 5) = 'Smith';
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | person | ALL  | NULL          | NULL | NULL    | NULL | 1000 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+

To solve the problem above, a simple change to the query will do the trick:

EXPLAIN SELECT * FROM person WHERE last_name LIKE 'Smith%';
+----+-------------+--------+-------+---------------+-----------+---------+------+------+-------------+
| id | select_type | table  | type  | possible_keys | key       | key_len | ref  | rows | Extra       |
+----+-------------+--------+-------+---------------+-----------+---------+------+------+-------------+
|  1 | SIMPLE      | person | range | last_name     | last_name | 99      | NULL |    5 | Using where |
+----+-------------+--------+-------+---------------+-----------+---------+------+------+-------------+

Not all functions are explicit

This leads us to the subject of this post. What if we were to look for persons whose driver license registration number begins with ‘123′? Trying to learn from the above example, we write:

SELECT * FROM person WHERE driver_license_registration LIKE '123%';

But the above query does NOT utilize an index! See:

EXPLAIN SELECT * FROM person WHERE driver_license_registration LIKE '123%';
+----+-------------+--------+------+-----------------------------+------+---------+------+------+-------------+
| id | select_type | table  | type | possible_keys               | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------+------+-----------------------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | person | ALL  | driver_license_registration | NULL | NULL    | NULL | 1000 | Using where |
+----+-------------+--------+------+-----------------------------+------+---------+------+------+-------------+

Why not?

Because the driver_license_registration column is an integer. The query asks for a string comparison (LIKE ‘123%’). This kind of comparison cannot be performed on integers. The column value will have to transform to text in order to work out the query. Effectively, it’s as if we wrote:

SELECT * FROM person WHERE CAST(driver_license_registration AS CHAR) LIKE '123%';

There’s no immediate solution to this last query. Perhaps the column should be textual after all, perhaps maintain a “ghost” column. If the number of digits in a number if fixed an known, then we can convert the above to a range query.

Casting is an implicit function

And, alas, it’s a well hidden function. There’s nothing to suggest a problem here. This is SQL, not Java. It’s not a strongly typed language. I see the above example quite a lot. Another common casting mistake is comparing Timestamp values to strings.

So, until MySQL supports index functions, watch out for these nuances.


PlanetMySQL Voting: Vote UP / Vote DOWN

How (not) to find unused indexes

Октябрь 16th, 2009

I've seen a few people link to an INFORMATION_SCHEMA query to be able to find any indexes that have low cardinality, in an effort to find out what indexes should be removed.  This method is flawed - here's the first reason why:

SQL:
  1. CREATE TABLE `sales` (
  2. `id` int(11) NOT NULL AUTO_INCREMENT,
  3. `customer_id` int(11) DEFAULT NULL,
  4. `status` enum('arhicved','active') DEFAULT NULL,
  5. PRIMARY KEY (`id`),
  6. KEY `status` (`status`)
  7. ) ENGINE=MyISAM AUTO_INCREMENT=65691 DEFAULT CHARSET=latin1;
  8.  
  9. mysql&gt; SELECT count(*), STATUS FROM sales GROUP BY STATUS;
  10. +----------+---------+
  11. | count(*) | STATUS  |
  12. +----------+---------+
  13. |    65536 | archived |
  14. |      154 | active  |
  15. +----------+---------+
  16. 2 rows IN SET (0.17 sec)
  17.  
  18. mysql&gt; EXPLAIN SELECT * FROM sales WHERE STATUS='active'; # query 1
  19. +----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
  20. | id | select_type | TABLE | type | possible_keys | KEY    | key_len | ref   | rows | Extra       |
  21. +----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
  22. 1 | SIMPLE      | sales | ref  | STATUS        | STATUS | 2       | const |  196 | USING WHERE |
  23. +----+-------------+-------+------+---------------+--------+---------+-------+------+-------------+
  24. 1 row IN SET (0.06 sec)
  25.  
  26. mysql&gt; EXPLAIN SELECT * FROM sales WHERE STATUS='archived'; # query 2
  27. +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
  28. | id | select_type | TABLE | type | possible_keys | KEY  | key_len | ref  | rows  | Extra       |
  29. +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
  30. 1 | SIMPLE      | sales | ALL  | STATUS        | NULL | NULL    | NULL | 65690 | USING WHERE |
  31. +----+-------------+-------+------+---------------+------+---------+------+-------+-------------+
  32. 1 row IN SET (0.01 sec)

The cardinality of status index is woeful, but provided that the application is always only sending query 1 to MySQL it's actually a pretty good index!  It's not always like this, but there are a lot of cases where applications have good selectivity with some queries despite what cardinality shows.

Not convinced?  Here's reason number two:

SQL:
  1. CREATE TABLE `Country` (
  2. `Code` char(3) NOT NULL DEFAULT '',
  3. `Name` char(52) NOT NULL DEFAULT '',
  4. `Continent` enum('Asia','Europe','North America','Africa','Oceania','Antarctica','South America') NOT NULL DEFAULT 'Asia',
  5. `Region` char(26) NOT NULL DEFAULT '',
  6. `SurfaceArea` float(10,2) NOT NULL DEFAULT '0.00',
  7. `IndepYear` smallint(6) DEFAULT NULL,
  8. `Population` int(11) NOT NULL DEFAULT '0',
  9. `LifeExpectancy` float(3,1) DEFAULT NULL,
  10. `GNP` float(10,2) DEFAULT NULL,
  11. `GNPOld` float(10,2) DEFAULT NULL,
  12. `LocalName` char(45) NOT NULL DEFAULT '',
  13. `GovernmentForm` char(45) NOT NULL DEFAULT '',
  14. `HeadOfState` char(60) DEFAULT NULL,
  15. `Capital` int(11) DEFAULT NULL,
  16. `Code2` char(2) NOT NULL DEFAULT '',
  17. PRIMARY KEY (`Code`),
  18. KEY `Population` (`Population`)
  19. ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  20.  
  21. mysql&gt; SELECT count(*) FROM Country;
  22. +----------+
  23. | count(*) |
  24. +----------+
  25. |      239 |
  26. +----------+
  27. 1 row IN SET (0.00 sec)
  28.  
  29. mysql&gt; SELECT count(DISTINCT(population)) FROM Country;
  30. +-----------------------------+
  31. | count(DISTINCT(population)) |
  32. +-----------------------------+
  33. |                         226 |
  34. +-----------------------------+
  35. 1 row IN SET (0.05 sec)
  36.  
  37. mysql&gt; EXPLAIN SELECT * FROM country WHERE population&gt; 1000; # query 3
  38. +----+-------------+---------+------+---------------+------+---------+------+------+-------------+
  39. | id | select_type | TABLE   | type | possible_keys | KEY  | key_len | ref  | rows | Extra       |
  40. +----+-------------+---------+------+---------------+------+---------+------+------+-------------+
  41. 1 | SIMPLE      | country | ALL  | Population    | NULL | NULL    | NULL239 | USING WHERE |
  42. +----+-------------+---------+------+---------------+------+---------+------+------+-------------+
  43. 1 row IN SET (0.04 sec)
  44.  
  45. mysql&gt; EXPLAIN SELECT * FROM country WHERE population&gt; 100000000; # query 4
  46. +----+-------------+---------+-------+---------------+------------+---------+------+------+-------------+
  47. | id | select_type | TABLE   | type  | possible_keys | KEY        | key_len | ref  | rows | Extra       |
  48. +----+-------------+---------+-------+---------------+------------+---------+------+------+-------------+
  49. 1 | SIMPLE      | country | range | Population    | Population | 4       | NULL |   23 | USING WHERE |
  50. +----+-------------+---------+-------+---------------+------------+---------+------+------+-------------+
  51. 1 row IN SET (0.00 sec)

The index on query 3 had high cardinality but should not be used since too many countries have a population greater than 1000.  An automated search for low cardinality indexes wouldn't have revealed it's uselessness.  For range scans, it's very easy to lead yourself into a trap where your index can not filter out enough rows to be effective.  I see this a lot in consulting issues where customers have queries that use a BETWEEN on a date, but the window of time it is searching in is too wide.

Side Note: In some texts you'll see people quote the numbers "20-30%" as the minimum amount of rows you have to filter down to for an index to be useful (that is, eliminate 70-80% of rows).  It's not quite correct to quote this as an exact percentage, since this value is not fixed in MySQL and can be a much wider window (15-60%) depending on the circumstances.  In this case, MySQL flipped from tablescan to index at about 34%.

How am I supposed to find unused indexes then?
You really have to run queries against your server - there is no other way.  From there, there's a helpful patch in 5.0-percona called INDEX_STATISTICS that can then show you which indexes were touched and which were not.

If you are not running a patched server, then the alternative is to either use a proxy that checks EXPLAIN information (like QUAN) or set your slow query log to zero microseconds (5.1 feature) and then find someway to parse and EXPLAIN all results, then subtract the indexes that were mentioned from all indexes known.  There's an old tool called mysqlidxchx which should be able to do this.


Entry posted by Morgan Tocker | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Comparison Between Solr And Sphinx Search Servers (Solr Vs Sphinx – Fight!)

Сентябрь 3rd, 2009

In the past few weeks I've been implementing advanced search at Plaxo, working quite closely with Solr enterprise search server. Today, I saw this relatively detailed comparison between Solr and its main competitor Sphinx (full credit goes to StackOverflow user mausch who had been using Solr for the past 2 years). For those still confused, Solr and Sphinx are similar to MySQL FULLTEXT search, or for those even more confused, think Google (yeah, this is a bit of a stretch, I know).

Similarities

  • Both Solr and Sphinx satisfy all of your requirements. They're fast and designed to index and search large bodies of data efficiently.
  • Both have a long list of high-traffic sites using them (Solr, Sphinx)
  • Both offer commercial support. (Solr, Sphinx)
  • Both offer client API bindings for several platforms/languages (Sphinx, Solr)
  • Both can be distributed to increase speed and capacity (Sphinx, Solr)

Here are some differences

Related questions

Conclusion

In my experience, Solr is very-very fast on the query side. It is also very powerful. The indexing side is very CPU and memory intensive and is an unfortunate side effect of having such a feature-rich, fast application. Nevertheless, I highly recommend Solr.

For disclaimer purposes, I have not had much experience with Sphinx and, again, all credit for this comparison goes to mausch.

 
Similar Posts:Share/Bookmark
PlanetMySQL Voting: Vote UP / Vote DOWN