The InnoDB full-text search capability is an exciting feature. The full-text search itself is generally useful to have in an RDBMS. If an application is using all InnoDB tables except for one that is used for full-text searches, now that last table can be switched to InnoDB. If putting the full-text data in a MyISAM table led to scalability problems, duplication, or a less-than-ideal schema design, now those issues can be addressed.
In this post, I’ll take you through some of the basics of setting up and querying an InnoDB FULLTEXT search index. I’ll leave the scalability and performance aspects to Jimmy’s and Vinay’s blog posts, and just use some toy-sized data for demonstration purposes.
Creating a Table with a Full-Text Search Index
The key component of this feature is an index of type FULLTEXT, applied to one or more columns of an InnoDB table.
In Jimmy’s post, he mentions some scalability considerations where you might create the table (including a special FTS_DOC_ID column), load the data, then create the FULLTEXT index afterward. For simplicity (and since the data volume is so small), I’ll create the table with the index in place, then load the data afterward.
use test;
-- We will do some commits and rollbacks to demonstrate transactional features.
-- So turn off the default setting that commits immediately after each statement.
set autocommit=0;
drop table if exists quotes;
-- In 5.5 and above, by default this table is an InnoDB table.
-- The full-text search feature lets us define the FULLTEXT index.
create table quotes
( id int unsigned auto_increment primary key
, author varchar(64) , quote varchar(4000)
, source varchar(64)
, fulltext(quote)
);
-- Get some words and phrases to search for into the table.
insert into quotes (author, quote, source) values
('Abraham Lincoln', 'Fourscore and seven years ago...',
'Gettysburg Address')
, ('George Harrison', 'All those years ago...',
'Live In Japan')
, ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.',
'2010: The Year We Make Contact')
, ('Benjamin Franklin',
'Early to bed and early to rise, makes a man healthy, wealthy, and wise.',
'Poor Richard''s Almanack')
, ('James Thurber',
'Early to rise and early to bed makes a male healthy and wealthy and dead.',
'The New Yorker')
, ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.',
'Men in Black')
;
-- Since this is an InnoDB table, we are mindful of transactions.
commit;
Word and Phrase Search – Natural Language Mode
Once the data is loaded and committed, you can run queries using the MATCH(columns) AGAINST (search expression) operator to do the actual searches. You can combine this operator with all the usual WHERE and similar clauses in the SELECT statement.
The simplest kind of search is to find a single word, or a phrase with all words in exact order. For this type of search, use the IN NATURAL LANGUAGE clause inside the AGAINST() call. This technique typically involves a user-entered string that you pass verbatim to the query (of course, after escaping any quotation marks or other special characters to prevent SQL injection attacks).
-- Search for a single word.
select author as "Monolith" from quotes
where match(quote) against ('monolith' in natural language mode);
+------------------+
| Monolith |
+------------------+
| Arthur C. Clarke |
+------------------+
1 row in set (0.01 sec)
select author as "Ago" from quotes
where match(quote) against ('ago' in natural language mode);
+------------------+
| Ago |
+------------------+
| Abraham Lincoln |
| George Harrison |
| Arthur C. Clarke |
| K |
+------------------+
4 rows in set (0.00 sec)
-- Search for a phrase.
select author as "Years Ago" from quotes
where match(quote) against ('years ago' in natural language mode);
+------------------+
| Years Ago |
+------------------+
| Abraham Lincoln |
| George Harrison |
| Arthur C. Clarke |
| K |
+------------------+
4 rows in set (0.00 sec)
AND / OR / NOT Operators – Boolean Mode
For more complicated searches, you can have multiple words and phrases and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.
-- Search for a combination of words, not in the same order as the original.
select author as "Ago and Years" from quotes
where match(quote) against ('+ago +years' in boolean mode);
+------------------+
| Ago and Years |
+------------------+
| Abraham Lincoln |
| George Harrison |
| Arthur C. Clarke |
| K |
+------------------+
4 rows in set (0.00 sec)
-- Search for other Boolean combinations of words.
select author as "Fourscore or Monolith" from quotes
where match(quote) against ('fourscore monolith' in boolean mode);
+-----------------------+
| Fourscore or Monolith |
+-----------------------+
| Abraham Lincoln |
| Arthur C. Clarke |
+-----------------------+
2 rows in set (0.00 sec)
select author as "Years and not Monolith" from quotes
where match(quote) against ('+years -monolith' in boolean mode);
+------------------------+
| Years and not Monolith |
+------------------------+
| Abraham Lincoln |
| George Harrison |
| K |
+------------------------+
3 rows in set (0.00 sec)
Proximity Search
Proximity search is a special case of Boolean search using the @ operator within the AGAINST() string. You supply 2 or more words, double-quoted, within the single-quoted AGAINST() string, followed by @distance to specify how far apart these words can be. The distance represents the maximum number of bytes between the starting points of all these words.
-- The starting points for these words are too far apart
-- (not within 20 bytes), so no results.
select quote as "Too Far Apart" from quotes
where match(quote) against ('"early wise" @20' in boolean mode);
Empty set (0.00 sec)
-- But the starting points of all words are within 100 bytes,
-- so this query does give results.
select quote as "Early...Wise" from quotes
where match(quote) against ('"early wise" @100' in boolean mode);
+-------------------------------------------------------------------------+
| Early...Wise |
+-------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
+-------------------------------------------------------------------------+
1 row in set (0.00 sec)
-- In this case, the smallest distance that produces results is 49.
select quote as "Early...Wise" from quotes
where match(quote) against ('"early wise" @49' in boolean mode);
+-------------------------------------------------------------------------+
| Early...Wise |
+-------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
+-------------------------------------------------------------------------+
1 row in set (0.00 sec)
-- Here is an example showing 2 results, with the words close to each other.
select quote as "Early...Bed" from quotes
where match(quote) against ('"early bed" @20' in boolean mode);
+---------------------------------------------------------------------------+
| Early...Bed |
+---------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
| Early to rise and early to bed makes a male healthy and wealthy and dead. |
+---------------------------------------------------------------------------+
2 rows in set (0.00 sec)
Relevance Ranking
The relevance ranking is fairly basic, derived from word frequencies within each document and the search data overall. Typically, you would only ORDER BY this value for very simplistic searches of small documents; for any important search you would layer your own ranking logic on top, perhaps with the MySQL relevance value as one factor in the overall rank.
-- Get the relevance of a single word.
select substr(quote,1,20) as "And",
match(quote) against ('and' in natural language mode) as "Relevance"
from quotes order by "Relevance" desc;
+----------------------+--------------------+
| And | Relevance |
+----------------------+--------------------+
| Fourscore and seven | 0.0906190574169159 |
| All those years ago. | 0 |
| Then 10 years ago th | 0 |
| Early to bed and ear | 0.1812381148338318 |
| Early to rise and ea | 0.2718571722507477 |
| 1500 hundred years a | 0 |
+----------------------+--------------------+
6 rows in set (0.00 sec)
Transactions
The key idea behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions, so that you can include full-text columns alongside other columns in tables in ways that make sense in terms of schema design, and multiple sessions can update the full-text column data (and/or other columns in the table) simultaneously. The full-text data doesn’t have to be treated as read-only or read-mostly.
As mentioned in Jimmy’s blog post, the table structures that manipulate the full-text data behind the scenes are only updated at COMMIT time. So make sure to insert or update full-text data in one transaction, commit, and then run any full-text queries in a subsequent transaction. (Actually, in the examples below, it looks like the data is taken out of the full-text results as soon as a DELETE is issued, then comes back if the deletion is rolled back. I think that is explained in Jimmy’s blog post by the discussion about the delete-marking optimization to avoid huge updates to the full-text index for deleted data.)
drop table if exists quotes_uncommitted;
create table quotes_uncommitted
(
author varchar(64)
, quote varchar(4000)
, source varchar(64)
, fulltext(quote)
, primary key (author, quote(128))
);
-- We insert but don't immediately commit.
insert into quotes_uncommitted select author, quote, source from quotes;
-- Within the same transaction, a full-text search does not see the uncommitted data.
select count(author), author as "Uncommitted Results" from quotes_uncommitted
where match(quote) against ('ago' in natural language mode);
+---------------+---------------------+
| count(author) | Uncommitted Results |
+---------------+---------------------+
| 0 | NULL |
+---------------+---------------------+
1 row in set (0.00 sec)
-- If the newly inserted rows are rolled back...
rollback;
-- ...then the full-text search still doesn't see them.
select count(author), author as "Rolled-Back Results" from quotes_uncommitted
where match(quote) against ('ago' in natural language mode);
+---------------+---------------------+
| count(author) | Rolled-Back Results |
+---------------+---------------------+
| 0 | NULL |
+---------------+---------------------+
1 row in set (0.00 sec)
-- OK, let's start with some committed data in the table, then empty the table,
-- and then try some FTS queries
-- both before and after the commit.
insert into quotes_uncommitted select author, quote, source from quotes;
commit;
delete from quotes_uncommitted;
select count(author), author as "Deleted but still not committed" from quotes_uncommitted
where match(quote) against ('ago' in natural language mode);
+---------------+---------------------------------+
| count(author) | Deleted but still not committed |
+---------------+---------------------------------+
| 0 | NULL |
+---------------+---------------------------------+
1 row in set (0.00 sec)
rollback;
select count(author), author as "Deleted and rolled back" from quotes_uncommitted
where match(quote) against ('ago' in natural language mode);
+---------------+-------------------------+
| count(author) | Deleted and rolled back |
+---------------+-------------------------+
| 4 | Abraham Lincoln |
+---------------+-------------------------+
1 row in set (0.00 sec)
delete from quotes_uncommitted;
commit;
select count(author), author as "Deleted and committed" from quotes_uncommitted
where match(quote) against ('ago' in natural language mode);
+---------------+-----------------------+
| count(author) | Deleted and committed |
+---------------+-----------------------+
| 0 | NULL |
+---------------+-----------------------+
1 row in set (0.00 sec)
insert into quotes_uncommitted select author, quote, source from quotes;
commit;
truncate table quotes_uncommitted;
select count(author), author as "Truncated" from quotes_uncommitted
where match(quote) against ('ago' in natural language mode);
+---------------+-----------+
| count(author) | Truncated |
+---------------+-----------+
| 0 | NULL |
+---------------+-----------+
1 row in set (0.00 sec)
Multi-Column Searches
Although you can only have one FULLTEXT index in an InnoDB table, that index can apply to multiple columns, allowing you to search when you aren’t sure which column contains the term. With a multi-column index, we can MATCH() against all the columns to find words that appear in any of those columns. Always reference all the same columns in the MATCH() clause as in the FULLTEXT index definition, because the information about which column the words appear in is not included in the full-text search data.
drop table if exists quotes_multi_col;
create table quotes_multi_col
(
id int unsigned auto_increment primary key
, author varchar(64)
, quote varchar(4000)
, source varchar(64)
, fulltext(author, quote, source)
);
insert into quotes_multi_col select * from quotes;
commit;
select author as "Poor 1 (NL)", substr(quote,1,15) as "Poor 2 (NL)", source as "Poor 3 (NL)" from
quotes_multi_col where match(author, quote, source)
against ('poor' in natural language mode);
+-------------------+-----------------+-------------------------+
| Poor 1 (NL) | Poor 2 (NL) | Poor 3 (NL) |
+-------------------+-----------------+-------------------------+
| Benjamin Franklin | Early to bed an | Poor Richard's Almanack |
+-------------------+-----------------+-------------------------+
1 row in set (0.00 sec)
select author as "Poor 1 (BOOL)", substr(quote,1,15) as "Poor 2 (BOOL)", source as "Poor 3 (BOOL)"
from quotes_multi_col where match(author, quote, source)
against ('poor' in boolean mode);
+-------------------+-----------------+-------------------------+
| Poor 1 (BOOL) | Poor 2 (BOOL) | Poor 3 (BOOL) |
+-------------------+-----------------+-------------------------+
| Benjamin Franklin | Early to bed an | Poor Richard's Almanack |
+-------------------+-----------------+-------------------------+
1 row in set (0.00 sec)
select author as "Clarke 1 (NL)", substr(quote,1,15) as "Clarke 2 (NL)", source as "Clarke 3 (NL)"
from quotes_multi_col where match(author, quote, source)
against ('clarke' in natural language mode);
+------------------+-----------------+--------------------------------+
| Clarke 1 (NL) | Clarke 2 (NL) | Clarke 3 (NL) |
+------------------+-----------------+--------------------------------+
| Arthur C. Clarke | Then 10 years a | 2010: The Year We Make Contact |
+------------------+-----------------+--------------------------------+
1 row in set (0.00 sec)
Interaction with Other Indexes
Remember that the design of your primary key index and secondary indexes is a big factor in query performance for InnoDB tables.
- You can include parts (prefixes) of the full-text column(s) within the primary key.
- However, that might not be a good idea if (a) the associated columns will ever be updated — which causes an expensive reorganization within the InnoDB table, or (b) if the table will have any other secondary indexes — the primary key values for a row are duplicated in the entry for that row in every secondary index, making index operations require more I/O and memory.
- As mentioned in Jimmy’s blog post, adding the
FULLTEXT index to the table is going to create a new column and associated index in the original table, so you could set up the column and index ahead of time, to avoid table reorganization later.
- You can use the unique constraint of the primary key or a
UNIQUE index to prevent duplicate values or combinations of values from being entered.
- You can use the not-null constraint of the primary key to prevent blank values or combinations of values from being entered.
- For the Labs release, the InnoDB
FULLTEXT processing isn’t integrated with the MySQL optimizer and its estimates for which index is best to use, so don’t draw conclusions about performance characteristics from this early preview.
Stopwords
Stopwords are typically short, commonly used words that you designate as not significant for a search. They are left out of the FULLTEXT index and ignored when entered in FULLTEXT queries. For example, a search for ‘the’ is unsuccessful because it’s in the default stopword list. For your own customized search, you might create a bigger list (say, with common words from several languages) or a smaller one (for example, a music or movie site where words such as “The” in names and titles are significant). The details about customizing the stopword list are in Jimmy’s blog post.
select count(*), author as "Stopword 1", quote as "Stopword 2", source as "Stopword 3"
from quotes_multi_col
where match(author, quote, source) against ('the' in natural language mode);
+----------+------------+------------+------------+
| count(*) | Stopword 1 | Stopword 2 | Stopword 3 |
+----------+------------+------------+------------+
| 0 | NULL | NULL | NULL |
+----------+------------+------------+------------+
PlanetMySQL Voting:
Vote UP /
Vote DOWN