Archive for the ‘sphinx’ Category

Sphinx & MySQL: facts and misconceptions

Сентябрь 2nd, 2010

Sphinx search is a full text search engine, commonly used with MySQL.

There are some misconceptions about Sphinx and its usage. Following is a list of some of Sphinx’ properties, hoping to answer some common questions.

  • Sphinx is not part of MySQL/Oracle.
  • It is a standalone server; an external application to MySQL.
  • Actually, it is not MySQL specific. It can work with other RDBMS: PostgreSQL, MS SQL Server.
  • And, although described as “free open-source SQL full-text search engine”, it is not SQL-specific: Sphinx can read documents from XML.
  • It is often described as “full text search for InnoDB”. This description is misleading. Sphinx indexes text; be it from any storage engine or external source. It solves, in a way, the issue of “FULLTEXT is only supported by MyISAM”. Essentially, it provided full-text indexing for InnoDB tables, but in a very different way than the way MyISAM’s FULLTEXT index works.

Sphinx works by reading documents, usually from databases. Considering the case of MySQL, Sphinx issues a SQL query which retrieves relevant data (mostly the text you want to index, but other properties allowed).

  • Being an external module, it does not update its indexes on the fly. So if 10 new rows are INSERTed, it has no knowledge of this. It must be called externally to re-read the data (or just read the new data), and re-index.
    • This is perhaps the greatest difference, functionality-wise, between Sphinx and MyISAM’s FULLTEXT. The latter is always updated, for every row INSERTed, DELETEd or UPDATEd. The latter also suffers by this property, as this makes for serious overhead with large volumes.
    • There’s more than one way to make that less of an issue. I’ll write some more in future posts.
  • Sphinx does not keep the text to itself; just the index. Sphinx cannot be asked “Give me the blog post content for those posts containing ‘open source’”.
    • Sphinx will only tell you the ID (i.e. Primary Key) for the row that matches your search.
    • It is up to you to then get the content from the table.
    • With SphinxSE (Sphinx Storage Engine for MySQL) this becomes easier, all-in-one query.
  • It can keep other numeric data. Such data can be used to filter results.
  • It provides with GROUP BY-like, as well as ORDER BY-like mechanism.
  • It allows for ordering results by relevance.
  • It allows for exact match search, boolean search, and more.
  • It has an API & implementation for popular programming languages: PHP, Python, Perl, Ruby, Java.

The above describes Sphinx as a general fulltext search engine for databases. It does, however, have special treatment for MySQL:

  • First and foremost, it knows how to query MySQL for data (duh!)
  • If you don’t mind compiling from source, you can rebuild MySQL with SphinxSE: a storage engine implementation. This storage engine does not actually hold any data, but rather provides an SQL-like interface to the search daemon.
    • Thus, you can query for search results using SELECT statements, JOINing to document tables, retrieving results, all in one step.
    • If you do mind compiling MySQL, be aware that MariaDB comes with SphinxSE built in in newer versions.
  • It implements the MySQL protocol. You can connect to the sphinx server using a MySQL client, and actually issue SQL statements to retrieve data. Not all SQL is supported. The valid subset is called SphinxQL.
  • Snippets (excerpts) are supported via MySQL UDF.

PlanetMySQL Voting: Vote UP / Vote DOWN

Expert PHP and MySQL published!

Апрель 15th, 2010
Expert PHP and MySQLI'm very pleased to announce the publication of my book Expert PHP and MySQL, published by Wrox. This book was written by myself, Andrew Curioso and Ronald Bradford. The short of it is, upon finishing my previous book, Developing Web Applications with Apache, MySQL, memcached, and Perl, Wiley asked me if I knew of anyone who would like to write a MySQL/PHP book. I had worked with Andrew at Lycos and found him to be a brilliant PHP developer, having been the primary developer of Lycos's Webon product-- which has some excellent usage of PHP, Javascript and MySQL. When I friend of mine Bob Wilkins, who started MyVBO, was looking for a developer I suggested Andrew (he now works at MyVBO), and for this book I also suggested Andrew. Andrew had also written a short book for O'Reilly on AJAX, so Wiley was glad to have had him as a suggestion. They came back and asked if I would lend a hand to which despite being exhausted from writing my first book, decided to contribute and work on some chapters that covered material that thought would be a great benefit to the community-- namely Gearman, Sphinx, the Memcached Functions for MySQL as well as Memcached itself. Finally, we needed someone else with MySQL expertise to add to the mix to cover more advanced MySQL information to which Ronald Bradford, who I know, like, and respect as an expert in the field, signed on also.

For me, I have been more of a Perl guy, however, I like PHP just as well and even prefer how it's web deployment model is much easier (you don't need to modify your httpd config files) was interested in a challenge of writing a book on a different language. Also, some fruits of this being that my project from my Perl book, Narada, now has a PHP port.

I'm proud of this book. It was harder in some ways to write a book with other people than do it all myself, despite far fewer pages. However, the end product took advantage of all our strengths.

Topics covered in this book are (but not limited to!):

* PHP and MySQL techniques every programmer should know
* Advanced PHP concepts such as using iterators, making classes behave like functions, using true lambda functions and closures
* MySQL storage engines
* Using the information schema
* Improving performance through caching - using memcached to add a caching layer to your application
* Writing UDFs
* Writing PHP extensions
* Using the Memcached Functions for MySQL (UDFs)
* Full text search - installing, configuring, and using Sphinx in a PHP application
* Multi-tasking in PHP using Gearman. Narada is used as an example application demonstrating how to put many of the concepts of the book into an application
* Using Apache rewrite rules
* HTTP-based authentication

All-in-all, this is a great book that I hope will benefit PHP/MySQL or any other web developers who need a source of expert information!

I want to thank Andrew, Ronald, Wiley (Bob Elliot, Maureen Spears), Trond Norbye and Eric Day (who tech-edited the book!), and all others who I could write a book in itself who have helped (see my credits in the book!). Also, I want to thank the team at NorthScale, my employer, for being able to work with a team experts who I could refer to while working on this book.

You can buy this book from any major publisher, and from Wiley at http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470563125.html

PlanetMySQL Voting: Vote UP / Vote DOWN

Slides from my Sphinx talk at RIT++ 2010

Апрель 14th, 2010

While the majority of Percona gang travelled to California for the MySQL event of the year, I headed in the opposite direction to Moscow for RIT++ 2010 conference where I presented a talk on Sphinx. You can get the PDF file here – Improving MySQL-based applications performance with Sphinx.

 

 

I have been invited to talk at Open Source Data Center Conference in Nürnberg, Germany in June this year, so I hope I can meet some of you there.


Entry posted by Maciej Dobrzanski | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

SphinxSE 0.9.9-RC2 bug workaround

Сентябрь 7th, 2009

There is a serious bug with the sphinx storage engine, introduced in 0.9.9-RC2 (and which has not been fixed in latest revisions, as yet – last checked with rev 2006).

I would usually just revert to an older version (0.9.9-RC1 does not contain this bug), but for the reason that RC2 introduces an important feature: the sphinx_snippets() function, which allows for creation of snippets from within MySQL, and which makes the sphinx integration with MySQL complete, as far as the application is concerned.

The bug

The bug is described here and here (and see further discussions). Though it’s claimed to have been fixed, it’s been re-reported, and I’ve tried quite a few revisions and verified it has not been fixed (tested on Debian/Ubuntu x64). Essentially, the bug does not allow you to set filters on a query issued from within the SphinxSE. For example, the following queries fail:

SELECT ... FROM ... WHERE query='python;mode=any;sort=relevance;limit=200;range=myUnixTimestamp,1249506000,1252184400;'
SELECT ... FROM ... WHERE query='python;mode=any;sort=relevance;limit=200;filter=my_field,1;'

While the following query succeeds:

SELECT ... FROM ... WHERE query='python;mode=any;sort=relevance;limit=200;'

The error message is this:

ERROR 1430 (HY000): There was a problem processing the query on the foreign data source. Data source error: searchd error: invalid or truncated request

I see this as a serious bug in the SphinxSE: it renders it useless; searching without the ability to filter is not something I can live with.

The motivation

Sphinx does not store the actual text content. To get search results with snippets, you need to:

  • Ask sphinx for the documents ids
  • Get the content for those documents
  • Ask sphinx for snippets based on the provided content and search phrase.

With the introduction of the sphinx_snippets() function, this can all be done with a single query, like this:

SELECT my_docs.my_docs_id,  my_docs.publish_time,  CONVERT(sphinx_snippets(my_docs.id.content, 'my_docs_index', 'python') USING utf8) AS snippet  FROM tets.my_docs INNER JOIN test.my_docs_sphinx USING(my_docs_id) WHERE query='python;mode=any;sort=relevance;limit=200;range=publish_time_unix,1249506000,1252184400;';

This is really a life saver; without this function, you need to get the resutls back to your application, then send the data again to MySQL, in which case you might altogether discard the SphinxSE and talk to sphinx directly. But with a single query you get to ask the results just as if you were asking for any result set from your database (with extra syntax).

The workaround

My setup is Percona’s mysql-5.1.34-xtradb5 source, on Ubuntu server 8.04 amd64. The trick is to first compile MySQL with sphinx 0.9.9-RC2, in order to produce the sphinx.so file (where the sphinx_snippets() function is found), backup the sphinx.so file, then recompile everything with sphinx 0.9.9-RC1. The steps being:

Compile MySQL with sphinx 0.9.9-Rc2 (I choose to install MySQL on /usr/local/mysql51):

tar xzfv mysql-5.1.34-xtradb5.tar.gz
cd mysql-5.1.34-xtradb5
cp -R /tmp/resources/sphinx-0.9.9-rc2/mysqlse storage/sphinx
sh BUILD/autorun.sh
./configure --with-plugins=innobase,sphinx --prefix=/usr/local/mysql51
make

This produces the sphinx.so, sphinx.so.0, sphinx.so.0.0.0 files. Back them up!

Next, recompile with sphinx 0.9.9-RC1. I’ve found that simple copying and recompiling doesn’t work well. So just cleanup everything and start afresh:

cd ..
rm -rf mysql-5.1.34-xtradb5
tar xzfv mysql-5.1.34-xtradb5.tar.gz
cd mysql-5.1.34-xtradb5
cp -R /tmp/resources/sphinx-0.9.9-rc1/mysqlse storage/sphinx
sh BUILD/autorun.sh
./configure --with-plugins=innobase,sphinx --prefix=/usr/local/mysql51
make
sudo make install

Then build sphinx (you must have MySQL includes for sphinx to compile, so this must be the second step):

cd /tmp/resources/sphinx-0.9.9-rc1/
./configure --prefix=/usr/local/sphinx --with-mysql=/usr/local/mysql51
make
sudo make install

Essentially, we’re working now with 0.9.9-RC1, but the sphinx_snippets() function is from the 0.9.9-RC2 version, and happily no one bothers about this mix.

I hope this helps.


PlanetMySQL Voting: Vote UP / Vote DOWN

Comparison Between Solr And Sphinx Search Servers (Solr Vs Sphinx – Fight!)

Сентябрь 3rd, 2009

In the past few weeks I've been implementing advanced search at Plaxo, working quite closely with Solr enterprise search server. Today, I saw this relatively detailed comparison between Solr and its main competitor Sphinx (full credit goes to StackOverflow user mausch who had been using Solr for the past 2 years). For those still confused, Solr and Sphinx are similar to MySQL FULLTEXT search, or for those even more confused, think Google (yeah, this is a bit of a stretch, I know).

Similarities

  • Both Solr and Sphinx satisfy all of your requirements. They're fast and designed to index and search large bodies of data efficiently.
  • Both have a long list of high-traffic sites using them (Solr, Sphinx)
  • Both offer commercial support. (Solr, Sphinx)
  • Both offer client API bindings for several platforms/languages (Sphinx, Solr)
  • Both can be distributed to increase speed and capacity (Sphinx, Solr)

Here are some differences

Related questions

Conclusion

In my experience, Solr is very-very fast on the query side. It is also very powerful. The indexing side is very CPU and memory intensive and is an unfortunate side effect of having such a feature-rich, fast application. Nevertheless, I highly recommend Solr.

For disclaimer purposes, I have not had much experience with Sphinx and, again, all credit for this comparison goes to mausch.

 
Similar Posts:Share/Bookmark
PlanetMySQL Voting: Vote UP / Vote DOWN

Using the Sphinx Search Engine with MySQL

Сентябрь 1st, 2009

MySQL Full Text Search Limitations

Suppose you have a MyISAM table containing a column with a full text index. This table starts to grow to a significant size (millions of rows) and gets updated fairly frequently. Chances are that you’ll start to see some bottlenecks when accessing this table, since without row level locking, the reading and writing operations will be blocking each other.

A solution that many people would suggest right away is to use the master for writes and a slave for reads, but this only masks the problem, and it won’t take long before enough read traffic on the slave starts causing slave lags.

Why Sphinx?

The main difference between the Sphinx search engine and other alternatives is its close integration with MySQL. For example, it can be used as a storage engine.  In this way, Sphinx’s impact on existing application code can be minimized, opening the door for its more advanced features in future releases.

Initial Considerations

Need to compile MySQL from source

I struggled with this one for a while, until I realized that the storage engine plugins only work if the full version number of the MySQL server matches the full version number of the MySQL libraries used in the compilation; and if compiler options for the server, libraries, and plugin also match.

I couldn’t find any other reliable way of getting everything to work together than grabbing a MySQL source tarball and compiling it with Sphinx SE. The process is straightforward if you follow the procedures in the MySQL’s manual: 2.10.1. Source Installation Overview.

Use MySQL 5.1.xx

Although it is possible to use Sphinx with 5.0.xx, I like the cleaner plug-in storage interface of 5.1.xx.

The table to be indexed needs a unique INT column

The table for which you are going to create a full text index has to have an INT column declared as UNIQUE KEY. This value is the one to which Sphinx has to point the rows containing the text in its indexes.

Putting everything together

Both Sphinx and MySQL have very clear and straightforward instructions on how to compile from the source code tarballs, so just follow the steps. Although I had to run several options until I got all the right pieces in place, compiling either tool was never an issue.

The instructions for installing Sphinx can be found in its online documentation 2.3. Installing Sphinx on Linux; and to compile into MySQL 5.1 in 7.2.2. Compiling MySQL 5.1.x with SphinxSE.

When Sphinx is installed, it has two different sample configuration files. One is very comprehensive and the other is clean, simple, functional, as billed in its header. Use the latter one to get started.

Example run

As always, I use the Amarok player MySQL database as an example and in this case, I used the lyrics table.

In the sphinx.conf file I used the following line to index the table:

	sql_query			= \
		SELECT lid, url, lyrics FROM lyrics

Notice that with this configuration line, I’ll be indexing both the song title (the file name, really) and the lyrics text.

After I added an INT AUTO_INCREMENT PRIMARY KEY column to the lyrics table and converted it into InnoDB, the resulting table was:

show create table lyrics\G
*************************** 1. row ***************************
       Table: lyrics
Create Table: CREATE TABLE `lyrics` (
  `lid` int(11) NOT NULL AUTO_INCREMENT,
  `url` varbinary(255) DEFAULT NULL,
  `deviceid` int(11) DEFAULT NULL,
  `lyrics` text,
  `uniqueid` varbinary(32) DEFAULT NULL,
  PRIMARY KEY (`lid`),
  UNIQUE KEY `lyrics_url` (`url`,`deviceid`),
  KEY `lyrics_uniqueid` (`uniqueid`)
) ENGINE=InnoDB AUTO_INCREMENT=110 DEFAULT CHARSET=latin1

Once the table was in place, I ran Sphinx’s indexer utility to create the full text indexes.

Then I added the SPHINX table as per the documentation:

show create table sphinx_search\G
*************************** 1. row ***************************
       Table: sphinx_search
Create Table: CREATE TABLE `sphinx_search` (
  `id` int(11) NOT NULL,
  `weight` int(11) NOT NULL,
  `query` varchar(3072) NOT NULL,
  KEY `query` (`query`)
) ENGINE=SPHINX DEFAULT CHARSET=latin1 CONNECTION='sphinx://localhost:3312/test1'

A difference from the examples in Sphinx’s documentation is that I left out any additional attributes. And finally, I created a view to hide as many implementation details as possible

CREATE VIEW `vlyrics` AS select `ly`.`lid` AS `vlid`,
            `ly`.`url` AS `vurl`,
            `ly`.`deviceid` AS `vdevicedid`,
            `ly`.`lyrics` AS `vlyrics`,
            `ly`.`uniqueid` AS `vuniqueid`,
            `ss`.`query` AS `vquery`
FROM (`sphinx_search` `ss` join `lyrics` `ly`) WHERE (`ss`.`id` = `ly`.`lid`);

Last but not least, I found out how many songs have the word “death” in their lyrics (I was going to use “love”, but it sounded too lame):

select vlid, vurl from vlyrics where vquery = 'death';
+------+------------------------------------------------------------------------------------------------------------+
| vlid | vurl                                                                                                       |
+------+------------------------------------------------------------------------------------------------------------+
|   60 | ./opt/musica/iTunes Music/Genesis/The Lamb Lies Down on Broadway Disc 2/03 Anyway.wma                      |
|   66 | ./opt/musica/iTunes Music/Compilations/Evita_ Premiere American Recording (Disc/1-04 Buenos Aires.mp3      |
|   88 | ./opt/musica/iTunes Music/Lake_&_Palmer_Emerson/Works__Vol._1/06_-_Closer_To_Believing_(Album_Version).mp3 |
+------+------------------------------------------------------------------------------------------------------------+
3 rows in set (0.04 sec)

Notice that in this case the the character “=” doesn’t mean “equals”, but “contains”. If I had renamed the original table and used its name for the view, this would have been the only change as opposed to using “like”:

select lid, url from lyrics where lyrics like '%death%';
+-----+------------------------------------------------------------------------------------------------------------+
| lid | url                                                                                                        |
+-----+------------------------------------------------------------------------------------------------------------+
|  60 | ./opt/musica/iTunes Music/Genesis/The Lamb Lies Down on Broadway Disc 2/03 Anyway.wma                      |
|  66 | ./opt/musica/iTunes Music/Compilations/Evita_ Premiere American Recording (Disc/1-04 Buenos Aires.mp3      |
|  88 | ./opt/musica/iTunes Music/Lake_&_Palmer_Emerson/Works__Vol._1/06_-_Closer_To_Believing_(Album_Version).mp3 |
+-----+------------------------------------------------------------------------------------------------------------+
3 rows in set (0.02 sec)

Note: disregard the times for each query—the table has only about 100 rows.

Conclusion

Implementing SphinxSE into MySQL proved to be easier than it seemed in the beginning, although it took some time to compile and install everything. With a creative use of views, it could potentially be implemented right away in legacy applications, offering numerous advantages.

Sphinx offers many additional features that I do not consider here. I recommend looking into the documentation to implement the more sophisticated indexing mechanisms and search methods.

This evaluation was done with Sphinx v0.9.8, which is labeled as stable since it was done for a customer project. Version 0.9.9, the current release candidate, has an new feature: it supports the MySQL communications protocol. This makes it possible to query text search engine directly using MySQL’s client libraries, which opens a new set of interesting possibilities.


PlanetMySQL Voting: Vote UP / Vote DOWN