People ask me about MySQL's full-text search from time to time, but I've never actually used it. I understand how it works, so I can generally provide ball-park ideas about performance and suitability for a particular purpose. But until today, I had no first hand experience.
That all changed today. My initial reaction: Wow!
In MySQL 4.0.10 (I haven't bothered to build 4.0.11 yet) it makes my life way easier.
Here's the problem I'm trying to solve, stated generally enough so that it's meaningful and doesn't give away any trade secrets.
I have a Perl script manipulating lots of short multi-word strings. Each string has an associated numeric value. There's anywhere from a few hundred thousand to 5 million of them. For each of those strings, I need to locate all the other strings that contain the first string and then do something interesting with the associated value.
For example, given the string "car rental" I need to find:
- national car rental
- avis car rental
- dollar car rental
- car rental companies
And so on.
I do not want to match "rental car" or "car rent" or "car rentals" or similar variations. Order matters. Word boundaries matter.
The simple solution is to iterate over the list of strings. For each string, scan all the other strings to look for matches. The problem is that this does not scale well at all. It's an O(n**2) solution. With a few million strings, it takes forever.
What I needed was a way to index the strings. In the "car rental" case, if I could somehow find a list of all the strings that contain the word "rental" and then examine those, it'd be way faster. It be even faster if I could find the the intersection of the set of strings that contain "car" and those that contain "rental." Then I could just check for ordering to make sure I don't find "rental car." But I didn't want to build that myself. And memory is at a premium here, so I can't attack it sloppily..
MySQL to the Rescue!
After a bit of thinking, I realized that MySQL's fulltext indexing could probably do the job a lot faster than I could. So I constructed a simple table that can hold these mysterious strings and values.
CREATE TABLE `stuff` ( secret_num INTEGER UNSIGNED NOT NULL, secret_string VARCHAR(250) NOT NULL )
Then I load all the data into the table, either directly in Perl or all at once using mysqlimport. Once it's there, I add a fulltext index to the secret_string column.
ALTER TABLE `stuff` ADD FULLTEXT (secret_string)
Then I can find the data I want much, much faster.
mysql> select * from stuff > where match (secret_string) against ('+"car rental"' > in boolean mode) order by freq asc; +------+-----------------------+ | 48 | discount car rental | | 56 | car rental companies | | 81 | advantage car rental | | 106 | payless car rental | | 204 | avis car rental | | 206 | hertz car rental | | 231 | dollar car rental | | 267 | alamo car rental | | 329 | thrifty car rental | | 495 | budget car rental | | 523 | enterprise car rental | | 960 | national car rental | | 1750 | car rental | +------+-----------------------+ 13 rows in set (0.00 sec)
Not bad.
Of course, it's not perfect. There are three issues.
- MySQL has a slightly different notion of what a "word" is than my code. But I can account for that my doing a sanity check on the records that come back.
- MySQL doesn't index small words (length 3 or less) by default. I haven't addressed that yet. I can either rebuild MySQL to also index smaller words, or handle it in a different way. I'll worry about it on Wednesday.
- The original record ("car rental") appears in the results. So I have to filter it out. No big deal.
All in all, this is a lot easier and faster that having to come up with my own solution.
Oh. I should point out that this data was destined to be stored in MySQL anyway, so it's not like I have an unusual dependency on MySQL just to solve this problem.
Go forth and make good use of MySQL's full-text search engine.
Posted by jzawodn at March 10, 2003 04:31 PM
with 4.0, you don't need to recompile to change ft_min_word_len. just change it in my.cnf, restart, and force the table to get reindexed. ('alter table foo' is enough, i believe.)
i wish changing how it defines words were as easy.
Oh, yeah. I completely forgot about that.
Heh. Rebuilding index now. :-)
As far as rebuilding the index, i don't think ALTER TABLE foo is enough. I *think* you need either ALTER ... TYPE=MyISAM or just REPAIR TABLE foo QUICK.
Anyway, yes, it is pretty cool Jeremy. :-) It was nice to see an entry about it.
I've decided to use full-text search for the forum system I'm designing. I've recently been testing it with 3 million test posts and discovered some things I wouldn't have known -- that I don't want to use IN BOOLEAN MODE in most cases because of its lack of automagic relevance sorting in particular. So I've had to come up with some workarounds for that and other "issues" like you posted.
If it might be of use, I've been having a discussion about it all here, actually: www.sitepointforums.com/showthread.php?threadid=69555
From about post #12 or so is where the recent testing stuff starts. Enjoy. ;-)
3.23's full-text index creation time is its biggest problem, thanks to its "Repair with keycache" method as opposed to 4's "Repair by sorting." 3.23 full-text creation is *painfully* slow once you get to about 100k posts. :-(
BTW, Jeremy, you wouldn't happen to have any info on *when* 4.0 will be delared "stable," so most will upgrade to it, would you? :-) I hope it's soon!
Finally, I look forward to more blog entries about full-text search if you discover more info/tips/etc.
I've got lots of message posts to index (about 300M at the moment) and mysql really performed poorly for me, well, under 3.23 when I last tried. I know there's been a lot of improvements, but I'm still unconvinced (will have to try it again). I ended up with what is definitely a more labor-intensive (cuz I had to code a 40 lines :-) solution, but I like it, and it is really fast (if you've got the RAM). I use mnogosearch (google it) and idnex the DBs directly, (not going through the web FE, and yes I realize jwz is trying to do something completely different), and use their CRC+table hashing mode... which seems really quite scalable. (I suspect moreso than mysql). You do have to preprocess the search terms to generate the CRC of each, and doing boolean is via good ol' algebraic logic in the SQL, not via + and -, but it's working well for me.
I don't know what other good solutions there are; I'll have to give the FT indexing in mysql another look, I guess, though.
I should clarify I mean 300 MEG of message posts... (I just checked and it's really more like 400meg). About 600,000 actual posts. (0.5k each)
By the way, the mnogosearch indexes (which are actually mysql tables) are a little under 500meg.
OK, 300MB Ben. I thought you meant 300 million posts. :-) MySQL 4.0 shouldn't have any problem with that.
I'll also have to see how this "mnogosearch" works. Thanks.
My 3 million post (645 bytes avg length) test table in 4.0 is 3.36GB (1.86GB data/1.5 index file). :-)
About 3.23's poor performance, at least when creating the full-text index, it took a 250,000 post table (336MB total) 7-9 hours to index on my system. Ouch! Compare that to the same 250,000 taking less than 4 mins. in 4.0.
Once the index is created, 3.23 seems to perform "OK" when searching the 250,000.
OK, 300MB Ben. I thought you meant 300 million posts. :-) MySQL 4.0 shouldn't have any problem with that.
I'll also have to see how this "mnogosearch" works. Thanks.
My 3 million post (645 bytes avg length) test table in 4.0 is 3.36GB (1.86GB data/1.5 index file). :-)
About 3.23's poor performance, at least when creating the full-text index, it took a 250,000 post table (336MB total) 7-9 hours to index on my system. Ouch! Compare that to the same 250,000 taking less than 4 mins. in 4.0.
Once the index is created, 3.23 seems to perform "OK" when searching the 250,000.
Great. Once more an interesting article about mysql. The one about logging logfiles to the DB has been helpful, but this may be of even more value for my future developments. Thanks :)
Good article. I like mysql's text search capabilities, but they are different than the alternatives from other db's.
I'd like a "CONTAINS" sql keyword rather than "MATCH .. AGAINST". My reasoning is the logical process of searching if a field contains your word list makes more sense than seeing if an index matches your word list.
I guess I'm obviously missing something, but what's wrong with
SELECT secret_string, secret_num FROM stuff WHERE secret_string like '%car rental%'
?
Heh, this entry is getting a lot of traffic from the MySQL site.
Robo (SPF Robo? :-)), because LIKE '%searches%' are *extremely* slow with tons of text. It would take a few *minutes* for a LIKE query to run on my 3 million posts, compared to 10-12 *seconds* tops with the full-text index.
Also, forgot a couple things in my previous comments:
1) What's with the "order by freq asc" in Jeremy's query? ;-)
2) ft_min_word_len *really* should be lowered to 3 by default.
Yeah, SPF's Robo :-)
And couldn't you just add an index to the secret_string column? That makes sense if the column's gonna be use in the WHERE clause, that would definitely speed up the few minutes runtime.
Since the query is simpler than the fulltext search, it should be faster as well?
Hi then. :-D
Nah, you couldn't add a regular index to a larger TEXT type column. Even if you could, it would take about as much extra space as a full-text index and would be pretty useless. Remember, MySQL can't use the index for LIKE with a % wildcard prefix.
The only way it could be slightly faster is if you weren't selecting any other columns, as MySQL could just scan that column in the index file (instead of the data file). And even that wouldn't be used if you had only indexed a prefix of the column, which you have to with TEXT columns, because not all of the text is in the index as far as MySQL knows. :-)
I love MySQL's full text search, but with tables in the 10-100GB range it gets a bit slow. Looking forward to version 4.1.x that will hopefully have support for two-level indexes, normalizing MySQL's fulltext index structure, and providing higher scalability.
There is a relatively simple method that will work with almost any database that supports joins.
You have to generate a list of words from your messages, say you store the message id and the word in a table. You also store a pointer to the previous word in the message, or NULL if it is the first word in the message.
create table search (
word_id integer not null,
message_id integer not null,
prev_word_id integer,
word varchar(100),
primay key word_id;
);
create index ind_search_word on search(word);
The query to find "car rental" now becomes:
select s1.message_id
from search s1, search s2
where s1.word = "car" and
s2.word = "rental" and s2.prev_word_id = s1.word_id
This query can be expanded as there are more words in your search string. It is obviously limited by the number of joins you can perform, but a search string of more than 5 or 6 words propably doesn't occur very often.
The same table can also be used for OR and AND searches given a search string with one or more words.
I agree with Matt - ft_min_word_len *really* should be lowered to 3 by default. In my search engine, a search for 'fiat' amoung 2M records takes less then 1 second, but a search for 'BMW' is completely impossible.
This would need to be implemented with 'LIKE %' and would take far too long.
To go a step further, an option to index 3+ character word parts, although it would be a massive index, would be immensely useful, and I assume would be a simple modification to FULLTEXT.
If you want I can give you a UDF that will further boundry MySQL fulltext searches. This is how we make use of it for Slashdot.
Hey Brian, that sounds interesting. What does it do exactly? Any page online with information or code or something?
I'm running MySQL 4.0.12 and trying to address the 4 char word limit with a:
set ft_min_word_len=3
and I keep getting:
Unknown system variable 'ft_min_word_len'
I've seen a bunch of references to this on Google but no clear answer. Any thoughts?
Thanks in advance.
Scott, you can't set any ft_* variables with SET that I'm aware of (*wish* you could :-)). Have to set them in the config file -- my.cnf or my.ini.
wow .... it's new for me. And this solve my problem. SQL Server can't work that fast
Hey Jeremy,
I'm probably not Blog savvy enough to find your email address (and I hope what I am doing is not against the blettiquette). I've been looking for a way to use more then 2GB of RAM for mysql under Linux. Do you (or anybody else) have any pointers / solutions for me?
Thanks so much!
(Wouldn't want to let those 16GB go to waste, would we?)
I'm stupid, I admit it ... I got your email address *sigh*. I wish I could delete my shameful previous entry :-/
On the issue of 3 character - just an idea.
I assume that the idea is to not index a lot of common words (like 'a' the' 'I' 'lot' 'not' 'to', 'etc')
Maybe mysql could have a list of words that are not to be indexed, and then index everything else. This would make building an index a slower than the current scheme, but output would run at the same speed.
Just my 2c
steveoc, MySQL already has a built-in stopword list so it doesn't index those common, meaningless words (which also includes longer ones like should, would, there...). :-)
Shortening the ft_min_word_len to 3, however, would let useful things such as DVD, PHP, etc. be indexed.
How about Asia language support which don't have seperate to devide words.
Is there any bi-gram based word segment solution provided?
Che, Dong
http://www.chedong.com/
This article is currently the number 2 at the top 10 post list, and it seems to be a worthy representative. But as seen with the number 1 candidate it would be nice to have a more subtle algorithm in order to get favorable rankings. Wouldn't it be nice to implement an aging-factor or something comparable?
One way to quickly have your new ft_min_word_len implemented in an old fulltext index is by running a query like: REPAIR TABLE yourtablename USE_FRM;
Brian are you still around? Back in March you stated "If you want I can give you a UDF that will further boundry MySQL fulltext searches. This is how we make use of it for Slashdot."
I'd really like to take a peak at that. Would you be willing to make it public?
Thanks much in advance,
Tony
this web page sucks u need one with all the titles of rocks on it ok??
Hi
What is the algorithm of ranking in Match() function
I only Know that it is basis on length of row
I need more details about ranking
Thanks
bye
Does anyone know how MATCH AGAINST on a fulltext index might compare to a REGEXP for a single term match? Frequency doesn't matter, I'm just looking for matches. REGEXP is quick enough, but its something I'd cron more often if it was quicker. I'd test the fulltext myself, but the table is getting a bit flaky...needs to be OPTIMIZEd like twice a day. I don't want to stress it by adding and possibly dropping an index until I've backed it up. ;)
I am kind of new at playing with MySQL settings..
I created a .my.cnf file in my account using vi and added ft_min_word_len = 3;
then used the Repair table command to rebuild the indexes.
and I am still at 4 char settings.
I am running Plesk and I put the .my.cnf in the web directory for the client. is that the correct location and is the one line enough info or do I need other info as well.
Your site is very interesting and good !!!
This is my site interesting too!
http://fastinternet-foryou.atspace.com
http://t1-foryou.atspace.com
Old post I know but this comes up in web searches better than the link below. If you are having problems changing:
ft_min_word_len
http://bugs.mysql.com/bug.php?id=845
You cannot change this information dynamically, it must be done in your my.conf or my.ini file and then restarted.
How would I make a search through FULL TEXT SEARCH that gives the rows that matches the initial characters of a String Example Search word "Tes" The rows matching the string "Testing" "Test" should be given but not the matching words like "hottest"
I am using DBSight, which uses the super fast Lucene as the search engine. It took usually under 0.5 seconds for my 1G data.
There are many ways to configure it. And there is even an option to choose to put all your index in memory, which is the google way to get it fast.
Indexing is slower, about 24 hours for initial full indexing because it's using jdbc to retrieve the content. But you can do incremental indexing, which is fairly good.
Hi there I hope u can help me and solve my problem,
my problem that i am using i a mysql and i made a fulltext index for a field type text and my problem is with arabic when i search in arabic ex: for a work like يضرب and i want to get سيضرب يضربون يضًرب يَضرب there are some characters in arabic which are for PARSER i want to index the table without it or search and ignor them if it were found
FOR KARIM:
There's is no way mysql can solve your problem, this is related to "Arabic Polymorphism" (do a google search)
You need to semantically analyse your query words and stem it before submitting to mysql's fulltext index
Good luck!
Affected rows:: 3701788
Time: 1118.308ms
Took Longer than i thought
When it comes to large tables mysql does not deliver. It is VERY slow. try using http://shodan.ru/projects/sphinx/ you will be amazed!
OK Does this mean that you really dont have to use packages like sphinxsearch.com, dtsearch or openfts ? and if i do can anyone reccomend which to use ? We have a server hosting customer - a daily newspaper with problems due to sphinx.
i just want to know whether it is useful to alter the 50% threshold scheme that mysql uses by default ,is it beneficial in any way regarding serach(either by increasing or decreasing its value)or is it good to use default value of 50% only.
Ive shortened the min_word_length, but never bothered with the threshold. Ive just not experienced a case where it was an issue. *shrug*
its nice but i need some what related to this
for example if i've a string like
"this is my text"
and if i use match()and against( )
in search if i give only "ex" then it should display the complete string. is it possible if it is please reply....
Shouldn't really use ORDER BY on full text searches as it slows down your query - but definately not as mych as using LIKE :)
...to add to the above:
When people are shortening their ft min owrd length they are making their queries much more ineffecient. MySQL has to re-build all the indexes to include smaller words. They set it at 4, as default, for a reason imo :)
I found the full text search to be very useful!
But how do I fetch data when I have several parenthesis in the query, for example: 'book(fiction)part(2)'
and I expect records like: englishbook(fiction)part(2) or englishbook(fiction)part(2)vol3
When I perform search I get lots of unwanted hits like: book, part. So how do I escape them?
My text field contains HTML tags. It is entered from a WYSIWYG editor. Is there a way to escape the HTML tags from the search?
I decided to try Lucene after six long years of putting up with MySQL's fulltext search and I can say with confidence that MySQL FT sucks speed-wise compared to Lucene when searching a dataset of 8M+ rows.
I can index 1.5G/8M rows of data in about 7 minutes by piping standard output to a dedicated indexer written in Java using the Lucene API. MySQL itself takes longer to rebuild the index. Regular searches for smallish datasets are usually returned in 1-10ms. Lucene takes ~0.5 seconds to return the top, say, 10k row hits, sorted, of a search that matches every single row in the document set. MySQL would take over 10 seconds to return sorted rows from a search matching just 1M rows. In normal cases it's just slow, in extreme cases it's unusable in a user interface.
While the Lucene API isn't the sexiest thing around, you do get enough flexibility to do some interesting things, like make specific optimizations that pertain to your data.
Additionally I use Lucene to search for documents by simple attributes rather than written words, it works splendidly for this purpose, still faster than using a table with multiple indexed columns in MySQL.
---
This post is intended for anyone about to come away from this article thinking that MySQL's FT search is pretty cool and useful. Maybe it will be useful for you, but if you find out that it sucks, just remember there are other search engines out there. And that you're not alone. And that you don't have to be afraid anymore. So take it from someone who hates Java yet still uses Lucene, it was worth the effort setting it up.
Awesome. Using this technique here: http://searchengenie.co.uk mixed in with alittle of the technique shown here: http://www.quietearth.us/articles/2006/10/18/Advanced-mysql-search-in-php