MySQL Full-Text Searches

Today I am presenting you with a third and the last in tutorial series about “rarely used MySQL features”.

In first and second tutorials we spoke about transactions and referential integrity – features that are available in InnoDB and BDB tables only. Today we are going to deal with “Full-Text Search”, which is unlike features above available in MyISAM tables only (which is MySQL default engine).

So, let say you’ve just created one database driven site (forum, blog, cms, whatever) and you need to make it searchable. That shouldn’t be too complicated, so withot thinking much, you would probably come to easy solution like:

select * from foo_table where foo_text_field like '%search_text%'

Maybe include more fields in search, let say some dynamically generated from php form, and that should be it, right ? Hmmm not exactly.

Method we used above have one huge disadvantage – it is not able to use indexes! Because the fact that search pattern starts with joker ‘%’, query can’t use indexes even though you’ve defined them in mysql table fields used in search. No indexes means slow lookup (becasue search query have to pass trough all records in your tables). Fortunately, there is solution to this problem and it’s name is Full-Text Search.

Full-Text Search allow you efficient search trough text fields (CHAR, VARCHAR, TEXT), by using so called Full-Text Indexes.

There are three types of full-text searches:
Natural language search
Search pattern is splitted into words, so as a result you got only those records containings those words.

Bolean mode seach
Search pattern is splitted into words, but for each word you can add logic operator which allow you to dig only those records which contain or not contain some word.

Expansion Search
This is search in two phases. First one is identical to natural language search. In second phase words from the most relevant rows returned by the search are added to the search string and the search is done again.

During search, those words that are present in more than 50% of records are ignored. Also ignored are those so called “stopwords” (like “the”, “and”, etc.), as well as words shorter than 4 chars. You can find a list with all stopwords in corresponding section of MySQL Manual, and bear in mind that you can change those by changing path to file with stopwords. You can find a way how to do that here.

Now, lets see few examples. In order to easily illustrate full text usage, we are going to create database with “famous wrong predictions“.

So, lets copy/paste text bellow and save it to your computer in a file wrong_prediction.txt.

Albert Einstein, 1974|There is not the slightest indication that nuclear energy will ever be obtainable. It would mean that the atom would have to be shattered at will
Margaret Thatcher, 1974|It will be years - not in my time - before a woman will become a Prime Minister.
Alexander Graham Bell, c.1880.|One day there will be a telephone in every major city in the USA.
Popular Mechanics, 1949|Computers in the future may weigh no more than 1.5 tons.
Thomas Watson, IBM Computers,1943|I think there is a world market for as many as 5 computers.
Marshal Foch, France, 1912|Aircraft are interesting toys, but of no military value.

Now, lets create proper table, import predictions, and add indexes. Start MySQL command line client (you have to star it from the same dir where you saved your wrong predictions file), and type (or copy/paste) this:

create table wrong_predictions(
who varchar(100) not null,
prediction text not null
) engine = MyISAM;

load data local infile 'wrong_predictions.txt' into table wrong_predictions
fields terminated by '|' lines terminated by 'rn';

ALTER TABLE wrong_predictions
ADD FULLTEXT (who),
ADD FULLTEXT (prediction),
ADD FULLTEXT (who, prediction);

As you can see, first we’ve imported data into table, then we’ve created indexes. The reason why we didn’t create indexes during table creation is perfomance – insert is always slower if table have indexes created. Of course, you are not going to see any difference here, but pay attention to that when you import some million record file :)

Now, let’s see how Natural Search works:

mysql> SELECT * FROM wrong_predictions WHERE MATCH(who) AGAINST('Thatcher');
+-------------------------+---------------------------------------------------------------------------------+
| who                     | prediction                                                                      |
+-------------------------+---------------------------------------------------------------------------------+
| Margaret Thatcher, 1974 | It will be years - not in my time - before a womanill become a Prime Minister.  |
+-------------------------+---------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql>  SELECT * FROM wrong_predictions WHERE MATCH(who,prediction) AGAINST('computers');
+-------------------------+-------------------------------------------------------------+
| who                     | prediction                                                  |
+-------------------------+-------------------------------------------------------------+
| Thomas Watson, IBM,1943 | I think there is a world market for as many as 5 computers. |
| Popular Mechanics, 1949 | Computers in the future may weigh no more than 1.5 tons.    |
+-------------------------+-------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql>

When using MATCH in where clause, order of columns in output is defined with relevance of matched records. We can see that if we issue this query:

mysql> SELECT prediction, MATCH(prediction) AGAINST('computers') AS relevance FROM wrong_predictions;
+---------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
| prediction                                                                                                                                        | relevance        |
+---------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
| There is not the slightest indication that nuclear energy will ever be obtainable. It would mean that the atom would have to be shattered at will | 0                |
| It will be years - not in my time - before a woman will become a Prime Minister.                                                                  | 0                |
| One day there will be a telephone in every major city in the USA.                                                                                 | 0                |
| Computers in the future may weigh no more than 1.5 tons.                                                                                          | 0.66266459031789 |
| I think there is a world market for as many as 5 computers.                                                                                       | 0.67003110026735 |
| Aircraft are interesting toys, but of no military value.                                                                                          | 0                |
+---------------------------------------------------------------------------------------------------------------------------------------------------+------------------+
6 rows in set (0.00 sec)

Boolean Mode allow us to make seach pattern in a simmilar way as in some search engines on Internet, since you can add words that must be presented (+foo) as well as words that must not be there (-foo).

mysql> SELECT * FROM wrong_predictions WHERE MATCH(who,prediction)
-> AGAINST('+computers -popular' IN BOOLEAN MODE);
+-------------------------+-------------------------------------------------------------+
| who                     | prediction                                                  |
+-------------------------+-------------------------------------------------------------+
| Thomas Watson, IBM,1943 | I think there is a world market for as many as 5 computers. |
+-------------------------+-------------------------------------------------------------+
1 row in set (0.00 sec)

You can also use jocker like:
pre>mysql> SELECT * FROM wrong_predictions WHERE MATCH(who,prediction)
-> AGAINST(‘comp*’ IN BOOLEAN MODE);
+————————-+————————————————————-+
| who | prediction |
+————————-+————————————————————-+
| Popular Mechanics, 1949 | Computers in the future may weigh no more than 1.5 tons. |
| Thomas Watson, IBM,1943 | I think there is a world market for as many as 5 computers. |
+————————-+————————————————————-+
2 rows in set (0.00 sec)

Full description of boolean mode is available in coresponding section of MySQL Manual.

And last but not least, let’s see how Expansion Search works:

mysql> SELECT * FROM wrong_predictions WHERE MATCH(who,prediction)
-> AGAINST('market');
+-------------------------+-------------------------------------------------------------+
| who                     | prediction                                                  |
+-------------------------+-------------------------------------------------------------+
| Thomas Watson, IBM,1943 | I think there is a world market for as many as 5 computers. |
+-------------------------+-------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM wrong_predictions WHERE MATCH(who,prediction)
-> AGAINST('market' WITH QUERY EXPANSION);
+-------------------------+-------------------------------------------------------------+
| who                     | prediction                                                  |
+-------------------------+-------------------------------------------------------------+
| Thomas Watson, IBM,1943 | I think there is a world market for as many as 5 computers. |
| Popular Mechanics, 1949 | Computers in the future may weigh no more than 1.5 tons.    |
+-------------------------+-------------------------------------------------------------+
2 rows in set (0.00 sec)

In first query we’ve looked for word ‘market’ with natural seach, not really needed for query expansion, but usefull so we can track how exactly QUERY EXPANSION works. In second query we used QUERY EXPANSION which works in two phases. First it run natural search (like we did in first query), and then in 2nd phase it form new search phrase with original search phrase (market) concatenated with most highly relevant records we got from first phase. As a result we get more records than with natural search that are relevant with original search. More info about expansion search you find in MySQL Manual.

This is all for this introductory tutorial. Examples I used here serve only to “get your feet wet”, for full explanation of MySQL Full Text indexes you might want to consult MySQL Manual, or some book on subject (I recommend MySQL by Paul Dubois).

No comments yet. Be the first.

Leave a reply