Entry
How can I do simple text indexing on a blob column?
How can I search for words in a blob field?
How can I simulate a search engine and look for words in blob fields?
Jan 23rd, 2000 21:42
Nathan Wallace, Jan Dvorak
Even if you could index the whole of your TEXT field, it wouldn't help
in the LIKE '%keyword%' searches.
The solution is to keep the words in a separate table. Say, you keep
information about Things. Then you probably have done something like
create table Thing (
id integer unsigned not null auto_increment primary key,
description text
);
If you want to have a fulltext index of description, you should add two
other tables:
create table Word (
id integer unsigned not null auto_increment primary key,
word varchar(100) not null unique
);
create table Word_Occurrence_in_Thing_Description (
word_id integer unsigned not null,
thing_id integer unsigned not null,
primary key ( word_id, thing_id )
);
Now, if you want to search the Thing descriptions for a word, you'll say
select
WO.thing_id
from
Word W,
Word_Occurrence_in_Thing_Description WO
where
W.word = 'searchword%'
and
WO.word_id = W.id;
This query will return you the list of Thing id's that contain the
'searchword' in their description. This query is as optimized as it can
be.
Admittedly, you'll have to do more work when inserting a new Thing. You
start with the standard
insert into Thing ( description )
values ( 'new_thing_description' );
You store the last_insert_id() in a client variable, $new_thing_id, say.
Now you take the same description you fed to the previous query and
split it into a list of words. Then, for each word (let's call it
$new_word) at a time, you do
insert ignore into Word ( word ) values ( lower( $new_word ) );
insert ignore into
Word_Occurrence_in_Thing_Description ( word_id, thing_id )
select id, $new_thing_id
from Word where word = lower( $new_word );
When the whole list is processed, you're done.
When you delete a Thing, you should also delete all the
Word_Occurrence_in_Thing_Description records that have thing_id equal to
the id of the Thing being deleted.
When updating the Thing description, you'll be best served by deleting
the old word occurrences and inserting new ones.
You don't have to worry about Words that are not referenced by any
Word_Occurrence_in_Thing_Description. They just sit there. However, if
you feel a very strong urge to get rid of them (and have nothing else to
do), you can safely delete them. There was a thread nearby on this list
that says how to do it.
That's the way to handle fulltext in a relational database.