Tsearch with PostgreSQL - MichaelHinds.com

Using tsearch with PostgreSQL

These are the steps I took to install tsearch for PostgreSQL on my White Box Linux server. Unlike grown-up databases like Oracle, PostgreSQL doesn't come with any built-in text searching. Well, not in an efficient indexed way, which is where tsearch comes in. Tsearch provides functions to allow you to index and search text in your database.

These days you should probably use tsearch2 over tsearch. I went with the moldier version because it is the version compatible with the version of PostgreSQL that came bundled in my White Box distro.

Credit: Most of these notes are based on this introduction to tsearch

Installation

Tsearch is part of the contrib package. For me that's rh-postgresql-contrib-7.3.6-1.i386.rpm. Install this as root with the command:

rpm -Uvh rh-postgresql-contrib-7.3.6-1.i386.rpm

This will install various bits and bobs to /usr/share/pgsql/contrib.

As your PostgreSQL user do the following to install tsearch to into your database:

cd /usr/share/pgsql/contrib/ psql DATABASE < tsearch.sql

Work in progress.

Creating indexes

The next stage is to add a text index to an existing table. In this example we already have a table defined as follows:

CREATE TABLE tblMessages ( intIndex int4, strTopic varchar(100), strMessage text, );

The next stage is to create a special text index which we will use for FTI, so we can search our table of messages for words or a phrase. We do this using the SQL command:

ALTER TABLE tblMessages ADD idxFTI txtidx;

Note that unlike traditional indexes, this is actually a new field in the same table, which is then used (through the magic of the tsearch operators and functions) by a special index we will create in a moment.

The next thing to do is to populate the new field with indexed data, which we do as follows:

UPDATE tblTitles SET idxFTI=txt2txtidx(strMessage);

Note that this only indexes the field strMessage, so if you want to also add strTopic to the same index, you should instead do the following, which effectively concates the two fields into one field before being fed into the indexing system:

UPDATE tblTitles SET idxFTI=txt2txtidx(strTopic || ' ' || strMessage);

Two final things you need to do, are create a special index on your newly created field, which you do as follows:

CREATE INDEX idxFTI_idx ON tblMessages USING gist(idxFTI);

And to set up a trigger so every time a row in this table is changed, the text index is automatically updated. This is easily done using:

CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strMessage);

Or if you are indexing both strMessage and strTopic you should instead do:

CREATE TRIGGER txtidxupdate BEFORE UPDATE OR INSERT ON tblMessages FOR EACH ROW EXECUTE PROCEDURE tsearch(idxFTI, strTopic,strMessage);

Before you ask, the tsearch function accepts multiple fields as arguments so there is no need to concate the two into one like we did before.

Querying a table

So you have your indexed data, how do you query it? Well the README.tsearch has some very detailed descriptions of everything you can do, but here are some examples to get you going.

If you wanted to search the above example table for all entries which contain the word "minority", you could use the query:

SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 'minority'

Note that searching this way isn't case sensitive, so using "MiNoRiTy" in place of the above will return the same results. A good thing for those wanting to use these as search engines on websites!

Now suppose you want to search for more than one word, say "minority" and "report", then you would use the following query instead:

SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 'minority&report'

Note that spaces are a bad thing in these sorts of queries, so you better get your regexps out and filter these away. Also note that the & has special meaning here too, so you can't search for anything with "&" in it either.

The last example here relates to searching for a phrase, for example "minority report". This poses a problem with regard to tsearch, as it doesn't index phrases, only words. But there is a way around which doesn't appear to have a significant impact on query time, and that is to use a query such as the following:

SELECT intIndex, strTopic, strMessage FROM tblMessages WHERE idxFTI ## 'minority&report' AND strMessage ILIKE '%minority report%'

Of course if your indexing both strTopic and strMessage, and want to search for this phrase on both, then you will have to get out the brackets and extend this query a little more.

More info

For more in-depth coverage, see the original document. This will tell you how to change which words are indexed (because some common ones aren't) and how to backup the database (which isn't as easy as it is without tsearch).

For the most detailed descriptions see the tsearch documentation itself.