Get search done with PostgreSQL

A few words of wisdom…

A search feature is now a part of almost every website regardless of its size or purpose.

If you’re a Django developer, then you’re lucky to have django-haystack at your disposal to hook up any of its available search engines, such as ElasticSearch or Solr. There’s also a library called ElasticUtils, for dealing exclusively with the ElasticSearch engine (if you wish to have more control over your searching power).

But what if you only need to plug in simple search functionality to your awesome new blogging website?

Well, if you’re using Django together with postgresql as your database engine (which you should!) you can actually use some postgresql built-in features to get search running quickly.

I’m about to show you how to set it up in a few simple steps. Also, I will add a quick tip on how to add autocomplete super-powers.

Setup

First things first, you need a small 3rd party app called djorm-ext-pgfulltext

pip install djorm-ext-pgfulltext

Make sure you have the proper indexes applied to your database:

CREATE EXTENSION unaccent;
ALTER FUNCTION unaccent(text) IMMUTABLE;

Next, let’s assume you have simple blog post model and you want to be able to search by its content field:

from django.db import models
from djorm_pgfulltext.models import SearchManager
from djorm_pgfulltext.fields import VectorField

class BlogPost(models.Model):
    title = models.CharField(max_length=255, default="")
    content = models.TextField(default="")
    search_index = VectorField()
    objects = SearchManager(
        fields = ('content',),
        config = 'pg_catalog.english',
        search_field = 'search_index',
        auto_update_search_field = True
    )

This setup will automatically update the index field when the object is saved.

If you already had some data, you would need to run: ./manage.py update_search_field [appname] to rebuild the index fields.

…and we’re done with the setup part.

Search

Now let’s see how we go about searching through the content.

For testing purposes I’ve added ~50k objects with some random text data.
Let’s say we want to search for either text1 or text2 keywords. Here’s how we do it:

word_list = ['text1', 'text2']
query = " | ".join(q)

qs = BlogPost.objects.search(query, rank_field="rank", raw=True)[:15]

The library includes custom manager method called search, which takes as arguments:

  • query string
  • rank field to be included in the results (so the search results will be ordered by highest rank value)
  • raw option to indicate usage of to_tsquery function

Basically to_tsquery is more powerful since it allows us to define our own searching query. You can read more about to_tsquery function and ranks here.

While testing on 50k rows, I got query times of ~130ms, which in searching standards is not really that great and it’s also not very good considering search widget responsiveness.

So how can we make a it faster?

When looking at the database indexes we notice the search_index field has a btree index type:

"search_blogpost_a71a185f" btree (search_index)

which is not that great when you’re dealing with parsing the search index field.

If you look through the documentation, you’ll find out there are actually better index types that we can use, namely GiST and GIN.

The GIN seemed to have the best results in my case. After adding it:

"gin_idx" gin (search_index)

..I got an almost 3x increase in query speed-up on the same search (~ 53ms). Obviously this will cause slower object entries, but it’s a small price to pay. Actually the real performance killer here is the additional rank field, since the to_rank function needs to go over all tsvector summaries of all returned documents, thereby slowing down the query.

Nevertheless, we’ve got pretty decent speeds for such a large amount of data, and it’s only by using postgresql goodies!

But wait! There’s more…

There’s a chance you’ll need to add autocomplete to your search feature. Standard keyword searching won’t really cut it anymore. Here’s where the raw search queries come to the rescue:

q = 'search phrase'
word_list = map(lambda word: word + ':*', tokenize(q))
query = " | ".join(word_list)

qs = BlogPost.objects.search(query, rank_field="rank", raw=True)[:15]

The tokenize function is taken from this snippet. You can see a brief explanation there of what it does as well.

So there ya go! All that searching functionality just by using postgresql power! All done within a couple lines of code, together with some database enhancements.