| Version 3 (modified by , 10 hours ago) ( diff ) |
|---|
Full Text Search in VitaDiet
Introduciton
Full Text Search (FTS) is a technique for searching natural language text in a database. Unlike a simple LIKE search which matches exact character patterns, FTS understands language, it normalizes words to their root form, ignores common words like "the" and "a", and ranks results by relevance.
For example a LIKE search for '%chicken%' will:
- Scan every row in the table sequentially
- Only match the exact string "chicken"
- Return results in no particular order
- Become slower as the table grows
A FTS search for 'chicken' will:
- Use a GIN index to find matches instantly
- Match "chicken", "chickens", "chickened"
- Rank results by relevance
- Perform consistently regardless of table size
For VitaDiet, FTS is used to power the food search feature, allowing users to search 350,000+ foods quickly and accurately.
How PostgreSQL FTS Works
PostgreSQL FTS is built around two core data types:
tsvector - a preprocessed representation of a text document. It stores the normalized lexemes (root forms of words) of a text column. For example:
SELECT to_tsvector('english', 'Grilled Chicken Breast');
Result: 'breast':3 'chicken':2 'grill':1
- "Grilled" becomes "grill" (stemmed to root form)
- Numbers indicate the position of each word
- Stop words like "and", "the" are removed
tsquery - a processed search query. It converts user input into a format that can be matched against a tsvector:
SELECT plainto_tsquery('english', 'grilled chicken');
Result: 'grill' & 'chicken'
The @@ operator checks if a tsquery matches a tsvector:
SELECT to_tsvector('english', 'Grilled Chicken Breast') @@ plainto_tsquery('english', 'grilled chicken');
Result: true
Setup
The first step is adding a tsvector column to the Food table and populating it:
ALTER TABLE Food ADD COLUMN IF NOT EXISTS search_vector tsvector;
UPDATE Food
SET search_vector = to_tsvector('english', COALESCE(name, ));
CREATE INDEX idx_food_fts ON Food USING GIN(search_vector);
We use GIN (Generalized Inverted Index) because it is designed for composite values like tsvector where a single column contains multiple lexemes. GIN stores each lexeme separately and maps it to the rows it appears in, making it ideal for FTS. While the standard B-tree indexes work best for equality and range comparison.
Trigger - Keeping Search Data Up To Date
Since search_vector is a derived column based on name, it needs to be updated whenever a food's name changes. Without a trigger, the search index would become stale and return incorrect results.
CREATE FUNCTION update_food_search_vector()
RETURNS TRIGGER AS $$
BEGIN
NEW.search_vector := to_tsvector('english',COALESCE(NEW.name,));
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_food_search_vector
BEFORE INSERT OR UPDATE OF name ON Food
FOR EACH ROW
EXECUTE FUNCTION update_food_search_vector();
- Activates BEFORE INSERT or UPDATE so the search_vector is set before the row is written to disk
- Only fires on UPDATE OF name if other columns like calories_per_100g are updated the trigger is skipped, avoiding unnecessary recomputation
- COALESCE(NEW.name, ) handles null names gracefully by treating them as empty strings
- NEW refers to the row being inserted or updated
The Search Function — search_foods
CREATE OR REPLACE FUNCTION search_foods(
p_query VARCHAR, p_limit INT DEFAULT 20, p_offset INT DEFAULT 0
) RETURNS TABLE (
food_id BIGINT, food_name VARCHAR, highlighted TEXT, category VARCHAR, calories INT, rank_score REAL,
) AS $$ DECLARE
v_tsquery tsquery;
BEGIN
v_tsquery := plainto_tsquery('english',p_query);
RETURN QUERY SELECT
f.id, f.name, ts_headline(
'english', f.name,v_tsquery, 'StartSel=<b>, StopSel=</b>, MaxWords=10, MinWords=1'
) AS highlighted, c.name, f.calories_per_100g, ts_rank(f.search_vector, v_tsquery) AS rank_score,
FROM Food f LEFT JOIN Category c ON f.category_id = c.category_id WHERE f.search_vector @@ v_tsquery ORDER BY rank_score DESC LIMIT p_limit OFFSET p_offset;
END; $$ LANGUAGE plpgsql;
Parameters:
- p_query - the search term entered by the user e.g. 'chicken breast'
- p_limit - how many results to return (default 20) for pagination
- p_offset - how many results to skip (default 0) for pagination. For page 2 with 20 results per page, pass p_offset = 20
Key components:
- plainto_tsquery('english', p_query) - converts the user's raw input into a tsquery. It handles natural language input gracefully so users don't need to know any special syntax. For example 'chicken breast' becomes 'chicken' & 'breast'.
- f.search_vector @@ v_tsquery - the match operator. Filters only foods whose search vector matches the query. This is where the GIN index is used.
- ts_rank(f.search_vector, v_tsquery) - scores each result by relevance. Foods where the search terms appear more frequently get a higher score. Results are ordered by this score so the most relevant foods appear first.
- ts_headline(...) - highlights matching terms in the food name by wrapping them in HTML tags. For example searching 'chicken' on 'Grilled Chicken Breast' returns 'Grilled <b>Chicken</b> Breast'. This is used by the frontend to bold matching words in search results.
Conclusion
The FTS implementation in VitaDiet provides a fast, accurate and user-friendly food search experience. When a user types a search query like "chicken breast", plainto_tsquery first converts it into a normalized tsquery ('chicken' & 'breast'), stripping out stop words and stemming each word to its root form. This tsquery is then matched against the precomputed search_vector column on the Food table using the @@ operator, which uses the GIN index to find matching foods near-instantly without scanning the entire table. Each matching food is then scored by ts_rank based on how frequently and prominently the search terms appear in the food name, ensuring the most relevant results appear at the top. Finally ts_headline wraps the matching terms in HTML tags so the frontend can display them in bold, helping users immediately see why each result matched their search. The results are then paginated using LIMIT and OFFSET so only a manageable number of results are returned at a time.
The trigger trg_food_search_vector ensures this entire pipeline stays accurate by automatically recomputing the search_vector whenever a food's name is inserted or updated, so the GIN index never becomes stale.
Advantages
- GIN index makes search fast regardless of table size
- Stemming handles word variations automatically
- Relevance ranking returns the most useful results first
- Highlighting improves the user experience
- plainto_tsquery handles natural language input without requiring special syntax from users
Disadvantages
- search_vector column adds storage overhead to the Food table
- The trigger adds a small overhead to every INSERT and UPDATE on Food
- FTS is language-specific - the 'english' configuration may not handle non-English food names perfectly.
