wiki:AdvancedTopics

Context Navigation

Version 4 (modified by 231035, 9 days ago) ( diff )
--

Advanced Topics

Vector Database and Property Recommendation System

A vector database is a database that stores data as mathematical vectors. A vector is a list of numbers that represents the meaning of some text, image, product or object. In our project, we use vectors to represent real estate properties.
Example: "Modern apartment in Skopje with parking, Wi-Fi, balcony and 4.7 rating" becomes:

[0.021, -0.112, 0.334, ..., 0.087] (python script that creates the embbedings generate_property_embeddings.py ) We need a vector database to build a recommendation system based on similarity.

We added an embedding column to the properties table:

ALTER TABLE properties
ADD COLUMN embedding vector(384);

This column stores the semantic meaning of each property. Additionaly we created a new table for this phase.

CREATE TABLE user_recommendation_profiles (
    user_id BIGINT PRIMARY KEY
        REFERENCES users(user_id)
        ON DELETE CASCADE
        ON UPDATE CASCADE,

    preference_embedding vector(384),
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

We created a database view:vw_property_embedding_text This view prepares one text description for every property. This text is then converted into an embedding. The view:

CREATE OR REPLACE VIEW vw_property_embedding_text AS
SELECT
    p.property_id,

    CONCAT(
        'Title: ', p.title, '. ',
        'Description: ', COALESCE(p.description, ''), '. ',
        'Listing type: ', lt.type_name, '. ',
        'City: ', a.city, '. ',
        'Country: ', c.country_name, '. ',
        'Base price: ', p.base_price, '. ',
        'Max guests: ', p.max_guests, '. ',
        'Property amenities: ', COALESCE(string_agg(DISTINCT pa_am.amenity_name, ', '), ''), '. ',
        'Room types: ', COALESCE(string_agg(DISTINCT rt.type_name, ', '), ''), '. ',
        'Room amenities: ', COALESCE(string_agg(DISTINCT ra_am.amenity_name, ', '), ''), '. ',
        'Average rating: ', COALESCE(ROUND(AVG(rv.rating)::numeric, 2), 0), '.'
    ) AS embedding_text

FROM properties p
JOIN listing_types lt
    ON lt.listing_type_id = p.listing_type_id
JOIN addresses a
    ON a.address_id = p.address_id
JOIN countries c
    ON c.country_id = a.country_id

LEFT JOIN property_amenities pa
    ON pa.property_id = p.property_id
LEFT JOIN amenities pa_am
    ON pa_am.amenity_id = pa.amenity_id
LEFT JOIN rooms r
    ON r.property_id = p.property_id
LEFT JOIN room_types rt
    ON rt.room_type_id = r.room_type_id
LEFT JOIN room_amenities ra
    ON ra.room_id = r.room_id
LEFT JOIN amenities ra_am
    ON ra_am.amenity_id = ra.amenity_id

LEFT JOIN reviews rv
    ON rv.property_id = p.property_id

GROUP BY
    p.property_id,
    p.title,
    p.description,
    p.base_price,
    p.max_guests,
    lt.type_name,
    a.city,
    c.country_name;

The user recommendation profile represents the user’s preferences. In our system, we do not ask the user to manually write preferences. Instead, we learn preferences from the properties they have already booked.
We create a function that recalculates the user preference embedding.

CREATE OR REPLACE FUNCTION refresh_user_recommendation_profile(
    p_user_id BIGINT
)
RETURNS void
LANGUAGE plpgsql
AS $$
BEGIN
    INSERT INTO user_recommendation_profiles (
        user_id,
        preference_embedding,
        updated_at
    )
    SELECT
        u.user_id,
        AVG(p.embedding),
        CURRENT_TIMESTAMP
    FROM users u
    JOIN guests g
        ON g.user_id = u.user_id
    JOIN bookings b
        ON b.guest_id = g.guest_id
    JOIN rooms r
        ON r.room_id = b.room_id
    JOIN properties p
        ON p.property_id = r.property_id
    WHERE u.user_id = p_user_id
      AND p.embedding IS NOT NULL
      AND b.booking_status IN ('CONFIRMED', 'COMPLETED')
    GROUP BY u.user_id
    ON CONFLICT (user_id)
    DO UPDATE SET
        preference_embedding = EXCLUDED.preference_embedding,
        updated_at = CURRENT_TIMESTAMP;
END;
$$;

We use:

<=>

This is the cosine distance operator from pgvector.
The recommendation query returns properties that are most similar to the user profile. Use:

SELECT *
FROM recommend_properties_for_user(1, 10);

This returns the top 10 recommended properties for user 1.

Index

To make vector search faster, we add an HNSW index:

CREATE INDEX IF NOT EXISTS idx_properties_embedding_hnsw
ON properties
USING hnsw (embedding vector_cosine_ops);

Attachments (1)

generate_property_embeddings.py (3.4 KB ) - added by 231035 12 days ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text