From Keyword to Context: Building Google-Fast Semantic Search with pgvector and Vertex AI Embeddings

Santiago Arias

CTO

June 1, 2025

Introduction

How do you build semantic search that handles 100k+ assets in under 50ms? Traditional keyword search falls short when users search for concepts like "vibrant summer colors" or "minimalist tech aesthetic" - queries that require understanding intent, not just matching text.

At CreativeOS, we combined PostgreSQL's pgvector extension with Google Vertex AI embeddings to create a production-ready search system that understands user intent. This article walks through our complete implementation, from architecture decisions to performance optimizations.

The Architecture

Our semantic search system has four key components:

Firebase Data Connect - GraphQL layer over PostgreSQL
PostgreSQL with pgvector - Vector similarity search at the database level
Vertex AI text-embedding-004 - 768-dimensional embedding generation
Cloud SQL Connection Pooling - Eliminates IAM authentication delays

Here's how a search request flows:

User Query → Firebase Function → Vertex AI Embedding → pgvector Search → Results

Setting Up the Connection Pool

The first critical piece is connection pooling. Firebase Functions are serverless, which means each cold start would normally authenticate with Cloud SQL using IAM - a process that adds 200-500ms latency. We solved this with a singleton connection pool:

import { Pool, PoolClient } from 'pg'
import { Connector, AuthTypes } from '@google-cloud/cloud-sql-connector'

let pool: Pool | null = null
let connector: Connector | null = null

export async function getPostgresPool(): Promise<Pool> {
  if (pool) {
    return pool // Reuse existing pool across requests
  }

  const projectId = process.env.GCLOUD_PROJECT
  const region = process.env.DATA_CONNECT_LOCATION
  const instanceId = process.env.DATA_CONNECT_INSTANCE
  const instanceConnectionName = `${projectId}:${region}:${instanceId}`
  
  const iamDbUser = process.env.PG_IAM_USER
  
  connector = new Connector()
  const clientOpts = await connector.getOptions({
    instanceConnectionName,
    authType: AuthTypes.IAM,
  })

  pool = new Pool({
    ...clientOpts,
    database: process.env.PGDATABASE,
    user: iamDbUser,
    max: 20, // Max 20 connections per container
    min: 2, // Keep 2 connections warm
    idleTimeoutMillis: 600000, // 10 minutes before closing idle connections
    connectionTimeoutMillis: 10000, // 10 seconds to establish connection
  })

  return pool
}

export async function getPostgresClient(): Promise<PoolClient> {
  const pool = await getPostgresPool()
  return pool.connect()
}

Key decisions:

Singleton pattern: Pool persists across function invocations in warm containers
Warm connections: min: 2 keeps connections ready, eliminating auth delays
IAM authentication: Uses Google Cloud IAM instead of passwords for security

Generating Embeddings with Vertex AI

Before we can search, we need to convert user queries into vector embeddings. We use Google's text-embedding-004 model which produces 768-dimensional vectors:

import { VertexAI } from '@google-cloud/vertexai'

async function getQueryEmbedding(queryText: string): Promise<number[]> {
  const vertexAI = new VertexAI({
    project: process.env.GCLOUD_PROJECT,
    location: process.env.DATA_CONNECT_LOCATION,
  })
  
  const model = vertexAI.preview.getGenerativeModel({
    model: 'text-embedding-004',
  })
  
  const result = await model.embedContent({
    content: { parts: [{ text: queryText }] },
  })
  
  return result.embedding.values
}

The embedding is a dense vector representation that captures semantic meaning. Similar concepts map to nearby points in the 768-dimensional space.

The Vector Search Query

Now for the core search logic. We use pgvector's cosine distance operator (<=>) to find the most similar embeddings:

export const searchAssets = onCall({ 
  timeoutSeconds: 120, 
  memory: '512MiB',
  minInstances: 1 
}, async (request) => {
  const {
    semantic = false,
    queryText = '',
    limit = 50,
    offset = 0,
    assetTypes = [],
    aspectRatios = [],
    // ... other filters
  } = request.data

  if (!semantic || !queryText?.trim()) {
    throw new Error('Semantic search requires non-empty queryText')
  }

  // Get client from connection pool (reuses connections across requests)
  const pg = await getPostgresClient()

  try {
    // Generate query embedding via Vertex AI
    const queryEmbedding = await getQueryEmbedding(queryText)
    const vectorLiteral = `[${queryEmbedding.map((v) => Number(v).toFixed(6)).join(',')}]`

    const whereClauses: string[] = []
    const values: any[] = []
    let paramIndex = 1

    // Only published assets
    whereClauses.push('ca.status = $' + paramIndex++)
    values.push('published')

    // CRITICAL: Filter out NULL embeddings before vector search
    // 16,143 rows (19%) have NULL embeddings - exclude them to avoid wasting computation
    whereClauses.push('ca.embedding IS NOT NULL')

    // Optional filters
    if (Array.isArray(assetTypes) && assetTypes.length > 0) {
      whereClauses.push('ca.asset_type = ANY($' + paramIndex++ + ')')
      values.push(assetTypes)
    }
    
    if (Array.isArray(aspectRatios) && aspectRatios.length > 0) {
      whereClauses.push('ca.aspect_ratio = ANY($' + paramIndex++ + ')')
      values.push(aspectRatios)
    }

    const whereClause = whereClauses.length > 0 
      ? 'WHERE ' + whereClauses.join(' AND ') 
      : ''

    // Vector similarity search using cosine distance
    const orderClause = `ORDER BY ca.embedding <=> '${vectorLiteral}'::vector ASC`

    const query = `
      SELECT ca.creative_id
      FROM creative_asset ca
      ${whereClause}
      ${orderClause}
      LIMIT $${paramIndex++} OFFSET $${paramIndex++}
    `
    values.push(limit, offset)

    const result = await pg.query(query, values)
    const ids: string[] = result.rows.map(r => r.creative_id)

    // Release connection back to pool
    pg.release()

    // Hydrate full asset details via Data Connect
    // (IDs first, then fetch full objects - more efficient)
    return { assetIds: ids }
  } catch (error) {
    pg.release()
    throw error
  }
})

Critical optimizations:

Pre-filtering: We filter by status, asset types, and other metadata BEFORE the vector search. This dramatically reduces the search space.
NULL embedding exclusion: We discovered 19% of our assets had NULL embeddings (legacy data). Filtering these out before vector computation saves significant CPU.
Two-phase retrieval: Return IDs first from the vector search, then hydrate full asset details via GraphQL. This is more efficient than joining all tables in the vector query.

Database Indexing: HNSW

To achieve sub-50ms queries on 100k+ vectors, we use HNSW (Hierarchical Navigable Small World) indexing:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE INDEX IF NOT EXISTS creative_asset_embedding_hnsw_idx
  ON creative_asset
  USING hnsw (embedding vector_l2_ops)
  WITH (m = 16, ef_construction = 64);

HNSW parameters:

m = 16: Each node connects to 16 neighbors (higher = more accurate, slower writes)
ef_construction = 64: Search width during index building (higher = better quality, slower builds)
vector_l2_ops: Use L2 (Euclidean) distance for similarity

For our use case, we use cosine distance in queries (<=>) but L2 indexing works well with normalization.

Performance Results

After implementing these optimizations:

Query latency: 30-50ms for 100k+ assets
Cold start impact: Eliminated via connection pooling
Index size: ~600MB for 100k 768-dimensional vectors
Query throughput: Handles 100+ concurrent searches

Real-World Challenges & Solutions

Challenge 1: Legacy Data Without Embeddings

Problem: 19% of assets had NULL embeddings from before we implemented semantic search.

Solution:

whereClauses.push('ca.embedding IS NOT NULL')

This simple filter excludes NULL embeddings before vector computation, preventing wasted CPU cycles.

Challenge 2: Combining Semantic Search with Filters

Problem: Users want semantic search AND specific filters (brands, industries, aspect ratios).

Solution: Apply filters in the WHERE clause before vector similarity ordering:

// Filters applied BEFORE vector search
if (assetTypes.length > 0) {
  whereClauses.push('ca.asset_type = ANY($' + paramIndex++ + ')')
  values.push(assetTypes)
}

// Vector similarity applied in ORDER BY
const orderClause = `ORDER BY ca.embedding <=> '${vectorLiteral}'::vector ASC`

This ensures we only compute similarity for assets matching the filters.

Challenge 3: Pagination

Problem: Users need paginated results for large result sets.

Solution: Use standard SQL LIMIT/OFFSET with the vector search:

LIMIT $${paramIndex++} OFFSET $${paramIndex++}

Since we're ordering by similarity, each page returns the next most similar results.

Best Practices

Always filter before vector search: Pre-filtering reduces computation dramatically
Use connection pooling: Essential for serverless functions hitting Cloud SQL
Index your vectors: HNSW indexes are critical for sub-100ms queries
Monitor NULL embeddings: Track and backfill missing embeddings for complete coverage
Two-phase retrieval: Get IDs first, hydrate details second (more efficient)

Conclusion

Building production-ready semantic search requires more than just adding vectors to a database. Connection pooling, pre-filtering, proper indexing, and efficient query patterns all contribute to the final performance.

By combining pgvector's powerful vector search with Vertex AI embeddings and careful optimization, we achieved sub-50ms semantic search at scale. The system now handles complex queries like "modern minimalist design with blue accents" that keyword search could never match.

The code patterns we've shared here are production-tested and handling thousands of searches daily. You can adapt them to your own use case, whether you're building a design asset platform, e-commerce search, or any application requiring semantic understanding.