From Keyword to Context: Building Google-Fast Semantic Search with pgvector and Vertex AI Embeddings
Introduction
How do you build semantic search that handles 100k+ assets in under 50ms? Traditional keyword search falls short when users search for concepts like "vibrant summer colors" or "minimalist tech aesthetic" - queries that require understanding intent, not just matching text.
At CreativeOS, we combined PostgreSQL's pgvector extension with Google Vertex AI embeddings to create a production-ready search system that understands user intent. This article walks through our complete implementation, from architecture decisions to performance optimizations.
The Architecture
Our semantic search system has four key components:
- Firebase Data Connect - GraphQL layer over PostgreSQL
- PostgreSQL with pgvector - Vector similarity search at the database level
- Vertex AI text-embedding-004 - 768-dimensional embedding generation
- Cloud SQL Connection Pooling - Eliminates IAM authentication delays
Here's how a search request flows:
User Query → Firebase Function → Vertex AI Embedding → pgvector Search → Results
Setting Up the Connection Pool
The first critical piece is connection pooling. Firebase Functions are serverless, which means each cold start would normally authenticate with Cloud SQL using IAM - a process that adds 200-500ms latency. We solved this with a singleton connection pool:
import { Pool, PoolClient } from 'pg'
import { Connector, AuthTypes } from '@google-cloud/cloud-sql-connector'
let pool: Pool | null = null
let connector: Connector | null = null
export async function getPostgresPool(): Promise<Pool> {
if (pool) {
return pool // Reuse existing pool across requests
}
const projectId = process.env.GCLOUD_PROJECT
const region = process.env.DATA_CONNECT_LOCATION
const instanceId = process.env.DATA_CONNECT_INSTANCE
const instanceConnectionName = `${projectId}:${region}:${instanceId}`
const iamDbUser = process.env.PG_IAM_USER
connector = new Connector()
const clientOpts = await connector.getOptions({
instanceConnectionName,
authType: AuthTypes.IAM,
})
pool = new Pool({
...clientOpts,
database: process.env.PGDATABASE,
user: iamDbUser,
max: 20, // Max 20 connections per container
min: 2, // Keep 2 connections warm
idleTimeoutMillis: 600000, // 10 minutes before closing idle connections
connectionTimeoutMillis: 10000, // 10 seconds to establish connection
})
return pool
}
export async function getPostgresClient(): Promise<PoolClient> {
const pool = await getPostgresPool()
return pool.connect()
}
Key decisions:
- Singleton pattern: Pool persists across function invocations in warm containers
- Warm connections:
min: 2keeps connections ready, eliminating auth delays - IAM authentication: Uses Google Cloud IAM instead of passwords for security
Generating Embeddings with Vertex AI
Before we can search, we need to convert user queries into vector embeddings. We use Google's text-embedding-004 model which produces 768-dimensional vectors:
import { VertexAI } from '@google-cloud/vertexai'
async function getQueryEmbedding(queryText: string): Promise<number[]> {
const vertexAI = new VertexAI({
project: process.env.GCLOUD_PROJECT,
location: process.env.DATA_CONNECT_LOCATION,
})
const model = vertexAI.preview.getGenerativeModel({
model: 'text-embedding-004',
})
const result = await model.embedContent({
content: { parts: [{ text: queryText }] },
})
return result.embedding.values
}
The embedding is a dense vector representation that captures semantic meaning. Similar concepts map to nearby points in the 768-dimensional space.
The Vector Search Query
Now for the core search logic. We use pgvector's cosine distance operator (<=>) to find the most similar embeddings:
export const searchAssets = onCall({
timeoutSeconds: 120,
memory: '512MiB',
minInstances: 1
}, async (request) => {
const {
semantic = false,
queryText = '',
limit = 50,
offset = 0,
assetTypes = [],
aspectRatios = [],
// ... other filters
} = request.data
if (!semantic || !queryText?.trim()) {
throw new Error('Semantic search requires non-empty queryText')
}
// Get client from connection pool (reuses connections across requests)
const pg = await getPostgresClient()
try {
// Generate query embedding via Vertex AI
const queryEmbedding = await getQueryEmbedding(queryText)
const vectorLiteral = `[${queryEmbedding.map((v) => Number(v).toFixed(6)).join(',')}]`
const whereClauses: string[] = []
const values: any[] = []
let paramIndex = 1
// Only published assets
whereClauses.push('ca.status = $' + paramIndex++)
values.push('published')
// CRITICAL: Filter out NULL embeddings before vector search
// 16,143 rows (19%) have NULL embeddings - exclude them to avoid wasting computation
whereClauses.push('ca.embedding IS NOT NULL')
// Optional filters
if (Array.isArray(assetTypes) && assetTypes.length > 0) {
whereClauses.push('ca.asset_type = ANY($' + paramIndex++ + ')')
values.push(assetTypes)
}
if (Array.isArray(aspectRatios) && aspectRatios.length > 0) {
whereClauses.push('ca.aspect_ratio = ANY($' + paramIndex++ + ')')
values.push(aspectRatios)
}
const whereClause = whereClauses.length > 0
? 'WHERE ' + whereClauses.join(' AND ')
: ''
// Vector similarity search using cosine distance
const orderClause = `ORDER BY ca.embedding <=> '${vectorLiteral}'::vector ASC`
const query = `
SELECT ca.creative_id
FROM creative_asset ca
${whereClause}
${orderClause}
LIMIT $${paramIndex++} OFFSET $${paramIndex++}
`
values.push(limit, offset)
const result = await pg.query(query, values)
const ids: string[] = result.rows.map(r => r.creative_id)
// Release connection back to pool
pg.release()
// Hydrate full asset details via Data Connect
// (IDs first, then fetch full objects - more efficient)
return { assetIds: ids }
} catch (error) {
pg.release()
throw error
}
})
Critical optimizations:
-
Pre-filtering: We filter by status, asset types, and other metadata BEFORE the vector search. This dramatically reduces the search space.
-
NULL embedding exclusion: We discovered 19% of our assets had NULL embeddings (legacy data). Filtering these out before vector computation saves significant CPU.
-
Two-phase retrieval: Return IDs first from the vector search, then hydrate full asset details via GraphQL. This is more efficient than joining all tables in the vector query.
Database Indexing: HNSW
To achieve sub-50ms queries on 100k+ vectors, we use HNSW (Hierarchical Navigable Small World) indexing:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE INDEX IF NOT EXISTS creative_asset_embedding_hnsw_idx
ON creative_asset
USING hnsw (embedding vector_l2_ops)
WITH (m = 16, ef_construction = 64);
HNSW parameters:
m = 16: Each node connects to 16 neighbors (higher = more accurate, slower writes)ef_construction = 64: Search width during index building (higher = better quality, slower builds)vector_l2_ops: Use L2 (Euclidean) distance for similarity
For our use case, we use cosine distance in queries (<=>) but L2 indexing works well with normalization.
Performance Results
After implementing these optimizations:
- Query latency: 30-50ms for 100k+ assets
- Cold start impact: Eliminated via connection pooling
- Index size: ~600MB for 100k 768-dimensional vectors
- Query throughput: Handles 100+ concurrent searches
Real-World Challenges & Solutions
Challenge 1: Legacy Data Without Embeddings
Problem: 19% of assets had NULL embeddings from before we implemented semantic search.
Solution:
whereClauses.push('ca.embedding IS NOT NULL')
This simple filter excludes NULL embeddings before vector computation, preventing wasted CPU cycles.
Challenge 2: Combining Semantic Search with Filters
Problem: Users want semantic search AND specific filters (brands, industries, aspect ratios).
Solution: Apply filters in the WHERE clause before vector similarity ordering:
// Filters applied BEFORE vector search
if (assetTypes.length > 0) {
whereClauses.push('ca.asset_type = ANY($' + paramIndex++ + ')')
values.push(assetTypes)
}
// Vector similarity applied in ORDER BY
const orderClause = `ORDER BY ca.embedding <=> '${vectorLiteral}'::vector ASC`
This ensures we only compute similarity for assets matching the filters.
Challenge 3: Pagination
Problem: Users need paginated results for large result sets.
Solution: Use standard SQL LIMIT/OFFSET with the vector search:
LIMIT $${paramIndex++} OFFSET $${paramIndex++}
Since we're ordering by similarity, each page returns the next most similar results.
Best Practices
- Always filter before vector search: Pre-filtering reduces computation dramatically
- Use connection pooling: Essential for serverless functions hitting Cloud SQL
- Index your vectors: HNSW indexes are critical for sub-100ms queries
- Monitor NULL embeddings: Track and backfill missing embeddings for complete coverage
- Two-phase retrieval: Get IDs first, hydrate details second (more efficient)
Conclusion
Building production-ready semantic search requires more than just adding vectors to a database. Connection pooling, pre-filtering, proper indexing, and efficient query patterns all contribute to the final performance.
By combining pgvector's powerful vector search with Vertex AI embeddings and careful optimization, we achieved sub-50ms semantic search at scale. The system now handles complex queries like "modern minimalist design with blue accents" that keyword search could never match.
The code patterns we've shared here are production-tested and handling thousands of searches daily. You can adapt them to your own use case, whether you're building a design asset platform, e-commerce search, or any application requiring semantic understanding.