RAG in Practicepgvectorsemantic searchpostgresnextjs

Adding Semantic Search to an Existing SaaS in a Weekend: pgvector Before You Reach for Pinecone

HSMalik Hamza ShabbirJune 25, 2026Updated June 25, 20269 min read

In short

If your SaaS already runs on Postgres, start with pgvector, not Pinecone. You can ship semantic search in a weekend: add a vector column, backfill embeddings, build an HNSW index, and run top-K from one Next.js route. You only need a dedicated vector database around the ~1M-vector mark or when write throughput outgrows a single Postgres node. See MVP development if you want this built for you.

Adding Semantic Search to an Existing SaaS in a Weekend: pgvector Before You Reach for Pinecone

On this page

Why start with pgvector instead of Pinecone?
How do I add a vector column to an existing table?
How do I run a fast top-K search with HNSW?
What does the Next.js search route look like?
How do I keep embeddings fresh without reindexing everything?
What is a semantic cache and why bother?
When do I actually need Pinecone or Milvus?
The weekend plan

If your SaaS already runs on Postgres, the fastest path to real semantic search is pgvector, not Pinecone. You add one extension, one vector column, and one HNSW index, then run top-K similarity from a single Next.js route. No second database, no sync job between two systems, no new vendor invoice. In my experience building MVPs and rescuing apps for SMBs, you can ship this over a weekend, and you genuinely do not need a dedicated vector database until you are somewhere around a million vectors per index. This article walks through the migration, the query, a cache layer that pays for itself, and the exact point where reaching for Pinecone or Milvus stops being premature.

Why start with pgvector instead of Pinecone?

Because the cheapest infrastructure is the database you already operate. pgvector lets your existing Postgres store and search embeddings, so your vectors sit in the same transaction, the same backup, and the same access controls as the rows they describe.

The hidden cost of a separate vector database is not the monthly bill. It is the second source of truth. The moment your embeddings live in Pinecone and your documents live in Postgres, you own a sync problem forever: rows get deleted in one place and orphaned in the other, a failed write leaves the two stores disagreeing, and every feature now has to reason about two systems. For a small team, that operational tax is bigger than any query-latency win.

With pgvector you keep one mental model. A JOIN brings back the matched row and its metadata in the same query. A foreign key cascade deletes the embedding when the parent row dies. Your row-level security still applies. That is the whole pitch, and for most SMB SaaS workloads it holds well past the point people assume they have outgrown it.

This fits the broader 2026 pattern of AI-native SaaS architecture: favor pgvector until you have a measured reason to leave it. If you are still deciding whether you even need retrieval, my 2026 decision tree for RAG, fine-tuning, or prompting ↗ covers that earlier fork.

How do I add a vector column to an existing table?

Enable the extension, add a vector column sized to your embedding model, and you have not changed any existing behavior yet. Old queries keep working untouched because the new column is just nullable space until you populate it.

Pick your embedding model first, because the dimension is fixed on the column. If your model outputs 1536-dimensional vectors, the column is vector(1536) and it cannot hold anything else. Choose the model once and keep it identical for both writes and queries, or your stored vectors and your query vector live in different spaces and similarity becomes noise.

SQL

-- 1. Enable the extension (idempotent)
CREATE EXTENSION IF NOT EXISTS vector;

-- 2. Add an embedding column to the table you want searchable.
--    Dimension must match your embedding model exactly.
ALTER TABLE documents
  ADD COLUMN embedding vector(1536);

-- 3. Track which rows still need embedding, so a backfill
--    can run in batches and resume after a crash.
ALTER TABLE documents
  ADD COLUMN embedded_at timestamptz;

The embedded_at column is not decoration. It is how you backfill a large table without holding one giant transaction open and without losing your place if the job dies halfway. You select a batch where embedded_at IS NULL, embed it, write the vectors, stamp the timestamp, and loop.

How do I run a fast top-K search with HNSW?

Build an HNSW index on the vector column, then order by the distance operator and limit the result. HNSW gives you approximate nearest-neighbor search that stays fast as the table grows, which is what makes interactive semantic search feel instant.

There are two index types in pgvector: IVFFlat and HNSW. For most SaaS search I reach for HNSW. It has higher build time and memory cost, but it gives better recall at a given speed and, importantly, it does not need to be rebuilt or retrained as your data distribution shifts the way IVFFlat's lists do.

SQL

-- Cosine distance is the common default for normalized text embeddings.
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops);

-- The query: nearest neighbors to a query embedding, scoped to a tenant.
-- <=> is cosine distance; smaller is closer.
SELECT
  id,
  title,
  1 - (embedding <=> $1) AS similarity
FROM documents
WHERE workspace_id = $2
  AND embedding IS NOT NULL
ORDER BY embedding <=> $1
LIMIT 10;

Two things matter here. The <=> operator must match the operator class you indexed (vector_cosine_ops pairs with cosine distance), or Postgres ignores your index and does a full scan. And in a multi-tenant SaaS, always scope by workspace_id so you never leak one customer's results into another's search box.

You can tune recall at query time with SET hnsw.ef_search = 100; (higher means better recall and slower queries). Start at the default, raise it only if your evaluation set shows you are missing relevant rows.

What does the Next.js search route look like?

One route handler: embed the incoming query with the same model you used on write, then run the top-K SQL above. The whole thing is a single network hop to your embedding provider plus one Postgres query.

// app/api/search/route.ts
import { NextRequest, NextResponse } from "next/server";
import { sql } from "@/lib/db";
import { embed } from "@/lib/embeddings";
import { cachedEmbedding } from "@/lib/semantic-cache";

export async function POST(req: NextRequest) {
  const { query, workspaceId } = await req.json();

  if (!query?.trim()) {
    return NextResponse.json({ results: [] });
  }

  // Same model as the write path. pgvector wants a bracketed string.
  const vector = await cachedEmbedding(query, () => embed(query));
  const literal = `[${vector.join(",")}]`;

  const results = await sql`
    SELECT id, title, 1 - (embedding <=> ${literal}::vector) AS similarity
    FROM documents
    WHERE workspace_id = ${workspaceId}
      AND embedding IS NOT NULL
    ORDER BY embedding <=> ${literal}::vector
    LIMIT 10
  `;

  return NextResponse.json({ results });
}

Keep the embedding call server-side. The query text and your provider key never touch the browser, and you get to wrap the call in the cache below. If you also need the route to refuse weak matches rather than return loosely related rows, the pattern in my piece on building a docs chatbot that refuses low-confidence answers ↗ applies directly: threshold on similarity and return empty rather than wrong.

How do I keep embeddings fresh without reindexing everything?

Embed on write. Whenever the source text of a row changes, enqueue a job that re-embeds just that row, so you never run a full reindex for an edit to one document.

The trap is embedding inline inside the user's request. That couples your write latency to your embedding provider's latency and turns a slow API call into a slow save button. Instead, write the row immediately, mark it as needing embedding, and let a background worker catch up.

// On create or update of searchable text:
await sql`
  UPDATE documents
  SET body = ${newBody}, embedded_at = NULL
  WHERE id = ${id}
`;
await queue.enqueue("embed-document", { id });

// Worker:
async function embedDocument({ id }: { id: string }) {
  const [row] = await sql`SELECT body FROM documents WHERE id = ${id}`;
  if (!row) return;
  const vector = await embed(row.body);
  await sql`
    UPDATE documents
    SET embedding = ${`[${vector.join(",")}]`}::vector,
        embedded_at = now()
    WHERE id = ${id}
  `;
}

Setting embedded_at = NULL on every text change unifies two paths: the one-time backfill of your existing table and the ongoing freshness of edited rows both become "find rows where embedded_at IS NULL and embed them." One worker handles both. If a row is edited twice before the worker runs, it just embeds the latest version once, which is the behavior you want.

What is a semantic cache and why bother?

A semantic cache stores recent query embeddings and their results so that repeated or near-identical searches skip the embedding API call entirely. In a real SaaS, the same handful of queries get typed constantly, and embedding them every time is wasted latency and money.

The simplest version caches by exact query string, which already removes a surprising amount of load because users retype the same searches. A more capable version caches the embedding itself and reuses results when a new query is close enough to a cached one.

// lib/semantic-cache.ts
const cache = new Map<string, number[]>(); // swap for Redis in prod

export async function cachedEmbedding(
  query: string,
  compute: () => Promise<number[]>,
): Promise<number[]> {
  const key = query.trim().toLowerCase();
  const hit = cache.get(key);
  if (hit) return hit;
  const vector = await compute();
  cache.set(key, vector);
  return vector;
}

In production I back this with Redis and a TTL so the cache survives deploys and shares across instances. The win is concrete: a cached query returns in single-digit milliseconds and costs nothing, while an uncached one pays for a round trip to the embedding provider. For search boxes that autocomplete or fire on every keystroke, this is the difference between a snappy feature and a slow, expensive one.

When do I actually need Pinecone or Milvus?

Around a million vectors per index, or when vector write throughput and recall tuning start fighting your transactional workload on a single Postgres node. Below that line, a dedicated vector database is mostly added operations cost for a benefit you cannot measure. The 2026 guidance I follow for AI-native SaaS is to stay on pgvector until one of these thresholds is real, not anticipated.

Here is the decision table I actually use:


Signal	Stay on pgvector	Move to Pinecone / Milvus
Vectors per index	Up to ~1M	Well past 1M, growing fast
Infra footprint	Already on Postgres, want one DB	Fine adding a managed service
Write pattern	Steady, batched embed-on-write	Very high-volume continuous ingest
Recall tuning needs	Default HNSW is good enough	Need fine-grained recall/latency control at scale
Query latency at scale	Acceptable on a sized Postgres box	p95 degrading despite tuning and bigger instances
Team operational budget	Small, wants fewer moving parts	Has capacity for a second data system
Filtering	Rich SQL filters and JOINs needed	Mostly vector-first with simple metadata filters

The strongest reason to stay is that pgvector keeps your filters in SQL. If your search is "documents similar to this query, in this workspace, created this quarter, by these authors," that is one query with a JOIN and WHERE clauses in Postgres, and it is genuinely awkward to replicate in a vector-first store.

The honest reason to leave is scale plus contention. Once a single index holds many millions of vectors, or once heavy embedding writes start competing with your OLTP traffic on the same node, a purpose-built system earns its keep. When that day comes, the migration is mechanical: you already have an embed-on-write pipeline, so you point the worker at the new store and replay. Designing the pipeline this way from day one is what makes the eventual move boring instead of a rewrite.

The weekend plan

To put it together: enable pgvector, add the column and the embedded_at flag, build the HNSW index, write the backfill worker, wire the Next.js route with a cache, and ship. That is a weekend of work for a SaaS already on Postgres, and it carries you to roughly a million vectors before the question of a dedicated vector database is worth revisiting.

The mistake I see most often is teams adopting Pinecone on day one for a search feature that has a few thousand documents. They take on a sync problem and a second bill to solve a scale problem they do not have yet. Start with the database you already run, measure, and migrate only when the numbers tell you to.

If you want this built into your product rather than built yourself, I do exactly this kind of pragmatic AI feature work as part of MVP development ↗. And if you would rather talk through whether pgvector or a dedicated store fits your roadmap before writing any code, my contact page ↗ is the place to start, no pitch, just a straight answer.

FAQ

Do I need a separate vector database to add semantic search to my SaaS?

No, if you already run Postgres you can add semantic search with the pgvector extension and an HNSW index without standing up a second database.

When should I migrate from pgvector to Pinecone or Milvus?

Plan the move around roughly one million vectors per index, or sooner if write throughput and recall tuning start fighting your transactional Postgres workload on a single node.

What embedding model should I use for pgvector?

Use a small, fast text embedding model with a fixed dimension you set once on the column, and keep the same model for both writes and queries so the vectors live in the same space.

How do I keep embeddings fresh as rows change?

Embed on write by enqueuing a job whenever the source text changes, so new and edited rows get re-embedded instead of running a full reindex.

Will adding a vector column slow down my existing queries?

Not in practice, because the vector column and its HNSW index are separate from your normal B-tree indexes and only get touched by similarity searches.

Working on something like this?

I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.

Start a conversation

Malik Hamza Shabbir · Full-Stack & AI Engineer

I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.

About me