Engineering

Why Your pgvector Setup Will Melt at 100 QPS (And Why 'Just Add Read Replicas' Won't Save You)

December 1, 2025•7 min read•Tara Khani, CEO

If you're building an enterprise RAG system today, you likely started with pgvector. It's the default choice: it's easy, it lives in Postgres, and it works fine… until it doesn't.

We're seeing a recurring pattern among businesses: their RAG pipeline hits a wall at roughly 50-100 Queries Per Second (QPS). Latency spikes from 200ms to 2s+. The database CPU pegs at 100%. The standard fix? Throw money at Read Replicas.

Here's the uncomfortable engineering truth: Postgres was never architected for high-dimensional vector math at scale.

The Mathematical Bottleneck: Float Vectors & HNSW

The problem isn't Postgres; it's the data structure. Standard HNSW indexes using float32 vectors are memory hogs.

Memory Bloat

A single 1536-dimensional vector (OpenAI) takes ~6KB. Multiply that by 10 million vectors. You're looking at 60GB+ of RAM just to keep the index hot.

The "Toast" Problem

Postgres stores large values (like vectors) in TOAST tables. Retrieving them triggers massive I/O overhead that bypasses the shared buffer, killing your throughput.

sql

-- What happens under the hood
SELECT embedding <-> query_vector 
FROM documents 
ORDER BY embedding <-> query_vector 
LIMIT 10;
 
-- Reality: Each vector fetch hits TOAST storage
-- Result: Disk I/O bottleneck at scale

Vacuuming Nightmares

High-velocity updates (common in agentic RAG) create dead tuples. Autovacuum struggles to keep up with massive vector indices, leading to:

Index bloat: Search performance degrades over time
Lock contention: Vacuum operations block writes
CPU saturation: Reindexing becomes prohibitively expensive

The Physics of the Problem

Let's do the math on a typical enterprise setup:

Scenario: 10M vectors, 1536 dimensions, OpenAI embeddings

code

Memory per vector: 1536 dims × 4 bytes (float32) = 6,144 bytes
Total index size: 10M × 6KB = 60GB
HNSW overhead: ~2x = 120GB total RAM required

Cost Reality:

AWS RDS db.r6g.4xlarge (128GB RAM): $1,200/month
Plus Read Replicas for HA: $2,400/month
Plus backup storage: $300/month

Total: $3,900/month just to keep vectors in memory.

And you still hit the QPS wall at ~100 queries/second.

The Solution: Binary Quantization on a NoSQL Core

We built Moorcheh not because we hate Postgres, but because we did the math. By moving away from float32 and using Information-Theoretic Scoring (ITS) with binary quantization, we changed the physics of retrieval:

32x Less Memory

We compress the index footprint by 32x without losing semantic precision. 10M vectors fit in RAM on a fraction of the hardware.

code

Moorcheh binary representation: 1536 dims ÷ 8 = 192 bytes
Memory savings: 6,144 bytes → 192 bytes = 32x reduction
Total for 10M vectors: 1.8GB (vs 60GB)

Zero "Toast" Overhead

Our proprietary NoSQL foundation handles binary blobs natively. No join penalties. No TOAST lookups. Direct memory access.

Linear Scalability

We sustain high QPS without the exponential CPU cost of HNSW on Postgres. Our serverless architecture scales to zero when idle, eliminating the "RAM Tax."

The Benchmark

We ran a head-to-head comparison on identical workloads:

| Metric | pgvector (RDS) | Moorcheh | |--------|----------------|----------| | Latency (p50) | 450ms | 40ms | | Latency (p99) | 2.1s | 120ms | | Max QPS | 85 | 1,200+ | | RAM Required | 120GB | 3GB | | Monthly Cost | $3,900 | $0-$200 |

Benchmark: 10M vectors, 1536 dimensions, concurrent queries

The Audit

If you're seeing 500ms+ latency on your RAG queries or your RDS bill is climbing faster than your user base, you have a vector architecture problem.

We're offering a free "Infrastructure Audit" for teams managing >1M vectors. We'll look at your query plans and show you exactly how much RAM you're wasting on floats.

What We'll Analyze:

Query Performance: Actual p50/p95/p99 latencies under load
Memory Utilization: TOAST overhead and buffer cache efficiency
Cost Projection: What happens at 10x scale
Migration Path: Zero-downtime transition strategy

The Engineering Reality

The vector database market is at an inflection point. The old architecture (float vectors + HNSW + dedicated clusters) was designed for a different era—when RAM was cheap and workloads were predictable.

Modern agentic AI demands:

Serverless economics: Pay for compute, not idle RAM
Sub-100ms latency: Real-time conversational AI
Linear scaling: From 1M to 100M vectors without re-architecture

Postgres is an incredible database. But forcing it to be a vector search engine is like using a Formula 1 car for off-road racing. It'll work, but you're fighting the design.

Use Moorcheh.ai for free or join our Discord community to discuss vector architecture and RAG optimization.

Written by Tara Khani, CEO at Moorcheh.ai

Build this architecture today.

Get your API key and start building agentic memory in under 5 minutes.

Get API Key