Why Your pgvector Setup Will Melt at 100 QPS (And Why 'Just Add Read Replicas' Won't Save You)
If you're building an enterprise RAG system today, you likely started with pgvector. It's the default choice: it's easy, it lives in Postgres, and it works fine… until it doesn't.
We're seeing a recurring pattern among businesses: their RAG pipeline hits a wall at roughly 50-100 Queries Per Second (QPS). Latency spikes from 200ms to 2s+. The database CPU pegs at 100%. The standard fix? Throw money at Read Replicas.
Here's the uncomfortable engineering truth: Postgres was never architected for high-dimensional vector math at scale.
The Mathematical Bottleneck: Float Vectors & HNSW
The problem isn't Postgres; it's the data structure. Standard HNSW indexes using float32 vectors are memory hogs.
Memory Bloat
A single 1536-dimensional vector (OpenAI) takes ~6KB. Multiply that by 10 million vectors. You're looking at 60GB+ of RAM just to keep the index hot.
The "Toast" Problem
Postgres stores large values (like vectors) in TOAST tables. Retrieving them triggers massive I/O overhead that bypasses the shared buffer, killing your throughput.
-- What happens under the hood
SELECT embedding <-> query_vector
FROM documents
ORDER BY embedding <-> query_vector
LIMIT 10;
-- Reality: Each vector fetch hits TOAST storage
-- Result: Disk I/O bottleneck at scaleVacuuming Nightmares
High-velocity updates (common in agentic RAG) create dead tuples. Autovacuum struggles to keep up with massive vector indices, leading to:
- Index bloat: Search performance degrades over time
- Lock contention: Vacuum operations block writes
- CPU saturation: Reindexing becomes prohibitively expensive
The Physics of the Problem
Let's do the math on a typical enterprise setup:
Scenario: 10M vectors, 1536 dimensions, OpenAI embeddings
Memory per vector: 1536 dims × 4 bytes (float32) = 6,144 bytes
Total index size: 10M × 6KB = 60GB
HNSW overhead: ~2x = 120GB total RAM required
Cost Reality:
- AWS RDS db.r6g.4xlarge (128GB RAM): $1,200/month
- Plus Read Replicas for HA: $2,400/month
- Plus backup storage: $300/month
Total: $3,900/month just to keep vectors in memory.
And you still hit the QPS wall at ~100 queries/second.
The Solution: Binary Quantization on a NoSQL Core
We built Moorcheh not because we hate Postgres, but because we did the math. By moving away from float32 and using Information-Theoretic Scoring (ITS) with binary quantization, we changed the physics of retrieval:
32x Less Memory
We compress the index footprint by 32x without losing semantic precision. 10M vectors fit in RAM on a fraction of the hardware.
Moorcheh binary representation: 1536 dims ÷ 8 = 192 bytes
Memory savings: 6,144 bytes → 192 bytes = 32x reduction
Total for 10M vectors: 1.8GB (vs 60GB)
Zero "Toast" Overhead
Our proprietary NoSQL foundation handles binary blobs natively. No join penalties. No TOAST lookups. Direct memory access.
Linear Scalability
We sustain high QPS without the exponential CPU cost of HNSW on Postgres. Our serverless architecture scales to zero when idle, eliminating the "RAM Tax."
The Benchmark
We ran a head-to-head comparison on identical workloads:
| Metric | pgvector (RDS) | Moorcheh | |--------|----------------|----------| | Latency (p50) | 450ms | 40ms | | Latency (p99) | 2.1s | 120ms | | Max QPS | 85 | 1,200+ | | RAM Required | 120GB | 3GB | | Monthly Cost | $3,900 | $0-$200 |
Benchmark: 10M vectors, 1536 dimensions, concurrent queries
The Audit
If you're seeing 500ms+ latency on your RAG queries or your RDS bill is climbing faster than your user base, you have a vector architecture problem.
We're offering a free "Infrastructure Audit" for teams managing >1M vectors. We'll look at your query plans and show you exactly how much RAM you're wasting on floats.
What We'll Analyze:
- Query Performance: Actual p50/p95/p99 latencies under load
- Memory Utilization: TOAST overhead and buffer cache efficiency
- Cost Projection: What happens at 10x scale
- Migration Path: Zero-downtime transition strategy
The Engineering Reality
The vector database market is at an inflection point. The old architecture (float vectors + HNSW + dedicated clusters) was designed for a different era—when RAM was cheap and workloads were predictable.
Modern agentic AI demands:
- Serverless economics: Pay for compute, not idle RAM
- Sub-100ms latency: Real-time conversational AI
- Linear scaling: From 1M to 100M vectors without re-architecture
Postgres is an incredible database. But forcing it to be a vector search engine is like using a Formula 1 car for off-road racing. It'll work, but you're fighting the design.
Use Moorcheh.ai for free or join our Discord community to discuss vector architecture and RAG optimization.
Written by Tara Khani, CEO at Moorcheh.ai
Build this architecture today.
Get your API key and start building agentic memory in under 5 minutes.
Get API Key