RAG vs SQL: Why Your AI Agent Should Query, Not Embed
Every week, another team announces they’re “building RAG for their data warehouse.” They chunk their tables into vectors, store them in Pinecone, and ask an LLM to answer business questions. It works for documents. It fails catastrophically for structured data.
The Problem: RAG on Tables
Here’s what happens when you use RAG for structured data:
# The RAG approach to answering "What was revenue in Q3?"
# Step 1: Chunk your tables into textchunks = []for row in orders_table: chunks.append(f"Order {row.id}: customer {row.customer}, " f"amount ${row.amount}, date {row.date}")
# Step 2: Embed chunksembeddings = openai.embed(chunks) # $0.0001 per chunk, 1M rows = $100
# Step 3: Store in vector DBpinecone_index.upsert(vectors=embeddings) # $70/month for 1M vectors
# Step 4: At query time, retrieve "relevant" chunksquery_embedding = openai.embed("What was revenue in Q3?")results = pinecone_index.query(query_embedding, top_k=20)
# Step 5: Ask LLM to synthesizeanswer = openai.chat( f"Based on these records: {results}\n" f"Answer: What was revenue in Q3?")# Returns: "Based on the retrieved records, Q3 revenue appears to be# approximately $2.3M" (actual answer: $4.7M)Why it fails:
- Retrieval misses rows. Vector similarity finds “relevant” chunks, not ALL matching rows. If you have 50,000 Q3 orders, you retrieved 20.
- Aggregation is impossible. You cannot SUM, AVG, or COUNT from 20 random samples.
- Data goes stale. You need to re-embed every time data changes.
- It’s expensive. Embedding millions of rows + vector DB hosting + LLM tokens for context.
- It hallucinates with confidence. The LLM sees 20 rows and extrapolates.
The Correct Approach: SQL
The same question answered with SQL:
SELECT SUM(amount) AS revenueFROM ordersWHERE created_at >= '2024-07-01' AND created_at < '2024-10-01'Result: $4,712,843.00 — exact, instant, free.
MCP + SQL with DataSpoc Lens
DataSpoc Lens exposes your data lake as an MCP server. AI agents execute real SQL queries instead of guessing from embeddings.
Setup
pip install dataspoc-lens[mcp]dataspoc-lens add-bucket s3://company-lakedataspoc-lens mcpHow the Agent Works
from dataspoc_lens import LensClient
client = LensClient()
# Agent discovers what data existstables = client.list_tables()# ['raw.postgres.orders', 'raw.postgres.customers', 'raw.postgres.products']
# Agent inspects schemaschema = client.get_schema("raw.postgres.orders")# [{"name": "id", "type": "INTEGER"},# {"name": "customer_id", "type": "INTEGER"},# {"name": "amount", "type": "DECIMAL"},# {"name": "created_at", "type": "TIMESTAMP"}, ...]
# Agent writes and executes SQLresult = client.query(""" SELECT SUM(amount) AS revenue FROM raw.postgres.orders WHERE created_at >= '2024-07-01' AND created_at < '2024-10-01'""")# [{"revenue": 4712843.00}]No embeddings. No vector DB. No hallucination. Real data, real answers.
Side-by-Side Code Comparison
RAG Pipeline (what teams are building)
# ingest.py — runs daily, takes 45 minutes for 1M rowsimport pandas as pdfrom langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain_openai import OpenAIEmbeddingsfrom langchain_pinecone import PineconeVectorStore
df = pd.read_parquet("s3://company-lake/raw/postgres/orders/")
# Convert rows to text chunksdocuments = []for _, row in df.iterrows(): text = f"Order {row['id']}: {row['customer_name']} purchased " text += f"${row['amount']:.2f} on {row['created_at']}" documents.append(text)
# Split and embedsplitter = RecursiveCharacterTextSplitter(chunk_size=500)chunks = splitter.create_documents(documents)embeddings = OpenAIEmbeddings()vectorstore = PineconeVectorStore.from_documents(chunks, embeddings, index_name="orders")
# query.py — at query timeretriever = vectorstore.as_retriever(search_kwargs={"k": 50})docs = retriever.get_relevant_documents("What was Q3 revenue?")# Gets 50 random-ish order records out of 250,000 Q3 ordersSQL Pipeline (what you should build)
# query.py — that's it, no ingest step neededfrom dataspoc_lens import LensClient
client = LensClient()result = client.query("SELECT SUM(amount) FROM orders WHERE quarter = 'Q3-2024'")# Scans all 250,000 rows in 0.3 seconds, returns exact answerThe Comparison
| Dimension | RAG on Tables | MCP + SQL |
|---|---|---|
| Accuracy | ~60% (misses rows, can’t aggregate) | 99%+ (exact computation) |
| Speed | 3-8 seconds (embed query + retrieve + LLM) | 0.1-2 seconds (SQL execution) |
| Freshness | Stale until re-embedded (hours/days) | Real-time (queries live data) |
| Cost per query | $0.02-0.10 (embedding + LLM tokens) | $0.001 (DuckDB compute only) |
| Monthly infra | $200+ (vector DB + embedding API) | $0 (DuckDB is free) |
| Auditability | ”Based on retrieved context…” | Exact SQL query, reproducible |
| Aggregations | Impossible (SUM of 20 rows != SUM of 250K) | Native (that’s what SQL does) |
| Joins | Cannot join across embeddings | Native multi-table joins |
| Updates | Re-embed entire dataset | Automatic (reads latest Parquet) |
When RAG IS the Right Choice
RAG is excellent for unstructured documents:
- PDF reports, legal contracts, support tickets
- Documentation search
- Knowledge bases with prose content
- Any data where keyword/semantic search is the access pattern
The rule is simple:
Tables and numbers = SQL. Documents and prose = RAG.
A Hybrid Architecture
For most companies, the best architecture combines both:
┌─────────────────────────────────────────────────┐│ AI Agent │├─────────────────────────────────────────────────┤│ ││ Structured data? ──► DataSpoc Lens (MCP + SQL) ││ ││ Unstructured docs? ──► RAG (vector search) ││ │└─────────────────────────────────────────────────┘from dataspoc_lens import LensClientfrom langchain_pinecone import PineconeVectorStore
lens = LensClient()vectorstore = PineconeVectorStore(index_name="docs")
def route_question(question: str) -> str: """Route to SQL for data questions, RAG for document questions.""" data_keywords = ["revenue", "count", "average", "total", "growth", "last month", "quarterly", "how many", "top 10"]
if any(kw in question.lower() for kw in data_keywords): # SQL path — exact answers from structured data result = lens.ask(question) return result["answer"] else: # RAG path — semantic search over documents docs = vectorstore.similarity_search(question, k=5) return synthesize_from_docs(docs, question)The Cost of Getting This Wrong
A real example from a Series B startup:
| Metric | Their RAG setup | After switching to SQL |
|---|---|---|
| Monthly cost | $2,400 (Pinecone + OpenAI embeddings) | $0 (DuckDB on Parquet) |
| Query accuracy | 58% (validated against actual numbers) | 99.7% |
| Data freshness | 6-hour lag (daily re-embedding) | Real-time |
| Maintenance | 2 engineers part-time | Zero (self-service) |
| Time to answer | 4.2 seconds average | 0.8 seconds average |
They were spending $2,400/month to get wrong answers slowly.
Getting Started
# If your data is already in Parquet/S3:pip install dataspoc-lens[mcp]dataspoc-lens add-bucket s3://your-bucketdataspoc-lens mcp # Start MCP server for AI agents
# If your data is in Postgres/MySQL:pip install dataspoc-pipedataspoc-pipe init && dataspoc-pipe add postgres --tables ordersdataspoc-pipe run # Extracts to Parquet
# Then query with SQLdataspoc-lens query "SELECT SUM(amount) FROM raw.postgres.orders WHERE created_at > '2024-01-01'"Stop embedding your tables. Start querying them.