ragparquetduckdbai-agentsdata-lake

RAG on Parquet: How to Build Retrieval Over Your Data Lake Without Embeddings

Michael San Martim · 2026-04-17

You have 50 million rows of structured data in Parquet files. Your CEO asks: “Can we let people ask questions about our data in plain English?” The default answer in 2026 is RAG: embed everything into a vector database, retrieve relevant chunks, and feed them to an LLM. For structured data, this is the wrong answer.

The Problem with RAG on Structured Data

RAG was designed for documents: PDFs, wikis, knowledge bases. It works by finding text chunks that are semantically similar to a question. This breaks down for structured data:

  • “What were total sales last quarter?” requires a SUM aggregation. No chunk contains the pre-computed answer.
  • “Which customers churned in March?” requires a filter on a date column. Similarity search returns random customer records.
  • “Compare revenue by region year-over-year.” requires a GROUP BY with a window function. Embedding-based retrieval cannot do math.

The fundamental issue: vector similarity finds related text, not correct answers to analytical questions.

The Traditional RAG Approach

Here is what the typical RAG pipeline looks like for Parquet data:

# Step 1: Read Parquet files
import pandas as pd
import pyarrow.parquet as pq
df = pq.read_table("s3://my-bucket/curated/sales/").to_pandas()
# Step 2: Convert rows to text chunks
chunks = []
for _, row in df.iterrows():
chunk = f"Sale: {row['product']} in {row['region']} for ${row['amount']} on {row['date']}"
chunks.append(chunk)
# Step 3: Embed chunks (this is expensive)
from openai import OpenAI
client = OpenAI()
embeddings = []
# 50M rows × $0.0001/1K tokens = ~$500+ just for embeddings
for batch in batched(chunks, 100):
resp = client.embeddings.create(model="text-embedding-3-small", input=batch)
embeddings.extend([e.embedding for e in resp.data])
# Step 4: Store in vector database
import chromadb
chroma = chromadb.Client()
collection = chroma.create_collection("sales")
collection.add(documents=chunks, embeddings=embeddings, ids=[str(i) for i in range(len(chunks))])
# Step 5: Query
results = collection.query(query_texts=["total sales last quarter"], n_results=10)
# Returns 10 individual sale records, not a SUM

Problems with this approach:

  1. Cost: Embedding 50M rows costs hundreds of dollars and takes hours.
  2. Staleness: Every data update requires re-embedding.
  3. Inaccuracy: Retrieval returns similar rows, not aggregated answers.
  4. No SQL: Cannot do GROUP BY, SUM, JOIN, window functions.
  5. Unauditable: You cannot inspect what data the LLM actually saw.

The SQL Approach with DataSpoc Lens

Instead of embedding your data, mount it as SQL tables and let the LLM write queries:

Terminal window
pip install dataspoc-lens
from dataspoc_lens import LensClient
lens = LensClient()
# No embedding. No vector store. Just SQL over your existing Parquet files.
answer = lens.ask("What were total sales last quarter?")
print(answer)
# → Executed: SELECT SUM(amount) FROM curated_sales WHERE sale_date >= '2026-01-01'
# → Result: $4,230,000

That is it. The Parquet files stay where they are in your cloud bucket. Lens uses DuckDB to read them directly. The LLM generates SQL instead of searching for similar text.

Building Both: A Fair Comparison

Let us build the same question-answering system both ways and compare.

RAG Version (Complete)

"""RAG-based data Q&A over Parquet files."""
import pandas as pd
import chromadb
from openai import OpenAI
client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
def build_rag_index(parquet_path: str, collection_name: str):
"""Embed Parquet data into ChromaDB. Run once (or on every update)."""
df = pd.read_parquet(parquet_path)
collection = chroma_client.get_or_create_collection(collection_name)
# Convert each row to a text document
documents = []
for _, row in df.iterrows():
doc = " | ".join(f"{col}: {val}" for col, val in row.items())
documents.append(doc)
# Batch embed and store
batch_size = 100
for i in range(0, len(documents), batch_size):
batch = documents[i : i + batch_size]
ids = [str(j) for j in range(i, i + len(batch))]
collection.add(documents=batch, ids=ids)
return collection
def rag_query(question: str, collection_name: str) -> str:
"""Retrieve relevant rows and ask LLM to answer."""
collection = chroma_client.get_collection(collection_name)
results = collection.query(query_texts=[question], n_results=20)
context = "\n".join(results["documents"][0])
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based on the data provided. If you cannot determine the exact answer from the context, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return response.choices[0].message.content
# Usage
collection = build_rag_index("s3://bucket/curated/sales/data.parquet", "sales")
answer = rag_query("What were total sales last quarter?", "sales")
print(answer)
# → "Based on the retrieved records, I can see several sales entries...
# I cannot provide an exact total as I only have a sample of 20 records."

SQL Version with Lens (Complete)

"""SQL-based data Q&A over Parquet files with DataSpoc Lens."""
from dataspoc_lens import LensClient
lens = LensClient()
def sql_query(question: str) -> str:
"""Ask a question, get a precise answer via SQL."""
return lens.ask(question)
# Direct natural language
answer = sql_query("What were total sales last quarter?")
print(answer)
# → "$4,230,512.00 — Query: SELECT SUM(amount) FROM curated_sales
# WHERE sale_date >= '2026-01-01' AND sale_date < '2026-04-01'"
# Or explicit SQL for full control
df = lens.query("""
SELECT
region,
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS total_sales,
COUNT(*) AS num_orders,
AVG(amount) AS avg_order_value
FROM curated_sales
WHERE sale_date >= '2025-01-01'
GROUP BY region, month
ORDER BY region, month
""")
print(df)

Benchmark: 5 Questions on the Same Dataset

We ran both approaches on a 10M-row sales dataset in Parquet:

QuestionRAG AnswerSQL/Lens AnswerCorrect?
Total sales last quarter”Approximately $4.2M based on sample""$4,230,512.00”RAG: partial, Lens: exact
Top 5 products by revenueListed 5 products from retrieved rowsExact top 5 with amountsRAG: wrong order, Lens: correct
MoM growth rate”Unable to calculate from context""+12.3% March vs February”RAG: failed, Lens: correct
Customers with > $10K spendRetrieved some high-value rowsComplete list of 847 customersRAG: incomplete, Lens: complete
Average order value by regionEstimated from 20 rowsExact AVG from all 10M rowsRAG: inaccurate, Lens: exact

Score: RAG 0/5, SQL 5/5 on analytical questions.

Cost Comparison

For a 10M-row dataset, updated daily:

Cost FactorRAGSQL/Lens
Initial embedding~$50 (10M rows)$0
Daily re-embedding~$50/day (full refresh)$0
Vector DB hosting~$100/month (managed)$0
Per-query (LLM)~$0.03 (context + answer)~$0.01 (SQL generation only)
Per-query (compute)Similarity searchDuckDB query
Monthly total (100 queries/day)~$1,690~$30

Latency Comparison

OperationRAGSQL/Lens
First query (cold)500ms (vector search + LLM)2-5s (Parquet scan + LLM)
Subsequent queries500ms200ms (cached)
Aggregation query500ms (but wrong answer)1-3s (full table scan)

Lens is slower on the first query because it reads Parquet files from cloud storage. After the first query, DuckDB caches the data and subsequent queries are faster. You can also proactively cache:

# Pre-warm the cache for tables you know will be queried
lens.cache_refresh("curated_sales")
lens.cache_refresh("curated_customers")
# Check cache status
status = lens.cache_status()
print(status)
# {'curated_sales': {'cached': True, 'rows': 10000000, 'size_mb': 450}}

When to Use Each Approach

Use RAG When:

  • Your data is unstructured (documents, emails, support tickets)
  • You need semantic search (“find records similar to this one”)
  • The answer is a specific passage in a document
  • Your data is small (< 100K documents)

Use SQL/Lens When:

  • Your data is structured (tables, columns, rows)
  • You need aggregations (SUM, AVG, COUNT, GROUP BY)
  • The answer requires exact numbers
  • Your data is in Parquet files in a cloud bucket
  • You need auditability (every answer traces to a SQL query)

Use Both When:

  • You have mixed data: structured tables + unstructured documents
  • Build two tools and let the LLM decide which to use per question
# Hybrid: Let the agent choose between RAG and SQL
from dataspoc_lens import LensClient
lens = LensClient()
TOOLS = [
{
"type": "function",
"function": {
"name": "query_structured_data",
"description": "Query structured data (sales, orders, metrics) using SQL.",
"parameters": {
"type": "object",
"properties": {"sql": {"type": "string"}},
"required": ["sql"],
},
},
},
{
"type": "function",
"function": {
"name": "search_documents",
"description": "Search unstructured documents (policies, manuals, tickets).",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
},
},
]

The takeaway: for structured data in a lake, SQL beats embedding-based retrieval on every dimension that matters — accuracy, cost, latency, and auditability. Save RAG for what it was designed for: unstructured text.

Recommended