langchainsqlpythondata-engineeringcomparison

LangChain SQLDatabaseChain vs DataSpoc Lens: Which Is Better for Data Queries?

Michael San Martim · 2026-04-16

LangChain’s SQLDatabaseChain was one of the first tools that let LLMs query databases. DataSpoc Lens takes a different approach: a purpose-built query engine over cloud Parquet. Both let you ask natural language questions about data. They solve the problem very differently.

This post puts them side by side with real code, then gives you a verdict.

The Setup: LangChain SQLDatabaseChain

LangChain needs a database connection, an LLM, and a chain that ties them together:

Terminal window
pip install langchain langchain-openai langchain-community sqlalchemy psycopg2-binary
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from langchain.chains import create_sql_query_chain
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool
# 1. Connect to your database
db = SQLDatabase.from_uri(
"postgresql://user:password@host:5432/analytics",
include_tables=["orders", "customers", "products"], # limit scope
sample_rows_in_table_info=3, # include sample rows in schema prompt
)
# 2. Create LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# 3. Create the chain
query_chain = create_sql_query_chain(llm, db)
execute_tool = QuerySQLDataBaseTool(db=db)
# 4. Ask a question
sql = query_chain.invoke({"question": "What were total sales by region last quarter?"})
print(f"Generated SQL: {sql}")
result = execute_tool.invoke(sql)
print(f"Result: {result}")

This works, but notice what you need: a running PostgreSQL instance, credentials in your code, network access to the database, and the right SQLAlchemy driver installed.

The Setup: DataSpoc Lens

Lens connects directly to Parquet files in your cloud bucket:

Terminal window
pip install dataspoc-lens
from dataspoc_lens import LensClient
lens = LensClient() # reads config from ~/.dataspoc/config.yaml or env vars
# List what's available
print(lens.tables())
# ['raw_orders', 'raw_customers', 'curated_sales', 'gold_revenue']
# Ask a question (natural language → SQL → execute → result)
answer = lens.ask("What were total sales by region last quarter?")
print(answer)
# Or write SQL directly
df = lens.query("""
SELECT region, SUM(amount) as total_sales
FROM curated_sales
WHERE sale_date >= '2026-01-01'
GROUP BY region
ORDER BY total_sales DESC
""")
print(df)

No database server. No credentials in code. No SQLAlchemy. The data lives as Parquet files in S3/GCS/Azure, and Lens reads them with DuckDB.

Side-by-Side: The Same 5 Tasks

Task 1: List Available Tables

# LangChain
print(db.get_usable_table_names())
# Requires active DB connection
# DataSpoc Lens
print(lens.tables())
# Reads from bucket manifest — no database needed

Task 2: Explore a Schema

# LangChain
print(db.get_table_info(table_names=["orders"]))
# Returns CREATE TABLE + sample rows (sent to LLM as context)
# DataSpoc Lens
print(lens.schema("raw_orders"))
# Returns column names and types as a dict

Task 3: Natural Language Query

# LangChain
chain = create_sql_query_chain(llm, db)
sql = chain.invoke({"question": "Top 5 customers by lifetime value"})
result = execute_tool.invoke(sql)
# DataSpoc Lens
answer = lens.ask("Top 5 customers by lifetime value")

Task 4: Direct SQL

# LangChain
result = db.run("SELECT customer_id, SUM(amount) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 5")
# DataSpoc Lens
df = lens.query("SELECT customer_id, SUM(amount) FROM raw_orders GROUP BY 1 ORDER BY 2 DESC LIMIT 5")
# Returns a pandas DataFrame, not a string

Task 5: Use with an AI Agent

# LangChain — define as a tool
from langchain.agents import create_openai_tools_agent, AgentExecutor
from langchain.tools import Tool
tools = [
Tool(name="query_db", func=execute_tool.invoke, description="Run SQL"),
Tool(name="list_tables", func=db.get_usable_table_names, description="List tables"),
]
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
executor.invoke({"input": "Analyze sales trends"})
# DataSpoc Lens — native MCP server
# No code needed. Just configure the MCP server:
# dataspoc-lens mcp
# Any MCP-compatible agent (Claude, Cursor, etc.) connects directly.

Comparison Matrix

CriteriaLangChain SQLDatabaseChainDataSpoc Lens
Data sourceAny SQL database (Postgres, MySQL, etc.)Parquet in S3/GCS/Azure
InfrastructureRunning database + network accessCloud bucket only
CredentialsConnection string with user/passwordCloud IAM (no secrets)
Query engineDatabase’s own engineDuckDB (in-process)
Natural languageLLM generates SQL via chainlens.ask() built-in
Agent integrationLangChain agents/toolsMCP server (any client)
Return typeStringpandas DataFrame
CachingNone built-inlens.cache_status(), lens.cache_refresh()
Setup time10-30 minutes2 minutes
CostDatabase compute + LLM tokensLLM tokens only
Write safetyCan execute INSERT/UPDATE/DELETERead-only by design

Where LangChain Wins

LangChain’s SQL chain is the right choice when:

  1. Your data lives in a traditional database. If it is in Postgres, MySQL, or Snowflake and you do not want to move it, LangChain connects directly.

  2. You need multi-database joins. LangChain can connect to multiple databases through different chains.

  3. You are already in the LangChain ecosystem. If your app uses LangChain for other things (document retrieval, chat memory, tool orchestration), adding SQLDatabaseChain is natural.

# LangChain excels at ad-hoc DB access
from langchain_community.utilities import SQLDatabase
# Connect to your production read replica
db = SQLDatabase.from_uri("postgresql://readonly:pass@replica:5432/prod")

Where DataSpoc Lens Wins

Lens is the right choice when:

  1. Your data is in a lake (Parquet in S3/GCS/Azure). Lens is purpose-built for this. No database to manage.

  2. You want zero infrastructure. No database server, no connection pooling, no read replicas. DuckDB runs in-process.

  3. You need agent-native access. The MCP server means any MCP client (Claude Code, Cursor, custom agents) connects without writing wrapper code.

  4. Security matters. Cloud IAM controls access. No database credentials to manage or leak. Read-only by design.

# Lens excels at data lake access
from dataspoc_lens import LensClient
lens = LensClient()
# Data comes from Parquet in your bucket — no DB needed
df = lens.query("SELECT * FROM curated_sales WHERE region = 'EMEA' LIMIT 100")

Hybrid: Use Both

If you have data in both a database and a lake, you can use both in the same agent:

import json
from openai import OpenAI
from langchain_community.utilities import SQLDatabase
from dataspoc_lens import LensClient
openai_client = OpenAI()
db = SQLDatabase.from_uri("postgresql://readonly:pass@host:5432/app")
lens = LensClient()
TOOLS = [
{
"type": "function",
"function": {
"name": "query_app_db",
"description": "Query the application database (users, sessions, settings).",
"parameters": {
"type": "object",
"properties": {"sql": {"type": "string"}},
"required": ["sql"],
},
},
},
{
"type": "function",
"function": {
"name": "query_data_lake",
"description": "Query the data lake (sales, events, analytics).",
"parameters": {
"type": "object",
"properties": {"sql": {"type": "string"}},
"required": ["sql"],
},
},
},
]
def dispatch(name, args):
if name == "query_app_db":
return db.run(args["sql"])
elif name == "query_data_lake":
df = lens.query(args["sql"])
return df.to_json(orient="records")

The Verdict

Use LangChain SQLDatabaseChain when your data is in a traditional database and you are building a LangChain application.

Use DataSpoc Lens when your data is in a cloud data lake (Parquet files), you want zero infrastructure, or you need agent-native MCP access.

Use both when your data spans databases and lakes. Let the LLM decide which tool to call based on the question.

The real insight: the tool matters less than the data architecture. If your data is already in Parquet in a cloud bucket (and it should be for analytics), Lens gives you the simplest path from question to answer.

Recommended