How to Build an MCP Server for Your Data Lake
MCP (Model Context Protocol) lets AI agents call tools — like querying a database. DataSpoc Lens ships with a built-in MCP server that turns your Parquet data lake into a queryable API for Claude, GPT, LangGraph agents, and any MCP-compatible client.
This tutorial shows you how to go from raw data in S3 to a fully functional MCP server in under 10 minutes.
The End Result
After setup, any AI agent can:
- Discover tables in your data lake
- Inspect schemas and sample data
- Run SQL queries with DuckDB
- Ask natural language questions
- Get pre-built metrics and KPIs
All without any custom code.
Step 1: Ingest Data with Pipe
If your data is already in Parquet on S3/GCS/Azure, skip to Step 2. Otherwise, extract from your sources:
pip install dataspoc-pipe
dataspoc-pipe init sales-pipelinedataspoc-pipe add postgres \ --host db.company.com \ --database sales \ --tables orders,customers,products,revenue \ --incremental updated_at \ --destination s3://company-lake
dataspoc-pipe runYour bucket now has:
s3://company-lake/ .dataspoc/manifest.json raw/postgres/orders/*.parquet raw/postgres/customers/*.parquet raw/postgres/products/*.parquet raw/postgres/revenue/*.parquetStep 2: Install DataSpoc Lens with MCP
pip install dataspoc-lens[mcp]This installs the core Lens CLI plus the MCP server dependencies.
Step 3: Configure and Start the MCP Server
# Point Lens at your bucketdataspoc-lens add-bucket s3://company-lake
# Discover tablesdataspoc-lens discover# Found 4 tables: orders, customers, products, revenue
# Start the MCP serverdataspoc-lens mcpOutput:
DataSpoc Lens MCP Server runningTransport: stdioTables: 4 discoveredTools: 7 available - list_tables - get_schema - sample_data - query - ask - get_metrics - explain_table
Waiting for MCP client connection...Step 4: Configure Claude Desktop
Add to your Claude Desktop config (~/.config/claude/claude_desktop_config.json on Linux, ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
{ "mcpServers": { "data-lake": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "AWS_PROFILE": "data-lake", "DATASPOC_BUCKET": "s3://company-lake" } } }}Restart Claude Desktop. You’ll see the tools icon indicating 7 available tools.
The 7 MCP Tools
1. list_tables — Discover Available Data
Call:
{ "tool": "list_tables"}Response:
{ "tables": [ {"name": "raw.postgres.orders", "rows": 1247893, "size_mb": 340}, {"name": "raw.postgres.customers", "rows": 89421, "size_mb": 12}, {"name": "raw.postgres.products", "rows": 2431, "size_mb": 1.1}, {"name": "raw.postgres.revenue", "rows": 4201847, "size_mb": 890} ]}2. get_schema — Inspect Table Structure
Call:
{ "tool": "get_schema", "arguments": {"table": "raw.postgres.orders"}}Response:
{ "table": "raw.postgres.orders", "columns": [ {"name": "id", "type": "INTEGER", "nullable": false}, {"name": "customer_id", "type": "INTEGER", "nullable": false}, {"name": "product_id", "type": "INTEGER", "nullable": false}, {"name": "amount", "type": "DECIMAL(10,2)", "nullable": false}, {"name": "status", "type": "VARCHAR", "nullable": false}, {"name": "created_at", "type": "TIMESTAMP", "nullable": false}, {"name": "updated_at", "type": "TIMESTAMP", "nullable": false} ], "row_count": 1247893}3. sample_data — Preview Rows
Call:
{ "tool": "sample_data", "arguments": {"table": "raw.postgres.orders", "limit": 5}}Response:
{ "table": "raw.postgres.orders", "sample": [ {"id": 1, "customer_id": 42, "product_id": 7, "amount": 299.00, "status": "completed", "created_at": "2024-01-15T10:23:00"}, {"id": 2, "customer_id": 15, "product_id": 3, "amount": 49.99, "status": "completed", "created_at": "2024-01-15T11:05:00"}, {"id": 3, "customer_id": 42, "product_id": 12, "amount": 799.00, "status": "pending", "created_at": "2024-01-15T14:30:00"} ]}4. query — Execute SQL
Call:
{ "tool": "query", "arguments": { "sql": "SELECT DATE_TRUNC('month', created_at) AS month, SUM(amount) AS revenue FROM raw.postgres.orders WHERE created_at >= '2024-01-01' GROUP BY 1 ORDER BY 1" }}Response:
{ "columns": ["month", "revenue"], "rows": [ {"month": "2024-01-01", "revenue": 342891.50}, {"month": "2024-02-01", "revenue": 389012.75}, {"month": "2024-03-01", "revenue": 421547.00} ], "row_count": 3, "execution_time_ms": 234}5. ask — Natural Language Questions
Call:
{ "tool": "ask", "arguments": { "question": "Which customers spent the most last quarter?" }}Response:
{ "answer": "The top 5 customers by spending last quarter were: Acme Corp ($89,234), GlobalTech ($67,891), Initech ($54,320), Umbrella Inc ($48,900), and Wayne Enterprises ($45,670).", "sql": "SELECT c.name, SUM(o.amount) AS total_spent FROM raw.postgres.orders o JOIN raw.postgres.customers c ON o.customer_id = c.id WHERE o.created_at >= '2024-07-01' AND o.created_at < '2024-10-01' GROUP BY c.name ORDER BY total_spent DESC LIMIT 5", "data": [ {"name": "Acme Corp", "total_spent": 89234.00}, {"name": "GlobalTech", "total_spent": 67891.00}, {"name": "Initech", "total_spent": 54320.00}, {"name": "Umbrella Inc", "total_spent": 48900.00}, {"name": "Wayne Enterprises", "total_spent": 45670.00} ]}6. get_metrics — Pre-Built KPIs
Call:
{ "tool": "get_metrics", "arguments": {"period": "last_30_days"}}Response:
{ "period": "2024-09-15 to 2024-10-15", "metrics": { "total_revenue": 487293.00, "order_count": 12847, "avg_order_value": 37.93, "unique_customers": 3421, "new_customers": 289 }}7. explain_table — Data Documentation
Call:
{ "tool": "explain_table", "arguments": {"table": "raw.postgres.orders"}}Response:
{ "table": "raw.postgres.orders", "description": "Transaction records from the e-commerce platform. Each row represents one purchase.", "columns": { "id": "Unique order identifier", "customer_id": "FK to customers table", "amount": "Total order value in USD", "status": "One of: pending, completed, refunded, cancelled", "created_at": "When the order was placed" }, "relationships": ["customers (customer_id)", "products (product_id)"], "common_queries": [ "Revenue by month: SELECT DATE_TRUNC('month', created_at), SUM(amount) FROM orders GROUP BY 1", "Orders by status: SELECT status, COUNT(*) FROM orders GROUP BY 1" ]}Realistic Multi-Turn Agent Conversation
Here’s what it looks like when Claude uses the MCP server:
User: “How is our retention trending? Are we losing customers?”
Claude’s internal tool calls:
- Calls
list_tablesto see what data is available - Calls
get_schema("raw.postgres.orders")to understand the orders structure - Calls
get_schema("raw.postgres.customers")to check for signup dates - Calls
querywith:
WITH monthly_cohorts AS ( SELECT DATE_TRUNC('month', c.created_at) AS cohort_month, DATE_TRUNC('month', o.created_at) AS order_month, COUNT(DISTINCT o.customer_id) AS active_customers FROM raw.postgres.customers c JOIN raw.postgres.orders o ON c.id = o.customer_id WHERE c.created_at >= '2024-01-01' GROUP BY 1, 2),cohort_sizes AS ( SELECT cohort_month, COUNT(DISTINCT id) AS cohort_size FROM raw.postgres.customers WHERE created_at >= '2024-01-01' GROUP BY 1)SELECT mc.cohort_month, mc.order_month, mc.active_customers, cs.cohort_size, ROUND(mc.active_customers::FLOAT / cs.cohort_size * 100, 1) AS retention_pctFROM monthly_cohorts mcJOIN cohort_sizes cs ON mc.cohort_month = cs.cohort_monthORDER BY mc.cohort_month, mc.order_monthClaude’s response:
Looking at cohort retention data, here’s the trend:
Cohort Month 1 Month 2 Month 3 Month 6 Jan 2024 100% 68% 52% 34% Apr 2024 100% 71% 55% 38% Jul 2024 100% 74% 59% — Good news: retention is actually improving. Month-2 retention went from 68% to 74% over the year. The product changes in Q2 (self-serve onboarding) appear to be working.
However, the Month-6 drop to 34% suggests we lose customers around the 6-month mark — possibly when annual contracts come up for renewal.
Security Model
The MCP server is read-only by design:
- No INSERT, UPDATE, DELETE, DROP, or CREATE statements are allowed
- Queries are validated before execution
- The server runs with the IAM permissions of the process (your AWS profile)
- Bucket-level access control means agents only see data they’re authorized to see
# Agent can only access buckets your IAM role permitsexport AWS_PROFILE=analyst # Has access to s3://company-product onlydataspoc-lens mcp # Only exposes tables in permitted bucketsNo API keys to manage. No auth tokens. Just IAM.
Python SDK for CrewAI / LangGraph
If you’re building agents in Python rather than using Claude Desktop, use the SDK directly:
from dataspoc_lens import LensClient
# Initialize (reads from same config as CLI)client = LensClient(bucket="s3://company-lake")
# Same capabilities as MCP toolstables = client.list_tables()schema = client.get_schema("raw.postgres.orders")result = client.query("SELECT COUNT(*) FROM raw.postgres.orders")answer = client.ask("What's our churn rate?")CrewAI Integration
from crewai import Agent, Task, Crewfrom crewai_tools import toolfrom dataspoc_lens import LensClient
client = LensClient()
@tooldef query_data_lake(sql: str) -> str: """Execute a SQL query against the company data lake.""" result = client.query(sql) return str(result)
@tooldef discover_tables() -> str: """List all available tables in the data lake.""" tables = client.list_tables() return "\n".join([f"{t['name']} ({t['rows']} rows)" for t in tables])
analyst = Agent( role="Data Analyst", goal="Answer business questions using SQL on the data lake", tools=[query_data_lake, discover_tables], llm="gpt-4o",)
task = Task( description="Analyze customer lifetime value by acquisition channel", agent=analyst,)
crew = Crew(agents=[analyst], tasks=[task])result = crew.kickoff()LangGraph Integration
See the full tutorial: Building a Data Analyst Agent with LangGraph.
The AGENT.md Pattern
For agents that work autonomously with your data lake, create an AGENT.md file that describes the available data:
# Data Lake Agent Instructions
## Available Data
You have access to a data lake via the DataSpoc Lens MCP server.
## Tables
- `raw.postgres.orders` — All orders (1.2M rows). Key columns: customer_id, amount, status, created_at- `raw.postgres.customers` — Customer profiles (89K rows). Key columns: name, email, plan, created_at- `raw.postgres.products` — Product catalog (2.4K rows). Key columns: name, category, price- `curated.finance.revenue` — Monthly revenue rollups (4.2M rows). Key columns: month, mrr, arr, churn_rate
## Common Queries
- Revenue by month: `SELECT DATE_TRUNC('month', created_at), SUM(amount) FROM raw.postgres.orders GROUP BY 1`- Active customers: `SELECT COUNT(DISTINCT customer_id) FROM raw.postgres.orders WHERE created_at > CURRENT_DATE - INTERVAL '30 days'`- Churn: `SELECT month, churn_rate FROM curated.finance.revenue ORDER BY month DESC LIMIT 12`
## Rules
- Always use the `query` tool for exact numbers- Use `ask` for exploratory questions when you're not sure about the schema- Never modify data — all queries are read-only- When presenting numbers, always show the SQL you used (for auditability)Place this in your project root. Agents that read project context (Claude Code, Cursor, Cline) will use it to understand how to interact with your data.
Running in Production
For production deployments, run the MCP server as a persistent process:
# systemd service[Unit]Description=DataSpoc Lens MCP ServerAfter=network.target
[Service]Type=simpleUser=dataspocExecStart=/usr/local/bin/dataspoc-lens mcp --transport sse --port 8080Environment=AWS_PROFILE=data-lakeEnvironment=DATASPOC_BUCKET=s3://company-lakeRestart=always
[Install]WantedBy=multi-user.targetOr with Docker:
FROM python:3.11-slimRUN pip install dataspoc-lens[mcp]ENV DATASPOC_BUCKET=s3://company-lakeCMD ["dataspoc-lens", "mcp", "--transport", "sse", "--port", "8080"]docker run -d \ -p 8080:8080 \ -e AWS_ACCESS_KEY_ID \ -e AWS_SECRET_ACCESS_KEY \ -e DATASPOC_BUCKET=s3://company-lake \ dataspoc-lens-mcpSummary
In 4 steps, you’ve turned a pile of Parquet files into an intelligent API for AI agents:
- Ingest data with Pipe (or bring your own Parquet)
- Install
dataspoc-lens[mcp] - Start
dataspoc-lens mcp - Connect Claude Desktop, LangGraph, CrewAI, or any MCP client
Your data lake is now a first-class tool for any AI agent — discoverable, queryable, and secure.