mcpclaudeai-agentsdata-laketutorial

How to Build an MCP Server for Your Data Lake

Michael San Martim · 2026-04-24

MCP (Model Context Protocol) lets AI agents call tools — like querying a database. DataSpoc Lens ships with a built-in MCP server that turns your Parquet data lake into a queryable API for Claude, GPT, LangGraph agents, and any MCP-compatible client.

This tutorial shows you how to go from raw data in S3 to a fully functional MCP server in under 10 minutes.

The End Result

After setup, any AI agent can:

  • Discover tables in your data lake
  • Inspect schemas and sample data
  • Run SQL queries with DuckDB
  • Ask natural language questions
  • Get pre-built metrics and KPIs

All without any custom code.

Step 1: Ingest Data with Pipe

If your data is already in Parquet on S3/GCS/Azure, skip to Step 2. Otherwise, extract from your sources:

Terminal window
pip install dataspoc-pipe
dataspoc-pipe init sales-pipeline
dataspoc-pipe add postgres \
--host db.company.com \
--database sales \
--tables orders,customers,products,revenue \
--incremental updated_at \
--destination s3://company-lake
dataspoc-pipe run

Your bucket now has:

s3://company-lake/
.dataspoc/manifest.json
raw/postgres/orders/*.parquet
raw/postgres/customers/*.parquet
raw/postgres/products/*.parquet
raw/postgres/revenue/*.parquet

Step 2: Install DataSpoc Lens with MCP

Terminal window
pip install dataspoc-lens[mcp]

This installs the core Lens CLI plus the MCP server dependencies.

Step 3: Configure and Start the MCP Server

Terminal window
# Point Lens at your bucket
dataspoc-lens add-bucket s3://company-lake
# Discover tables
dataspoc-lens discover
# Found 4 tables: orders, customers, products, revenue
# Start the MCP server
dataspoc-lens mcp

Output:

DataSpoc Lens MCP Server running
Transport: stdio
Tables: 4 discovered
Tools: 7 available
- list_tables
- get_schema
- sample_data
- query
- ask
- get_metrics
- explain_table
Waiting for MCP client connection...

Step 4: Configure Claude Desktop

Add to your Claude Desktop config (~/.config/claude/claude_desktop_config.json on Linux, ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
"mcpServers": {
"data-lake": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"AWS_PROFILE": "data-lake",
"DATASPOC_BUCKET": "s3://company-lake"
}
}
}
}

Restart Claude Desktop. You’ll see the tools icon indicating 7 available tools.

The 7 MCP Tools

1. list_tables — Discover Available Data

Call:

{
"tool": "list_tables"
}

Response:

{
"tables": [
{"name": "raw.postgres.orders", "rows": 1247893, "size_mb": 340},
{"name": "raw.postgres.customers", "rows": 89421, "size_mb": 12},
{"name": "raw.postgres.products", "rows": 2431, "size_mb": 1.1},
{"name": "raw.postgres.revenue", "rows": 4201847, "size_mb": 890}
]
}

2. get_schema — Inspect Table Structure

Call:

{
"tool": "get_schema",
"arguments": {"table": "raw.postgres.orders"}
}

Response:

{
"table": "raw.postgres.orders",
"columns": [
{"name": "id", "type": "INTEGER", "nullable": false},
{"name": "customer_id", "type": "INTEGER", "nullable": false},
{"name": "product_id", "type": "INTEGER", "nullable": false},
{"name": "amount", "type": "DECIMAL(10,2)", "nullable": false},
{"name": "status", "type": "VARCHAR", "nullable": false},
{"name": "created_at", "type": "TIMESTAMP", "nullable": false},
{"name": "updated_at", "type": "TIMESTAMP", "nullable": false}
],
"row_count": 1247893
}

3. sample_data — Preview Rows

Call:

{
"tool": "sample_data",
"arguments": {"table": "raw.postgres.orders", "limit": 5}
}

Response:

{
"table": "raw.postgres.orders",
"sample": [
{"id": 1, "customer_id": 42, "product_id": 7, "amount": 299.00, "status": "completed", "created_at": "2024-01-15T10:23:00"},
{"id": 2, "customer_id": 15, "product_id": 3, "amount": 49.99, "status": "completed", "created_at": "2024-01-15T11:05:00"},
{"id": 3, "customer_id": 42, "product_id": 12, "amount": 799.00, "status": "pending", "created_at": "2024-01-15T14:30:00"}
]
}

4. query — Execute SQL

Call:

{
"tool": "query",
"arguments": {
"sql": "SELECT DATE_TRUNC('month', created_at) AS month, SUM(amount) AS revenue FROM raw.postgres.orders WHERE created_at >= '2024-01-01' GROUP BY 1 ORDER BY 1"
}
}

Response:

{
"columns": ["month", "revenue"],
"rows": [
{"month": "2024-01-01", "revenue": 342891.50},
{"month": "2024-02-01", "revenue": 389012.75},
{"month": "2024-03-01", "revenue": 421547.00}
],
"row_count": 3,
"execution_time_ms": 234
}

5. ask — Natural Language Questions

Call:

{
"tool": "ask",
"arguments": {
"question": "Which customers spent the most last quarter?"
}
}

Response:

{
"answer": "The top 5 customers by spending last quarter were: Acme Corp ($89,234), GlobalTech ($67,891), Initech ($54,320), Umbrella Inc ($48,900), and Wayne Enterprises ($45,670).",
"sql": "SELECT c.name, SUM(o.amount) AS total_spent FROM raw.postgres.orders o JOIN raw.postgres.customers c ON o.customer_id = c.id WHERE o.created_at >= '2024-07-01' AND o.created_at < '2024-10-01' GROUP BY c.name ORDER BY total_spent DESC LIMIT 5",
"data": [
{"name": "Acme Corp", "total_spent": 89234.00},
{"name": "GlobalTech", "total_spent": 67891.00},
{"name": "Initech", "total_spent": 54320.00},
{"name": "Umbrella Inc", "total_spent": 48900.00},
{"name": "Wayne Enterprises", "total_spent": 45670.00}
]
}

6. get_metrics — Pre-Built KPIs

Call:

{
"tool": "get_metrics",
"arguments": {"period": "last_30_days"}
}

Response:

{
"period": "2024-09-15 to 2024-10-15",
"metrics": {
"total_revenue": 487293.00,
"order_count": 12847,
"avg_order_value": 37.93,
"unique_customers": 3421,
"new_customers": 289
}
}

7. explain_table — Data Documentation

Call:

{
"tool": "explain_table",
"arguments": {"table": "raw.postgres.orders"}
}

Response:

{
"table": "raw.postgres.orders",
"description": "Transaction records from the e-commerce platform. Each row represents one purchase.",
"columns": {
"id": "Unique order identifier",
"customer_id": "FK to customers table",
"amount": "Total order value in USD",
"status": "One of: pending, completed, refunded, cancelled",
"created_at": "When the order was placed"
},
"relationships": ["customers (customer_id)", "products (product_id)"],
"common_queries": [
"Revenue by month: SELECT DATE_TRUNC('month', created_at), SUM(amount) FROM orders GROUP BY 1",
"Orders by status: SELECT status, COUNT(*) FROM orders GROUP BY 1"
]
}

Realistic Multi-Turn Agent Conversation

Here’s what it looks like when Claude uses the MCP server:

User: “How is our retention trending? Are we losing customers?”

Claude’s internal tool calls:

  1. Calls list_tables to see what data is available
  2. Calls get_schema("raw.postgres.orders") to understand the orders structure
  3. Calls get_schema("raw.postgres.customers") to check for signup dates
  4. Calls query with:
WITH monthly_cohorts AS (
SELECT
DATE_TRUNC('month', c.created_at) AS cohort_month,
DATE_TRUNC('month', o.created_at) AS order_month,
COUNT(DISTINCT o.customer_id) AS active_customers
FROM raw.postgres.customers c
JOIN raw.postgres.orders o ON c.id = o.customer_id
WHERE c.created_at >= '2024-01-01'
GROUP BY 1, 2
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT id) AS cohort_size
FROM raw.postgres.customers
WHERE created_at >= '2024-01-01'
GROUP BY 1
)
SELECT
mc.cohort_month,
mc.order_month,
mc.active_customers,
cs.cohort_size,
ROUND(mc.active_customers::FLOAT / cs.cohort_size * 100, 1) AS retention_pct
FROM monthly_cohorts mc
JOIN cohort_sizes cs ON mc.cohort_month = cs.cohort_month
ORDER BY mc.cohort_month, mc.order_month

Claude’s response:

Looking at cohort retention data, here’s the trend:

CohortMonth 1Month 2Month 3Month 6
Jan 2024100%68%52%34%
Apr 2024100%71%55%38%
Jul 2024100%74%59%

Good news: retention is actually improving. Month-2 retention went from 68% to 74% over the year. The product changes in Q2 (self-serve onboarding) appear to be working.

However, the Month-6 drop to 34% suggests we lose customers around the 6-month mark — possibly when annual contracts come up for renewal.

Security Model

The MCP server is read-only by design:

  • No INSERT, UPDATE, DELETE, DROP, or CREATE statements are allowed
  • Queries are validated before execution
  • The server runs with the IAM permissions of the process (your AWS profile)
  • Bucket-level access control means agents only see data they’re authorized to see
Terminal window
# Agent can only access buckets your IAM role permits
export AWS_PROFILE=analyst # Has access to s3://company-product only
dataspoc-lens mcp # Only exposes tables in permitted buckets

No API keys to manage. No auth tokens. Just IAM.

Python SDK for CrewAI / LangGraph

If you’re building agents in Python rather than using Claude Desktop, use the SDK directly:

from dataspoc_lens import LensClient
# Initialize (reads from same config as CLI)
client = LensClient(bucket="s3://company-lake")
# Same capabilities as MCP tools
tables = client.list_tables()
schema = client.get_schema("raw.postgres.orders")
result = client.query("SELECT COUNT(*) FROM raw.postgres.orders")
answer = client.ask("What's our churn rate?")

CrewAI Integration

from crewai import Agent, Task, Crew
from crewai_tools import tool
from dataspoc_lens import LensClient
client = LensClient()
@tool
def query_data_lake(sql: str) -> str:
"""Execute a SQL query against the company data lake."""
result = client.query(sql)
return str(result)
@tool
def discover_tables() -> str:
"""List all available tables in the data lake."""
tables = client.list_tables()
return "\n".join([f"{t['name']} ({t['rows']} rows)" for t in tables])
analyst = Agent(
role="Data Analyst",
goal="Answer business questions using SQL on the data lake",
tools=[query_data_lake, discover_tables],
llm="gpt-4o",
)
task = Task(
description="Analyze customer lifetime value by acquisition channel",
agent=analyst,
)
crew = Crew(agents=[analyst], tasks=[task])
result = crew.kickoff()

LangGraph Integration

See the full tutorial: Building a Data Analyst Agent with LangGraph.

The AGENT.md Pattern

For agents that work autonomously with your data lake, create an AGENT.md file that describes the available data:

# Data Lake Agent Instructions
## Available Data
You have access to a data lake via the DataSpoc Lens MCP server.
## Tables
- `raw.postgres.orders` — All orders (1.2M rows). Key columns: customer_id, amount, status, created_at
- `raw.postgres.customers` — Customer profiles (89K rows). Key columns: name, email, plan, created_at
- `raw.postgres.products` — Product catalog (2.4K rows). Key columns: name, category, price
- `curated.finance.revenue` — Monthly revenue rollups (4.2M rows). Key columns: month, mrr, arr, churn_rate
## Common Queries
- Revenue by month: `SELECT DATE_TRUNC('month', created_at), SUM(amount) FROM raw.postgres.orders GROUP BY 1`
- Active customers: `SELECT COUNT(DISTINCT customer_id) FROM raw.postgres.orders WHERE created_at > CURRENT_DATE - INTERVAL '30 days'`
- Churn: `SELECT month, churn_rate FROM curated.finance.revenue ORDER BY month DESC LIMIT 12`
## Rules
- Always use the `query` tool for exact numbers
- Use `ask` for exploratory questions when you're not sure about the schema
- Never modify data — all queries are read-only
- When presenting numbers, always show the SQL you used (for auditability)

Place this in your project root. Agents that read project context (Claude Code, Cursor, Cline) will use it to understand how to interact with your data.

Running in Production

For production deployments, run the MCP server as a persistent process:

Terminal window
# systemd service
[Unit]
Description=DataSpoc Lens MCP Server
After=network.target
[Service]
Type=simple
User=dataspoc
ExecStart=/usr/local/bin/dataspoc-lens mcp --transport sse --port 8080
Environment=AWS_PROFILE=data-lake
Environment=DATASPOC_BUCKET=s3://company-lake
Restart=always
[Install]
WantedBy=multi-user.target

Or with Docker:

FROM python:3.11-slim
RUN pip install dataspoc-lens[mcp]
ENV DATASPOC_BUCKET=s3://company-lake
CMD ["dataspoc-lens", "mcp", "--transport", "sse", "--port", "8080"]
Terminal window
docker run -d \
-p 8080:8080 \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-e DATASPOC_BUCKET=s3://company-lake \
dataspoc-lens-mcp

Summary

In 4 steps, you’ve turned a pile of Parquet files into an intelligent API for AI agents:

  1. Ingest data with Pipe (or bring your own Parquet)
  2. Install dataspoc-lens[mcp]
  3. Start dataspoc-lens mcp
  4. Connect Claude Desktop, LangGraph, CrewAI, or any MCP client

Your data lake is now a first-class tool for any AI agent — discoverable, queryable, and secure.

Recommended