mcpclaudeai-agentsdata-laketutorial

How to Build an MCP Server for Your Data Lake

Michael San Martim · 2026-04-24

MCP (Model Context Protocol) lets AI agents call tools — like querying a database. DataSpoc Lens ships with a built-in MCP server that turns your Parquet data lake into a queryable API for Claude, GPT, LangGraph agents, and any MCP-compatible client.

This tutorial shows you how to go from raw data in S3 to a fully functional MCP server in under 10 minutes.

The End Result

After setup, any AI agent can:

Discover tables in your data lake
Inspect schemas and sample data
Run SQL queries with DuckDB
Ask natural language questions
Get pre-built metrics and KPIs

All without any custom code.

Step 1: Ingest Data with Pipe

If your data is already in Parquet on S3/GCS/Azure, skip to Step 2. Otherwise, extract from your sources:

pip install dataspoc-pipe

dataspoc-pipe init sales-pipeline
dataspoc-pipe add postgres \
  --host db.company.com \
  --database sales \
  --tables orders,customers,products,revenue \
  --incremental updated_at \
  --destination s3://company-lake

dataspoc-pipe run

Your bucket now has:

s3://company-lake/
  .dataspoc/manifest.json
  raw/postgres/orders/*.parquet
  raw/postgres/customers/*.parquet
  raw/postgres/products/*.parquet
  raw/postgres/revenue/*.parquet

Step 2: Install DataSpoc Lens with MCP

pip install dataspoc-lens[mcp]

This installs the core Lens CLI plus the MCP server dependencies.

Step 3: Configure and Start the MCP Server

# Point Lens at your bucket
dataspoc-lens add-bucket s3://company-lake

# Discover tables
dataspoc-lens discover
# Found 4 tables: orders, customers, products, revenue

# Start the MCP server
dataspoc-lens mcp

Output:

DataSpoc Lens MCP Server running
Transport: stdio
Tables: 4 discovered
Tools: 7 available
  - list_tables
  - get_schema
  - sample_data
  - query
  - ask
  - get_metrics
  - explain_table

Waiting for MCP client connection...

Step 4: Configure Claude Desktop

Add to your Claude Desktop config (~/.config/claude/claude_desktop_config.json on Linux, ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

{
  "mcpServers": {
    "data-lake": {
      "command": "dataspoc-lens",
      "args": ["mcp"],
      "env": {
        "AWS_PROFILE": "data-lake",
        "DATASPOC_BUCKET": "s3://company-lake"
      }
    }
  }
}

Restart Claude Desktop. You’ll see the tools icon indicating 7 available tools.

The 7 MCP Tools

1. `list_tables` — Discover Available Data

Call:

{
  "tool": "list_tables"
}

Response:

{
  "tables": [
    {"name": "raw.postgres.orders", "rows": 1247893, "size_mb": 340},
    {"name": "raw.postgres.customers", "rows": 89421, "size_mb": 12},
    {"name": "raw.postgres.products", "rows": 2431, "size_mb": 1.1},
    {"name": "raw.postgres.revenue", "rows": 4201847, "size_mb": 890}
  ]
}

2. `get_schema` — Inspect Table Structure

Call:

{
  "tool": "get_schema",
  "arguments": {"table": "raw.postgres.orders"}
}

Response:

{
  "table": "raw.postgres.orders",
  "columns": [
    {"name": "id", "type": "INTEGER", "nullable": false},
    {"name": "customer_id", "type": "INTEGER", "nullable": false},
    {"name": "product_id", "type": "INTEGER", "nullable": false},
    {"name": "amount", "type": "DECIMAL(10,2)", "nullable": false},
    {"name": "status", "type": "VARCHAR", "nullable": false},
    {"name": "created_at", "type": "TIMESTAMP", "nullable": false},
    {"name": "updated_at", "type": "TIMESTAMP", "nullable": false}
  ],
  "row_count": 1247893
}

3. `sample_data` — Preview Rows

Call:

{
  "tool": "sample_data",
  "arguments": {"table": "raw.postgres.orders", "limit": 5}
}

Response:

{
  "table": "raw.postgres.orders",
  "sample": [
    {"id": 1, "customer_id": 42, "product_id": 7, "amount": 299.00, "status": "completed", "created_at": "2024-01-15T10:23:00"},
    {"id": 2, "customer_id": 15, "product_id": 3, "amount": 49.99, "status": "completed", "created_at": "2024-01-15T11:05:00"},
    {"id": 3, "customer_id": 42, "product_id": 12, "amount": 799.00, "status": "pending", "created_at": "2024-01-15T14:30:00"}
  ]
}

4. `query` — Execute SQL

Call:

{
  "tool": "query",
  "arguments": {
    "sql": "SELECT DATE_TRUNC('month', created_at) AS month, SUM(amount) AS revenue FROM raw.postgres.orders WHERE created_at >= '2024-01-01' GROUP BY 1 ORDER BY 1"
  }
}

Response:

{
  "columns": ["month", "revenue"],
  "rows": [
    {"month": "2024-01-01", "revenue": 342891.50},
    {"month": "2024-02-01", "revenue": 389012.75},
    {"month": "2024-03-01", "revenue": 421547.00}
  ],
  "row_count": 3,
  "execution_time_ms": 234
}

5. `ask` — Natural Language Questions

Call:

{
  "tool": "ask",
  "arguments": {
    "question": "Which customers spent the most last quarter?"
  }
}

Response:

{
  "answer": "The top 5 customers by spending last quarter were: Acme Corp ($89,234), GlobalTech ($67,891), Initech ($54,320), Umbrella Inc ($48,900), and Wayne Enterprises ($45,670).",
  "sql": "SELECT c.name, SUM(o.amount) AS total_spent FROM raw.postgres.orders o JOIN raw.postgres.customers c ON o.customer_id = c.id WHERE o.created_at >= '2024-07-01' AND o.created_at < '2024-10-01' GROUP BY c.name ORDER BY total_spent DESC LIMIT 5",
  "data": [
    {"name": "Acme Corp", "total_spent": 89234.00},
    {"name": "GlobalTech", "total_spent": 67891.00},
    {"name": "Initech", "total_spent": 54320.00},
    {"name": "Umbrella Inc", "total_spent": 48900.00},
    {"name": "Wayne Enterprises", "total_spent": 45670.00}
  ]
}

6. `get_metrics` — Pre-Built KPIs

Call:

{
  "tool": "get_metrics",
  "arguments": {"period": "last_30_days"}
}

Response:

{
  "period": "2024-09-15 to 2024-10-15",
  "metrics": {
    "total_revenue": 487293.00,
    "order_count": 12847,
    "avg_order_value": 37.93,
    "unique_customers": 3421,
    "new_customers": 289
  }
}

7. `explain_table` — Data Documentation

Call:

{
  "tool": "explain_table",
  "arguments": {"table": "raw.postgres.orders"}
}

Response:

{
  "table": "raw.postgres.orders",
  "description": "Transaction records from the e-commerce platform. Each row represents one purchase.",
  "columns": {
    "id": "Unique order identifier",
    "customer_id": "FK to customers table",
    "amount": "Total order value in USD",
    "status": "One of: pending, completed, refunded, cancelled",
    "created_at": "When the order was placed"
  },
  "relationships": ["customers (customer_id)", "products (product_id)"],
  "common_queries": [
    "Revenue by month: SELECT DATE_TRUNC('month', created_at), SUM(amount) FROM orders GROUP BY 1",
    "Orders by status: SELECT status, COUNT(*) FROM orders GROUP BY 1"
  ]
}

Realistic Multi-Turn Agent Conversation

Here’s what it looks like when Claude uses the MCP server:

User: “How is our retention trending? Are we losing customers?”

Claude’s internal tool calls:

Calls list_tables to see what data is available
Calls get_schema("raw.postgres.orders") to understand the orders structure
Calls get_schema("raw.postgres.customers") to check for signup dates
Calls query with:

WITH monthly_cohorts AS (
    SELECT
        DATE_TRUNC('month', c.created_at) AS cohort_month,
        DATE_TRUNC('month', o.created_at) AS order_month,
        COUNT(DISTINCT o.customer_id) AS active_customers
    FROM raw.postgres.customers c
    JOIN raw.postgres.orders o ON c.id = o.customer_id
    WHERE c.created_at >= '2024-01-01'
    GROUP BY 1, 2
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(DISTINCT id) AS cohort_size
    FROM raw.postgres.customers
    WHERE created_at >= '2024-01-01'
    GROUP BY 1
)
SELECT
    mc.cohort_month,
    mc.order_month,
    mc.active_customers,
    cs.cohort_size,
    ROUND(mc.active_customers::FLOAT / cs.cohort_size * 100, 1) AS retention_pct
FROM monthly_cohorts mc
JOIN cohort_sizes cs ON mc.cohort_month = cs.cohort_month
ORDER BY mc.cohort_month, mc.order_month

Claude’s response:

Looking at cohort retention data, here’s the trend:

Cohort Month 1 Month 2 Month 3 Month 6
Jan 2024 100% 68% 52% 34%
Apr 2024 100% 71% 55% 38%
Jul 2024 100% 74% 59% —

Good news: retention is actually improving. Month-2 retention went from 68% to 74% over the year. The product changes in Q2 (self-serve onboarding) appear to be working.

However, the Month-6 drop to 34% suggests we lose customers around the 6-month mark — possibly when annual contracts come up for renewal.

Cohort	Month 1	Month 2	Month 3	Month 6
Jan 2024	100%	68%	52%	34%
Apr 2024	100%	71%	55%	38%
Jul 2024	100%	74%	59%	—

Security Model

The MCP server is read-only by design:

No INSERT, UPDATE, DELETE, DROP, or CREATE statements are allowed
Queries are validated before execution
The server runs with the IAM permissions of the process (your AWS profile)
Bucket-level access control means agents only see data they’re authorized to see

# Agent can only access buckets your IAM role permits
export AWS_PROFILE=analyst  # Has access to s3://company-product only
dataspoc-lens mcp  # Only exposes tables in permitted buckets

No API keys to manage. No auth tokens. Just IAM.

Python SDK for CrewAI / LangGraph

If you’re building agents in Python rather than using Claude Desktop, use the SDK directly:

from dataspoc_lens import LensClient

# Initialize (reads from same config as CLI)
client = LensClient(bucket="s3://company-lake")

# Same capabilities as MCP tools
tables = client.list_tables()
schema = client.get_schema("raw.postgres.orders")
result = client.query("SELECT COUNT(*) FROM raw.postgres.orders")
answer = client.ask("What's our churn rate?")

CrewAI Integration

from crewai import Agent, Task, Crew
from crewai_tools import tool
from dataspoc_lens import LensClient

client = LensClient()

@tool
def query_data_lake(sql: str) -> str:
    """Execute a SQL query against the company data lake."""
    result = client.query(sql)
    return str(result)

@tool
def discover_tables() -> str:
    """List all available tables in the data lake."""
    tables = client.list_tables()
    return "\n".join([f"{t['name']} ({t['rows']} rows)" for t in tables])

analyst = Agent(
    role="Data Analyst",
    goal="Answer business questions using SQL on the data lake",
    tools=[query_data_lake, discover_tables],
    llm="gpt-4o",
)

task = Task(
    description="Analyze customer lifetime value by acquisition channel",
    agent=analyst,
)

crew = Crew(agents=[analyst], tasks=[task])
result = crew.kickoff()

LangGraph Integration

See the full tutorial: Building a Data Analyst Agent with LangGraph.

The AGENT.md Pattern

For agents that work autonomously with your data lake, create an AGENT.md file that describes the available data:

# Data Lake Agent Instructions

## Available Data

You have access to a data lake via the DataSpoc Lens MCP server.

## Tables

- `raw.postgres.orders` — All orders (1.2M rows). Key columns: customer_id, amount, status, created_at
- `raw.postgres.customers` — Customer profiles (89K rows). Key columns: name, email, plan, created_at
- `raw.postgres.products` — Product catalog (2.4K rows). Key columns: name, category, price
- `curated.finance.revenue` — Monthly revenue rollups (4.2M rows). Key columns: month, mrr, arr, churn_rate

## Common Queries

- Revenue by month: `SELECT DATE_TRUNC('month', created_at), SUM(amount) FROM raw.postgres.orders GROUP BY 1`
- Active customers: `SELECT COUNT(DISTINCT customer_id) FROM raw.postgres.orders WHERE created_at > CURRENT_DATE - INTERVAL '30 days'`
- Churn: `SELECT month, churn_rate FROM curated.finance.revenue ORDER BY month DESC LIMIT 12`

## Rules

- Always use the `query` tool for exact numbers
- Use `ask` for exploratory questions when you're not sure about the schema
- Never modify data — all queries are read-only
- When presenting numbers, always show the SQL you used (for auditability)

Place this in your project root. Agents that read project context (Claude Code, Cursor, Cline) will use it to understand how to interact with your data.

Running in Production

For production deployments, run the MCP server as a persistent process:

# systemd service
[Unit]
Description=DataSpoc Lens MCP Server
After=network.target

[Service]
Type=simple
User=dataspoc
ExecStart=/usr/local/bin/dataspoc-lens mcp --transport sse --port 8080
Environment=AWS_PROFILE=data-lake
Environment=DATASPOC_BUCKET=s3://company-lake
Restart=always

[Install]
WantedBy=multi-user.target

Or with Docker:

FROM python:3.11-slim
RUN pip install dataspoc-lens[mcp]
ENV DATASPOC_BUCKET=s3://company-lake
CMD ["dataspoc-lens", "mcp", "--transport", "sse", "--port", "8080"]

docker run -d \
  -p 8080:8080 \
  -e AWS_ACCESS_KEY_ID \
  -e AWS_SECRET_ACCESS_KEY \
  -e DATASPOC_BUCKET=s3://company-lake \
  dataspoc-lens-mcp

Summary

In 4 steps, you’ve turned a pile of Parquet files into an intelligent API for AI agents:

Ingest data with Pipe (or bring your own Parquet)
Install dataspoc-lens[mcp]
Start dataspoc-lens mcp
Connect Claude Desktop, LangGraph, CrewAI, or any MCP client

Your data lake is now a first-class tool for any AI agent — discoverable, queryable, and secure.

← Back to blog

How to Build an MCP Server for Your Data Lake

The End Result

Step 1: Ingest Data with Pipe

Step 2: Install DataSpoc Lens with MCP

Step 3: Configure and Start the MCP Server

Step 4: Configure Claude Desktop

The 7 MCP Tools

1. `list_tables` — Discover Available Data

2. `get_schema` — Inspect Table Structure

3. `sample_data` — Preview Rows

4. `query` — Execute SQL

5. `ask` — Natural Language Questions

6. `get_metrics` — Pre-Built KPIs

7. `explain_table` — Data Documentation

Realistic Multi-Turn Agent Conversation

Security Model

Python SDK for CrewAI / LangGraph

CrewAI Integration

LangGraph Integration

The AGENT.md Pattern

Running in Production

Summary

Recommended

Using Claude Code as Your Data Engineer: MCP + DataSpoc

The AGENT.md Pattern: Teaching AI Agents to Use Your Tools

Analyze Your Data Lake from Cursor IDE with DataSpoc MCP

The End Result

Step 1: Ingest Data with Pipe

Step 2: Install DataSpoc Lens with MCP

Step 3: Configure and Start the MCP Server

Step 4: Configure Claude Desktop

The 7 MCP Tools

1. list_tables — Discover Available Data

2. get_schema — Inspect Table Structure

3. sample_data — Preview Rows

4. query — Execute SQL

5. ask — Natural Language Questions

6. get_metrics — Pre-Built KPIs

7. explain_table — Data Documentation

Realistic Multi-Turn Agent Conversation

Security Model

Python SDK for CrewAI / LangGraph

CrewAI Integration

LangGraph Integration

The AGENT.md Pattern

Running in Production

Summary

Recommended

Using Claude Code as Your Data Engineer: MCP + DataSpoc

The AGENT.md Pattern: Teaching AI Agents to Use Your Tools

Analyze Your Data Lake from Cursor IDE with DataSpoc MCP

1. `list_tables` — Discover Available Data

2. `get_schema` — Inspect Table Structure

3. `sample_data` — Preview Rows

4. `query` — Execute SQL

5. `ask` — Natural Language Questions

6. `get_metrics` — Pre-Built KPIs

7. `explain_table` — Data Documentation