claude-codemcpdata-engineeringai-agentstutorial

Using Claude Code as Your Data Engineer: MCP + DataSpoc

Michael San Martim · 2026-04-18

Claude Code is a terminal-based AI coding agent. DataSpoc exposes both Pipe (ingestion) and Lens (query) as MCP servers. Connect them and Claude Code becomes a data engineer that can ingest data, query your lake, debug pipelines, and fix errors — all from your terminal.

Install the Tools

Terminal window
pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]

This installs both CLIs with MCP server support. Verify:

Terminal window
dataspoc-pipe mcp --help
dataspoc-lens mcp --help

Configure Claude Code

Add the MCP servers to your Claude Code configuration. Create or edit .claude.json in your project root:

{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "s3://my-company-data"
}
},
"dataspoc-pipe": {
"command": "dataspoc-pipe",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "s3://my-company-data"
}
}
}
}

For Azure or GCS:

{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"],
"env": {
"DATASPOC_BUCKET": "gs://my-company-data"
}
}
}
}

Restart Claude Code. It will discover the MCP tools automatically.

What Claude Code Can Do Now

Discover Tables

Ask Claude Code:

> What tables are in my data lake?

Claude Code calls lens.tables() via MCP and responds:

Your data lake has 12 tables:
- raw_postgres_orders (245K rows)
- raw_postgres_customers (18K rows)
- raw_stripe_payments (890K rows)
- curated_sales (245K rows)
- curated_customer_360 (18K rows)
- gold_revenue_by_region (48 rows)
...

Explore Schemas

> What columns does curated_sales have?

Claude Code calls lens.schema("curated_sales"):

curated_sales schema:
- order_id: VARCHAR
- customer_id: VARCHAR
- product: VARCHAR
- amount: DOUBLE
- region: VARCHAR
- sale_date: DATE
- payment_method: VARCHAR

Run Queries

> What were the top 5 regions by revenue last month?

Claude Code generates and executes SQL:

SELECT region, SUM(amount) as revenue
FROM curated_sales
WHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'
GROUP BY region
ORDER BY revenue DESC
LIMIT 5
Results:
| region | revenue |
|---------------|-------------|
| North America | $1,240,000 |
| Europe | $890,000 |
| Asia Pacific | $670,000 |
| Latin America | $340,000 |
| Middle East | $120,000 |

Trigger Data Ingestion

> Pull the latest data from Postgres into the lake

Claude Code calls Pipe’s MCP tools:

Running pipeline: postgres-to-lake
Source: tap-postgres (host: db.company.com, schema: public)
Tables: orders, customers, products
Mode: incremental (bookmark: 2026-04-17T23:00:00Z)
✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
✓ products: 0 new rows (no changes)
Pipeline completed in 23 seconds.

Read and Debug Logs

> Show me the last pipeline run. Did anything fail?

Claude Code reads the pipeline logs via Pipe’s MCP:

Last run: 2026-04-15T14:30:00Z — postgres-to-lake
Status: PARTIAL FAILURE
✓ orders: 1,247 rows extracted, 1,247 rows loaded
✗ customers: FAILED — connection timeout after 30s
✓ products: 0 rows (no changes)
Error details for customers:
psycopg2.OperationalError: could not connect to server:
Connection timed out. Is the server running on host "db.company.com"
and accepting TCP/IP connections on port 5432?

Fix Errors

> The customers extraction failed due to a timeout. Can you increase the timeout and retry?

Claude Code updates the pipeline config and reruns:

Updated postgres-to-lake pipeline config:
connection_timeout: 30s → 60s
Retrying customers extraction...
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
Pipeline completed successfully.

The AGENT.md Pattern

Create an AGENT.md file in your project to give Claude Code persistent context about your data infrastructure:

# Data Engineering Agent Context
## Infrastructure
- Data lake: s3://my-company-data
- Source database: PostgreSQL at db.company.com:5432/analytics
- Pipelines: dataspoc-pipe (Singer-based)
- Query engine: dataspoc-lens (DuckDB over Parquet)
## Pipelines
| Pipeline | Source | Schedule | Mode |
|----------|--------|----------|------|
| postgres-to-lake | tap-postgres | Every 6h | Incremental |
| stripe-to-lake | tap-stripe | Every 1h | Incremental |
| sheets-to-lake | tap-google-sheets | Daily | Full |
## Common Tasks
- "Refresh the lake" → run all pipelines
- "Check data freshness" → compare latest timestamps in each table
- "Debug pipeline X" → read logs from .dataspoc/logs/X/
## Data Quality Rules
- orders.amount must be > 0
- customers.email must not be null
- All dates must be <= today (no future dates)
## Permissions
- This agent has READ-ONLY access to the data lake
- Pipeline runs require confirmation before execution

Claude Code reads this file automatically and uses it as context for every interaction.

Real Session Example

Here is a complete terminal session showing Claude Code working as a data engineer:

$ claude
> I need to understand why our revenue dashboard shows a dip last Tuesday.
I'll investigate. Let me check the data.
[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue
FROM curated_sales
WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14'
GROUP BY sale_date ORDER BY sale_date]
Here's daily revenue for the past week:
| date | revenue |
|-----------|------------|
| 2026-04-07 | $142,000 |
| 2026-04-08 | $138,000 |
| 2026-04-09 | $23,000 | ← significant drop
| 2026-04-10 | $145,000 |
| 2026-04-11 | $151,000 |
April 9th shows $23K vs a ~$140K average. Let me check if data was
actually ingested that day.
[Calling pipe.logs: postgres-to-lake, date=2026-04-09]
Found the issue. The pipeline on April 9th failed:
Status: FAILED
Error: "disk full on /tmp during Parquet conversion"
Rows extracted: 12,450 (expected ~45,000)
Only partial data was loaded. The remaining ~32,550 orders from April 9th
are missing from the lake.
Recommendation:
1. Clear disk space on the pipeline runner
2. Re-run the pipeline for April 9th with a full extraction
3. Verify the revenue number after re-ingestion
Want me to re-run the pipeline?
> Yes, re-run it.
[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]
✓ Pipeline completed: 44,892 rows extracted and loaded.
Let me verify the revenue now:
[Calling lens.cache_refresh: curated_sales]
[Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']
April 9th revenue is now $141,200 — consistent with other days.
The dashboard dip was caused by incomplete data ingestion, not an
actual revenue drop.

Tips for Effective Prompts

Be specific about tables:

# Good
> What's the average order value in curated_sales for Q1?
# Bad
> What's our AOV?

Ask for verification:

> Query the top 10 customers by spend, and cross-check the count
> against raw_orders to make sure no orders are missing.

Chain operations:

> Pull fresh data from Stripe, refresh the cache, then tell me
> how many new subscriptions we got this week.

Request explanations:

> Run a query to find customers who haven't ordered in 90 days.
> Explain the SQL you used.

Security Model

Claude Code with DataSpoc MCP is safe by design:

  • Lens MCP is read-only. It rejects any SQL that attempts to write, modify, or delete data.
  • Pipe MCP requires confirmation. Pipeline runs are gated by user approval.
  • Cloud IAM controls access. The MCP server inherits your cloud credentials. If you cannot access a bucket, neither can Claude Code.
  • Every query is SQL. You can review exactly what was executed. No opaque retrieval or hidden data access.
  • No credentials in config. The MCP servers use your existing cloud authentication (AWS SSO, gcloud auth, Azure CLI).

Claude Code becomes a powerful data engineering assistant without introducing any new security risks. It can only do what you could already do — just faster.

Recommended