claude-codemcpdata-engineeringai-agentstutorial

Using Claude Code as Your Data Engineer: MCP + DataSpoc

Michael San Martim · 2026-04-18

Claude Code is a terminal-based AI coding agent. DataSpoc exposes both Pipe (ingestion) and Lens (query) as MCP servers. Connect them and Claude Code becomes a data engineer that can ingest data, query your lake, debug pipelines, and fix errors — all from your terminal.

Install the Tools

pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]

This installs both CLIs with MCP server support. Verify:

dataspoc-pipe mcp --help
dataspoc-lens mcp --help

Configure Claude Code

Add the MCP servers to your Claude Code configuration. Create or edit .claude.json in your project root:

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"],
      "env": {
        "DATASPOC_BUCKET": "s3://my-company-data"
      }
    },
    "dataspoc-pipe": {
      "command": "dataspoc-pipe",
      "args": ["mcp"],
      "env": {
        "DATASPOC_BUCKET": "s3://my-company-data"
      }
    }
  }
}

For Azure or GCS:

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"],
      "env": {
        "DATASPOC_BUCKET": "gs://my-company-data"
      }
    }
  }
}

Restart Claude Code. It will discover the MCP tools automatically.

What Claude Code Can Do Now

Discover Tables

Ask Claude Code:

> What tables are in my data lake?

Claude Code calls lens.tables() via MCP and responds:

Your data lake has 12 tables:
- raw_postgres_orders (245K rows)
- raw_postgres_customers (18K rows)
- raw_stripe_payments (890K rows)
- curated_sales (245K rows)
- curated_customer_360 (18K rows)
- gold_revenue_by_region (48 rows)
...

Explore Schemas

> What columns does curated_sales have?

Claude Code calls lens.schema("curated_sales"):

curated_sales schema:
  - order_id: VARCHAR
  - customer_id: VARCHAR
  - product: VARCHAR
  - amount: DOUBLE
  - region: VARCHAR
  - sale_date: DATE
  - payment_method: VARCHAR

Run Queries

> What were the top 5 regions by revenue last month?

Claude Code generates and executes SQL:

SELECT region, SUM(amount) as revenue
FROM curated_sales
WHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'
GROUP BY region
ORDER BY revenue DESC
LIMIT 5

Results:
| region         | revenue     |
|---------------|-------------|
| North America | $1,240,000  |
| Europe        | $890,000    |
| Asia Pacific  | $670,000    |
| Latin America | $340,000    |
| Middle East   | $120,000    |

Trigger Data Ingestion

> Pull the latest data from Postgres into the lake

Claude Code calls Pipe’s MCP tools:

Running pipeline: postgres-to-lake
Source: tap-postgres (host: db.company.com, schema: public)
Tables: orders, customers, products
Mode: incremental (bookmark: 2026-04-17T23:00:00Z)

✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
✓ products: 0 new rows (no changes)

Pipeline completed in 23 seconds.

Read and Debug Logs

> Show me the last pipeline run. Did anything fail?

Claude Code reads the pipeline logs via Pipe’s MCP:

Last run: 2026-04-15T14:30:00Z — postgres-to-lake
Status: PARTIAL FAILURE

✓ orders: 1,247 rows extracted, 1,247 rows loaded
✗ customers: FAILED — connection timeout after 30s
✓ products: 0 rows (no changes)

Error details for customers:
  psycopg2.OperationalError: could not connect to server:
  Connection timed out. Is the server running on host "db.company.com"
  and accepting TCP/IP connections on port 5432?

Fix Errors

> The customers extraction failed due to a timeout. Can you increase the timeout and retry?

Claude Code updates the pipeline config and reruns:

Updated postgres-to-lake pipeline config:
  connection_timeout: 30s → 60s

Retrying customers extraction...
✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/

Pipeline completed successfully.

The AGENT.md Pattern

Create an AGENT.md file in your project to give Claude Code persistent context about your data infrastructure:

# Data Engineering Agent Context

## Infrastructure
- Data lake: s3://my-company-data
- Source database: PostgreSQL at db.company.com:5432/analytics
- Pipelines: dataspoc-pipe (Singer-based)
- Query engine: dataspoc-lens (DuckDB over Parquet)

## Pipelines
| Pipeline | Source | Schedule | Mode |
|----------|--------|----------|------|
| postgres-to-lake | tap-postgres | Every 6h | Incremental |
| stripe-to-lake | tap-stripe | Every 1h | Incremental |
| sheets-to-lake | tap-google-sheets | Daily | Full |

## Common Tasks
- "Refresh the lake" → run all pipelines
- "Check data freshness" → compare latest timestamps in each table
- "Debug pipeline X" → read logs from .dataspoc/logs/X/

## Data Quality Rules
- orders.amount must be > 0
- customers.email must not be null
- All dates must be <= today (no future dates)

## Permissions
- This agent has READ-ONLY access to the data lake
- Pipeline runs require confirmation before execution

Claude Code reads this file automatically and uses it as context for every interaction.

Real Session Example

Here is a complete terminal session showing Claude Code working as a data engineer:

$ claude

> I need to understand why our revenue dashboard shows a dip last Tuesday.

I'll investigate. Let me check the data.

[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue
 FROM curated_sales
 WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14'
 GROUP BY sale_date ORDER BY sale_date]

Here's daily revenue for the past week:

| date       | revenue    |
|-----------|------------|
| 2026-04-07 | $142,000  |
| 2026-04-08 | $138,000  |
| 2026-04-09 | $23,000   | ← significant drop
| 2026-04-10 | $145,000  |
| 2026-04-11 | $151,000  |

April 9th shows $23K vs a ~$140K average. Let me check if data was
actually ingested that day.

[Calling pipe.logs: postgres-to-lake, date=2026-04-09]

Found the issue. The pipeline on April 9th failed:
  Status: FAILED
  Error: "disk full on /tmp during Parquet conversion"
  Rows extracted: 12,450 (expected ~45,000)

Only partial data was loaded. The remaining ~32,550 orders from April 9th
are missing from the lake.

Recommendation:
1. Clear disk space on the pipeline runner
2. Re-run the pipeline for April 9th with a full extraction
3. Verify the revenue number after re-ingestion

Want me to re-run the pipeline?

> Yes, re-run it.

[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]

✓ Pipeline completed: 44,892 rows extracted and loaded.

Let me verify the revenue now:

[Calling lens.cache_refresh: curated_sales]
[Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']

April 9th revenue is now $141,200 — consistent with other days.
The dashboard dip was caused by incomplete data ingestion, not an
actual revenue drop.

Tips for Effective Prompts

Be specific about tables:

# Good
> What's the average order value in curated_sales for Q1?

# Bad
> What's our AOV?

Ask for verification:

> Query the top 10 customers by spend, and cross-check the count
> against raw_orders to make sure no orders are missing.

Chain operations:

> Pull fresh data from Stripe, refresh the cache, then tell me
> how many new subscriptions we got this week.

Request explanations:

> Run a query to find customers who haven't ordered in 90 days.
> Explain the SQL you used.

Security Model

Claude Code with DataSpoc MCP is safe by design:

Lens MCP is read-only. It rejects any SQL that attempts to write, modify, or delete data.
Pipe MCP requires confirmation. Pipeline runs are gated by user approval.
Cloud IAM controls access. The MCP server inherits your cloud credentials. If you cannot access a bucket, neither can Claude Code.
Every query is SQL. You can review exactly what was executed. No opaque retrieval or hidden data access.
No credentials in config. The MCP servers use your existing cloud authentication (AWS SSO, gcloud auth, Azure CLI).

Claude Code becomes a powerful data engineering assistant without introducing any new security risks. It can only do what you could already do — just faster.

← Back to blog