Using Claude Code as Your Data Engineer: MCP + DataSpoc
Claude Code is a terminal-based AI coding agent. DataSpoc exposes both Pipe (ingestion) and Lens (query) as MCP servers. Connect them and Claude Code becomes a data engineer that can ingest data, query your lake, debug pipelines, and fix errors — all from your terminal.
Install the Tools
pip install dataspoc-pipe[mcp] dataspoc-lens[mcp]This installs both CLIs with MCP server support. Verify:
dataspoc-pipe mcp --helpdataspoc-lens mcp --helpConfigure Claude Code
Add the MCP servers to your Claude Code configuration. Create or edit .claude.json in your project root:
{ "mcpServers": { "dataspoc-lens": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "s3://my-company-data" } }, "dataspoc-pipe": { "command": "dataspoc-pipe", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "s3://my-company-data" } } }}For Azure or GCS:
{ "mcpServers": { "dataspoc-lens": { "command": "dataspoc-lens", "args": ["mcp"], "env": { "DATASPOC_BUCKET": "gs://my-company-data" } } }}Restart Claude Code. It will discover the MCP tools automatically.
What Claude Code Can Do Now
Discover Tables
Ask Claude Code:
> What tables are in my data lake?Claude Code calls lens.tables() via MCP and responds:
Your data lake has 12 tables:- raw_postgres_orders (245K rows)- raw_postgres_customers (18K rows)- raw_stripe_payments (890K rows)- curated_sales (245K rows)- curated_customer_360 (18K rows)- gold_revenue_by_region (48 rows)...Explore Schemas
> What columns does curated_sales have?Claude Code calls lens.schema("curated_sales"):
curated_sales schema: - order_id: VARCHAR - customer_id: VARCHAR - product: VARCHAR - amount: DOUBLE - region: VARCHAR - sale_date: DATE - payment_method: VARCHARRun Queries
> What were the top 5 regions by revenue last month?Claude Code generates and executes SQL:
SELECT region, SUM(amount) as revenueFROM curated_salesWHERE sale_date >= '2026-03-01' AND sale_date < '2026-04-01'GROUP BY regionORDER BY revenue DESCLIMIT 5Results:| region | revenue ||---------------|-------------|| North America | $1,240,000 || Europe | $890,000 || Asia Pacific | $670,000 || Latin America | $340,000 || Middle East | $120,000 |Trigger Data Ingestion
> Pull the latest data from Postgres into the lakeClaude Code calls Pipe’s MCP tools:
Running pipeline: postgres-to-lakeSource: tap-postgres (host: db.company.com, schema: public)Tables: orders, customers, productsMode: incremental (bookmark: 2026-04-17T23:00:00Z)
✓ orders: 1,247 new rows → s3://my-company-data/raw/postgres/orders/✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/✓ products: 0 new rows (no changes)
Pipeline completed in 23 seconds.Read and Debug Logs
> Show me the last pipeline run. Did anything fail?Claude Code reads the pipeline logs via Pipe’s MCP:
Last run: 2026-04-15T14:30:00Z — postgres-to-lakeStatus: PARTIAL FAILURE
✓ orders: 1,247 rows extracted, 1,247 rows loaded✗ customers: FAILED — connection timeout after 30s✓ products: 0 rows (no changes)
Error details for customers: psycopg2.OperationalError: could not connect to server: Connection timed out. Is the server running on host "db.company.com" and accepting TCP/IP connections on port 5432?Fix Errors
> The customers extraction failed due to a timeout. Can you increase the timeout and retry?Claude Code updates the pipeline config and reruns:
Updated postgres-to-lake pipeline config: connection_timeout: 30s → 60s
Retrying customers extraction...✓ customers: 34 new rows → s3://my-company-data/raw/postgres/customers/
Pipeline completed successfully.The AGENT.md Pattern
Create an AGENT.md file in your project to give Claude Code persistent context about your data infrastructure:
# Data Engineering Agent Context
## Infrastructure- Data lake: s3://my-company-data- Source database: PostgreSQL at db.company.com:5432/analytics- Pipelines: dataspoc-pipe (Singer-based)- Query engine: dataspoc-lens (DuckDB over Parquet)
## Pipelines| Pipeline | Source | Schedule | Mode ||----------|--------|----------|------|| postgres-to-lake | tap-postgres | Every 6h | Incremental || stripe-to-lake | tap-stripe | Every 1h | Incremental || sheets-to-lake | tap-google-sheets | Daily | Full |
## Common Tasks- "Refresh the lake" → run all pipelines- "Check data freshness" → compare latest timestamps in each table- "Debug pipeline X" → read logs from .dataspoc/logs/X/
## Data Quality Rules- orders.amount must be > 0- customers.email must not be null- All dates must be <= today (no future dates)
## Permissions- This agent has READ-ONLY access to the data lake- Pipeline runs require confirmation before executionClaude Code reads this file automatically and uses it as context for every interaction.
Real Session Example
Here is a complete terminal session showing Claude Code working as a data engineer:
$ claude
> I need to understand why our revenue dashboard shows a dip last Tuesday.
I'll investigate. Let me check the data.
[Calling lens.query: SELECT sale_date, SUM(amount) as daily_revenue FROM curated_sales WHERE sale_date BETWEEN '2026-04-07' AND '2026-04-14' GROUP BY sale_date ORDER BY sale_date]
Here's daily revenue for the past week:
| date | revenue ||-----------|------------|| 2026-04-07 | $142,000 || 2026-04-08 | $138,000 || 2026-04-09 | $23,000 | ← significant drop| 2026-04-10 | $145,000 || 2026-04-11 | $151,000 |
April 9th shows $23K vs a ~$140K average. Let me check if data wasactually ingested that day.
[Calling pipe.logs: postgres-to-lake, date=2026-04-09]
Found the issue. The pipeline on April 9th failed: Status: FAILED Error: "disk full on /tmp during Parquet conversion" Rows extracted: 12,450 (expected ~45,000)
Only partial data was loaded. The remaining ~32,550 orders from April 9thare missing from the lake.
Recommendation:1. Clear disk space on the pipeline runner2. Re-run the pipeline for April 9th with a full extraction3. Verify the revenue number after re-ingestion
Want me to re-run the pipeline?
> Yes, re-run it.
[Calling pipe.run: postgres-to-lake, start_date=2026-04-09, end_date=2026-04-09, mode=full]
✓ Pipeline completed: 44,892 rows extracted and loaded.
Let me verify the revenue now:
[Calling lens.cache_refresh: curated_sales][Calling lens.query: SELECT SUM(amount) FROM curated_sales WHERE sale_date = '2026-04-09']
April 9th revenue is now $141,200 — consistent with other days.The dashboard dip was caused by incomplete data ingestion, not anactual revenue drop.Tips for Effective Prompts
Be specific about tables:
# Good> What's the average order value in curated_sales for Q1?
# Bad> What's our AOV?Ask for verification:
> Query the top 10 customers by spend, and cross-check the count> against raw_orders to make sure no orders are missing.Chain operations:
> Pull fresh data from Stripe, refresh the cache, then tell me> how many new subscriptions we got this week.Request explanations:
> Run a query to find customers who haven't ordered in 90 days.> Explain the SQL you used.Security Model
Claude Code with DataSpoc MCP is safe by design:
- Lens MCP is read-only. It rejects any SQL that attempts to write, modify, or delete data.
- Pipe MCP requires confirmation. Pipeline runs are gated by user approval.
- Cloud IAM controls access. The MCP server inherits your cloud credentials. If you cannot access a bucket, neither can Claude Code.
- Every query is SQL. You can review exactly what was executed. No opaque retrieval or hidden data access.
- No credentials in config. The MCP servers use your existing cloud authentication (AWS SSO, gcloud auth, Azure CLI).
Claude Code becomes a powerful data engineering assistant without introducing any new security risks. It can only do what you could already do — just faster.