Quickstart

This guide takes you from zero to querying your data lake in five minutes.

1. Install

pip install dataspoc-pipe dataspoc-lens

2. Initialize a Pipe project

dataspoc-pipe init

This creates a .dataspoc/ directory in your bucket with the manifest and state tracking.

3. Add a data source

dataspoc-pipe add my-source

The interactive wizard walks you through:

Source type (database, API, file, etc.)
Connection details (host, credentials via env vars)
Tables or endpoints to extract
Destination bucket path
Sync mode (full or incremental)

4. Run the pipeline

dataspoc-pipe run my-source

Pipe extracts data from your source, converts it to Parquet, and writes it to your bucket under raw/my-source/<table>/.

5. Connect Lens to the bucket

dataspoc-lens add-bucket s3://my-data

Lens reads the manifest and discovers all tables Pipe has written.

6. Query with SQL

dataspoc-lens shell

SELECT customer_name, SUM(revenue) as total
FROM raw.my_source.orders
GROUP BY customer_name
ORDER BY total DESC
LIMIT 10;

7. Ask in natural language

dataspoc-lens ask "top customers by revenue"

Lens translates your question into SQL, runs it, and returns the result.

Local Testing with CSV

You do not need a cloud bucket to get started. Use local files with file:// URIs:

# Create a sample CSV
mkdir -p /tmp/my-lake
cat > /tmp/sales.csv << 'EOF'
date,customer,product,revenue
2025-01-15,Acme Corp,Widget Pro,15000
2025-01-15,Globex Inc,Widget Basic,8500
2025-01-16,Acme Corp,Widget Basic,4200
2025-01-16,Initech,Widget Pro,12000
2025-01-17,Globex Inc,Widget Pro,19500
EOF

# Initialize and ingest
dataspoc-pipe init --bucket file:///tmp/my-lake
dataspoc-pipe add local-csv --source-type file --path /tmp/sales.csv
dataspoc-pipe run local-csv

# Query
dataspoc-lens add-bucket file:///tmp/my-lake
dataspoc-lens shell

SELECT customer, SUM(revenue) as total_revenue
FROM raw.local_csv.sales
GROUP BY customer
ORDER BY total_revenue DESC;

Next Steps

Architecture --- Understand the bucket contract
AI Agent Integration --- Connect your AI agent to the data lake
MCP Setup --- Use DataSpoc from Claude or Cursor