Skip to content

Quickstart

This guide takes you from zero to querying your data lake in five minutes.

Terminal window
pip install dataspoc-pipe dataspoc-lens
Terminal window
dataspoc-pipe init

This creates a .dataspoc/ directory in your bucket with the manifest and state tracking.

Terminal window
dataspoc-pipe add my-source

The interactive wizard walks you through:

  • Source type (database, API, file, etc.)
  • Connection details (host, credentials via env vars)
  • Tables or endpoints to extract
  • Destination bucket path
  • Sync mode (full or incremental)
Terminal window
dataspoc-pipe run my-source

Pipe extracts data from your source, converts it to Parquet, and writes it to your bucket under raw/my-source/<table>/.

Terminal window
dataspoc-lens add-bucket s3://my-data

Lens reads the manifest and discovers all tables Pipe has written.

Terminal window
dataspoc-lens shell
SELECT customer_name, SUM(revenue) as total
FROM raw.my_source.orders
GROUP BY customer_name
ORDER BY total DESC
LIMIT 10;
Terminal window
dataspoc-lens ask "top customers by revenue"

Lens translates your question into SQL, runs it, and returns the result.

You do not need a cloud bucket to get started. Use local files with file:// URIs:

Terminal window
# Create a sample CSV
mkdir -p /tmp/my-lake
cat > /tmp/sales.csv << 'EOF'
date,customer,product,revenue
2025-01-15,Acme Corp,Widget Pro,15000
2025-01-15,Globex Inc,Widget Basic,8500
2025-01-16,Acme Corp,Widget Basic,4200
2025-01-16,Initech,Widget Pro,12000
2025-01-17,Globex Inc,Widget Pro,19500
EOF
# Initialize and ingest
dataspoc-pipe init --bucket file:///tmp/my-lake
dataspoc-pipe add local-csv --source-type file --path /tmp/sales.csv
dataspoc-pipe run local-csv
# Query
dataspoc-lens add-bucket file:///tmp/my-lake
dataspoc-lens shell
SELECT customer, SUM(revenue) as total_revenue
FROM raw.local_csv.sales
GROUP BY customer
ORDER BY total_revenue DESC;