Quickstart

This guide walks you through creating and running a pipeline that ingests a CSV file into Parquet on your local filesystem.

1. Install Pipe and the CSV tap

pip install dataspoc-pipe
pip install tap-csv

2. Create a sample CSV

mkdir -p /tmp/sample-data
cat > /tmp/sample-data/orders.csv << 'EOF'
id,customer,product,amount,created_at
1,Alice,Widget,29.99,2025-01-15
2,Bob,Gadget,49.99,2025-01-16
3,Charlie,Widget,29.99,2025-01-17
4,Alice,Gizmo,19.99,2025-01-18
5,Diana,Gadget,49.99,2025-01-19
EOF

3. Initialize the config directory

dataspoc-pipe init

Expected output:

Structure created at /home/you/.dataspoc-pipe
Next step: dataspoc-pipe add <name>

This creates the directory structure:

~/.dataspoc-pipe/
  config.yaml       # Global defaults
  sources/           # Tap configuration files
  pipelines/         # Pipeline YAML definitions
  transforms/        # Optional Python transforms

4. Create a pipeline

dataspoc-pipe add my-first-pipeline

The interactive wizard will ask:

Creating pipeline: my-first-pipeline

Taps with template: tap-csv, tap-postgres, tap-mysql, ...

Tap Singer: tap-csv
Template found for tap-csv
Source config: /home/you/.dataspoc-pipe/sources/my-first-pipeline.json
Destination bucket (e.g.: s3://my-bucket, file:///tmp/lake): file:///tmp/lake
Base path in bucket [raw]: raw
Compression (zstd, snappy, gzip, none) [zstd]: zstd
Enable incremental extraction? [y/N]: N
Cron expression for scheduling (empty to skip):

Pipeline saved at /home/you/.dataspoc-pipe/pipelines/my-first-pipeline.yaml

Next steps:
  1. Edit source config: /home/you/.dataspoc-pipe/sources/my-first-pipeline.json
  2. Validate: dataspoc-pipe validate my-first-pipeline
  3. Run: dataspoc-pipe run my-first-pipeline

5. Edit the source config

Open ~/.dataspoc-pipe/sources/my-first-pipeline.json and point it to the CSV:

{
  "csv_files_definition": [
    {
      "entity": "orders",
      "path": "/tmp/sample-data/orders.csv",
      "keys": ["id"]
    }
  ]
}

6. Check the pipeline YAML

The generated file at ~/.dataspoc-pipe/pipelines/my-first-pipeline.yaml looks like this:

source:
  tap: tap-csv
  config: /home/you/.dataspoc-pipe/sources/my-first-pipeline.json
destination:
  bucket: file:///tmp/lake
  path: raw
  compression: zstd
  partition_by: _extraction_date
incremental:
  enabled: false
schedule:
  cron: null

7. Validate

dataspoc-pipe validate my-first-pipeline

Expected output:

Validating: my-first-pipeline
  Bucket OK: file:///tmp/lake
  Tap OK: tap-csv found in PATH

8. Run the pipeline

dataspoc-pipe run my-first-pipeline

Expected output:

Running: my-first-pipeline
  orders: 5 records...
Done! 5 records in 1 stream(s)
  orders: 5

9. Check status

dataspoc-pipe status

                   Pipelines
┌───────────────────┬─────────────────────┬─────────┬──────────┬─────────┐
│ Pipeline          │ Last Run            │ Status  │ Duration │ Records │
├───────────────────┼─────────────────────┼─────────┼──────────┼─────────┤
│ my-first-pipeline │ 2025-01-20T10:30:00 │ success │ 1.2s     │ 5       │
└───────────────────┴─────────────────────┴─────────┴──────────┴─────────┘

10. View the execution log

dataspoc-pipe logs my-first-pipeline

{
  "pipeline": "my-first-pipeline",
  "status": "success",
  "started_at": "2025-01-20T10:30:00Z",
  "finished_at": "2025-01-20T10:30:01Z",
  "duration_seconds": 1.2,
  "total_records": 5,
  "streams": {
    "orders": 5
  }
}

11. Inspect the manifest

dataspoc-pipe manifest file:///tmp/lake

The manifest shows all tables Pipe has written to this bucket, including schemas, record counts, and timestamps. This is the catalog that downstream tools like DataSpoc Lens use to discover available data.

12. Inspect the bucket

/tmp/lake/
  .dataspoc/
    manifest.json
    state/my-first-pipeline/state.json
    logs/my-first-pipeline/2025-01-20T103000Z.json
  raw/
    csv/orders/
      dt=2025-01-20/
        orders_0000.parquet

The Parquet file is ready to query with DuckDB, Pandas, Polars, or DataSpoc Lens.

Next steps

Configuration — full pipeline YAML reference
Incremental Extraction — only fetch new data
Transforms — clean data during ingestion
Multi-Cloud Storage — write to S3, GCS, or Azure
MCP Server — connect AI agents to Pipe