Quickstart
This guide walks you through creating and running a pipeline that ingests a CSV file into Parquet on your local filesystem.
1. Install Pipe and the CSV tap
Section titled “1. Install Pipe and the CSV tap”pip install dataspoc-pipepip install tap-csv2. Create a sample CSV
Section titled “2. Create a sample CSV”mkdir -p /tmp/sample-datacat > /tmp/sample-data/orders.csv << 'EOF'id,customer,product,amount,created_at1,Alice,Widget,29.99,2025-01-152,Bob,Gadget,49.99,2025-01-163,Charlie,Widget,29.99,2025-01-174,Alice,Gizmo,19.99,2025-01-185,Diana,Gadget,49.99,2025-01-19EOF3. Initialize the config directory
Section titled “3. Initialize the config directory”dataspoc-pipe initExpected output:
Structure created at /home/you/.dataspoc-pipeNext step: dataspoc-pipe add <name>This creates the directory structure:
~/.dataspoc-pipe/ config.yaml # Global defaults sources/ # Tap configuration files pipelines/ # Pipeline YAML definitions transforms/ # Optional Python transforms4. Create a pipeline
Section titled “4. Create a pipeline”dataspoc-pipe add my-first-pipelineThe interactive wizard will ask:
Creating pipeline: my-first-pipeline
Taps with template: tap-csv, tap-postgres, tap-mysql, ...
Tap Singer: tap-csvTemplate found for tap-csvSource config: /home/you/.dataspoc-pipe/sources/my-first-pipeline.jsonDestination bucket (e.g.: s3://my-bucket, file:///tmp/lake): file:///tmp/lakeBase path in bucket [raw]: rawCompression (zstd, snappy, gzip, none) [zstd]: zstdEnable incremental extraction? [y/N]: NCron expression for scheduling (empty to skip):
Pipeline saved at /home/you/.dataspoc-pipe/pipelines/my-first-pipeline.yaml
Next steps: 1. Edit source config: /home/you/.dataspoc-pipe/sources/my-first-pipeline.json 2. Validate: dataspoc-pipe validate my-first-pipeline 3. Run: dataspoc-pipe run my-first-pipeline5. Edit the source config
Section titled “5. Edit the source config”Open ~/.dataspoc-pipe/sources/my-first-pipeline.json and point it to the CSV:
{ "csv_files_definition": [ { "entity": "orders", "path": "/tmp/sample-data/orders.csv", "keys": ["id"] } ]}6. Check the pipeline YAML
Section titled “6. Check the pipeline YAML”The generated file at ~/.dataspoc-pipe/pipelines/my-first-pipeline.yaml looks like this:
source: tap: tap-csv config: /home/you/.dataspoc-pipe/sources/my-first-pipeline.jsondestination: bucket: file:///tmp/lake path: raw compression: zstd partition_by: _extraction_dateincremental: enabled: falseschedule: cron: null7. Validate
Section titled “7. Validate”dataspoc-pipe validate my-first-pipelineExpected output:
Validating: my-first-pipeline Bucket OK: file:///tmp/lake Tap OK: tap-csv found in PATH8. Run the pipeline
Section titled “8. Run the pipeline”dataspoc-pipe run my-first-pipelineExpected output:
Running: my-first-pipeline orders: 5 records...Done! 5 records in 1 stream(s) orders: 59. Check status
Section titled “9. Check status”dataspoc-pipe status Pipelines┌───────────────────┬─────────────────────┬─────────┬──────────┬─────────┐│ Pipeline │ Last Run │ Status │ Duration │ Records │├───────────────────┼─────────────────────┼─────────┼──────────┼─────────┤│ my-first-pipeline │ 2025-01-20T10:30:00 │ success │ 1.2s │ 5 │└───────────────────┴─────────────────────┴─────────┴──────────┴─────────┘10. View the execution log
Section titled “10. View the execution log”dataspoc-pipe logs my-first-pipeline{ "pipeline": "my-first-pipeline", "status": "success", "started_at": "2025-01-20T10:30:00Z", "finished_at": "2025-01-20T10:30:01Z", "duration_seconds": 1.2, "total_records": 5, "streams": { "orders": 5 }}11. Inspect the manifest
Section titled “11. Inspect the manifest”dataspoc-pipe manifest file:///tmp/lakeThe manifest shows all tables Pipe has written to this bucket, including schemas, record counts, and timestamps. This is the catalog that downstream tools like DataSpoc Lens use to discover available data.
12. Inspect the bucket
Section titled “12. Inspect the bucket”/tmp/lake/ .dataspoc/ manifest.json state/my-first-pipeline/state.json logs/my-first-pipeline/2025-01-20T103000Z.json raw/ csv/orders/ dt=2025-01-20/ orders_0000.parquetThe Parquet file is ready to query with DuckDB, Pandas, Polars, or DataSpoc Lens.
Next steps
Section titled “Next steps”- Configuration — full pipeline YAML reference
- Incremental Extraction — only fetch new data
- Transforms — clean data during ingestion
- Multi-Cloud Storage — write to S3, GCS, or Azure
- MCP Server — connect AI agents to Pipe