Configuration

Este conteúdo não está disponível em sua língua ainda.

All Pipe configuration lives under ~/.dataspoc-pipe/. This page covers the directory structure and the full pipeline YAML reference.

Directory structure

~/.dataspoc-pipe/
  config.yaml                        # Global defaults
  sources/                           # One JSON file per data source
    orders.json
    customers.json
  pipelines/                         # One YAML file per pipeline
    orders.yaml
    customers.yaml
  transforms/                        # Optional Python transform scripts
    orders.py

config.yaml

Global defaults applied to new pipelines:

defaults:
  compression: zstd
  partition_by: _extraction_date

Source config files

Each source has a JSON file in sources/ containing tap-specific configuration. The format depends on the Singer tap. Examples:

tap-csv (sources/orders.json):

{
  "csv_files_definition": [
    {
      "entity": "orders",
      "path": "/data/orders.csv",
      "keys": ["id"]
    }
  ]
}

tap-postgres (sources/customers.json):

{
  "host": "db.example.com",
  "port": 5432,
  "user": "readonly",
  "password": "${POSTGRES_PASSWORD}",
  "dbname": "production",
  "filter_schemas": "public"
}

Pipeline YAML reference

Each pipeline is defined in a YAML file at ~/.dataspoc-pipe/pipelines/<name>.yaml.

Full example

source:
  tap: tap-postgres
  config: /home/you/.dataspoc-pipe/sources/customers.json
  streams:
    - public-customers
    - public-orders

destination:
  bucket: s3://my-datalake
  path: raw
  partition_by: _extraction_date
  compression: zstd

incremental:
  enabled: true

schedule:
  cron: "0 */2 * * *"

Field reference

`source`

Field	Type	Required	Description
`source.tap`	string	yes	Singer tap command (e.g., `tap-postgres`, `tap-csv`)
`source.config`	string or dict	yes	Path to source JSON config file, or inline config dict
`source.streams`	list of strings	no	Filter to specific streams. `null` or omitted means all streams

The tap value is the exact command Pipe will execute as a subprocess. It must be available in your PATH.

When config is a file path, Pipe passes it to the tap with --config. When it is an inline dict, Pipe writes it to a temporary file before execution.

# Path to file
source:
  tap: tap-csv
  config: /home/you/.dataspoc-pipe/sources/orders.json

# Inline config
source:
  tap: tap-csv
  config:
    csv_files_definition:
      - entity: orders
        path: /data/orders.csv
        keys: ["id"]

`destination`

Field	Type	Default	Description
`destination.bucket`	string	— (required)	Bucket URI: `s3://`, `gs://`, `az://`, or `file://`
`destination.path`	string	`raw`	Base path within the bucket
`destination.partition_by`	string	`_extraction_date`	Hive-style partition field
`destination.compression`	string	`zstd`	Parquet compression: `zstd`, `snappy`, `gzip`, or `none`

The final path for a table is:

<bucket>/<path>/<source>/<table>/dt=<partition_value>/<table>_0000.parquet

For example, with bucket: s3://my-lake and path: raw, a table called orders from tap-csv writes to:

s3://my-lake/raw/csv/orders/dt=2025-01-20/orders_0000.parquet

`incremental`

Field	Type	Default	Description
`incremental.enabled`	boolean	`false`	Enable Singer bookmark-based incremental extraction

When enabled, Pipe loads the previous state from the bucket before running the tap, and saves updated state after a successful run. See Incremental Extraction for details.

`schedule`

Field	Type	Default	Description
`schedule.cron`	string or null	`null`	Cron expression for automated scheduling

The cron expression follows standard 5-field cron format: minute hour day month weekday.

schedule:
  cron: "0 */2 * * *"    # Every 2 hours

Install the schedule with dataspoc-pipe schedule install. See Scheduling for details.

Transforms (convention-based)

Transforms are not configured in the YAML. Instead, if a file exists at:

~/.dataspoc-pipe/transforms/<pipeline_name>.py

and it defines a def transform(df) function, Pipe automatically applies it to each batch during ingestion. See Transforms for details.