Configuration
Este conteúdo não está disponível em sua língua ainda.
All Pipe configuration lives under ~/.dataspoc-pipe/. This page covers the directory structure and the full pipeline YAML reference.
Directory structure
Section titled “Directory structure”~/.dataspoc-pipe/ config.yaml # Global defaults sources/ # One JSON file per data source orders.json customers.json pipelines/ # One YAML file per pipeline orders.yaml customers.yaml transforms/ # Optional Python transform scripts orders.pyconfig.yaml
Section titled “config.yaml”Global defaults applied to new pipelines:
defaults: compression: zstd partition_by: _extraction_dateSource config files
Section titled “Source config files”Each source has a JSON file in sources/ containing tap-specific configuration. The format depends on the Singer tap. Examples:
tap-csv (sources/orders.json):
{ "csv_files_definition": [ { "entity": "orders", "path": "/data/orders.csv", "keys": ["id"] } ]}tap-postgres (sources/customers.json):
{ "host": "db.example.com", "port": 5432, "user": "readonly", "password": "${POSTGRES_PASSWORD}", "dbname": "production", "filter_schemas": "public"}Pipeline YAML reference
Section titled “Pipeline YAML reference”Each pipeline is defined in a YAML file at ~/.dataspoc-pipe/pipelines/<name>.yaml.
Full example
Section titled “Full example”source: tap: tap-postgres config: /home/you/.dataspoc-pipe/sources/customers.json streams: - public-customers - public-orders
destination: bucket: s3://my-datalake path: raw partition_by: _extraction_date compression: zstd
incremental: enabled: true
schedule: cron: "0 */2 * * *"Field reference
Section titled “Field reference”source
Section titled “source”| Field | Type | Required | Description |
|---|---|---|---|
source.tap | string | yes | Singer tap command (e.g., tap-postgres, tap-csv) |
source.config | string or dict | yes | Path to source JSON config file, or inline config dict |
source.streams | list of strings | no | Filter to specific streams. null or omitted means all streams |
The tap value is the exact command Pipe will execute as a subprocess. It must be available in your PATH.
When config is a file path, Pipe passes it to the tap with --config. When it is an inline dict, Pipe writes it to a temporary file before execution.
# Path to filesource: tap: tap-csv config: /home/you/.dataspoc-pipe/sources/orders.json
# Inline configsource: tap: tap-csv config: csv_files_definition: - entity: orders path: /data/orders.csv keys: ["id"]destination
Section titled “destination”| Field | Type | Default | Description |
|---|---|---|---|
destination.bucket | string | — (required) | Bucket URI: s3://, gs://, az://, or file:// |
destination.path | string | raw | Base path within the bucket |
destination.partition_by | string | _extraction_date | Hive-style partition field |
destination.compression | string | zstd | Parquet compression: zstd, snappy, gzip, or none |
The final path for a table is:
<bucket>/<path>/<source>/<table>/dt=<partition_value>/<table>_0000.parquetFor example, with bucket: s3://my-lake and path: raw, a table called orders from tap-csv writes to:
s3://my-lake/raw/csv/orders/dt=2025-01-20/orders_0000.parquetincremental
Section titled “incremental”| Field | Type | Default | Description |
|---|---|---|---|
incremental.enabled | boolean | false | Enable Singer bookmark-based incremental extraction |
When enabled, Pipe loads the previous state from the bucket before running the tap, and saves updated state after a successful run. See Incremental Extraction for details.
schedule
Section titled “schedule”| Field | Type | Default | Description |
|---|---|---|---|
schedule.cron | string or null | null | Cron expression for automated scheduling |
The cron expression follows standard 5-field cron format: minute hour day month weekday.
schedule: cron: "0 */2 * * *" # Every 2 hoursInstall the schedule with dataspoc-pipe schedule install. See Scheduling for details.
Transforms (convention-based)
Section titled “Transforms (convention-based)”Transforms are not configured in the YAML. Instead, if a file exists at:
~/.dataspoc-pipe/transforms/<pipeline_name>.pyand it defines a def transform(df) function, Pipe automatically applies it to each batch during ingestion. See Transforms for details.