Pular para o conteúdo

Configuration

Este conteúdo não está disponível em sua língua ainda.

All Pipe configuration lives under ~/.dataspoc-pipe/. This page covers the directory structure and the full pipeline YAML reference.

~/.dataspoc-pipe/
config.yaml # Global defaults
sources/ # One JSON file per data source
orders.json
customers.json
pipelines/ # One YAML file per pipeline
orders.yaml
customers.yaml
transforms/ # Optional Python transform scripts
orders.py

Global defaults applied to new pipelines:

defaults:
compression: zstd
partition_by: _extraction_date

Each source has a JSON file in sources/ containing tap-specific configuration. The format depends on the Singer tap. Examples:

tap-csv (sources/orders.json):

{
"csv_files_definition": [
{
"entity": "orders",
"path": "/data/orders.csv",
"keys": ["id"]
}
]
}

tap-postgres (sources/customers.json):

{
"host": "db.example.com",
"port": 5432,
"user": "readonly",
"password": "${POSTGRES_PASSWORD}",
"dbname": "production",
"filter_schemas": "public"
}

Each pipeline is defined in a YAML file at ~/.dataspoc-pipe/pipelines/<name>.yaml.

source:
tap: tap-postgres
config: /home/you/.dataspoc-pipe/sources/customers.json
streams:
- public-customers
- public-orders
destination:
bucket: s3://my-datalake
path: raw
partition_by: _extraction_date
compression: zstd
incremental:
enabled: true
schedule:
cron: "0 */2 * * *"
FieldTypeRequiredDescription
source.tapstringyesSinger tap command (e.g., tap-postgres, tap-csv)
source.configstring or dictyesPath to source JSON config file, or inline config dict
source.streamslist of stringsnoFilter to specific streams. null or omitted means all streams

The tap value is the exact command Pipe will execute as a subprocess. It must be available in your PATH.

When config is a file path, Pipe passes it to the tap with --config. When it is an inline dict, Pipe writes it to a temporary file before execution.

# Path to file
source:
tap: tap-csv
config: /home/you/.dataspoc-pipe/sources/orders.json
# Inline config
source:
tap: tap-csv
config:
csv_files_definition:
- entity: orders
path: /data/orders.csv
keys: ["id"]
FieldTypeDefaultDescription
destination.bucketstring— (required)Bucket URI: s3://, gs://, az://, or file://
destination.pathstringrawBase path within the bucket
destination.partition_bystring_extraction_dateHive-style partition field
destination.compressionstringzstdParquet compression: zstd, snappy, gzip, or none

The final path for a table is:

<bucket>/<path>/<source>/<table>/dt=<partition_value>/<table>_0000.parquet

For example, with bucket: s3://my-lake and path: raw, a table called orders from tap-csv writes to:

s3://my-lake/raw/csv/orders/dt=2025-01-20/orders_0000.parquet
FieldTypeDefaultDescription
incremental.enabledbooleanfalseEnable Singer bookmark-based incremental extraction

When enabled, Pipe loads the previous state from the bucket before running the tap, and saves updated state after a successful run. See Incremental Extraction for details.

FieldTypeDefaultDescription
schedule.cronstring or nullnullCron expression for automated scheduling

The cron expression follows standard 5-field cron format: minute hour day month weekday.

schedule:
cron: "0 */2 * * *" # Every 2 hours

Install the schedule with dataspoc-pipe schedule install. See Scheduling for details.

Transforms are not configured in the YAML. Instead, if a file exists at:

~/.dataspoc-pipe/transforms/<pipeline_name>.py

and it defines a def transform(df) function, Pipe automatically applies it to each batch during ingestion. See Transforms for details.