Scheduling

Pipe uses the system crontab to schedule automated pipeline runs. No external orchestrator needed.

Adding a schedule

Set the schedule.cron field in your pipeline YAML:

source:
  tap: tap-postgres
  config: /home/you/.dataspoc-pipe/sources/orders.json

destination:
  bucket: s3://my-datalake
  path: raw

incremental:
  enabled: true

schedule:
  cron: "0 */2 * * *"

The cron expression uses the standard 5-field format:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sunday = 0)
│ │ │ │ │
* * * * *

Installing schedules

After configuring cron expressions in your pipeline YAMLs, install them into the system crontab:

dataspoc-pipe schedule install

  Installed: orders (0 */2 * * *)
  Installed: customers (30 1 * * *)

2 schedule(s) installed.

This scans all pipelines for schedule.cron values and creates or updates crontab entries. Pipelines without a cron expression are skipped.

Removing schedules

To remove all dataspoc-pipe entries from your crontab:

dataspoc-pipe schedule remove

  Removed: dataspoc-pipe:orders
  Removed: dataspoc-pipe:customers

2 schedule(s) removed.

Common cron expressions

Expression	Description
`0 * * * *`	Every hour, on the hour
`0 /2 * *`	Every 2 hours
`0 /6 * *`	Every 6 hours
`30 1 * * *`	Daily at 01:30
`0 0 * * *`	Daily at midnight
`0 8 * * 1-5`	Weekdays at 08:00
`0 0 * * 0`	Weekly on Sunday at midnight
`0 0 1 * *`	First day of each month at midnight
`/15 * * *`	Every 15 minutes

Overlap prevention with flock

Each scheduled run uses flock to prevent overlapping executions. If a pipeline is still running when the next scheduled trigger fires, the new run is skipped silently.

The generated crontab entry looks like:

# dataspoc-pipe:orders
0 */2 * * * flock -n /tmp/dataspoc-pipe-orders.lock /usr/local/bin/dataspoc-pipe run orders

The lock file is stored at /tmp/dataspoc-pipe-<pipeline>.lock.

Requirements

Scheduling uses the python-crontab package, which is included in Pipe’s default dependencies. The flock command is available on all Linux and macOS systems.

Tips

Combine scheduling with incremental extraction for efficient recurring pipelines. Each run only fetches new data.
Use dataspoc-pipe status to monitor scheduled runs:

dataspoc-pipe status

                   Pipelines
┌───────────┬─────────────────────┬─────────┬──────────┬─────────┐
│ Pipeline  │ Last Run            │ Status  │ Duration │ Records │
├───────────┼─────────────────────┼─────────┼──────────┼─────────┤
│ orders    │ 2025-01-20T14:00:00 │ success │ 3.2s     │ 450     │
│ customers │ 2025-01-20T01:30:00 │ success │ 12.5s    │ 8,200   │
└───────────┴─────────────────────┴─────────┴──────────┴─────────┘

Check system logs if a scheduled run does not appear in status:

grep dataspoc-pipe /var/log/syslog