Skip to content

Scheduling

Pipe uses the system crontab to schedule automated pipeline runs. No external orchestrator needed.

Set the schedule.cron field in your pipeline YAML:

source:
tap: tap-postgres
config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
bucket: s3://my-datalake
path: raw
incremental:
enabled: true
schedule:
cron: "0 */2 * * *"

The cron expression uses the standard 5-field format:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sunday = 0)
│ │ │ │ │
* * * * *

After configuring cron expressions in your pipeline YAMLs, install them into the system crontab:

Terminal window
dataspoc-pipe schedule install
Installed: orders (0 */2 * * *)
Installed: customers (30 1 * * *)
2 schedule(s) installed.

This scans all pipelines for schedule.cron values and creates or updates crontab entries. Pipelines without a cron expression are skipped.

To remove all dataspoc-pipe entries from your crontab:

Terminal window
dataspoc-pipe schedule remove
Removed: dataspoc-pipe:orders
Removed: dataspoc-pipe:customers
2 schedule(s) removed.
ExpressionDescription
0 * * * *Every hour, on the hour
0 */2 * * *Every 2 hours
0 */6 * * *Every 6 hours
30 1 * * *Daily at 01:30
0 0 * * *Daily at midnight
0 8 * * 1-5Weekdays at 08:00
0 0 * * 0Weekly on Sunday at midnight
0 0 1 * *First day of each month at midnight
*/15 * * * *Every 15 minutes

Each scheduled run uses flock to prevent overlapping executions. If a pipeline is still running when the next scheduled trigger fires, the new run is skipped silently.

The generated crontab entry looks like:

# dataspoc-pipe:orders
0 */2 * * * flock -n /tmp/dataspoc-pipe-orders.lock /usr/local/bin/dataspoc-pipe run orders

The lock file is stored at /tmp/dataspoc-pipe-<pipeline>.lock.

Scheduling uses the python-crontab package, which is included in Pipe’s default dependencies. The flock command is available on all Linux and macOS systems.

  • Combine scheduling with incremental extraction for efficient recurring pipelines. Each run only fetches new data.
  • Use dataspoc-pipe status to monitor scheduled runs:
Terminal window
dataspoc-pipe status
Pipelines
┌───────────┬─────────────────────┬─────────┬──────────┬─────────┐
│ Pipeline │ Last Run │ Status │ Duration │ Records │
├───────────┼─────────────────────┼─────────┼──────────┼─────────┤
│ orders │ 2025-01-20T14:00:00 │ success │ 3.2s │ 450 │
│ customers │ 2025-01-20T01:30:00 │ success │ 12.5s │ 8,200 │
└───────────┴─────────────────────┴─────────┴──────────┴─────────┘
  • Check system logs if a scheduled run does not appear in status:
Terminal window
grep dataspoc-pipe /var/log/syslog