Scheduling
Pipe uses the system crontab to schedule automated pipeline runs. No external orchestrator needed.
Adding a schedule
Section titled “Adding a schedule”Set the schedule.cron field in your pipeline YAML:
source: tap: tap-postgres config: /home/you/.dataspoc-pipe/sources/orders.json
destination: bucket: s3://my-datalake path: raw
incremental: enabled: true
schedule: cron: "0 */2 * * *"The cron expression uses the standard 5-field format:
┌───────────── minute (0 - 59)│ ┌───────────── hour (0 - 23)│ │ ┌───────────── day of month (1 - 31)│ │ │ ┌───────────── month (1 - 12)│ │ │ │ ┌───────────── day of week (0 - 6, Sunday = 0)│ │ │ │ │* * * * *Installing schedules
Section titled “Installing schedules”After configuring cron expressions in your pipeline YAMLs, install them into the system crontab:
dataspoc-pipe schedule install Installed: orders (0 */2 * * *) Installed: customers (30 1 * * *)
2 schedule(s) installed.This scans all pipelines for schedule.cron values and creates or updates crontab entries. Pipelines without a cron expression are skipped.
Removing schedules
Section titled “Removing schedules”To remove all dataspoc-pipe entries from your crontab:
dataspoc-pipe schedule remove Removed: dataspoc-pipe:orders Removed: dataspoc-pipe:customers
2 schedule(s) removed.Common cron expressions
Section titled “Common cron expressions”| Expression | Description |
|---|---|
0 * * * * | Every hour, on the hour |
0 */2 * * * | Every 2 hours |
0 */6 * * * | Every 6 hours |
30 1 * * * | Daily at 01:30 |
0 0 * * * | Daily at midnight |
0 8 * * 1-5 | Weekdays at 08:00 |
0 0 * * 0 | Weekly on Sunday at midnight |
0 0 1 * * | First day of each month at midnight |
*/15 * * * * | Every 15 minutes |
Overlap prevention with flock
Section titled “Overlap prevention with flock”Each scheduled run uses flock to prevent overlapping executions. If a pipeline is still running when the next scheduled trigger fires, the new run is skipped silently.
The generated crontab entry looks like:
# dataspoc-pipe:orders0 */2 * * * flock -n /tmp/dataspoc-pipe-orders.lock /usr/local/bin/dataspoc-pipe run ordersThe lock file is stored at /tmp/dataspoc-pipe-<pipeline>.lock.
Requirements
Section titled “Requirements”Scheduling uses the python-crontab package, which is included in Pipe’s default dependencies. The flock command is available on all Linux and macOS systems.
- Combine scheduling with incremental extraction for efficient recurring pipelines. Each run only fetches new data.
- Use
dataspoc-pipe statusto monitor scheduled runs:
dataspoc-pipe status Pipelines┌───────────┬─────────────────────┬─────────┬──────────┬─────────┐│ Pipeline │ Last Run │ Status │ Duration │ Records │├───────────┼─────────────────────┼─────────┼──────────┼─────────┤│ orders │ 2025-01-20T14:00:00 │ success │ 3.2s │ 450 ││ customers │ 2025-01-20T01:30:00 │ success │ 12.5s │ 8,200 │└───────────┴─────────────────────┴─────────┴──────────┴─────────┘- Check system logs if a scheduled run does not appear in
status:
grep dataspoc-pipe /var/log/syslog