Ir al contenido

Scheduling

Esta página aún no está disponible en tu idioma.

Pipe uses the system crontab to schedule automated pipeline runs. No external orchestrator needed.

Set the schedule.cron field in your pipeline YAML:

source:
tap: tap-postgres
config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
bucket: s3://my-datalake
path: raw
incremental:
enabled: true
schedule:
cron: "0 */2 * * *"

The cron expression uses the standard 5-field format:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sunday = 0)
│ │ │ │ │
* * * * *

After configuring cron expressions in your pipeline YAMLs, install them into the system crontab:

Terminal window
dataspoc-pipe schedule install
Installed: orders (0 */2 * * *)
Installed: customers (30 1 * * *)
2 schedule(s) installed.

This scans all pipelines for schedule.cron values and creates or updates crontab entries. Pipelines without a cron expression are skipped.

To remove all dataspoc-pipe entries from your crontab:

Terminal window
dataspoc-pipe schedule remove
Removed: dataspoc-pipe:orders
Removed: dataspoc-pipe:customers
2 schedule(s) removed.
ExpressionDescription
0 * * * *Every hour, on the hour
0 */2 * * *Every 2 hours
0 */6 * * *Every 6 hours
30 1 * * *Daily at 01:30
0 0 * * *Daily at midnight
0 8 * * 1-5Weekdays at 08:00
0 0 * * 0Weekly on Sunday at midnight
0 0 1 * *First day of each month at midnight
*/15 * * * *Every 15 minutes

Each scheduled run uses flock to prevent overlapping executions. If a pipeline is still running when the next scheduled trigger fires, the new run is skipped silently.

The generated crontab entry looks like:

# dataspoc-pipe:orders
0 */2 * * * flock -n /tmp/dataspoc-pipe-orders.lock /usr/local/bin/dataspoc-pipe run orders

The lock file is stored at /tmp/dataspoc-pipe-<pipeline>.lock.

Scheduling uses the python-crontab package, which is included in Pipe’s default dependencies. The flock command is available on all Linux and macOS systems.

  • Combine scheduling with incremental extraction for efficient recurring pipelines. Each run only fetches new data.
  • Use dataspoc-pipe status to monitor scheduled runs:
Terminal window
dataspoc-pipe status
Pipelines
┌───────────┬─────────────────────┬─────────┬──────────┬─────────┐
│ Pipeline │ Last Run │ Status │ Duration │ Records │
├───────────┼─────────────────────┼─────────┼──────────┼─────────┤
│ orders │ 2025-01-20T14:00:00 │ success │ 3.2s │ 450 │
│ customers │ 2025-01-20T01:30:00 │ success │ 12.5s │ 8,200 │
└───────────┴─────────────────────┴─────────┴──────────┴─────────┘
  • Check system logs if a scheduled run does not appear in status:
Terminal window
grep dataspoc-pipe /var/log/syslog