Incremental Extraction Patterns: Never Re-Extract Old Data Again
Your pipeline extracts 10 million rows from Postgres every night. But only 5,000 rows changed since yesterday. You are re-reading 99.95% of unchanged data, paying for the compute, the bandwidth, and the time.
Incremental extraction fixes this. DataSpoc Pipe uses Singer bookmarks to track where it left off and only fetch what is new. This post covers the three patterns for incremental extraction, how to configure each one, and how much money you save.
How Singer Bookmarks Work
Every Singer tap emits STATE messages that record its position in the source data. DataSpoc Pipe persists these as state.json files in the bucket:
bucket/ .dataspoc/ state/ postgres-production/ state.jsonA typical state file looks like this:
{ "bookmarks": { "orders": { "replication_key": "updated_at", "replication_key_value": "2026-04-26T15:30:00Z" }, "users": { "replication_key": "id", "replication_key_value": 45231 }, "audit_log": { "log_position": "mysql-bin.000042:12345" } }}On the next run, Pipe passes this state to the tap. The tap only queries rows after the bookmark. No full table scans. No wasted reads.
Pattern 1: Timestamp-Based (modified_at)
The most common pattern. Your table has a updated_at or modified_at column that gets set whenever a row changes.
Best for: Tables with an updated_at timestamp that is reliably set on every INSERT and UPDATE.
source: tap: tap-postgres config: host: "${POSTGRES_HOST}" port: 5432 database: production user: "${POSTGRES_USER}" password: "${POSTGRES_PASSWORD}"
tables: - name: orders replication_method: incremental replication_key: updated_at
- name: customers replication_method: incremental replication_key: modified_at
- name: products replication_method: incremental replication_key: updated_at
target: bucket: "s3://my-company-data" prefix: "raw/postgres" format: parquetHow it works:
-- First run: full extractionSELECT * FROM orders ORDER BY updated_at;-- State saved: updated_at = '2026-04-26T15:30:00Z'
-- Second run: only new/changed rowsSELECT * FROM ordersWHERE updated_at > '2026-04-26T15:30:00Z'ORDER BY updated_at;-- State updated: updated_at = '2026-04-27T08:15:00Z'Gotcha: If your application does not set updated_at on every update, you will miss changes. Always verify with:
-- Check: are there recent rows without updated_at?SELECT COUNT(*)FROM ordersWHERE updated_at IS NULL OR updated_at < created_at;Pattern 2: ID-Based (Auto-Increment)
Use the primary key as the bookmark. Only works for append-only tables where rows are never updated.
Best for: Event logs, audit trails, analytics events — any table where rows are inserted but never modified.
tables: - name: events replication_method: incremental replication_key: id
- name: audit_log replication_method: incremental replication_key: id
- name: email_sends replication_method: incremental replication_key: idHow it works:
-- First runSELECT * FROM events ORDER BY id;-- State saved: id = 1245600
-- Second runSELECT * FROM events WHERE id > 1245600 ORDER BY id;-- State updated: id = 1248900Gotcha: This misses updates to existing rows. If you UPDATE events SET status = 'processed' WHERE id = 1000, the second run will not re-extract row 1000. Use timestamp-based for tables with updates.
Pattern 3: Log-Based (CDC)
Change Data Capture reads the database’s transaction log (WAL in Postgres, binlog in MySQL). It captures every INSERT, UPDATE, and DELETE — even rows without an updated_at column.
Best for: Tables where you need to capture every change including deletes. Tables without a reliable replication key.
tables: - name: subscriptions replication_method: log_based
- name: payments replication_method: log_based
- name: inventory replication_method: log_basedHow it works:
# First run: initial full snapshot + start reading WALSnapshot: subscriptions (full table)WAL position: 0/1A3B4C0
# Second run: only WAL changes since last positionRead WAL from 0/1A3B4C0: INSERT INTO subscriptions (id=5001, plan='pro', ...) UPDATE subscriptions SET plan='business' WHERE id=4200 DELETE FROM subscriptions WHERE id=3100Setup for Postgres: Enable logical replication:
-- On your Postgres serverALTER SYSTEM SET wal_level = 'logical';ALTER SYSTEM SET max_replication_slots = 4;-- Restart Postgres
-- Create a replication slot for PipeSELECT pg_create_logical_replication_slot('dataspoc_pipe', 'pgoutput');# Pipeline config for CDCsource: tap: tap-postgres config: host: "${POSTGRES_HOST}" database: production user: "${POSTGRES_REPLICATION_USER}" password: "${POSTGRES_REPLICATION_PASSWORD}" replication_slot: dataspoc_pipe default_replication_method: log_basedGotcha: CDC requires a replication user with REPLICATION privilege. It also adds some load to the source database for maintaining the replication slot.
When to Use Each Pattern
| Pattern | Catches Updates? | Catches Deletes? | Source Requirements | Performance |
|---|---|---|---|---|
| Timestamp | Yes | No | Needs updated_at column | Fast |
| ID-based | No | No | Needs auto-increment PK | Fastest |
| Log-based (CDC) | Yes | Yes | WAL/binlog access | Medium |
Rule of thumb:
- Append-only tables (events, logs) → ID-based
- Mutable tables with updated_at (orders, users) → Timestamp-based
- Mutable tables without updated_at → Log-based (CDC)
- Small lookup tables (< 10K rows) → Full extraction (it is fast enough)
Mixing Patterns in One Pipeline
Most real pipelines use multiple patterns:
source: tap: tap-postgres config: host: "${POSTGRES_HOST}" database: production
tables: # Mutable tables: use timestamp - name: orders replication_method: incremental replication_key: updated_at
- name: customers replication_method: incremental replication_key: modified_at
# Append-only tables: use ID - name: events replication_method: incremental replication_key: id
- name: audit_log replication_method: incremental replication_key: id
# Small lookup tables: full extraction - name: countries replication_method: full
- name: product_categories replication_method: full
target: bucket: "s3://my-company-data" prefix: "raw/postgres" format: parquetMonitoring Incremental State
Check the status of any pipeline:
dataspoc-pipe status postgres-productionPipeline: postgres-productionLast run: 2026-04-27 06:00:12 (success, 45s)Tables: orders incremental updated_at=2026-04-27T05:59:48Z 1,230 new rows customers incremental modified_at=2026-04-27T04:12:00Z 42 new rows events incremental id=2,456,789 18,400 new rows audit_log incremental id=892,345 3,200 new rows countries full - 195 rows categories full - 34 rowsView detailed logs:
dataspoc-pipe logs postgres-production --last 3[2026-04-27 06:00:12] SUCCESS 45s 22,901 rows 4.2 MB[2026-04-26 06:00:08] SUCCESS 38s 19,450 rows 3.8 MB[2026-04-25 06:00:15] SUCCESS 52s 31,200 rows 5.1 MBResetting State (Full Re-Extraction)
Sometimes you need to start over: schema changed, data was backfilled, or you suspect missed rows.
# Re-extract everythingdataspoc-pipe run postgres-production --full
# Re-extract one tabledataspoc-pipe run postgres-production --full --table ordersThe --full flag ignores the state file and does a complete extraction. A new state is saved at the end.
Cost Savings Calculation
Consider a pipeline with these characteristics:
- Source: Postgres with 50M rows across 10 tables
- Daily changes: ~100K rows (0.2%)
- Running daily
| Metric | Full Extraction | Incremental |
|---|---|---|
| Rows read per run | 50,000,000 | 100,000 |
| Source DB load | High (full scan) | Minimal (index lookup) |
| Network transfer | ~5 GB | ~10 MB |
| Pipe runtime | ~30 minutes | ~15 seconds |
| S3 PUT costs | ~$0.25/run | ~$0.005/run |
| Monthly cost | ~$7.50 | ~$0.15 |
| Monthly DB load | 15 hours | 7.5 minutes |
That is a 50x reduction in cost and a 120x reduction in database load. For larger datasets, the savings are even more dramatic.
Checklist Before Going Incremental
- Verify your replication key is indexed in the source database
- Confirm
updated_atis set on every INSERT and UPDATE (for timestamp pattern) - Test with a small table first: run full, then incremental, compare row counts
- Set up monitoring:
dataspoc-pipe statusin your alerting system - Schedule
--fullruns weekly or monthly as a safety net
Incremental extraction is the single biggest optimization you can make in a data pipeline. Stop re-reading data that has not changed.