incrementalextractionsingerdata-engineeringpatterns

Incremental Extraction Patterns: Never Re-Extract Old Data Again

Michael San Martim · 2026-04-27

Your pipeline extracts 10 million rows from Postgres every night. But only 5,000 rows changed since yesterday. You are re-reading 99.95% of unchanged data, paying for the compute, the bandwidth, and the time.

Incremental extraction fixes this. DataSpoc Pipe uses Singer bookmarks to track where it left off and only fetch what is new. This post covers the three patterns for incremental extraction, how to configure each one, and how much money you save.

How Singer Bookmarks Work

Every Singer tap emits STATE messages that record its position in the source data. DataSpoc Pipe persists these as state.json files in the bucket:

bucket/
  .dataspoc/
    state/
      postgres-production/
        state.json

A typical state file looks like this:

{
  "bookmarks": {
    "orders": {
      "replication_key": "updated_at",
      "replication_key_value": "2026-04-26T15:30:00Z"
    },
    "users": {
      "replication_key": "id",
      "replication_key_value": 45231
    },
    "audit_log": {
      "log_position": "mysql-bin.000042:12345"
    }
  }
}

On the next run, Pipe passes this state to the tap. The tap only queries rows after the bookmark. No full table scans. No wasted reads.

Pattern 1: Timestamp-Based (modified_at)

The most common pattern. Your table has a updated_at or modified_at column that gets set whenever a row changes.

Best for: Tables with an updated_at timestamp that is reliably set on every INSERT and UPDATE.

source:
  tap: tap-postgres
  config:
    host: "${POSTGRES_HOST}"
    port: 5432
    database: production
    user: "${POSTGRES_USER}"
    password: "${POSTGRES_PASSWORD}"

    tables:
      - name: orders
        replication_method: incremental
        replication_key: updated_at

      - name: customers
        replication_method: incremental
        replication_key: modified_at

      - name: products
        replication_method: incremental
        replication_key: updated_at

target:
  bucket: "s3://my-company-data"
  prefix: "raw/postgres"
  format: parquet

How it works:

-- First run: full extraction
SELECT * FROM orders ORDER BY updated_at;
-- State saved: updated_at = '2026-04-26T15:30:00Z'

-- Second run: only new/changed rows
SELECT * FROM orders
WHERE updated_at > '2026-04-26T15:30:00Z'
ORDER BY updated_at;
-- State updated: updated_at = '2026-04-27T08:15:00Z'

Gotcha: If your application does not set updated_at on every update, you will miss changes. Always verify with:

-- Check: are there recent rows without updated_at?
SELECT COUNT(*)
FROM orders
WHERE updated_at IS NULL
   OR updated_at < created_at;

Pattern 2: ID-Based (Auto-Increment)

Use the primary key as the bookmark. Only works for append-only tables where rows are never updated.

Best for: Event logs, audit trails, analytics events — any table where rows are inserted but never modified.

tables:
  - name: events
    replication_method: incremental
    replication_key: id

  - name: audit_log
    replication_method: incremental
    replication_key: id

  - name: email_sends
    replication_method: incremental
    replication_key: id

How it works:

-- First run
SELECT * FROM events ORDER BY id;
-- State saved: id = 1245600

-- Second run
SELECT * FROM events WHERE id > 1245600 ORDER BY id;
-- State updated: id = 1248900

Gotcha: This misses updates to existing rows. If you UPDATE events SET status = 'processed' WHERE id = 1000, the second run will not re-extract row 1000. Use timestamp-based for tables with updates.

Pattern 3: Log-Based (CDC)

Change Data Capture reads the database’s transaction log (WAL in Postgres, binlog in MySQL). It captures every INSERT, UPDATE, and DELETE — even rows without an updated_at column.

Best for: Tables where you need to capture every change including deletes. Tables without a reliable replication key.

tables:
  - name: subscriptions
    replication_method: log_based

  - name: payments
    replication_method: log_based

  - name: inventory
    replication_method: log_based

How it works:

# First run: initial full snapshot + start reading WAL
Snapshot: subscriptions (full table)
WAL position: 0/1A3B4C0

# Second run: only WAL changes since last position
Read WAL from 0/1A3B4C0:
  INSERT INTO subscriptions (id=5001, plan='pro', ...)
  UPDATE subscriptions SET plan='business' WHERE id=4200
  DELETE FROM subscriptions WHERE id=3100

Setup for Postgres: Enable logical replication:

-- On your Postgres server
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 4;
-- Restart Postgres

-- Create a replication slot for Pipe
SELECT pg_create_logical_replication_slot('dataspoc_pipe', 'pgoutput');

# Pipeline config for CDC
source:
  tap: tap-postgres
  config:
    host: "${POSTGRES_HOST}"
    database: production
    user: "${POSTGRES_REPLICATION_USER}"
    password: "${POSTGRES_REPLICATION_PASSWORD}"
    replication_slot: dataspoc_pipe
    default_replication_method: log_based

Gotcha: CDC requires a replication user with REPLICATION privilege. It also adds some load to the source database for maintaining the replication slot.

When to Use Each Pattern

Pattern	Catches Updates?	Catches Deletes?	Source Requirements	Performance
Timestamp	Yes	No	Needs `updated_at` column	Fast
ID-based	No	No	Needs auto-increment PK	Fastest
Log-based (CDC)	Yes	Yes	WAL/binlog access	Medium

Rule of thumb:

Append-only tables (events, logs) → ID-based
Mutable tables with updated_at (orders, users) → Timestamp-based
Mutable tables without updated_at → Log-based (CDC)
Small lookup tables (< 10K rows) → Full extraction (it is fast enough)

Mixing Patterns in One Pipeline

Most real pipelines use multiple patterns:

source:
  tap: tap-postgres
  config:
    host: "${POSTGRES_HOST}"
    database: production

    tables:
      # Mutable tables: use timestamp
      - name: orders
        replication_method: incremental
        replication_key: updated_at

      - name: customers
        replication_method: incremental
        replication_key: modified_at

      # Append-only tables: use ID
      - name: events
        replication_method: incremental
        replication_key: id

      - name: audit_log
        replication_method: incremental
        replication_key: id

      # Small lookup tables: full extraction
      - name: countries
        replication_method: full

      - name: product_categories
        replication_method: full

target:
  bucket: "s3://my-company-data"
  prefix: "raw/postgres"
  format: parquet

Monitoring Incremental State

Check the status of any pipeline:

dataspoc-pipe status postgres-production

Pipeline: postgres-production
Last run: 2026-04-27 06:00:12 (success, 45s)
Tables:
  orders        incremental  updated_at=2026-04-27T05:59:48Z   1,230 new rows
  customers     incremental  modified_at=2026-04-27T04:12:00Z     42 new rows
  events        incremental  id=2,456,789                      18,400 new rows
  audit_log     incremental  id=892,345                         3,200 new rows
  countries     full         -                                     195 rows
  categories    full         -                                      34 rows

View detailed logs:

dataspoc-pipe logs postgres-production --last 3

[2026-04-27 06:00:12] SUCCESS  45s  22,901 rows  4.2 MB
[2026-04-26 06:00:08] SUCCESS  38s  19,450 rows  3.8 MB
[2026-04-25 06:00:15] SUCCESS  52s  31,200 rows  5.1 MB

Resetting State (Full Re-Extraction)

Sometimes you need to start over: schema changed, data was backfilled, or you suspect missed rows.

# Re-extract everything
dataspoc-pipe run postgres-production --full

# Re-extract one table
dataspoc-pipe run postgres-production --full --table orders

The --full flag ignores the state file and does a complete extraction. A new state is saved at the end.

Cost Savings Calculation

Consider a pipeline with these characteristics:

Source: Postgres with 50M rows across 10 tables
Daily changes: ~100K rows (0.2%)
Running daily

Metric	Full Extraction	Incremental
Rows read per run	50,000,000	100,000
Source DB load	High (full scan)	Minimal (index lookup)
Network transfer	~5 GB	~10 MB
Pipe runtime	~30 minutes	~15 seconds
S3 PUT costs	~$0.25/run	~$0.005/run
Monthly cost	~$7.50	~$0.15
Monthly DB load	15 hours	7.5 minutes

That is a 50x reduction in cost and a 120x reduction in database load. For larger datasets, the savings are even more dramatic.

Checklist Before Going Incremental

Verify your replication key is indexed in the source database
Confirm updated_at is set on every INSERT and UPDATE (for timestamp pattern)
Test with a small table first: run full, then incremental, compare row counts
Set up monitoring: dataspoc-pipe status in your alerting system
Schedule --full runs weekly or monthly as a safety net

Incremental extraction is the single biggest optimization you can make in a data pipeline. Stop re-reading data that has not changed.

← Back to blog

Incremental Extraction Patterns: Never Re-Extract Old Data Again

How Singer Bookmarks Work

Pattern 1: Timestamp-Based (modified_at)

Pattern 2: ID-Based (Auto-Increment)

Pattern 3: Log-Based (CDC)

When to Use Each Pattern

Mixing Patterns in One Pipeline

Monitoring Incremental State

Resetting State (Full Re-Extraction)

Cost Savings Calculation

Checklist Before Going Incremental

Recommended

Migrating from Fivetran to DataSpoc Pipe: A Step-by-Step Guide

The AGENT.md Pattern: Teaching AI Agents to Use Your Tools

Using Claude Code as Your Data Engineer: MCP + DataSpoc