mongodbmigrationparquetdata-laketutorial

MongoDB to Parquet: Build a Data Lake from Your MongoDB in 15 Minutes

Michael San Martim · 2026-04-26

Your app runs on MongoDB. Your analysts want SQL. Your data team wants a data lake. You do not want to build an ETL pipeline from scratch.

This post takes you from a MongoDB database to a queryable Parquet-based data lake in 15 minutes using DataSpoc Pipe. You will extract collections, flatten nested documents, set up incremental sync, and query the result with SQL.

The Problem

MongoDB is great for applications but painful for analytics:

No SQL (yes, there is MQL, but your analysts do not know it)
No joins (aggregation pipelines are verbose and slow for analytics)
Production load: analytical queries compete with your app
No schema: every document can be different

The solution: extract to Parquet once, query forever with DuckDB.

Step 1: Install the MongoDB Tap

pip install dataspoc-pipe
dataspoc-pipe add tap-mongodb

This installs the Singer-compatible MongoDB extractor. DataSpoc Pipe manages it as a plugin.

Step 2: Configure the Connection

Create a pipeline YAML that defines the source and target:

source:
  tap: tap-mongodb
  config:
    connection_string: "${MONGODB_URI}"
    database: "production"
    collections:
      - users
      - orders
      - products
      - events

target:
  bucket: "s3://my-company-data"
  prefix: "raw/mongodb"
  format: parquet

settings:
  batch_size: 10000
  flatten_nested: true
  flatten_max_depth: 3

Set your MongoDB connection string as an environment variable:

export MONGODB_URI="mongodb+srv://readonly:password@cluster0.abc123.mongodb.net"

Use a read-only user. Never give your ETL pipeline write access to production.

Step 3: Preview What Will Be Extracted

Before running a full extraction, preview the schema that Pipe discovers:

dataspoc-pipe discover mongodb-production

Output:

Discovered 4 collections:

  users (12,450 documents)
    _id            ObjectId
    email          string
    name           string
    plan           string
    address         object  → address_street, address_city, address_state, address_zip
    created_at     datetime
    metadata        object  → metadata_source, metadata_campaign

  orders (89,230 documents)
    _id            ObjectId
    user_id        ObjectId
    items           array   → flattened to separate rows
    total          number
    status         string
    created_at     datetime

  products (340 documents)
    _id            ObjectId
    name           string
    sku            string
    price          number
    category       string
    tags            array   → tags_0, tags_1, tags_2

  events (1,245,600 documents)
    _id            ObjectId
    user_id        ObjectId
    event_type     string
    properties      object  → properties_page, properties_referrer, properties_duration
    timestamp      datetime

Notice how flatten_nested: true converts nested objects into flat columns. The address object becomes address_street, address_city, etc. This is critical because Parquet needs a flat schema.

Step 4: Run the Extraction

dataspoc-pipe run mongodb-production

[14:23:01] Starting pipeline: mongodb-production
[14:23:02] Extracting: users (12,450 docs)
[14:23:05] → raw/mongodb/users/users_20260426_142305.parquet (1.2 MB)
[14:23:05] Extracting: orders (89,230 docs)
[14:23:18] → raw/mongodb/orders/orders_20260426_142305.parquet (8.4 MB)
[14:23:18] Extracting: products (340 docs)
[14:23:19] → raw/mongodb/products/products_20260426_142319.parquet (42 KB)
[14:23:19] Extracting: events (1,245,600 docs)
[14:24:02] → raw/mongodb/events/events_20260426_142319.parquet (95 MB)
[14:24:02] Pipeline complete: 4 tables, 1,347,620 rows, 105 MB Parquet
[14:24:02] State saved to .dataspoc/state/mongodb-production/state.json

Four collections extracted. 1.3 million documents converted to Parquet. The state file records the last _id seen in each collection for incremental extraction.

Step 5: Query with DataSpoc Lens

dataspoc-lens add-bucket s3://my-company-data --name production
dataspoc-lens tables

raw.mongodb.users
raw.mongodb.orders
raw.mongodb.products
raw.mongodb.events

Now run SQL:

dataspoc-lens query "
  SELECT
    u.plan,
    COUNT(DISTINCT u._id) as customers,
    SUM(o.total) as revenue,
    ROUND(SUM(o.total) / COUNT(DISTINCT u._id), 2) as arpu
  FROM raw.mongodb.users u
  JOIN raw.mongodb.orders o ON u._id = o.user_id
  WHERE o.status = 'completed'
  GROUP BY u.plan
  ORDER BY revenue DESC
"

┌──────────┬───────────┬────────────┬────────┐
│   plan   │ customers │  revenue   │  arpu  │
├──────────┼───────────┼────────────┼────────┤
│ business │     2,340 │ 1,245,600  │ 532.31 │
│ pro      │     4,120 │   824,000  │ 200.00 │
│ starter  │     5,990 │   299,500  │  50.00 │
└──────────┴───────────┴────────────┴────────┘

Your analysts now have SQL access to MongoDB data without touching the production database.

Handling Nested Documents

MongoDB’s flexible schema is the biggest challenge. Here is how Pipe handles each case:

Nested objects are flattened with underscore separators:

// MongoDB document
{
  "address": {
    "street": "123 Main St",
    "city": "San Francisco",
    "state": "CA"
  }
}

// Parquet columns
address_street  VARCHAR  "123 Main St"
address_city    VARCHAR  "San Francisco"
address_state   VARCHAR  "CA"

Arrays of primitives become numbered columns:

{ "tags": ["electronics", "sale", "featured"] }

tags_0  VARCHAR  "electronics"
tags_1  VARCHAR  "sale"
tags_2  VARCHAR  "featured"

Arrays of objects (like order items) can be flattened to separate rows:

# In pipeline config
settings:
  flatten_nested: true
  flatten_arrays: explode  # Creates one row per array element

{
  "order_id": "abc",
  "items": [
    {"sku": "A1", "qty": 2, "price": 10.00},
    {"sku": "B2", "qty": 1, "price": 25.00}
  ]
}

Becomes two Parquet rows:

order_id  items_sku  items_qty  items_price
abc       A1         2          10.00
abc       B2         1          25.00

Step 6: Incremental Sync

After the first full extraction, subsequent runs only fetch new documents:

dataspoc-pipe run mongodb-production

[15:00:01] Starting pipeline: mongodb-production (incremental)
[15:00:01] Resuming from state: users._id > ObjectId('6627a...')
[15:00:02] Extracting: users (45 new docs since last run)
[15:00:02] → raw/mongodb/users/users_20260426_150002.parquet (4 KB)
[15:00:02] Extracting: orders (312 new docs)
[15:00:03] → raw/mongodb/orders/orders_20260426_150002.parquet (28 KB)
...
[15:00:05] Pipeline complete: 4 tables, 1,892 new rows

Pipe uses MongoDB’s _id field (which is naturally ordered by insertion time) to bookmark progress. Only new documents are extracted.

To force a full re-extraction:

dataspoc-pipe run mongodb-production --full

Step 7: Schedule with Cron

Add a cron job to extract every hour:

crontab -e

0 * * * * cd /opt/dataspoc && dataspoc-pipe run mongodb-production >> /var/log/dataspoc/mongodb.log 2>&1

Check the last run status:

dataspoc-pipe status mongodb-production

Pipeline: mongodb-production
Last run: 2026-04-26 15:00:05 (success)
Next scheduled: 2026-04-26 16:00:00
Total rows extracted: 1,349,512
State: incremental (bookmark: users._id=6627a...)

View execution logs:

dataspoc-pipe logs mongodb-production --last 5

Cost Comparison

Approach	Monthly Cost	Setup Time
MongoDB Atlas Analytics Node	$500+	Minutes
Custom ETL (Airflow + Python)	$200+ (infra)	Weeks
Fivetran MongoDB connector	$1/MAR	Minutes
DataSpoc Pipe + S3	~$3 (S3 storage)	15 minutes

For a startup with 1M documents, DataSpoc Pipe plus S3 storage costs under $5/month. No managed ETL platform fees, no analytics nodes, no Airflow cluster.

Complete Pipeline Config Reference

source:
  tap: tap-mongodb
  config:
    connection_string: "${MONGODB_URI}"
    database: "production"
    collections:
      - users
      - orders
      - products
      - events
    # Optional: filter collections by pattern
    # collection_pattern: "^(users|orders).*"

target:
  bucket: "s3://my-company-data"
  prefix: "raw/mongodb"
  format: parquet
  # Parquet settings
  compression: snappy
  row_group_size: 100000

settings:
  batch_size: 10000
  flatten_nested: true
  flatten_max_depth: 3
  flatten_arrays: explode  # or "stringify" to keep as JSON string

  # Incremental settings
  replication_method: incremental
  bookmark_field: _id

From MongoDB to SQL-queryable data lake in 15 minutes. Your analysts get SQL, your production database stays untouched, and your monthly bill stays under $5.

← Back to blog