mongodbmigrationparquetdata-laketutorial

MongoDB to Parquet: Build a Data Lake from Your MongoDB in 15 Minutes

Michael San Martim · 2026-04-26

Your app runs on MongoDB. Your analysts want SQL. Your data team wants a data lake. You do not want to build an ETL pipeline from scratch.

This post takes you from a MongoDB database to a queryable Parquet-based data lake in 15 minutes using DataSpoc Pipe. You will extract collections, flatten nested documents, set up incremental sync, and query the result with SQL.

The Problem

MongoDB is great for applications but painful for analytics:

  • No SQL (yes, there is MQL, but your analysts do not know it)
  • No joins (aggregation pipelines are verbose and slow for analytics)
  • Production load: analytical queries compete with your app
  • No schema: every document can be different

The solution: extract to Parquet once, query forever with DuckDB.

Step 1: Install the MongoDB Tap

Terminal window
pip install dataspoc-pipe
dataspoc-pipe add tap-mongodb

This installs the Singer-compatible MongoDB extractor. DataSpoc Pipe manages it as a plugin.

Step 2: Configure the Connection

Create a pipeline YAML that defines the source and target:

pipelines/mongodb-production.yaml
source:
tap: tap-mongodb
config:
connection_string: "${MONGODB_URI}"
database: "production"
collections:
- users
- orders
- products
- events
target:
bucket: "s3://my-company-data"
prefix: "raw/mongodb"
format: parquet
settings:
batch_size: 10000
flatten_nested: true
flatten_max_depth: 3

Set your MongoDB connection string as an environment variable:

Terminal window
export MONGODB_URI="mongodb+srv://readonly:password@cluster0.abc123.mongodb.net"

Use a read-only user. Never give your ETL pipeline write access to production.

Step 3: Preview What Will Be Extracted

Before running a full extraction, preview the schema that Pipe discovers:

Terminal window
dataspoc-pipe discover mongodb-production

Output:

Discovered 4 collections:
users (12,450 documents)
_id ObjectId
email string
name string
plan string
address object → address_street, address_city, address_state, address_zip
created_at datetime
metadata object → metadata_source, metadata_campaign
orders (89,230 documents)
_id ObjectId
user_id ObjectId
items array → flattened to separate rows
total number
status string
created_at datetime
products (340 documents)
_id ObjectId
name string
sku string
price number
category string
tags array → tags_0, tags_1, tags_2
events (1,245,600 documents)
_id ObjectId
user_id ObjectId
event_type string
properties object → properties_page, properties_referrer, properties_duration
timestamp datetime

Notice how flatten_nested: true converts nested objects into flat columns. The address object becomes address_street, address_city, etc. This is critical because Parquet needs a flat schema.

Step 4: Run the Extraction

Terminal window
dataspoc-pipe run mongodb-production
[14:23:01] Starting pipeline: mongodb-production
[14:23:02] Extracting: users (12,450 docs)
[14:23:05] → raw/mongodb/users/users_20260426_142305.parquet (1.2 MB)
[14:23:05] Extracting: orders (89,230 docs)
[14:23:18] → raw/mongodb/orders/orders_20260426_142305.parquet (8.4 MB)
[14:23:18] Extracting: products (340 docs)
[14:23:19] → raw/mongodb/products/products_20260426_142319.parquet (42 KB)
[14:23:19] Extracting: events (1,245,600 docs)
[14:24:02] → raw/mongodb/events/events_20260426_142319.parquet (95 MB)
[14:24:02] Pipeline complete: 4 tables, 1,347,620 rows, 105 MB Parquet
[14:24:02] State saved to .dataspoc/state/mongodb-production/state.json

Four collections extracted. 1.3 million documents converted to Parquet. The state file records the last _id seen in each collection for incremental extraction.

Step 5: Query with DataSpoc Lens

Terminal window
dataspoc-lens add-bucket s3://my-company-data --name production
dataspoc-lens tables
raw.mongodb.users
raw.mongodb.orders
raw.mongodb.products
raw.mongodb.events

Now run SQL:

Terminal window
dataspoc-lens query "
SELECT
u.plan,
COUNT(DISTINCT u._id) as customers,
SUM(o.total) as revenue,
ROUND(SUM(o.total) / COUNT(DISTINCT u._id), 2) as arpu
FROM raw.mongodb.users u
JOIN raw.mongodb.orders o ON u._id = o.user_id
WHERE o.status = 'completed'
GROUP BY u.plan
ORDER BY revenue DESC
"
┌──────────┬───────────┬────────────┬────────┐
│ plan │ customers │ revenue │ arpu │
├──────────┼───────────┼────────────┼────────┤
│ business │ 2,340 │ 1,245,600 │ 532.31 │
│ pro │ 4,120 │ 824,000 │ 200.00 │
│ starter │ 5,990 │ 299,500 │ 50.00 │
└──────────┴───────────┴────────────┴────────┘

Your analysts now have SQL access to MongoDB data without touching the production database.

Handling Nested Documents

MongoDB’s flexible schema is the biggest challenge. Here is how Pipe handles each case:

Nested objects are flattened with underscore separators:

// MongoDB document
{
"address": {
"street": "123 Main St",
"city": "San Francisco",
"state": "CA"
}
}
// Parquet columns
address_street VARCHAR "123 Main St"
address_city VARCHAR "San Francisco"
address_state VARCHAR "CA"

Arrays of primitives become numbered columns:

{ "tags": ["electronics", "sale", "featured"] }
tags_0 VARCHAR "electronics"
tags_1 VARCHAR "sale"
tags_2 VARCHAR "featured"

Arrays of objects (like order items) can be flattened to separate rows:

# In pipeline config
settings:
flatten_nested: true
flatten_arrays: explode # Creates one row per array element
{
"order_id": "abc",
"items": [
{"sku": "A1", "qty": 2, "price": 10.00},
{"sku": "B2", "qty": 1, "price": 25.00}
]
}

Becomes two Parquet rows:

order_id items_sku items_qty items_price
abc A1 2 10.00
abc B2 1 25.00

Step 6: Incremental Sync

After the first full extraction, subsequent runs only fetch new documents:

Terminal window
dataspoc-pipe run mongodb-production
[15:00:01] Starting pipeline: mongodb-production (incremental)
[15:00:01] Resuming from state: users._id > ObjectId('6627a...')
[15:00:02] Extracting: users (45 new docs since last run)
[15:00:02] → raw/mongodb/users/users_20260426_150002.parquet (4 KB)
[15:00:02] Extracting: orders (312 new docs)
[15:00:03] → raw/mongodb/orders/orders_20260426_150002.parquet (28 KB)
...
[15:00:05] Pipeline complete: 4 tables, 1,892 new rows

Pipe uses MongoDB’s _id field (which is naturally ordered by insertion time) to bookmark progress. Only new documents are extracted.

To force a full re-extraction:

Terminal window
dataspoc-pipe run mongodb-production --full

Step 7: Schedule with Cron

Add a cron job to extract every hour:

Terminal window
crontab -e
0 * * * * cd /opt/dataspoc && dataspoc-pipe run mongodb-production >> /var/log/dataspoc/mongodb.log 2>&1

Check the last run status:

Terminal window
dataspoc-pipe status mongodb-production
Pipeline: mongodb-production
Last run: 2026-04-26 15:00:05 (success)
Next scheduled: 2026-04-26 16:00:00
Total rows extracted: 1,349,512
State: incremental (bookmark: users._id=6627a...)

View execution logs:

Terminal window
dataspoc-pipe logs mongodb-production --last 5

Cost Comparison

ApproachMonthly CostSetup Time
MongoDB Atlas Analytics Node$500+Minutes
Custom ETL (Airflow + Python)$200+ (infra)Weeks
Fivetran MongoDB connector$1/MARMinutes
DataSpoc Pipe + S3~$3 (S3 storage)15 minutes

For a startup with 1M documents, DataSpoc Pipe plus S3 storage costs under $5/month. No managed ETL platform fees, no analytics nodes, no Airflow cluster.

Complete Pipeline Config Reference

pipelines/mongodb-production.yaml
source:
tap: tap-mongodb
config:
connection_string: "${MONGODB_URI}"
database: "production"
collections:
- users
- orders
- products
- events
# Optional: filter collections by pattern
# collection_pattern: "^(users|orders).*"
target:
bucket: "s3://my-company-data"
prefix: "raw/mongodb"
format: parquet
# Parquet settings
compression: snappy
row_group_size: 100000
settings:
batch_size: 10000
flatten_nested: true
flatten_max_depth: 3
flatten_arrays: explode # or "stringify" to keep as JSON string
# Incremental settings
replication_method: incremental
bookmark_field: _id

From MongoDB to SQL-queryable data lake in 15 minutes. Your analysts get SQL, your production database stays untouched, and your monthly bill stays under $5.

Recommended