MongoDB to Parquet: Build a Data Lake from Your MongoDB in 15 Minutes
Your app runs on MongoDB. Your analysts want SQL. Your data team wants a data lake. You do not want to build an ETL pipeline from scratch.
This post takes you from a MongoDB database to a queryable Parquet-based data lake in 15 minutes using DataSpoc Pipe. You will extract collections, flatten nested documents, set up incremental sync, and query the result with SQL.
The Problem
MongoDB is great for applications but painful for analytics:
- No SQL (yes, there is MQL, but your analysts do not know it)
- No joins (aggregation pipelines are verbose and slow for analytics)
- Production load: analytical queries compete with your app
- No schema: every document can be different
The solution: extract to Parquet once, query forever with DuckDB.
Step 1: Install the MongoDB Tap
pip install dataspoc-pipedataspoc-pipe add tap-mongodbThis installs the Singer-compatible MongoDB extractor. DataSpoc Pipe manages it as a plugin.
Step 2: Configure the Connection
Create a pipeline YAML that defines the source and target:
source: tap: tap-mongodb config: connection_string: "${MONGODB_URI}" database: "production" collections: - users - orders - products - events
target: bucket: "s3://my-company-data" prefix: "raw/mongodb" format: parquet
settings: batch_size: 10000 flatten_nested: true flatten_max_depth: 3Set your MongoDB connection string as an environment variable:
export MONGODB_URI="mongodb+srv://readonly:password@cluster0.abc123.mongodb.net"Use a read-only user. Never give your ETL pipeline write access to production.
Step 3: Preview What Will Be Extracted
Before running a full extraction, preview the schema that Pipe discovers:
dataspoc-pipe discover mongodb-productionOutput:
Discovered 4 collections:
users (12,450 documents) _id ObjectId email string name string plan string address object → address_street, address_city, address_state, address_zip created_at datetime metadata object → metadata_source, metadata_campaign
orders (89,230 documents) _id ObjectId user_id ObjectId items array → flattened to separate rows total number status string created_at datetime
products (340 documents) _id ObjectId name string sku string price number category string tags array → tags_0, tags_1, tags_2
events (1,245,600 documents) _id ObjectId user_id ObjectId event_type string properties object → properties_page, properties_referrer, properties_duration timestamp datetimeNotice how flatten_nested: true converts nested objects into flat columns. The address object becomes address_street, address_city, etc. This is critical because Parquet needs a flat schema.
Step 4: Run the Extraction
dataspoc-pipe run mongodb-production[14:23:01] Starting pipeline: mongodb-production[14:23:02] Extracting: users (12,450 docs)[14:23:05] → raw/mongodb/users/users_20260426_142305.parquet (1.2 MB)[14:23:05] Extracting: orders (89,230 docs)[14:23:18] → raw/mongodb/orders/orders_20260426_142305.parquet (8.4 MB)[14:23:18] Extracting: products (340 docs)[14:23:19] → raw/mongodb/products/products_20260426_142319.parquet (42 KB)[14:23:19] Extracting: events (1,245,600 docs)[14:24:02] → raw/mongodb/events/events_20260426_142319.parquet (95 MB)[14:24:02] Pipeline complete: 4 tables, 1,347,620 rows, 105 MB Parquet[14:24:02] State saved to .dataspoc/state/mongodb-production/state.jsonFour collections extracted. 1.3 million documents converted to Parquet. The state file records the last _id seen in each collection for incremental extraction.
Step 5: Query with DataSpoc Lens
dataspoc-lens add-bucket s3://my-company-data --name productiondataspoc-lens tablesraw.mongodb.usersraw.mongodb.ordersraw.mongodb.productsraw.mongodb.eventsNow run SQL:
dataspoc-lens query " SELECT u.plan, COUNT(DISTINCT u._id) as customers, SUM(o.total) as revenue, ROUND(SUM(o.total) / COUNT(DISTINCT u._id), 2) as arpu FROM raw.mongodb.users u JOIN raw.mongodb.orders o ON u._id = o.user_id WHERE o.status = 'completed' GROUP BY u.plan ORDER BY revenue DESC"┌──────────┬───────────┬────────────┬────────┐│ plan │ customers │ revenue │ arpu │├──────────┼───────────┼────────────┼────────┤│ business │ 2,340 │ 1,245,600 │ 532.31 ││ pro │ 4,120 │ 824,000 │ 200.00 ││ starter │ 5,990 │ 299,500 │ 50.00 │└──────────┴───────────┴────────────┴────────┘Your analysts now have SQL access to MongoDB data without touching the production database.
Handling Nested Documents
MongoDB’s flexible schema is the biggest challenge. Here is how Pipe handles each case:
Nested objects are flattened with underscore separators:
// MongoDB document{ "address": { "street": "123 Main St", "city": "San Francisco", "state": "CA" }}// Parquet columnsaddress_street VARCHAR "123 Main St"address_city VARCHAR "San Francisco"address_state VARCHAR "CA"Arrays of primitives become numbered columns:
{ "tags": ["electronics", "sale", "featured"] }tags_0 VARCHAR "electronics"tags_1 VARCHAR "sale"tags_2 VARCHAR "featured"Arrays of objects (like order items) can be flattened to separate rows:
# In pipeline configsettings: flatten_nested: true flatten_arrays: explode # Creates one row per array element{ "order_id": "abc", "items": [ {"sku": "A1", "qty": 2, "price": 10.00}, {"sku": "B2", "qty": 1, "price": 25.00} ]}Becomes two Parquet rows:
order_id items_sku items_qty items_priceabc A1 2 10.00abc B2 1 25.00Step 6: Incremental Sync
After the first full extraction, subsequent runs only fetch new documents:
dataspoc-pipe run mongodb-production[15:00:01] Starting pipeline: mongodb-production (incremental)[15:00:01] Resuming from state: users._id > ObjectId('6627a...')[15:00:02] Extracting: users (45 new docs since last run)[15:00:02] → raw/mongodb/users/users_20260426_150002.parquet (4 KB)[15:00:02] Extracting: orders (312 new docs)[15:00:03] → raw/mongodb/orders/orders_20260426_150002.parquet (28 KB)...[15:00:05] Pipeline complete: 4 tables, 1,892 new rowsPipe uses MongoDB’s _id field (which is naturally ordered by insertion time) to bookmark progress. Only new documents are extracted.
To force a full re-extraction:
dataspoc-pipe run mongodb-production --fullStep 7: Schedule with Cron
Add a cron job to extract every hour:
crontab -e0 * * * * cd /opt/dataspoc && dataspoc-pipe run mongodb-production >> /var/log/dataspoc/mongodb.log 2>&1Check the last run status:
dataspoc-pipe status mongodb-productionPipeline: mongodb-productionLast run: 2026-04-26 15:00:05 (success)Next scheduled: 2026-04-26 16:00:00Total rows extracted: 1,349,512State: incremental (bookmark: users._id=6627a...)View execution logs:
dataspoc-pipe logs mongodb-production --last 5Cost Comparison
| Approach | Monthly Cost | Setup Time |
|---|---|---|
| MongoDB Atlas Analytics Node | $500+ | Minutes |
| Custom ETL (Airflow + Python) | $200+ (infra) | Weeks |
| Fivetran MongoDB connector | $1/MAR | Minutes |
| DataSpoc Pipe + S3 | ~$3 (S3 storage) | 15 minutes |
For a startup with 1M documents, DataSpoc Pipe plus S3 storage costs under $5/month. No managed ETL platform fees, no analytics nodes, no Airflow cluster.
Complete Pipeline Config Reference
source: tap: tap-mongodb config: connection_string: "${MONGODB_URI}" database: "production" collections: - users - orders - products - events # Optional: filter collections by pattern # collection_pattern: "^(users|orders).*"
target: bucket: "s3://my-company-data" prefix: "raw/mongodb" format: parquet # Parquet settings compression: snappy row_group_size: 100000
settings: batch_size: 10000 flatten_nested: true flatten_max_depth: 3 flatten_arrays: explode # or "stringify" to keep as JSON string
# Incremental settings replication_method: incremental bookmark_field: _idFrom MongoDB to SQL-queryable data lake in 15 minutes. Your analysts get SQL, your production database stays untouched, and your monthly bill stays under $5.