From Google Sheets to a Queryable Data Lake in 3 Commands
Every company has that one spreadsheet. The marketing budget tracker. The sales forecast. The product roadmap with 47 tabs. It lives in Google Sheets because that is where non-technical teams work.
You need that data in your data lake. Not as a CSV export you remember to download on Fridays. As a proper Parquet table you can join with your production data and query with SQL.
DataSpoc Pipe has a built-in Google Sheets tap. No extra packages, no OAuth dance for public sheets, no Singer configuration files. Three commands.
The Scenario
Your marketing team maintains a Google Sheet with campaign spend data:
https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms/editIt has two sheets (tabs):
- campaigns — campaign name, channel, budget, spend, start_date, end_date
- daily_metrics — date, campaign, impressions, clicks, conversions, cost
You want this in your data lake as Parquet, queryable with SQL, joinable with your revenue data from Postgres.
Command 1: Initialize the Project
dataspoc-pipe init marketing-lake --bucket s3://company-marketingOutput:
Created project: marketing-lake/ dataspoc-pipe.yaml pipelines/Command 2: Add the Google Sheets Pipeline
dataspoc-pipe add sheets-marketing \ --source google_sheets \ --url "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms" \ --target s3://company-marketingThis generates the pipeline config:
pipeline: sheets-marketingsource: type: google_sheets url: "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms" sheets: - campaigns - daily_metrics
target: type: s3 bucket: s3://company-marketing prefix: raw/google_sheetsThat is it. No API keys needed for public sheets. For private sheets, set GOOGLE_SERVICE_ACCOUNT_JSON as an environment variable.
Command 3: Run
dataspoc-pipe run sheets-marketingOutput:
[sheets-marketing] Starting extraction...[sheets-marketing] Extracting sheet: campaigns (156 rows)[sheets-marketing] Extracting sheet: daily_metrics (4,380 rows)[sheets-marketing] Writing: s3://company-marketing/raw/google_sheets/campaigns/20260415_070000.parquet[sheets-marketing] Writing: s3://company-marketing/raw/google_sheets/daily_metrics/20260415_070000.parquet[sheets-marketing] Done. 2 tables, 4,536 rows, 3.2sYour data lake now has:
s3://company-marketing/ .dataspoc/ manifest.json state/sheets-marketing/state.json raw/google_sheets/ campaigns/ 20260415_070000.parquet daily_metrics/ 20260415_070000.parquetQuerying with DataSpoc Lens
Now connect Lens and start querying:
dataspoc-lens add-bucket marketing s3://company-marketingdataspoc-lens shellYou are now in a DuckDB SQL shell with your Sheets data mounted:
-- See available tablesSHOW TABLES;┌─────────────────────────────────────┐│ name │├─────────────────────────────────────┤│ raw__google_sheets__campaigns ││ raw__google_sheets__daily_metrics │└─────────────────────────────────────┘-- Total spend by channelSELECT channel, COUNT(*) as campaigns, SUM(spend) as total_spend, SUM(budget) as total_budget, ROUND(SUM(spend) / SUM(budget) * 100, 1) as pct_usedFROM raw__google_sheets__campaignsGROUP BY channelORDER BY total_spend DESC;┌──────────┬───────────┬─────────────┬──────────────┬──────────┐│ channel │ campaigns │ total_spend │ total_budget │ pct_used │├──────────┼───────────┼─────────────┼──────────────┼──────────┤│ google │ 45 │ 125,400 │ 150,000 │ 83.6 ││ meta │ 38 │ 98,200 │ 120,000 │ 81.8 ││ linkedin │ 22 │ 67,800 │ 80,000 │ 84.8 ││ twitter │ 51 │ 34,500 │ 50,000 │ 69.0 │└──────────┴───────────┴─────────────┴──────────────┴──────────┘-- Cost per conversion by campaign (last 30 days)SELECT c.campaign_name, c.channel, SUM(d.cost) as total_cost, SUM(d.conversions) as total_conversions, ROUND(SUM(d.cost) / NULLIF(SUM(d.conversions), 0), 2) as cpaFROM raw__google_sheets__daily_metrics dJOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_nameWHERE d.date >= CURRENT_DATE - INTERVAL '30 days'GROUP BY c.campaign_name, c.channelHAVING SUM(d.conversions) > 0ORDER BY cpa ASCLIMIT 10;┌──────────────────────┬──────────┬────────────┬──────────────────┬───────┐│ campaign_name │ channel │ total_cost │ total_conversions│ cpa │├──────────────────────┼──────────┼────────────┼──────────────────┼───────┤│ spring_retargeting │ meta │ 2,340 │ 156 │ 15.00 ││ brand_search │ google │ 4,200 │ 245 │ 17.14 ││ lookalike_q2 │ meta │ 3,800 │ 198 │ 19.19 ││ dev_tools_launch │ linkedin │ 5,600 │ 112 │ 50.00 │└──────────────────────┴──────────┴────────────┴──────────────────┴───────┘AI Ask: Natural Language Queries
For quick exploration, use AI Ask:
dataspoc-lens ask "Which campaign had the best conversion rate last week?"Output:
SQL: SELECT c.campaign_name, c.channel, SUM(d.clicks) as clicks, SUM(d.conversions) as conversions, ROUND(SUM(d.conversions)::FLOAT / NULLIF(SUM(d.clicks), 0) * 100, 2) as conversion_rate FROM raw__google_sheets__daily_metrics d JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name WHERE d.date >= CURRENT_DATE - INTERVAL '7 days' GROUP BY c.campaign_name, c.channel ORDER BY conversion_rate DESC LIMIT 1
Result: campaign_name: spring_retargeting channel: meta conversion_rate: 12.4%dataspoc-lens ask "Compare Google vs Meta spend efficiency this month"SQL: SELECT c.channel, SUM(d.cost) as spend, SUM(d.conversions) as conversions, ROUND(SUM(d.cost) / NULLIF(SUM(d.conversions), 0), 2) as cost_per_conversion FROM raw__google_sheets__daily_metrics d JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name WHERE d.date >= DATE_TRUNC('month', CURRENT_DATE) AND c.channel IN ('google', 'meta') GROUP BY c.channel
Result: ┌─────────┬────────┬─────────────┬─────────────────────┐ │ channel │ spend │ conversions │ cost_per_conversion │ ├─────────┼────────┼─────────────┼─────────────────────┤ │ meta │ 12,400 │ 523 │ 23.71 │ │ google │ 18,900 │ 687 │ 27.51 │ └─────────┴────────┴─────────────┴─────────────────────┘Keeping It Fresh
Set up a cron to refresh daily:
# Refresh sheets data every morning at 6am0 6 * * * cd /opt/marketing-lake && dataspoc-pipe run sheets-marketingOr use Pipe’s built-in schedule:
pipeline: sheets-marketingschedule: "0 6 * * *"source: type: google_sheets url: "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms" sheets: - campaigns - daily_metricstarget: type: s3 bucket: s3://company-marketing prefix: raw/google_sheetsPrivate Sheets
For sheets that are not publicly shared, use a Google Service Account:
export GOOGLE_SERVICE_ACCOUNT_JSON=/path/to/service-account.jsondataspoc-pipe run sheets-marketingShare the sheet with the service account email address, and it works exactly the same way.
Joining Sheets Data with Other Sources
The real power is combining spreadsheet data with your production data. If you also have a Postgres pipeline:
-- Join marketing spend with actual revenueSELECT c.channel, SUM(d.cost) as marketing_spend, SUM(o.amount) as revenue, ROUND(SUM(o.amount) / NULLIF(SUM(d.cost), 0), 2) as roasFROM raw__google_sheets__daily_metrics dJOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_nameJOIN raw__postgres__orders o ON o.utm_source = c.channel AND o.created_at::DATE = d.dateWHERE d.date >= CURRENT_DATE - INTERVAL '30 days'GROUP BY c.channelORDER BY roas DESC;Why This Works for Small Teams
- No infrastructure — no Airflow, no Fivetran, no dbt Cloud
- No coding — the business team keeps editing their spreadsheet, you get Parquet
- SQL on everything — once it is Parquet, it is just another table
- AI-ready — AI Ask works on spreadsheet data the same as production data
- Free — DataSpoc Pipe is open source, Google Sheets API is free for reasonable usage
Three commands. Public spreadsheet to queryable data lake. That is the promise of modern data tooling.