google-sheetsno-codedata-laketutorial

From Google Sheets to a Queryable Data Lake in 3 Commands

Michael San Martim · 2026-04-26

Every company has that one spreadsheet. The marketing budget tracker. The sales forecast. The product roadmap with 47 tabs. It lives in Google Sheets because that is where non-technical teams work.

You need that data in your data lake. Not as a CSV export you remember to download on Fridays. As a proper Parquet table you can join with your production data and query with SQL.

DataSpoc Pipe has a built-in Google Sheets tap. No extra packages, no OAuth dance for public sheets, no Singer configuration files. Three commands.

The Scenario

Your marketing team maintains a Google Sheet with campaign spend data:

https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms/edit

It has two sheets (tabs):

  • campaigns — campaign name, channel, budget, spend, start_date, end_date
  • daily_metrics — date, campaign, impressions, clicks, conversions, cost

You want this in your data lake as Parquet, queryable with SQL, joinable with your revenue data from Postgres.

Command 1: Initialize the Project

Terminal window
dataspoc-pipe init marketing-lake --bucket s3://company-marketing

Output:

Created project: marketing-lake/
dataspoc-pipe.yaml
pipelines/

Command 2: Add the Google Sheets Pipeline

Terminal window
dataspoc-pipe add sheets-marketing \
--source google_sheets \
--url "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms" \
--target s3://company-marketing

This generates the pipeline config:

pipelines/sheets-marketing.yaml
pipeline: sheets-marketing
source:
type: google_sheets
url: "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms"
sheets:
- campaigns
- daily_metrics
target:
type: s3
bucket: s3://company-marketing
prefix: raw/google_sheets

That is it. No API keys needed for public sheets. For private sheets, set GOOGLE_SERVICE_ACCOUNT_JSON as an environment variable.

Command 3: Run

Terminal window
dataspoc-pipe run sheets-marketing

Output:

[sheets-marketing] Starting extraction...
[sheets-marketing] Extracting sheet: campaigns (156 rows)
[sheets-marketing] Extracting sheet: daily_metrics (4,380 rows)
[sheets-marketing] Writing: s3://company-marketing/raw/google_sheets/campaigns/20260415_070000.parquet
[sheets-marketing] Writing: s3://company-marketing/raw/google_sheets/daily_metrics/20260415_070000.parquet
[sheets-marketing] Done. 2 tables, 4,536 rows, 3.2s

Your data lake now has:

s3://company-marketing/
.dataspoc/
manifest.json
state/sheets-marketing/state.json
raw/google_sheets/
campaigns/
20260415_070000.parquet
daily_metrics/
20260415_070000.parquet

Querying with DataSpoc Lens

Now connect Lens and start querying:

Terminal window
dataspoc-lens add-bucket marketing s3://company-marketing
dataspoc-lens shell

You are now in a DuckDB SQL shell with your Sheets data mounted:

-- See available tables
SHOW TABLES;
┌─────────────────────────────────────┐
│ name │
├─────────────────────────────────────┤
│ raw__google_sheets__campaigns │
│ raw__google_sheets__daily_metrics │
└─────────────────────────────────────┘
-- Total spend by channel
SELECT
channel,
COUNT(*) as campaigns,
SUM(spend) as total_spend,
SUM(budget) as total_budget,
ROUND(SUM(spend) / SUM(budget) * 100, 1) as pct_used
FROM raw__google_sheets__campaigns
GROUP BY channel
ORDER BY total_spend DESC;
┌──────────┬───────────┬─────────────┬──────────────┬──────────┐
│ channel │ campaigns │ total_spend │ total_budget │ pct_used │
├──────────┼───────────┼─────────────┼──────────────┼──────────┤
│ google │ 45 │ 125,400 │ 150,000 │ 83.6 │
│ meta │ 38 │ 98,200 │ 120,000 │ 81.8 │
│ linkedin │ 22 │ 67,800 │ 80,000 │ 84.8 │
│ twitter │ 51 │ 34,500 │ 50,000 │ 69.0 │
└──────────┴───────────┴─────────────┴──────────────┴──────────┘
-- Cost per conversion by campaign (last 30 days)
SELECT
c.campaign_name,
c.channel,
SUM(d.cost) as total_cost,
SUM(d.conversions) as total_conversions,
ROUND(SUM(d.cost) / NULLIF(SUM(d.conversions), 0), 2) as cpa
FROM raw__google_sheets__daily_metrics d
JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
WHERE d.date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.campaign_name, c.channel
HAVING SUM(d.conversions) > 0
ORDER BY cpa ASC
LIMIT 10;
┌──────────────────────┬──────────┬────────────┬──────────────────┬───────┐
│ campaign_name │ channel │ total_cost │ total_conversions│ cpa │
├──────────────────────┼──────────┼────────────┼──────────────────┼───────┤
│ spring_retargeting │ meta │ 2,340 │ 156 │ 15.00 │
│ brand_search │ google │ 4,200 │ 245 │ 17.14 │
│ lookalike_q2 │ meta │ 3,800 │ 198 │ 19.19 │
│ dev_tools_launch │ linkedin │ 5,600 │ 112 │ 50.00 │
└──────────────────────┴──────────┴────────────┴──────────────────┴───────┘

AI Ask: Natural Language Queries

For quick exploration, use AI Ask:

Terminal window
dataspoc-lens ask "Which campaign had the best conversion rate last week?"

Output:

SQL:
SELECT
c.campaign_name,
c.channel,
SUM(d.clicks) as clicks,
SUM(d.conversions) as conversions,
ROUND(SUM(d.conversions)::FLOAT / NULLIF(SUM(d.clicks), 0) * 100, 2) as conversion_rate
FROM raw__google_sheets__daily_metrics d
JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
WHERE d.date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY c.campaign_name, c.channel
ORDER BY conversion_rate DESC
LIMIT 1
Result:
campaign_name: spring_retargeting
channel: meta
conversion_rate: 12.4%
Terminal window
dataspoc-lens ask "Compare Google vs Meta spend efficiency this month"
SQL:
SELECT
c.channel,
SUM(d.cost) as spend,
SUM(d.conversions) as conversions,
ROUND(SUM(d.cost) / NULLIF(SUM(d.conversions), 0), 2) as cost_per_conversion
FROM raw__google_sheets__daily_metrics d
JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
WHERE d.date >= DATE_TRUNC('month', CURRENT_DATE)
AND c.channel IN ('google', 'meta')
GROUP BY c.channel
Result:
┌─────────┬────────┬─────────────┬─────────────────────┐
│ channel │ spend │ conversions │ cost_per_conversion │
├─────────┼────────┼─────────────┼─────────────────────┤
│ meta │ 12,400 │ 523 │ 23.71 │
│ google │ 18,900 │ 687 │ 27.51 │
└─────────┴────────┴─────────────┴─────────────────────┘

Keeping It Fresh

Set up a cron to refresh daily:

Terminal window
# Refresh sheets data every morning at 6am
0 6 * * * cd /opt/marketing-lake && dataspoc-pipe run sheets-marketing

Or use Pipe’s built-in schedule:

pipelines/sheets-marketing.yaml
pipeline: sheets-marketing
schedule: "0 6 * * *"
source:
type: google_sheets
url: "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms"
sheets:
- campaigns
- daily_metrics
target:
type: s3
bucket: s3://company-marketing
prefix: raw/google_sheets

Private Sheets

For sheets that are not publicly shared, use a Google Service Account:

Terminal window
export GOOGLE_SERVICE_ACCOUNT_JSON=/path/to/service-account.json
dataspoc-pipe run sheets-marketing

Share the sheet with the service account email address, and it works exactly the same way.

Joining Sheets Data with Other Sources

The real power is combining spreadsheet data with your production data. If you also have a Postgres pipeline:

-- Join marketing spend with actual revenue
SELECT
c.channel,
SUM(d.cost) as marketing_spend,
SUM(o.amount) as revenue,
ROUND(SUM(o.amount) / NULLIF(SUM(d.cost), 0), 2) as roas
FROM raw__google_sheets__daily_metrics d
JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
JOIN raw__postgres__orders o ON o.utm_source = c.channel
AND o.created_at::DATE = d.date
WHERE d.date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.channel
ORDER BY roas DESC;

Why This Works for Small Teams

  • No infrastructure — no Airflow, no Fivetran, no dbt Cloud
  • No coding — the business team keeps editing their spreadsheet, you get Parquet
  • SQL on everything — once it is Parquet, it is just another table
  • AI-ready — AI Ask works on spreadsheet data the same as production data
  • Free — DataSpoc Pipe is open source, Google Sheets API is free for reasonable usage

Three commands. Public spreadsheet to queryable data lake. That is the promise of modern data tooling.

Recommended