google-sheetsno-codedata-laketutorial

From Google Sheets to a Queryable Data Lake in 3 Commands

Michael San Martim · 2026-04-26

Every company has that one spreadsheet. The marketing budget tracker. The sales forecast. The product roadmap with 47 tabs. It lives in Google Sheets because that is where non-technical teams work.

You need that data in your data lake. Not as a CSV export you remember to download on Fridays. As a proper Parquet table you can join with your production data and query with SQL.

DataSpoc Pipe has a built-in Google Sheets tap. No extra packages, no OAuth dance for public sheets, no Singer configuration files. Three commands.

The Scenario

Your marketing team maintains a Google Sheet with campaign spend data:

https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms/edit

It has two sheets (tabs):

campaigns — campaign name, channel, budget, spend, start_date, end_date
daily_metrics — date, campaign, impressions, clicks, conversions, cost

You want this in your data lake as Parquet, queryable with SQL, joinable with your revenue data from Postgres.

Command 1: Initialize the Project

dataspoc-pipe init marketing-lake --bucket s3://company-marketing

Output:

Created project: marketing-lake/
  dataspoc-pipe.yaml
  pipelines/

Command 2: Add the Google Sheets Pipeline

dataspoc-pipe add sheets-marketing \
  --source google_sheets \
  --url "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms" \
  --target s3://company-marketing

This generates the pipeline config:

pipeline: sheets-marketing
source:
  type: google_sheets
  url: "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms"
  sheets:
    - campaigns
    - daily_metrics

target:
  type: s3
  bucket: s3://company-marketing
  prefix: raw/google_sheets

That is it. No API keys needed for public sheets. For private sheets, set GOOGLE_SERVICE_ACCOUNT_JSON as an environment variable.

Command 3: Run

dataspoc-pipe run sheets-marketing

Output:

[sheets-marketing] Starting extraction...
[sheets-marketing] Extracting sheet: campaigns (156 rows)
[sheets-marketing] Extracting sheet: daily_metrics (4,380 rows)
[sheets-marketing] Writing: s3://company-marketing/raw/google_sheets/campaigns/20260415_070000.parquet
[sheets-marketing] Writing: s3://company-marketing/raw/google_sheets/daily_metrics/20260415_070000.parquet
[sheets-marketing] Done. 2 tables, 4,536 rows, 3.2s

Your data lake now has:

s3://company-marketing/
  .dataspoc/
    manifest.json
    state/sheets-marketing/state.json
  raw/google_sheets/
    campaigns/
      20260415_070000.parquet
    daily_metrics/
      20260415_070000.parquet

Querying with DataSpoc Lens

Now connect Lens and start querying:

dataspoc-lens add-bucket marketing s3://company-marketing
dataspoc-lens shell

You are now in a DuckDB SQL shell with your Sheets data mounted:

-- See available tables
SHOW TABLES;

┌─────────────────────────────────────┐
│ name                                │
├─────────────────────────────────────┤
│ raw__google_sheets__campaigns       │
│ raw__google_sheets__daily_metrics   │
└─────────────────────────────────────┘

-- Total spend by channel
SELECT
    channel,
    COUNT(*) as campaigns,
    SUM(spend) as total_spend,
    SUM(budget) as total_budget,
    ROUND(SUM(spend) / SUM(budget) * 100, 1) as pct_used
FROM raw__google_sheets__campaigns
GROUP BY channel
ORDER BY total_spend DESC;

┌──────────┬───────────┬─────────────┬──────────────┬──────────┐
│ channel  │ campaigns │ total_spend │ total_budget │ pct_used │
├──────────┼───────────┼─────────────┼──────────────┼──────────┤
│ google   │ 45        │ 125,400     │ 150,000      │ 83.6     │
│ meta     │ 38        │ 98,200      │ 120,000      │ 81.8     │
│ linkedin │ 22        │ 67,800      │ 80,000       │ 84.8     │
│ twitter  │ 51        │ 34,500      │ 50,000       │ 69.0     │
└──────────┴───────────┴─────────────┴──────────────┴──────────┘

-- Cost per conversion by campaign (last 30 days)
SELECT
    c.campaign_name,
    c.channel,
    SUM(d.cost) as total_cost,
    SUM(d.conversions) as total_conversions,
    ROUND(SUM(d.cost) / NULLIF(SUM(d.conversions), 0), 2) as cpa
FROM raw__google_sheets__daily_metrics d
JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
WHERE d.date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.campaign_name, c.channel
HAVING SUM(d.conversions) > 0
ORDER BY cpa ASC
LIMIT 10;

┌──────────────────────┬──────────┬────────────┬──────────────────┬───────┐
│ campaign_name        │ channel  │ total_cost │ total_conversions│ cpa   │
├──────────────────────┼──────────┼────────────┼──────────────────┼───────┤
│ spring_retargeting   │ meta     │ 2,340      │ 156              │ 15.00 │
│ brand_search         │ google   │ 4,200      │ 245              │ 17.14 │
│ lookalike_q2         │ meta     │ 3,800      │ 198              │ 19.19 │
│ dev_tools_launch     │ linkedin │ 5,600      │ 112              │ 50.00 │
└──────────────────────┴──────────┴────────────┴──────────────────┴───────┘

AI Ask: Natural Language Queries

For quick exploration, use AI Ask:

dataspoc-lens ask "Which campaign had the best conversion rate last week?"

Output:

SQL:
  SELECT
      c.campaign_name,
      c.channel,
      SUM(d.clicks) as clicks,
      SUM(d.conversions) as conversions,
      ROUND(SUM(d.conversions)::FLOAT / NULLIF(SUM(d.clicks), 0) * 100, 2) as conversion_rate
  FROM raw__google_sheets__daily_metrics d
  JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
  WHERE d.date >= CURRENT_DATE - INTERVAL '7 days'
  GROUP BY c.campaign_name, c.channel
  ORDER BY conversion_rate DESC
  LIMIT 1

Result:
  campaign_name: spring_retargeting
  channel: meta
  conversion_rate: 12.4%

dataspoc-lens ask "Compare Google vs Meta spend efficiency this month"

SQL:
  SELECT
      c.channel,
      SUM(d.cost) as spend,
      SUM(d.conversions) as conversions,
      ROUND(SUM(d.cost) / NULLIF(SUM(d.conversions), 0), 2) as cost_per_conversion
  FROM raw__google_sheets__daily_metrics d
  JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
  WHERE d.date >= DATE_TRUNC('month', CURRENT_DATE)
    AND c.channel IN ('google', 'meta')
  GROUP BY c.channel

Result:
  ┌─────────┬────────┬─────────────┬─────────────────────┐
  │ channel │ spend  │ conversions │ cost_per_conversion │
  ├─────────┼────────┼─────────────┼─────────────────────┤
  │ meta    │ 12,400 │ 523         │ 23.71               │
  │ google  │ 18,900 │ 687         │ 27.51               │
  └─────────┴────────┴─────────────┴─────────────────────┘

Keeping It Fresh

Set up a cron to refresh daily:

# Refresh sheets data every morning at 6am
0 6 * * * cd /opt/marketing-lake && dataspoc-pipe run sheets-marketing

Or use Pipe’s built-in schedule:

pipeline: sheets-marketing
schedule: "0 6 * * *"
source:
  type: google_sheets
  url: "https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgVE2upms"
  sheets:
    - campaigns
    - daily_metrics
target:
  type: s3
  bucket: s3://company-marketing
  prefix: raw/google_sheets

Private Sheets

For sheets that are not publicly shared, use a Google Service Account:

export GOOGLE_SERVICE_ACCOUNT_JSON=/path/to/service-account.json
dataspoc-pipe run sheets-marketing

Share the sheet with the service account email address, and it works exactly the same way.

Joining Sheets Data with Other Sources

The real power is combining spreadsheet data with your production data. If you also have a Postgres pipeline:

-- Join marketing spend with actual revenue
SELECT
    c.channel,
    SUM(d.cost) as marketing_spend,
    SUM(o.amount) as revenue,
    ROUND(SUM(o.amount) / NULLIF(SUM(d.cost), 0), 2) as roas
FROM raw__google_sheets__daily_metrics d
JOIN raw__google_sheets__campaigns c ON d.campaign = c.campaign_name
JOIN raw__postgres__orders o ON o.utm_source = c.channel
    AND o.created_at::DATE = d.date
WHERE d.date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.channel
ORDER BY roas DESC;

Why This Works for Small Teams

No infrastructure — no Airflow, no Fivetran, no dbt Cloud
No coding — the business team keeps editing their spreadsheet, you get Parquet
SQL on everything — once it is Parquet, it is just another table
AI-ready — AI Ask works on spreadsheet data the same as production data
Free — DataSpoc Pipe is open source, Google Sheets API is free for reasonable usage

Three commands. Public spreadsheet to queryable data lake. That is the promise of modern data tooling.

← Back to blog

From Google Sheets to a Queryable Data Lake in 3 Commands

The Scenario

Command 1: Initialize the Project

Command 2: Add the Google Sheets Pipeline

Command 3: Run

Querying with DataSpoc Lens

AI Ask: Natural Language Queries

Keeping It Fresh

Private Sheets

Joining Sheets Data with Other Sources

Why This Works for Small Teams

Recommended

The $0 Data Lake for Startups: DataSpoc + S3 in 30 Minutes

How to Build an MCP Server for Your Data Lake

Building a Medallion Architecture Data Lake with DataSpoc