multi-clouds3gcsazuredata-lake

One Data Lake, Three Clouds: Multi-Cloud Analytics with DataSpoc

Michael San Martim · 2026-04-28

Your company acquired a startup that runs on GCP. Your main platform is on AWS. The finance team stores exports in Azure Blob Storage. Now you need to join customer data across all three clouds for a single report.

DataSpoc Lens registers buckets from S3, GCS, and Azure as a unified catalog. DuckDB queries across all of them with standard SQL. One query, three clouds, no data movement.

The Scenario

A mid-size company with data spread across cloud providers:

Cloud	Bucket	Data	Team
AWS S3	`s3://acme-production`	Core app data (users, orders)	Engineering
GCS	`gs://acme-acquired-app`	Acquired product data	Former startup team
Azure	`az://acme-finance`	Financial exports, invoices	Finance

Each team chose their cloud. Moving everything to one provider is a 6-month project. You need analytics now.

Step 1: Register All Buckets

# AWS S3 bucket
dataspoc-lens add-bucket s3://acme-production --name production

# Google Cloud Storage bucket
dataspoc-lens add-bucket gs://acme-acquired-app --name acquired

# Azure Blob Storage bucket
dataspoc-lens add-bucket az://acme-finance/data --name finance

Verify all three are connected:

dataspoc-lens tables

production.raw.users
production.raw.orders
production.raw.products
production.curated.daily_revenue
acquired.raw.app_users
acquired.raw.app_events
acquired.raw.subscriptions
finance.raw.invoices
finance.raw.payments
finance.raw.budget_forecast

Three clouds, one catalog.

Step 2: Configure Credentials for Each Provider

AWS S3

Use an AWS profile or environment variables:

# Option 1: AWS profile (recommended)
export AWS_PROFILE=production

# Option 2: Explicit credentials
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"

Or use IAM roles if running on EC2/ECS — no credentials needed.

Google Cloud Storage

Use a service account key or application default credentials:

# Option 1: Service account key
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Option 2: Application default credentials (if on GCP or gcloud configured)
gcloud auth application-default login

Azure Blob Storage

Use a storage account key or managed identity:

# Option 1: Storage account key
export AZURE_STORAGE_ACCOUNT="acmefinance"
export AZURE_STORAGE_KEY="..."

# Option 2: Connection string
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=..."

# Option 3: Managed identity (if running on Azure)
# No env vars needed -- uses Azure Instance Metadata

DataSpoc Lens detects the provider from the bucket URI prefix (s3://, gs://, az://) and uses the corresponding credentials.

Step 3: Cross-Cloud Queries

Now the interesting part. Query data across all three clouds with a single SQL statement:

-- Match acquired app users with production users by email
SELECT
    p.name AS production_name,
    p.email,
    p.plan AS production_plan,
    a.subscription_tier AS acquired_plan,
    a.last_active AS acquired_last_active,
    f.total_invoiced
FROM production.raw.users p
JOIN acquired.raw.app_users a
    ON LOWER(p.email) = LOWER(a.email)
LEFT JOIN (
    SELECT customer_email, SUM(amount) AS total_invoiced
    FROM finance.raw.invoices
    GROUP BY customer_email
) f ON LOWER(p.email) = LOWER(f.customer_email)
ORDER BY f.total_invoiced DESC NULLS LAST
LIMIT 20;

This query joins AWS data with GCS data and Azure data in one statement. DuckDB handles the cross-cloud reads transparently.

Run it:

dataspoc-lens query "
  SELECT
    p.name, p.email, p.plan,
    a.subscription_tier,
    f.total_invoiced
  FROM production.raw.users p
  JOIN acquired.raw.app_users a ON LOWER(p.email) = LOWER(a.email)
  LEFT JOIN (
    SELECT customer_email, SUM(amount) AS total_invoiced
    FROM finance.raw.invoices
    GROUP BY customer_email
  ) f ON LOWER(p.email) = LOWER(f.customer_email)
  ORDER BY f.total_invoiced DESC NULLS LAST
  LIMIT 10
"

┌─────────────────┬──────────────────────┬────────┬───────────────────┬────────────────┐
│      name       │        email         │  plan  │ subscription_tier │ total_invoiced │
├─────────────────┼──────────────────────┼────────┼───────────────────┼────────────────┤
│ Acme Corp       │ admin@acme.com       │ enterprise │ premium       │     125,400.00 │
│ TechFlow Inc    │ ops@techflow.io      │ business   │ pro           │      89,200.00 │
│ Global Systems  │ data@globalsys.com   │ business   │ premium       │      67,800.00 │
└─────────────────┴──────────────────────┴────────┴───────────────────┴────────────────┘

Step 4: Cache for Performance

Cross-cloud queries can be slow because data is read over the network from three different providers. Caching solves this:

# Cache the most-queried tables locally
dataspoc-lens cache refresh production.raw.users
dataspoc-lens cache refresh production.raw.orders
dataspoc-lens cache refresh acquired.raw.app_users
dataspoc-lens cache refresh finance.raw.invoices

Check cache status:

dataspoc-lens cache status

Table                          Status    Size      Last Refreshed
production.raw.users           cached    1.2 MB    2 min ago
production.raw.orders          cached    8.4 MB    2 min ago
acquired.raw.app_users         cached    3.1 MB    1 min ago
finance.raw.invoices           cached    5.6 MB    1 min ago
production.raw.products        stale     42 KB     3 days ago
acquired.raw.app_events        not cached  --      --

Cached queries run against local Parquet files — no network calls, no cross-cloud latency. The same cross-cloud JOIN that took 12 seconds uncached runs in 200ms cached.

Refresh stale caches in one command:

dataspoc-lens cache refresh-stale

Step 5: Access Control Per Bucket via IAM

DataSpoc never implements auth. Each cloud provider’s IAM controls who can access what:

AWS S3 (production):
  └── IAM Policy: DataEngineers group → full access
  └── IAM Policy: Analysts group → read-only

GCS (acquired):
  └── IAM Policy: IntegrationTeam → read-only
  └── IAM Policy: AcquiredTeam → full access

Azure (finance):
  └── IAM Policy: FinanceTeam → full access
  └── IAM Policy: Executives → read-only

When an analyst runs a query that touches the finance bucket, they need Azure credentials with read access. If they do not have them, the query fails with a permission error. DataSpoc does not try to work around cloud IAM — it respects it.

Practical Example: Unified Customer 360

Build a complete customer view across all three clouds:

-- transforms/01_customer_360.sql
CREATE OR REPLACE TABLE gold.customer_360 AS
WITH production_data AS (
    SELECT
        LOWER(email) AS email,
        name,
        plan AS production_plan,
        created_at AS production_signup,
        DATE_DIFF('day', created_at, CURRENT_DATE) AS days_as_customer
    FROM production.raw.users
),
acquired_data AS (
    SELECT
        LOWER(email) AS email,
        subscription_tier AS acquired_plan,
        last_active AS acquired_last_active,
        feature_usage_score
    FROM acquired.raw.app_users
),
finance_data AS (
    SELECT
        LOWER(customer_email) AS email,
        SUM(amount) AS total_invoiced,
        COUNT(*) AS invoice_count,
        MAX(invoice_date) AS last_invoice_date
    FROM finance.raw.invoices
    WHERE status = 'paid'
    GROUP BY 1
),
order_data AS (
    SELECT
        LOWER(u.email) AS email,
        COUNT(*) AS total_orders,
        SUM(o.amount) AS total_revenue,
        MAX(o.order_date) AS last_order_date
    FROM production.raw.orders o
    JOIN production.raw.users u ON o.user_id = u.user_id
    WHERE o.status = 'completed'
    GROUP BY 1
)
SELECT
    p.email,
    p.name,
    p.production_plan,
    p.production_signup,
    p.days_as_customer,
    a.acquired_plan,
    a.acquired_last_active,
    a.feature_usage_score,
    f.total_invoiced,
    f.invoice_count,
    o.total_orders,
    o.total_revenue,
    o.last_order_date,
    CASE
        WHEN a.email IS NOT NULL THEN 'both_products'
        ELSE 'production_only'
    END AS product_usage,
    COALESCE(o.total_revenue, 0) + COALESCE(f.total_invoiced, 0) AS total_ltv
FROM production_data p
LEFT JOIN acquired_data a ON p.email = a.email
LEFT JOIN finance_data f ON p.email = f.email
LEFT JOIN order_data o ON p.email = o.email;

Run this transform:

dataspoc-lens transform run --file 01_customer_360.sql

Now query the unified view:

dataspoc-lens query "
  SELECT product_usage, COUNT(*) as customers, ROUND(AVG(total_ltv), 2) as avg_ltv
  FROM gold.customer_360
  GROUP BY product_usage
"

┌────────────────┬───────────┬──────────┐
│ product_usage  │ customers │ avg_ltv  │
├────────────────┼───────────┼──────────┤
│ both_products  │     2,340 │ 1,245.80 │
│ production_only│    10,110 │   342.50 │
└────────────────┴───────────┴──────────┘

Customers using both products have 3.6x higher LTV. That is a cross-sell opportunity you could only find by querying across clouds.

SDK Usage

The same multi-cloud setup works from Python:

from dataspoc_lens import LensClient

lens = LensClient()

# Tables from all clouds appear in one catalog
tables = lens.tables()
print(tables)
# ['production.raw.users', 'acquired.raw.app_users', 'finance.raw.invoices', ...]

# Cross-cloud query
result = lens.query("""
    SELECT COUNT(DISTINCT p.email) AS overlap
    FROM production.raw.users p
    JOIN acquired.raw.app_users a ON LOWER(p.email) = LOWER(a.email)
""")
print(f"Users on both platforms: {result}")

Cost Considerations

Cross-cloud data transfer costs money. Here is what to expect:

Transfer	Cost per GB
S3 to internet	$0.09
GCS to internet	$0.12
Azure to internet	$0.087
Local cache (after first read)	$0.00

Strategy to minimize costs:

Cache aggressively — first query reads from cloud, subsequent queries read from local cache
Cache refresh on schedule — dataspoc-lens cache refresh-stale once per hour
Pre-aggregate — build gold tables that summarize cross-cloud data, cache only the gold tables
Filter early — WHERE clauses in Parquet pushdown reduce bytes read

For a company with 10 GB across three clouds, refreshing cache daily costs about $3/month in egress. That is cheaper than a single Snowflake credit.

One catalog, three clouds, zero data movement. Register your buckets, write SQL, and let DuckDB handle the rest.

← Back to blog