multi-clouds3gcsazuredata-lake

One Data Lake, Three Clouds: Multi-Cloud Analytics with DataSpoc

Michael San Martim · 2026-04-28

Your company acquired a startup that runs on GCP. Your main platform is on AWS. The finance team stores exports in Azure Blob Storage. Now you need to join customer data across all three clouds for a single report.

DataSpoc Lens registers buckets from S3, GCS, and Azure as a unified catalog. DuckDB queries across all of them with standard SQL. One query, three clouds, no data movement.

The Scenario

A mid-size company with data spread across cloud providers:

CloudBucketDataTeam
AWS S3s3://acme-productionCore app data (users, orders)Engineering
GCSgs://acme-acquired-appAcquired product dataFormer startup team
Azureaz://acme-financeFinancial exports, invoicesFinance

Each team chose their cloud. Moving everything to one provider is a 6-month project. You need analytics now.

Step 1: Register All Buckets

Terminal window
# AWS S3 bucket
dataspoc-lens add-bucket s3://acme-production --name production
# Google Cloud Storage bucket
dataspoc-lens add-bucket gs://acme-acquired-app --name acquired
# Azure Blob Storage bucket
dataspoc-lens add-bucket az://acme-finance/data --name finance

Verify all three are connected:

Terminal window
dataspoc-lens tables
production.raw.users
production.raw.orders
production.raw.products
production.curated.daily_revenue
acquired.raw.app_users
acquired.raw.app_events
acquired.raw.subscriptions
finance.raw.invoices
finance.raw.payments
finance.raw.budget_forecast

Three clouds, one catalog.

Step 2: Configure Credentials for Each Provider

AWS S3

Use an AWS profile or environment variables:

Terminal window
# Option 1: AWS profile (recommended)
export AWS_PROFILE=production
# Option 2: Explicit credentials
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"

Or use IAM roles if running on EC2/ECS — no credentials needed.

Google Cloud Storage

Use a service account key or application default credentials:

Terminal window
# Option 1: Service account key
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# Option 2: Application default credentials (if on GCP or gcloud configured)
gcloud auth application-default login

Azure Blob Storage

Use a storage account key or managed identity:

Terminal window
# Option 1: Storage account key
export AZURE_STORAGE_ACCOUNT="acmefinance"
export AZURE_STORAGE_KEY="..."
# Option 2: Connection string
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=..."
# Option 3: Managed identity (if running on Azure)
# No env vars needed -- uses Azure Instance Metadata

DataSpoc Lens detects the provider from the bucket URI prefix (s3://, gs://, az://) and uses the corresponding credentials.

Step 3: Cross-Cloud Queries

Now the interesting part. Query data across all three clouds with a single SQL statement:

-- Match acquired app users with production users by email
SELECT
p.name AS production_name,
p.email,
p.plan AS production_plan,
a.subscription_tier AS acquired_plan,
a.last_active AS acquired_last_active,
f.total_invoiced
FROM production.raw.users p
JOIN acquired.raw.app_users a
ON LOWER(p.email) = LOWER(a.email)
LEFT JOIN (
SELECT customer_email, SUM(amount) AS total_invoiced
FROM finance.raw.invoices
GROUP BY customer_email
) f ON LOWER(p.email) = LOWER(f.customer_email)
ORDER BY f.total_invoiced DESC NULLS LAST
LIMIT 20;

This query joins AWS data with GCS data and Azure data in one statement. DuckDB handles the cross-cloud reads transparently.

Run it:

Terminal window
dataspoc-lens query "
SELECT
p.name, p.email, p.plan,
a.subscription_tier,
f.total_invoiced
FROM production.raw.users p
JOIN acquired.raw.app_users a ON LOWER(p.email) = LOWER(a.email)
LEFT JOIN (
SELECT customer_email, SUM(amount) AS total_invoiced
FROM finance.raw.invoices
GROUP BY customer_email
) f ON LOWER(p.email) = LOWER(f.customer_email)
ORDER BY f.total_invoiced DESC NULLS LAST
LIMIT 10
"
┌─────────────────┬──────────────────────┬────────┬───────────────────┬────────────────┐
│ name │ email │ plan │ subscription_tier │ total_invoiced │
├─────────────────┼──────────────────────┼────────┼───────────────────┼────────────────┤
│ Acme Corp │ admin@acme.com │ enterprise │ premium │ 125,400.00 │
│ TechFlow Inc │ ops@techflow.io │ business │ pro │ 89,200.00 │
│ Global Systems │ data@globalsys.com │ business │ premium │ 67,800.00 │
└─────────────────┴──────────────────────┴────────┴───────────────────┴────────────────┘

Step 4: Cache for Performance

Cross-cloud queries can be slow because data is read over the network from three different providers. Caching solves this:

Terminal window
# Cache the most-queried tables locally
dataspoc-lens cache refresh production.raw.users
dataspoc-lens cache refresh production.raw.orders
dataspoc-lens cache refresh acquired.raw.app_users
dataspoc-lens cache refresh finance.raw.invoices

Check cache status:

Terminal window
dataspoc-lens cache status
Table Status Size Last Refreshed
production.raw.users cached 1.2 MB 2 min ago
production.raw.orders cached 8.4 MB 2 min ago
acquired.raw.app_users cached 3.1 MB 1 min ago
finance.raw.invoices cached 5.6 MB 1 min ago
production.raw.products stale 42 KB 3 days ago
acquired.raw.app_events not cached -- --

Cached queries run against local Parquet files — no network calls, no cross-cloud latency. The same cross-cloud JOIN that took 12 seconds uncached runs in 200ms cached.

Refresh stale caches in one command:

Terminal window
dataspoc-lens cache refresh-stale

Step 5: Access Control Per Bucket via IAM

DataSpoc never implements auth. Each cloud provider’s IAM controls who can access what:

AWS S3 (production):
└── IAM Policy: DataEngineers group → full access
└── IAM Policy: Analysts group → read-only
GCS (acquired):
└── IAM Policy: IntegrationTeam → read-only
└── IAM Policy: AcquiredTeam → full access
Azure (finance):
└── IAM Policy: FinanceTeam → full access
└── IAM Policy: Executives → read-only

When an analyst runs a query that touches the finance bucket, they need Azure credentials with read access. If they do not have them, the query fails with a permission error. DataSpoc does not try to work around cloud IAM — it respects it.

Practical Example: Unified Customer 360

Build a complete customer view across all three clouds:

-- transforms/01_customer_360.sql
CREATE OR REPLACE TABLE gold.customer_360 AS
WITH production_data AS (
SELECT
LOWER(email) AS email,
name,
plan AS production_plan,
created_at AS production_signup,
DATE_DIFF('day', created_at, CURRENT_DATE) AS days_as_customer
FROM production.raw.users
),
acquired_data AS (
SELECT
LOWER(email) AS email,
subscription_tier AS acquired_plan,
last_active AS acquired_last_active,
feature_usage_score
FROM acquired.raw.app_users
),
finance_data AS (
SELECT
LOWER(customer_email) AS email,
SUM(amount) AS total_invoiced,
COUNT(*) AS invoice_count,
MAX(invoice_date) AS last_invoice_date
FROM finance.raw.invoices
WHERE status = 'paid'
GROUP BY 1
),
order_data AS (
SELECT
LOWER(u.email) AS email,
COUNT(*) AS total_orders,
SUM(o.amount) AS total_revenue,
MAX(o.order_date) AS last_order_date
FROM production.raw.orders o
JOIN production.raw.users u ON o.user_id = u.user_id
WHERE o.status = 'completed'
GROUP BY 1
)
SELECT
p.email,
p.name,
p.production_plan,
p.production_signup,
p.days_as_customer,
a.acquired_plan,
a.acquired_last_active,
a.feature_usage_score,
f.total_invoiced,
f.invoice_count,
o.total_orders,
o.total_revenue,
o.last_order_date,
CASE
WHEN a.email IS NOT NULL THEN 'both_products'
ELSE 'production_only'
END AS product_usage,
COALESCE(o.total_revenue, 0) + COALESCE(f.total_invoiced, 0) AS total_ltv
FROM production_data p
LEFT JOIN acquired_data a ON p.email = a.email
LEFT JOIN finance_data f ON p.email = f.email
LEFT JOIN order_data o ON p.email = o.email;

Run this transform:

Terminal window
dataspoc-lens transform run --file 01_customer_360.sql

Now query the unified view:

Terminal window
dataspoc-lens query "
SELECT product_usage, COUNT(*) as customers, ROUND(AVG(total_ltv), 2) as avg_ltv
FROM gold.customer_360
GROUP BY product_usage
"
┌────────────────┬───────────┬──────────┐
│ product_usage │ customers │ avg_ltv │
├────────────────┼───────────┼──────────┤
│ both_products │ 2,340 │ 1,245.80 │
│ production_only│ 10,110 │ 342.50 │
└────────────────┴───────────┴──────────┘

Customers using both products have 3.6x higher LTV. That is a cross-sell opportunity you could only find by querying across clouds.

SDK Usage

The same multi-cloud setup works from Python:

from dataspoc_lens import LensClient
lens = LensClient()
# Tables from all clouds appear in one catalog
tables = lens.tables()
print(tables)
# ['production.raw.users', 'acquired.raw.app_users', 'finance.raw.invoices', ...]
# Cross-cloud query
result = lens.query("""
SELECT COUNT(DISTINCT p.email) AS overlap
FROM production.raw.users p
JOIN acquired.raw.app_users a ON LOWER(p.email) = LOWER(a.email)
""")
print(f"Users on both platforms: {result}")

Cost Considerations

Cross-cloud data transfer costs money. Here is what to expect:

TransferCost per GB
S3 to internet$0.09
GCS to internet$0.12
Azure to internet$0.087
Local cache (after first read)$0.00

Strategy to minimize costs:

  1. Cache aggressively — first query reads from cloud, subsequent queries read from local cache
  2. Cache refresh on scheduledataspoc-lens cache refresh-stale once per hour
  3. Pre-aggregate — build gold tables that summarize cross-cloud data, cache only the gold tables
  4. Filter earlyWHERE clauses in Parquet pushdown reduce bytes read

For a company with 10 GB across three clouds, refreshing cache daily costs about $3/month in egress. That is cheaper than a single Snowflake credit.

One catalog, three clouds, zero data movement. Register your buckets, write SQL, and let DuckDB handle the rest.

Recommended