One Data Lake, Three Clouds: Multi-Cloud Analytics with DataSpoc
Your company acquired a startup that runs on GCP. Your main platform is on AWS. The finance team stores exports in Azure Blob Storage. Now you need to join customer data across all three clouds for a single report.
DataSpoc Lens registers buckets from S3, GCS, and Azure as a unified catalog. DuckDB queries across all of them with standard SQL. One query, three clouds, no data movement.
The Scenario
A mid-size company with data spread across cloud providers:
| Cloud | Bucket | Data | Team |
|---|---|---|---|
| AWS S3 | s3://acme-production | Core app data (users, orders) | Engineering |
| GCS | gs://acme-acquired-app | Acquired product data | Former startup team |
| Azure | az://acme-finance | Financial exports, invoices | Finance |
Each team chose their cloud. Moving everything to one provider is a 6-month project. You need analytics now.
Step 1: Register All Buckets
# AWS S3 bucketdataspoc-lens add-bucket s3://acme-production --name production
# Google Cloud Storage bucketdataspoc-lens add-bucket gs://acme-acquired-app --name acquired
# Azure Blob Storage bucketdataspoc-lens add-bucket az://acme-finance/data --name financeVerify all three are connected:
dataspoc-lens tablesproduction.raw.usersproduction.raw.ordersproduction.raw.productsproduction.curated.daily_revenueacquired.raw.app_usersacquired.raw.app_eventsacquired.raw.subscriptionsfinance.raw.invoicesfinance.raw.paymentsfinance.raw.budget_forecastThree clouds, one catalog.
Step 2: Configure Credentials for Each Provider
AWS S3
Use an AWS profile or environment variables:
# Option 1: AWS profile (recommended)export AWS_PROFILE=production
# Option 2: Explicit credentialsexport AWS_ACCESS_KEY_ID="AKIA..."export AWS_SECRET_ACCESS_KEY="..."export AWS_REGION="us-east-1"Or use IAM roles if running on EC2/ECS — no credentials needed.
Google Cloud Storage
Use a service account key or application default credentials:
# Option 1: Service account keyexport GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# Option 2: Application default credentials (if on GCP or gcloud configured)gcloud auth application-default loginAzure Blob Storage
Use a storage account key or managed identity:
# Option 1: Storage account keyexport AZURE_STORAGE_ACCOUNT="acmefinance"export AZURE_STORAGE_KEY="..."
# Option 2: Connection stringexport AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=..."
# Option 3: Managed identity (if running on Azure)# No env vars needed -- uses Azure Instance MetadataDataSpoc Lens detects the provider from the bucket URI prefix (s3://, gs://, az://) and uses the corresponding credentials.
Step 3: Cross-Cloud Queries
Now the interesting part. Query data across all three clouds with a single SQL statement:
-- Match acquired app users with production users by emailSELECT p.name AS production_name, p.email, p.plan AS production_plan, a.subscription_tier AS acquired_plan, a.last_active AS acquired_last_active, f.total_invoicedFROM production.raw.users pJOIN acquired.raw.app_users a ON LOWER(p.email) = LOWER(a.email)LEFT JOIN ( SELECT customer_email, SUM(amount) AS total_invoiced FROM finance.raw.invoices GROUP BY customer_email) f ON LOWER(p.email) = LOWER(f.customer_email)ORDER BY f.total_invoiced DESC NULLS LASTLIMIT 20;This query joins AWS data with GCS data and Azure data in one statement. DuckDB handles the cross-cloud reads transparently.
Run it:
dataspoc-lens query " SELECT p.name, p.email, p.plan, a.subscription_tier, f.total_invoiced FROM production.raw.users p JOIN acquired.raw.app_users a ON LOWER(p.email) = LOWER(a.email) LEFT JOIN ( SELECT customer_email, SUM(amount) AS total_invoiced FROM finance.raw.invoices GROUP BY customer_email ) f ON LOWER(p.email) = LOWER(f.customer_email) ORDER BY f.total_invoiced DESC NULLS LAST LIMIT 10"┌─────────────────┬──────────────────────┬────────┬───────────────────┬────────────────┐│ name │ email │ plan │ subscription_tier │ total_invoiced │├─────────────────┼──────────────────────┼────────┼───────────────────┼────────────────┤│ Acme Corp │ admin@acme.com │ enterprise │ premium │ 125,400.00 ││ TechFlow Inc │ ops@techflow.io │ business │ pro │ 89,200.00 ││ Global Systems │ data@globalsys.com │ business │ premium │ 67,800.00 │└─────────────────┴──────────────────────┴────────┴───────────────────┴────────────────┘Step 4: Cache for Performance
Cross-cloud queries can be slow because data is read over the network from three different providers. Caching solves this:
# Cache the most-queried tables locallydataspoc-lens cache refresh production.raw.usersdataspoc-lens cache refresh production.raw.ordersdataspoc-lens cache refresh acquired.raw.app_usersdataspoc-lens cache refresh finance.raw.invoicesCheck cache status:
dataspoc-lens cache statusTable Status Size Last Refreshedproduction.raw.users cached 1.2 MB 2 min agoproduction.raw.orders cached 8.4 MB 2 min agoacquired.raw.app_users cached 3.1 MB 1 min agofinance.raw.invoices cached 5.6 MB 1 min agoproduction.raw.products stale 42 KB 3 days agoacquired.raw.app_events not cached -- --Cached queries run against local Parquet files — no network calls, no cross-cloud latency. The same cross-cloud JOIN that took 12 seconds uncached runs in 200ms cached.
Refresh stale caches in one command:
dataspoc-lens cache refresh-staleStep 5: Access Control Per Bucket via IAM
DataSpoc never implements auth. Each cloud provider’s IAM controls who can access what:
AWS S3 (production): └── IAM Policy: DataEngineers group → full access └── IAM Policy: Analysts group → read-only
GCS (acquired): └── IAM Policy: IntegrationTeam → read-only └── IAM Policy: AcquiredTeam → full access
Azure (finance): └── IAM Policy: FinanceTeam → full access └── IAM Policy: Executives → read-onlyWhen an analyst runs a query that touches the finance bucket, they need Azure credentials with read access. If they do not have them, the query fails with a permission error. DataSpoc does not try to work around cloud IAM — it respects it.
Practical Example: Unified Customer 360
Build a complete customer view across all three clouds:
-- transforms/01_customer_360.sqlCREATE OR REPLACE TABLE gold.customer_360 ASWITH production_data AS ( SELECT LOWER(email) AS email, name, plan AS production_plan, created_at AS production_signup, DATE_DIFF('day', created_at, CURRENT_DATE) AS days_as_customer FROM production.raw.users),acquired_data AS ( SELECT LOWER(email) AS email, subscription_tier AS acquired_plan, last_active AS acquired_last_active, feature_usage_score FROM acquired.raw.app_users),finance_data AS ( SELECT LOWER(customer_email) AS email, SUM(amount) AS total_invoiced, COUNT(*) AS invoice_count, MAX(invoice_date) AS last_invoice_date FROM finance.raw.invoices WHERE status = 'paid' GROUP BY 1),order_data AS ( SELECT LOWER(u.email) AS email, COUNT(*) AS total_orders, SUM(o.amount) AS total_revenue, MAX(o.order_date) AS last_order_date FROM production.raw.orders o JOIN production.raw.users u ON o.user_id = u.user_id WHERE o.status = 'completed' GROUP BY 1)SELECT p.email, p.name, p.production_plan, p.production_signup, p.days_as_customer, a.acquired_plan, a.acquired_last_active, a.feature_usage_score, f.total_invoiced, f.invoice_count, o.total_orders, o.total_revenue, o.last_order_date, CASE WHEN a.email IS NOT NULL THEN 'both_products' ELSE 'production_only' END AS product_usage, COALESCE(o.total_revenue, 0) + COALESCE(f.total_invoiced, 0) AS total_ltvFROM production_data pLEFT JOIN acquired_data a ON p.email = a.emailLEFT JOIN finance_data f ON p.email = f.emailLEFT JOIN order_data o ON p.email = o.email;Run this transform:
dataspoc-lens transform run --file 01_customer_360.sqlNow query the unified view:
dataspoc-lens query " SELECT product_usage, COUNT(*) as customers, ROUND(AVG(total_ltv), 2) as avg_ltv FROM gold.customer_360 GROUP BY product_usage"┌────────────────┬───────────┬──────────┐│ product_usage │ customers │ avg_ltv │├────────────────┼───────────┼──────────┤│ both_products │ 2,340 │ 1,245.80 ││ production_only│ 10,110 │ 342.50 │└────────────────┴───────────┴──────────┘Customers using both products have 3.6x higher LTV. That is a cross-sell opportunity you could only find by querying across clouds.
SDK Usage
The same multi-cloud setup works from Python:
from dataspoc_lens import LensClient
lens = LensClient()
# Tables from all clouds appear in one catalogtables = lens.tables()print(tables)# ['production.raw.users', 'acquired.raw.app_users', 'finance.raw.invoices', ...]
# Cross-cloud queryresult = lens.query(""" SELECT COUNT(DISTINCT p.email) AS overlap FROM production.raw.users p JOIN acquired.raw.app_users a ON LOWER(p.email) = LOWER(a.email)""")print(f"Users on both platforms: {result}")Cost Considerations
Cross-cloud data transfer costs money. Here is what to expect:
| Transfer | Cost per GB |
|---|---|
| S3 to internet | $0.09 |
| GCS to internet | $0.12 |
| Azure to internet | $0.087 |
| Local cache (after first read) | $0.00 |
Strategy to minimize costs:
- Cache aggressively — first query reads from cloud, subsequent queries read from local cache
- Cache refresh on schedule —
dataspoc-lens cache refresh-staleonce per hour - Pre-aggregate — build gold tables that summarize cross-cloud data, cache only the gold tables
- Filter early —
WHEREclauses in Parquet pushdown reduce bytes read
For a company with 10 GB across three clouds, refreshing cache daily costs about $3/month in egress. That is cheaper than a single Snowflake credit.
One catalog, three clouds, zero data movement. Register your buckets, write SQL, and let DuckDB handle the rest.