data-mesharchitecturedata-engineeringteamsdecentralizediam

Data Mesh Without the Complexity: One Bucket Per Team

Michael San Martim · 2026-04-29

Data mesh sounds great in theory: decentralized ownership, domain-driven data, self-serve infrastructure. In practice, it usually means 18 months of “platform team” work building a self-serve portal that nobody uses.

Here’s the dirty secret: you don’t need a platform to do data mesh. You need a bucket per team and a pip install.

The data mesh promise (and the usual failure)

The promise:

Each team owns their data
No central bottleneck
Self-serve infrastructure
Domain-driven design

The usual implementation:

6 months building a “data platform” with Terraform, Kubernetes, and Airflow
A “platform team” of 5 people maintaining it
Teams still can’t onboard without filing a ticket
$200k/year in infrastructure before anyone queries anything

The DataSpoc implementation:

Each team runs pip install dataspoc-pipe dataspoc-lens
Each team creates their own S3 bucket
Each team manages their own pipelines
Total cost: $0 + S3 storage (~$5/month per team)
Setup time: 30 minutes per team

The architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Company Cloud (AWS/GCS/Azure)            │
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐  │
│  │ s3://finance     │  │ s3://product     │  │ s3://marketing │  │
│  │                  │  │                  │  │                │  │
│  │ Pipe → Parquet   │  │ Pipe → Parquet   │  │ Pipe → Parquet │  │
│  │ Lens → SQL/AI    │  │ Lens → SQL/AI    │  │ Lens → SQL/AI  │  │
│  │ ML → predictions │  │                  │  │                │  │
│  │                  │  │                  │  │                │  │
│  │ IAM: finance-team│  │ IAM: product-team│  │ IAM: mkt-team  │  │
│  └─────────────────┘  └─────────────────┘  └────────────────┘  │
│                                                                 │
│  Cross-domain analyst: registers all 3 buckets in Lens          │
│  AI agent: connects via MCP, scoped to team's bucket            │
└─────────────────────────────────────────────────────────────────┘

Each team is a self-contained data platform:

Their own bucket (isolated data)
Their own Pipe config (their sources, their schedule)
Their own Lens config (their queries, their transforms)
Their own AI agent (MCP scoped to their bucket)

No shared infrastructure. No central team. No tickets.

Setting up the Finance team (example)

Step 1: Create the bucket and IAM

# AWS (or equivalent for GCS/Azure)
aws s3 mb s3://company-finance
aws iam create-policy --policy-name finance-data-access --policy-document '{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
    "Resource": ["arn:aws:s3:::company-finance", "arn:aws:s3:::company-finance/*"]
  }]
}'

Step 2: Install and configure Pipe

pip install dataspoc-pipe[s3]
dataspoc-pipe init

Pipeline: Postgres (ERP) → Finance bucket

source:
  tap: tap-postgres
  config: ~/.dataspoc-pipe/sources/erp.json
  streams:
    - invoices
    - payments
    - accounts_receivable

destination:
  bucket: s3://company-finance
  path: raw
  compression: zstd

incremental:
  enabled: true

schedule:
  cron: "0 6 * * *"  # daily at 6am

Pipeline: Stripe → Finance bucket

source:
  tap: tap-stripe
  config: ~/.dataspoc-pipe/sources/stripe.json
  streams:
    - charges
    - refunds
    - subscriptions

destination:
  bucket: s3://company-finance
  path: raw
  compression: zstd

incremental:
  enabled: true

schedule:
  cron: "0 7 * * *"

dataspoc-pipe run _ --all
dataspoc-pipe schedule install

Step 3: Install and configure Lens

pip install dataspoc-lens[s3,ai,mcp]
dataspoc-lens init
dataspoc-lens add-bucket s3://company-finance

Step 4: Create domain transforms

-- ~/.dataspoc-lens/transforms/001_monthly_revenue.sql
CREATE OR REPLACE TABLE monthly_revenue AS
SELECT
    DATE_TRUNC('month', created_at) AS month,
    SUM(amount) AS revenue,
    COUNT(*) AS transactions,
    SUM(CASE WHEN type = 'refund' THEN amount ELSE 0 END) AS refunds
FROM charges
GROUP BY 1
ORDER BY 1;

-- ~/.dataspoc-lens/transforms/002_accounts_aging.sql
CREATE OR REPLACE TABLE accounts_aging AS
SELECT
    customer_id,
    SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) > 90 THEN amount ELSE 0 END) AS over_90,
    SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) BETWEEN 60 AND 90 THEN amount ELSE 0 END) AS days_60_90,
    SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) BETWEEN 30 AND 60 THEN amount ELSE 0 END) AS days_30_60,
    SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) < 30 THEN amount ELSE 0 END) AS current_amount
FROM accounts_receivable
WHERE status = 'open'
GROUP BY 1;

dataspoc-lens transform run

Step 5: Connect the team’s AI agent

dataspoc-lens mcp

Claude Desktop config for the finance team:

{
  "mcpServers": {
    "finance-data": {
      "command": "dataspoc-lens",
      "args": ["mcp"]
    }
  }
}

Now the CFO’s AI agent can ask:

"What's our monthly revenue trend?"
"Which customers have overdue payments over $10k?"
"What's our net retention rate this quarter?"

Every answer comes from real SQL on the finance team’s data. No access to other teams’ buckets.

Setting up the Product team (same pattern, different data)

pip install dataspoc-pipe[s3] dataspoc-lens[s3,ai]
dataspoc-pipe init

# Product team sources: Postgres (app DB) + Mixpanel events
source:
  tap: tap-postgres
  config: ~/.dataspoc-pipe/sources/app-db.json
  streams:
    - users
    - subscriptions
    - feature_flags

destination:
  bucket: s3://company-product
  path: raw
  compression: zstd

incremental:
  enabled: true

dataspoc-lens add-bucket s3://company-product
dataspoc-lens ask "how many users signed up last week?"

The product team has zero visibility into finance data. Finance has zero visibility into product data. IAM enforces the boundaries, not application code.

Cross-domain analytics

The head of analytics needs to see across teams. Simple — register multiple buckets:

dataspoc-lens add-bucket s3://company-finance
dataspoc-lens add-bucket s3://company-product
dataspoc-lens add-bucket s3://company-marketing

Now they can JOIN across domains:

SELECT
    p.user_id,
    p.plan,
    f.lifetime_value,
    m.acquisition_channel
FROM product_users p
JOIN finance_customer_360 f ON p.user_id = f.customer_id
JOIN marketing_attribution m ON p.user_id = m.user_id
WHERE p.plan = 'enterprise';

This only works if the analyst’s IAM role has read access to all three buckets. The access model is:

Role	Buckets	Sees
Finance analyst	`s3://company-finance`	Finance data only
Product PM	`s3://company-product`	Product data only
Head of Analytics	All 3	Cross-domain JOINs
CEO AI agent	All 3	Everything via MCP
Team-specific agent	Team’s bucket only	Scoped data

Data mesh principles mapped to DataSpoc

Principle	How DataSpoc implements it
Domain ownership	Each team owns their bucket, their pipelines, their transforms
Data as a product	Manifest.json catalogs what’s available. Clean/gold layers are the “product”
Self-serve platform	`pip install` — no tickets, no platform team
Federated governance	Cloud IAM at bucket level. No application-level auth
Interoperability	Same format (Parquet), same convention (bucket structure), same tools
Discoverability	`dataspoc-lens catalog` shows all tables in your registered buckets

The data contract: bucket convention

Teams agree on one thing — the bucket structure:

s3://team-bucket/
  .dataspoc/manifest.json          # What tables exist (auto-generated)
  raw/<source>/<table>/*.parquet   # Raw ingested data
  curated/<domain>/<table>/*.parquet # Cleaned data
  gold/<domain>/<table>/*.parquet  # Business-ready aggregations

This is the only contract between teams. If team A wants to share data with team B, they grant read IAM on their bucket. Team B registers it with dataspoc-lens add-bucket. Done.

No API to build. No data catalog to maintain. No governance meetings. The manifest IS the catalog.

Scaling: from 1 team to 20

Team 1:  pip install → bucket → pipe → lens → agent    (30 min)
Team 2:  pip install → bucket → pipe → lens → agent    (30 min)
Team 3:  pip install → bucket → pipe → lens → agent    (30 min)
...
Team 20: pip install → bucket → pipe → lens → agent    (30 min)

Each team is independent. Adding team 20 doesn’t affect teams 1-19. No shared Airflow. No shared warehouse. No shared compute.

What a “platform team” does in this model:

Creates buckets and IAM policies for new teams (5 min per team)
Maintains the cross-domain analyst access
Helps teams with their first pipeline setup (Services offering)
That’s it. No Kubernetes. No Terraform. No 5-person team.

Cost per team

Item	Cost
DataSpoc Pipe	$0 (open source)
DataSpoc Lens	$0 (open source)
S3 storage (50GB)	~$1.15/month
S3 requests	~$2/month
Total per team	~$3-5/month

For a 10-team company: $30-50/month total. Compare with Databricks ($30k-100k/year) or Snowflake ($24k-120k/year).

When this doesn’t work

Be honest:

Petabyte scale per team — DuckDB can’t handle it. Need Spark/Trino.
Real-time requirements — DataSpoc is batch. Need Kafka + Flink.
Heavy governance/compliance — Need a real data catalog (DataHub, Atlan). Manifest.json is minimal.
100+ data sources per team — Managing many Singer taps can get complex. Consider Meltano.
Team has zero technical skills — Need CLI comfort. If they can’t pip install, use Fivetran + Looker.

Try it

Set up your first domain in 10 minutes:

pip install dataspoc-pipe[s3] dataspoc-lens[s3,mcp]

# Create your domain
dataspoc-pipe init
dataspoc-pipe add my-source
dataspoc-pipe run my-source

# Query your domain
dataspoc-lens init
dataspoc-lens add-bucket s3://my-team-data
dataspoc-lens shell

# Connect your team's AI agent
dataspoc-lens mcp

Data mesh isn’t a product you buy. It’s a pattern you follow. DataSpoc makes the pattern trivial: one bucket per team, pip install, done.

← Back to blog