data-mesharchitecturedata-engineeringteamsdecentralizediam

Data Mesh Without the Complexity: One Bucket Per Team

Michael San Martim · 2026-04-29

Data mesh sounds great in theory: decentralized ownership, domain-driven data, self-serve infrastructure. In practice, it usually means 18 months of “platform team” work building a self-serve portal that nobody uses.

Here’s the dirty secret: you don’t need a platform to do data mesh. You need a bucket per team and a pip install.

The data mesh promise (and the usual failure)

The promise:

  • Each team owns their data
  • No central bottleneck
  • Self-serve infrastructure
  • Domain-driven design

The usual implementation:

  • 6 months building a “data platform” with Terraform, Kubernetes, and Airflow
  • A “platform team” of 5 people maintaining it
  • Teams still can’t onboard without filing a ticket
  • $200k/year in infrastructure before anyone queries anything

The DataSpoc implementation:

  • Each team runs pip install dataspoc-pipe dataspoc-lens
  • Each team creates their own S3 bucket
  • Each team manages their own pipelines
  • Total cost: $0 + S3 storage (~$5/month per team)
  • Setup time: 30 minutes per team

The architecture

┌─────────────────────────────────────────────────────────────────┐
│ Company Cloud (AWS/GCS/Azure) │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ │
│ │ s3://finance │ │ s3://product │ │ s3://marketing │ │
│ │ │ │ │ │ │ │
│ │ Pipe → Parquet │ │ Pipe → Parquet │ │ Pipe → Parquet │ │
│ │ Lens → SQL/AI │ │ Lens → SQL/AI │ │ Lens → SQL/AI │ │
│ │ ML → predictions │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ IAM: finance-team│ │ IAM: product-team│ │ IAM: mkt-team │ │
│ └─────────────────┘ └─────────────────┘ └────────────────┘ │
│ │
│ Cross-domain analyst: registers all 3 buckets in Lens │
│ AI agent: connects via MCP, scoped to team's bucket │
└─────────────────────────────────────────────────────────────────┘

Each team is a self-contained data platform:

  • Their own bucket (isolated data)
  • Their own Pipe config (their sources, their schedule)
  • Their own Lens config (their queries, their transforms)
  • Their own AI agent (MCP scoped to their bucket)

No shared infrastructure. No central team. No tickets.

Setting up the Finance team (example)

Step 1: Create the bucket and IAM

Terminal window
# AWS (or equivalent for GCS/Azure)
aws s3 mb s3://company-finance
aws iam create-policy --policy-name finance-data-access --policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
"Resource": ["arn:aws:s3:::company-finance", "arn:aws:s3:::company-finance/*"]
}]
}'

Step 2: Install and configure Pipe

Terminal window
pip install dataspoc-pipe[s3]
dataspoc-pipe init

Pipeline: Postgres (ERP) → Finance bucket

~/.dataspoc-pipe/pipelines/erp-finance.yaml
source:
tap: tap-postgres
config: ~/.dataspoc-pipe/sources/erp.json
streams:
- invoices
- payments
- accounts_receivable
destination:
bucket: s3://company-finance
path: raw
compression: zstd
incremental:
enabled: true
schedule:
cron: "0 6 * * *" # daily at 6am

Pipeline: Stripe → Finance bucket

~/.dataspoc-pipe/pipelines/stripe-finance.yaml
source:
tap: tap-stripe
config: ~/.dataspoc-pipe/sources/stripe.json
streams:
- charges
- refunds
- subscriptions
destination:
bucket: s3://company-finance
path: raw
compression: zstd
incremental:
enabled: true
schedule:
cron: "0 7 * * *"
Terminal window
dataspoc-pipe run _ --all
dataspoc-pipe schedule install

Step 3: Install and configure Lens

Terminal window
pip install dataspoc-lens[s3,ai,mcp]
dataspoc-lens init
dataspoc-lens add-bucket s3://company-finance

Step 4: Create domain transforms

-- ~/.dataspoc-lens/transforms/001_monthly_revenue.sql
CREATE OR REPLACE TABLE monthly_revenue AS
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue,
COUNT(*) AS transactions,
SUM(CASE WHEN type = 'refund' THEN amount ELSE 0 END) AS refunds
FROM charges
GROUP BY 1
ORDER BY 1;
-- ~/.dataspoc-lens/transforms/002_accounts_aging.sql
CREATE OR REPLACE TABLE accounts_aging AS
SELECT
customer_id,
SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) > 90 THEN amount ELSE 0 END) AS over_90,
SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) BETWEEN 60 AND 90 THEN amount ELSE 0 END) AS days_60_90,
SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) BETWEEN 30 AND 60 THEN amount ELSE 0 END) AS days_30_60,
SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) < 30 THEN amount ELSE 0 END) AS current_amount
FROM accounts_receivable
WHERE status = 'open'
GROUP BY 1;
Terminal window
dataspoc-lens transform run

Step 5: Connect the team’s AI agent

Terminal window
dataspoc-lens mcp

Claude Desktop config for the finance team:

{
"mcpServers": {
"finance-data": {
"command": "dataspoc-lens",
"args": ["mcp"]
}
}
}

Now the CFO’s AI agent can ask:

"What's our monthly revenue trend?"
"Which customers have overdue payments over $10k?"
"What's our net retention rate this quarter?"

Every answer comes from real SQL on the finance team’s data. No access to other teams’ buckets.

Setting up the Product team (same pattern, different data)

Terminal window
pip install dataspoc-pipe[s3] dataspoc-lens[s3,ai]
dataspoc-pipe init
# Product team sources: Postgres (app DB) + Mixpanel events
source:
tap: tap-postgres
config: ~/.dataspoc-pipe/sources/app-db.json
streams:
- users
- subscriptions
- feature_flags
destination:
bucket: s3://company-product
path: raw
compression: zstd
incremental:
enabled: true
Terminal window
dataspoc-lens add-bucket s3://company-product
dataspoc-lens ask "how many users signed up last week?"

The product team has zero visibility into finance data. Finance has zero visibility into product data. IAM enforces the boundaries, not application code.

Cross-domain analytics

The head of analytics needs to see across teams. Simple — register multiple buckets:

Terminal window
dataspoc-lens add-bucket s3://company-finance
dataspoc-lens add-bucket s3://company-product
dataspoc-lens add-bucket s3://company-marketing

Now they can JOIN across domains:

SELECT
p.user_id,
p.plan,
f.lifetime_value,
m.acquisition_channel
FROM product_users p
JOIN finance_customer_360 f ON p.user_id = f.customer_id
JOIN marketing_attribution m ON p.user_id = m.user_id
WHERE p.plan = 'enterprise';

This only works if the analyst’s IAM role has read access to all three buckets. The access model is:

RoleBucketsSees
Finance analysts3://company-financeFinance data only
Product PMs3://company-productProduct data only
Head of AnalyticsAll 3Cross-domain JOINs
CEO AI agentAll 3Everything via MCP
Team-specific agentTeam’s bucket onlyScoped data

Data mesh principles mapped to DataSpoc

PrincipleHow DataSpoc implements it
Domain ownershipEach team owns their bucket, their pipelines, their transforms
Data as a productManifest.json catalogs what’s available. Clean/gold layers are the “product”
Self-serve platformpip install — no tickets, no platform team
Federated governanceCloud IAM at bucket level. No application-level auth
InteroperabilitySame format (Parquet), same convention (bucket structure), same tools
Discoverabilitydataspoc-lens catalog shows all tables in your registered buckets

The data contract: bucket convention

Teams agree on one thing — the bucket structure:

s3://team-bucket/
.dataspoc/manifest.json # What tables exist (auto-generated)
raw/<source>/<table>/*.parquet # Raw ingested data
curated/<domain>/<table>/*.parquet # Cleaned data
gold/<domain>/<table>/*.parquet # Business-ready aggregations

This is the only contract between teams. If team A wants to share data with team B, they grant read IAM on their bucket. Team B registers it with dataspoc-lens add-bucket. Done.

No API to build. No data catalog to maintain. No governance meetings. The manifest IS the catalog.

Scaling: from 1 team to 20

Team 1: pip install → bucket → pipe → lens → agent (30 min)
Team 2: pip install → bucket → pipe → lens → agent (30 min)
Team 3: pip install → bucket → pipe → lens → agent (30 min)
...
Team 20: pip install → bucket → pipe → lens → agent (30 min)

Each team is independent. Adding team 20 doesn’t affect teams 1-19. No shared Airflow. No shared warehouse. No shared compute.

What a “platform team” does in this model:

  • Creates buckets and IAM policies for new teams (5 min per team)
  • Maintains the cross-domain analyst access
  • Helps teams with their first pipeline setup (Services offering)
  • That’s it. No Kubernetes. No Terraform. No 5-person team.

Cost per team

ItemCost
DataSpoc Pipe$0 (open source)
DataSpoc Lens$0 (open source)
S3 storage (50GB)~$1.15/month
S3 requests~$2/month
Total per team~$3-5/month

For a 10-team company: $30-50/month total. Compare with Databricks ($30k-100k/year) or Snowflake ($24k-120k/year).

When this doesn’t work

Be honest:

  • Petabyte scale per team — DuckDB can’t handle it. Need Spark/Trino.
  • Real-time requirements — DataSpoc is batch. Need Kafka + Flink.
  • Heavy governance/compliance — Need a real data catalog (DataHub, Atlan). Manifest.json is minimal.
  • 100+ data sources per team — Managing many Singer taps can get complex. Consider Meltano.
  • Team has zero technical skills — Need CLI comfort. If they can’t pip install, use Fivetran + Looker.

Try it

Set up your first domain in 10 minutes:

Terminal window
pip install dataspoc-pipe[s3] dataspoc-lens[s3,mcp]
# Create your domain
dataspoc-pipe init
dataspoc-pipe add my-source
dataspoc-pipe run my-source
# Query your domain
dataspoc-lens init
dataspoc-lens add-bucket s3://my-team-data
dataspoc-lens shell
# Connect your team's AI agent
dataspoc-lens mcp

Data mesh isn’t a product you buy. It’s a pattern you follow. DataSpoc makes the pattern trivial: one bucket per team, pip install, done.

Recommended