Data Mesh Without the Complexity: One Bucket Per Team
Data mesh sounds great in theory: decentralized ownership, domain-driven data, self-serve infrastructure. In practice, it usually means 18 months of “platform team” work building a self-serve portal that nobody uses.
Here’s the dirty secret: you don’t need a platform to do data mesh. You need a bucket per team and a pip install.
The data mesh promise (and the usual failure)
The promise:
- Each team owns their data
- No central bottleneck
- Self-serve infrastructure
- Domain-driven design
The usual implementation:
- 6 months building a “data platform” with Terraform, Kubernetes, and Airflow
- A “platform team” of 5 people maintaining it
- Teams still can’t onboard without filing a ticket
- $200k/year in infrastructure before anyone queries anything
The DataSpoc implementation:
- Each team runs
pip install dataspoc-pipe dataspoc-lens - Each team creates their own S3 bucket
- Each team manages their own pipelines
- Total cost: $0 + S3 storage (~$5/month per team)
- Setup time: 30 minutes per team
The architecture
┌─────────────────────────────────────────────────────────────────┐│ Company Cloud (AWS/GCS/Azure) ││ ││ ┌─────────────────┐ ┌─────────────────┐ ┌────────────────┐ ││ │ s3://finance │ │ s3://product │ │ s3://marketing │ ││ │ │ │ │ │ │ ││ │ Pipe → Parquet │ │ Pipe → Parquet │ │ Pipe → Parquet │ ││ │ Lens → SQL/AI │ │ Lens → SQL/AI │ │ Lens → SQL/AI │ ││ │ ML → predictions │ │ │ │ │ ││ │ │ │ │ │ │ ││ │ IAM: finance-team│ │ IAM: product-team│ │ IAM: mkt-team │ ││ └─────────────────┘ └─────────────────┘ └────────────────┘ ││ ││ Cross-domain analyst: registers all 3 buckets in Lens ││ AI agent: connects via MCP, scoped to team's bucket │└─────────────────────────────────────────────────────────────────┘Each team is a self-contained data platform:
- Their own bucket (isolated data)
- Their own Pipe config (their sources, their schedule)
- Their own Lens config (their queries, their transforms)
- Their own AI agent (MCP scoped to their bucket)
No shared infrastructure. No central team. No tickets.
Setting up the Finance team (example)
Step 1: Create the bucket and IAM
# AWS (or equivalent for GCS/Azure)aws s3 mb s3://company-financeaws iam create-policy --policy-name finance-data-access --policy-document '{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"], "Resource": ["arn:aws:s3:::company-finance", "arn:aws:s3:::company-finance/*"] }]}'Step 2: Install and configure Pipe
pip install dataspoc-pipe[s3]dataspoc-pipe initPipeline: Postgres (ERP) → Finance bucket
source: tap: tap-postgres config: ~/.dataspoc-pipe/sources/erp.json streams: - invoices - payments - accounts_receivable
destination: bucket: s3://company-finance path: raw compression: zstd
incremental: enabled: true
schedule: cron: "0 6 * * *" # daily at 6amPipeline: Stripe → Finance bucket
source: tap: tap-stripe config: ~/.dataspoc-pipe/sources/stripe.json streams: - charges - refunds - subscriptions
destination: bucket: s3://company-finance path: raw compression: zstd
incremental: enabled: true
schedule: cron: "0 7 * * *"dataspoc-pipe run _ --alldataspoc-pipe schedule installStep 3: Install and configure Lens
pip install dataspoc-lens[s3,ai,mcp]dataspoc-lens initdataspoc-lens add-bucket s3://company-financeStep 4: Create domain transforms
-- ~/.dataspoc-lens/transforms/001_monthly_revenue.sqlCREATE OR REPLACE TABLE monthly_revenue ASSELECT DATE_TRUNC('month', created_at) AS month, SUM(amount) AS revenue, COUNT(*) AS transactions, SUM(CASE WHEN type = 'refund' THEN amount ELSE 0 END) AS refundsFROM chargesGROUP BY 1ORDER BY 1;-- ~/.dataspoc-lens/transforms/002_accounts_aging.sqlCREATE OR REPLACE TABLE accounts_aging ASSELECT customer_id, SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) > 90 THEN amount ELSE 0 END) AS over_90, SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) BETWEEN 60 AND 90 THEN amount ELSE 0 END) AS days_60_90, SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) BETWEEN 30 AND 60 THEN amount ELSE 0 END) AS days_30_60, SUM(CASE WHEN DATEDIFF('day', due_date, CURRENT_DATE) < 30 THEN amount ELSE 0 END) AS current_amountFROM accounts_receivableWHERE status = 'open'GROUP BY 1;dataspoc-lens transform runStep 5: Connect the team’s AI agent
dataspoc-lens mcpClaude Desktop config for the finance team:
{ "mcpServers": { "finance-data": { "command": "dataspoc-lens", "args": ["mcp"] } }}Now the CFO’s AI agent can ask:
"What's our monthly revenue trend?""Which customers have overdue payments over $10k?""What's our net retention rate this quarter?"Every answer comes from real SQL on the finance team’s data. No access to other teams’ buckets.
Setting up the Product team (same pattern, different data)
pip install dataspoc-pipe[s3] dataspoc-lens[s3,ai]dataspoc-pipe init# Product team sources: Postgres (app DB) + Mixpanel eventssource: tap: tap-postgres config: ~/.dataspoc-pipe/sources/app-db.json streams: - users - subscriptions - feature_flags
destination: bucket: s3://company-product path: raw compression: zstd
incremental: enabled: truedataspoc-lens add-bucket s3://company-productdataspoc-lens ask "how many users signed up last week?"The product team has zero visibility into finance data. Finance has zero visibility into product data. IAM enforces the boundaries, not application code.
Cross-domain analytics
The head of analytics needs to see across teams. Simple — register multiple buckets:
dataspoc-lens add-bucket s3://company-financedataspoc-lens add-bucket s3://company-productdataspoc-lens add-bucket s3://company-marketingNow they can JOIN across domains:
SELECT p.user_id, p.plan, f.lifetime_value, m.acquisition_channelFROM product_users pJOIN finance_customer_360 f ON p.user_id = f.customer_idJOIN marketing_attribution m ON p.user_id = m.user_idWHERE p.plan = 'enterprise';This only works if the analyst’s IAM role has read access to all three buckets. The access model is:
| Role | Buckets | Sees |
|---|---|---|
| Finance analyst | s3://company-finance | Finance data only |
| Product PM | s3://company-product | Product data only |
| Head of Analytics | All 3 | Cross-domain JOINs |
| CEO AI agent | All 3 | Everything via MCP |
| Team-specific agent | Team’s bucket only | Scoped data |
Data mesh principles mapped to DataSpoc
| Principle | How DataSpoc implements it |
|---|---|
| Domain ownership | Each team owns their bucket, their pipelines, their transforms |
| Data as a product | Manifest.json catalogs what’s available. Clean/gold layers are the “product” |
| Self-serve platform | pip install — no tickets, no platform team |
| Federated governance | Cloud IAM at bucket level. No application-level auth |
| Interoperability | Same format (Parquet), same convention (bucket structure), same tools |
| Discoverability | dataspoc-lens catalog shows all tables in your registered buckets |
The data contract: bucket convention
Teams agree on one thing — the bucket structure:
s3://team-bucket/ .dataspoc/manifest.json # What tables exist (auto-generated) raw/<source>/<table>/*.parquet # Raw ingested data curated/<domain>/<table>/*.parquet # Cleaned data gold/<domain>/<table>/*.parquet # Business-ready aggregationsThis is the only contract between teams. If team A wants to share data with team B, they grant read IAM on their bucket. Team B registers it with dataspoc-lens add-bucket. Done.
No API to build. No data catalog to maintain. No governance meetings. The manifest IS the catalog.
Scaling: from 1 team to 20
Team 1: pip install → bucket → pipe → lens → agent (30 min)Team 2: pip install → bucket → pipe → lens → agent (30 min)Team 3: pip install → bucket → pipe → lens → agent (30 min)...Team 20: pip install → bucket → pipe → lens → agent (30 min)Each team is independent. Adding team 20 doesn’t affect teams 1-19. No shared Airflow. No shared warehouse. No shared compute.
What a “platform team” does in this model:
- Creates buckets and IAM policies for new teams (5 min per team)
- Maintains the cross-domain analyst access
- Helps teams with their first pipeline setup (Services offering)
- That’s it. No Kubernetes. No Terraform. No 5-person team.
Cost per team
| Item | Cost |
|---|---|
| DataSpoc Pipe | $0 (open source) |
| DataSpoc Lens | $0 (open source) |
| S3 storage (50GB) | ~$1.15/month |
| S3 requests | ~$2/month |
| Total per team | ~$3-5/month |
For a 10-team company: $30-50/month total. Compare with Databricks ($30k-100k/year) or Snowflake ($24k-120k/year).
When this doesn’t work
Be honest:
- Petabyte scale per team — DuckDB can’t handle it. Need Spark/Trino.
- Real-time requirements — DataSpoc is batch. Need Kafka + Flink.
- Heavy governance/compliance — Need a real data catalog (DataHub, Atlan). Manifest.json is minimal.
- 100+ data sources per team — Managing many Singer taps can get complex. Consider Meltano.
- Team has zero technical skills — Need CLI comfort. If they can’t
pip install, use Fivetran + Looker.
Try it
Set up your first domain in 10 minutes:
pip install dataspoc-pipe[s3] dataspoc-lens[s3,mcp]
# Create your domaindataspoc-pipe initdataspoc-pipe add my-sourcedataspoc-pipe run my-source
# Query your domaindataspoc-lens initdataspoc-lens add-bucket s3://my-team-datadataspoc-lens shell
# Connect your team's AI agentdataspoc-lens mcpData mesh isn’t a product you buy. It’s a pattern you follow. DataSpoc makes the pattern trivial: one bucket per team, pip install, done.