data-governancesecurityai-agentsmcpiam

Data Governance for AI Agents: How DataSpoc Keeps Your Lake Secure

Michael San Martim · 2026-04-21

The moment you give an AI agent access to your data, someone in security will ask: “What stops it from deleting everything?” Fair question. Most AI-to-data integrations have no good answer. DataSpoc does.

This post covers the security model that lets AI agents query your data lake without introducing new risks.

The Fear

Teams hesitate to connect AI agents to data for valid reasons:

Write access: What if the agent runs DROP TABLE or DELETE FROM?
Credential sprawl: Another set of database passwords to manage and rotate.
Data exfiltration: Can the agent send data to unauthorized destinations?
No audit trail: How do you know what data the agent accessed?
Scope creep: The agent can see everything, including data it should not.

These fears are justified when you give agents direct database access. DataSpoc eliminates each one.

Security Layer 1: Read-Only by Design

The Lens MCP server is read-only. It does not expose write operations. Period.

from dataspoc_lens import LensClient

lens = LensClient()

# This works — read query
df = lens.query("SELECT * FROM curated_sales LIMIT 10")

# This is rejected — write query
try:
    lens.query("DROP TABLE curated_sales")
except Exception as e:
    print(e)
    # "Write operations are not permitted. Lens is read-only."

# These are also rejected
lens.query("INSERT INTO curated_sales VALUES (...)")  # rejected
lens.query("UPDATE curated_sales SET amount = 0")     # rejected
lens.query("DELETE FROM curated_sales")                # rejected
lens.query("CREATE TABLE test (id INT)")               # rejected

This is enforced at the engine level, not just the prompt level. Even if an LLM generates a write query, Lens will not execute it. The SQL parser checks every statement before execution.

When using the MCP server, the same protection applies:

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"]
    }
  }
}

The MCP server exposes tools like query, tables, schema, and ask. None of them accept write operations. An AI agent connected via MCP physically cannot modify your data.

Security Layer 2: Cloud IAM (No New Credentials)

DataSpoc never manages credentials. It uses your existing cloud IAM:

AWS

# Lens uses your existing AWS credentials
# Option 1: AWS SSO (recommended)
aws sso login --profile data-team

# Option 2: IAM role (for EC2/ECS/Lambda)
# Automatically uses the instance/task role

# Option 3: Environment variables
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."

The IAM policy controls what the agent can see:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::company-analytics",
        "arn:aws:s3:::company-analytics/*"
      ]
    },
    {
      "Effect": "Deny",
      "Action": ["s3:PutObject", "s3:DeleteObject"],
      "Resource": "*"
    }
  ]
}

Notice: the IAM policy explicitly denies write access to S3. Even if Lens had a bug that allowed write SQL (it does not), the cloud layer would block the write.

GCP

# Use application default credentials
gcloud auth application-default login

# Or service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/sa-key.json"

# IAM binding — read-only access to the bucket
- members:
    - serviceAccount:dataspoc-reader@project.iam.gserviceaccount.com
  role: roles/storage.objectViewer

Azure

# Use Azure CLI credentials
az login

# Or managed identity (recommended for production)
export AZURE_STORAGE_ACCOUNT="companylake"

{
  "roleDefinitionName": "Storage Blob Data Reader",
  "scope": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Storage/storageAccounts/companylake"
}

The key insight: DataSpoc adds zero new credentials. The AI agent has exactly the same access as the human who configured it. If your cloud IAM says “this identity can only read from the analytics bucket,” that is all the agent can do.

Security Layer 3: Bucket-Level Access Control

Different teams see different data by having access to different buckets:

s3://company-finance    → Finance team only
s3://company-hr         → HR team only
s3://company-product    → Product team only
s3://company-analytics  → Everyone (aggregated, non-sensitive)

Configure Lens to point to the appropriate bucket:

# Finance team's agent
export DATASPOC_BUCKET="s3://company-finance"
dataspoc-lens mcp  # This agent sees finance data only

# Product team's agent
export DATASPOC_BUCKET="s3://company-product"
dataspoc-lens mcp  # This agent sees product data only

An agent configured with s3://company-product literally cannot access s3://company-finance. It does not know that bucket exists. The isolation is at the cloud infrastructure level, not application logic.

Security Layer 4: Audit Trail

Every query executed through Lens is SQL. SQL is text. Text is loggable.

from dataspoc_lens import LensClient

lens = LensClient()

# Every call to query() or ask() produces a SQL statement
# that can be logged, reviewed, and audited

# ask() returns both the answer and the SQL it generated
answer = lens.ask("How many customers do we have?")
# Internally executes: SELECT COUNT(*) FROM curated_customers
# This SQL is logged to .dataspoc/logs/

Lens logs every query to the bucket:

bucket/
  .dataspoc/
    logs/
      lens/
        2026-04-15T14:30:00Z.json
        2026-04-15T14:31:15Z.json

Each log entry contains:

{
  "timestamp": "2026-04-15T14:30:00Z",
  "query": "SELECT COUNT(*) FROM curated_customers",
  "source": "mcp",
  "tables_accessed": ["curated_customers"],
  "rows_returned": 1,
  "duration_ms": 45,
  "status": "success"
}

You can review exactly what data the agent accessed, when, and how much. Compare this with RAG, where the retrieval step is opaque — you cannot easily see which chunks were sent to the LLM.

Comparison: Three Approaches to AI Data Access

Approach 1: Direct Database Access (Dangerous)

# The agent gets a database connection string
import psycopg2
conn = psycopg2.connect("postgresql://admin:password@prod-db:5432/main")
cursor = conn.cursor()

# Nothing stops the agent from running:
cursor.execute("DROP TABLE customers")  # disaster
cursor.execute("SELECT * FROM hr.salaries")  # data leak
cursor.execute("UPDATE orders SET status = 'shipped'")  # data corruption

Problems:

Credentials in code
Full read/write access
No scope limitation
One mistake destroys production data

Approach 2: RAG with Vector Store (Unauditable)

# The agent retrieves chunks from a vector store
results = vector_store.similarity_search("customer salary data", k=20)
# Which 20 chunks were returned? Hard to audit.
# Did they include sensitive HR data? Maybe.
# Can you prove what the LLM saw? Not easily.

Problems:

Opaque retrieval (what chunks were actually returned?)
Embeddings can encode sensitive data
No row-level access control
Cannot prove compliance

Approach 3: DataSpoc Lens (Governed)

from dataspoc_lens import LensClient

lens = LensClient()  # uses cloud IAM, read-only, scoped to one bucket

# Every action is SQL — auditable, reviewable, explainable
df = lens.query("SELECT region, COUNT(*) FROM curated_sales GROUP BY region")

# Write operations are rejected at engine level
# Access scope is determined by cloud IAM
# Every query is logged with timestamp and tables accessed

Advantages:

No credentials to manage
Read-only by design
Scoped by cloud IAM
Full audit trail
Every answer traces to a SQL query

Configuration Checklist for Production

Here is a step-by-step checklist for deploying DataSpoc with AI agents in a governed environment:

1. Create a Dedicated IAM Identity

# AWS: Create a role for the agent
aws iam create-role --role-name dataspoc-agent-reader \
  --assume-role-policy-document file://trust-policy.json

aws iam attach-role-policy --role-name dataspoc-agent-reader \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

2. Restrict to Specific Buckets

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::company-analytics",
        "arn:aws:s3:::company-analytics/*"
      ]
    }
  ]
}

3. Configure the MCP Server

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"],
      "env": {
        "DATASPOC_BUCKET": "s3://company-analytics",
        "AWS_PROFILE": "dataspoc-agent-reader"
      }
    }
  }
}

4. Enable Query Logging

# dataspoc config
logging:
  enabled: true
  destination: "s3://company-analytics/.dataspoc/logs/lens/"
  level: "all"  # logs every query

5. Review Logs Regularly

from dataspoc_lens import LensClient

lens = LensClient()

# Query the agent's own audit logs
df = lens.query("""
    SELECT timestamp, query, tables_accessed, rows_returned
    FROM lens_audit_log
    WHERE timestamp >= CURRENT_DATE - INTERVAL '7 days'
    ORDER BY timestamp DESC
""")
print(df)

The Bottom Line

Giving AI agents data access does not have to be scary. DataSpoc’s security model is simple:

Read-only engine — writes are impossible at the SQL parser level
Cloud IAM — no new credentials, same permissions as humans
Bucket isolation — each team/agent sees only their data
SQL audit trail — every query is logged and reviewable

The result: your security team gets the governance they need, and your data team gets AI agents that actually work. No compromise required.

← Back to blog