anthropicclaudepythonsdkdata-analysis

Building a Data Query Agent with Anthropic Claude SDK and DataSpoc

Michael San Martim · 2026-04-25

Claude’s tool_use feature turns the model into an agent that can call your functions, inspect results, and decide what to do next. Pair that with DataSpoc Lens and you get an AI that queries your data lake with real SQL — no guessing, no hallucinating numbers.

This post builds a complete data query agent using the Anthropic Python SDK. You will define tools, implement the message loop, and handle multi-step conversations where Claude discovers schemas, writes queries, and answers follow-up questions.

Prerequisites

pip install anthropic dataspoc-lens
export ANTHROPIC_API_KEY="sk-ant-..."

Configure your data lake:

dataspoc-lens add-bucket s3://my-company-data --name production

Step 1: Define Tools as JSON Schema

Claude’s tool_use requires tools defined as JSON schemas. Each tool maps to a LensClient method:

TOOLS = [
    {
        "name": "list_tables",
        "description": (
            "List all available tables in the data lake. "
            "Call this first to discover what data exists."
        ),
        "input_schema": {
            "type": "object",
            "properties": {},
            "required": [],
        },
    },
    {
        "name": "get_schema",
        "description": (
            "Get column names and types for a table. "
            "Always call this before writing SQL to know exact column names."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "table_name": {
                    "type": "string",
                    "description": "Full table name, e.g. 'raw.orders'",
                }
            },
            "required": ["table_name"],
        },
    },
    {
        "name": "run_query",
        "description": (
            "Execute a SQL query against the data lake using DuckDB syntax. "
            "Returns up to 50 rows. Always include LIMIT unless aggregating."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "sql": {
                    "type": "string",
                    "description": "DuckDB-compatible SQL query",
                }
            },
            "required": ["sql"],
        },
    },
    {
        "name": "ask_ai",
        "description": (
            "Ask a natural language question about the data. "
            "Uses the built-in AI query engine. Good for quick exploration."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "question": {
                    "type": "string",
                    "description": "Natural language question about the data",
                }
            },
            "required": ["question"],
        },
    },
]

Step 2: Implement Tool Execution

Each tool name maps to a LensClient call:

from dataspoc_lens import LensClient

lens = LensClient()


def execute_tool(name: str, input: dict) -> str:
    """Route tool calls to LensClient methods."""
    if name == "list_tables":
        tables = lens.tables()
        return "\n".join(tables)

    elif name == "get_schema":
        schema = lens.schema(input["table_name"])
        lines = [f"  {col['name']} ({col['type']})" for col in schema]
        return f"Table: {input['table_name']}\n" + "\n".join(lines)

    elif name == "run_query":
        result = lens.query(input["sql"])
        return result.to_string(max_rows=50)

    elif name == "ask_ai":
        result = lens.ask(input["question"])
        return result

    else:
        return f"Unknown tool: {name}"

Step 3: The Agent Loop

The core pattern for Claude tool_use is a loop: send messages, check for tool calls, execute them, send results back, repeat until Claude responds with text.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = (
    "You are a data analyst with access to a company data lake stored as "
    "Parquet files in cloud storage. You query it using DuckDB SQL via tools.\n\n"
    "Workflow:\n"
    "1. List tables to see what's available\n"
    "2. Check schemas before writing SQL\n"
    "3. Write precise SQL queries\n"
    "4. Present results with business context\n\n"
    "Use DuckDB syntax. For dates, use DATE '2026-01-01' format."
)


def chat(messages: list[dict]) -> tuple[str, list[dict]]:
    """Send messages to Claude and handle tool calls in a loop."""
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=SYSTEM_PROMPT,
            tools=TOOLS,
            messages=messages,
        )

        # Collect the assistant's response
        messages.append({"role": "assistant", "content": response.content})

        # If the model is done (no more tool calls), return the text
        if response.stop_reason == "end_turn":
            text = "".join(
                block.text for block in response.content
                if block.type == "text"
            )
            return text, messages

        # Process tool calls
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"  [tool] {block.name}({block.input})")
                result = execute_tool(block.name, block.input)
                tool_results.append(
                    {
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    }
                )

        # Send tool results back to Claude
        messages.append({"role": "user", "content": tool_results})

This loop handles the full lifecycle: Claude asks to use a tool, we execute it, send the result, and Claude either asks for another tool or responds with its final answer.

Step 4: Interactive Session

def main():
    messages = []
    print("Data Lake Agent (Claude) -- type 'quit' to exit")
    print("-" * 50)

    while True:
        question = input("\nYou: ")
        if question.lower() in ("quit", "exit", "q"):
            break

        messages.append({"role": "user", "content": question})
        answer, messages = chat(messages)
        print(f"\nAgent: {answer}")


if __name__ == "__main__":
    main()

Example Session

You: What data do we have?

  [tool] list_tables({})

Agent: Here are the tables in your data lake:

- **raw.customers** -- Customer records
- **raw.orders** -- Order transactions
- **raw.products** -- Product catalog
- **curated.daily_revenue** -- Aggregated daily revenue
- **curated.customer_segments** -- Customer segmentation

You: Who are our top 5 customers by total spend?

  [tool] get_schema({"table_name": "raw.orders"})
  [tool] get_schema({"table_name": "raw.customers"})
  [tool] run_query({"sql": "SELECT c.name, SUM(o.amount) as total_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id GROUP BY c.name ORDER BY total_spend DESC LIMIT 5"})

Agent: Here are your top 5 customers by total spend:

| Customer          | Total Spend  |
|-------------------|-------------|
| Acme Corporation  | $1,245,800  |
| TechFlow Inc      | $892,340    |
| Global Systems    | $756,120    |
| DataVentures      | $623,450    |
| CloudFirst Ltd    | $589,200    |

Acme Corporation leads significantly, spending 40% more than the second
highest customer.

You: Show me Acme's orders over time, monthly

  [tool] run_query({"sql": "SELECT DATE_TRUNC('month', o.order_date) as month, COUNT(*) as order_count, SUM(o.amount) as monthly_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id WHERE c.name = 'Acme Corporation' GROUP BY 1 ORDER BY 1"})

Agent: Acme Corporation's monthly orders show a clear growth trend...

Comparison: Anthropic SDK vs. OpenAI Function Calling

The pattern is nearly identical. Here are the key differences:

# === Anthropic (Claude) ===
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    tools=TOOLS,             # tools at top level
    messages=messages,
)
# Tool calls are in response.content as tool_use blocks
# Results go back as tool_result in a user message

# === OpenAI ===
response = openai.chat.completions.create(
    model="gpt-4o",
    tools=[{"type": "function", "function": t} for t in TOOLS],
    messages=messages,
)
# Tool calls are in response.choices[0].message.tool_calls
# Results go back as role="tool" messages

Aspect	Anthropic SDK	OpenAI SDK
Tool definition	`input_schema` (JSON Schema)	`parameters` (JSON Schema)
Tool call location	`content[]` blocks with `type: "tool_use"`	`tool_calls[]` on the message
Result format	`tool_result` in user message	Separate `role: "tool"` message
Parallel tools	Multiple `tool_use` blocks in one response	Multiple entries in `tool_calls`
Stop reason	`stop_reason: "tool_use"`	`finish_reason: "tool_calls"`

Full Working Script

"""claude_data_agent.py - Data lake agent with Anthropic Claude SDK."""

import anthropic
from dataspoc_lens import LensClient

lens = LensClient()
client = anthropic.Anthropic()

TOOLS = [
    {
        "name": "list_tables",
        "description": "List all tables in the data lake.",
        "input_schema": {
            "type": "object",
            "properties": {},
            "required": [],
        },
    },
    {
        "name": "get_schema",
        "description": "Get column names and types for a table.",
        "input_schema": {
            "type": "object",
            "properties": {
                "table_name": {"type": "string", "description": "Table name"}
            },
            "required": ["table_name"],
        },
    },
    {
        "name": "run_query",
        "description": "Execute a DuckDB SQL query. Include LIMIT unless aggregating.",
        "input_schema": {
            "type": "object",
            "properties": {
                "sql": {"type": "string", "description": "SQL query"}
            },
            "required": ["sql"],
        },
    },
]

SYSTEM = (
    "You are a data analyst. List tables first, check schemas, then query. "
    "Use DuckDB SQL. Present results with business context."
)


def execute_tool(name: str, inp: dict) -> str:
    if name == "list_tables":
        return "\n".join(lens.tables())
    elif name == "get_schema":
        schema = lens.schema(inp["table_name"])
        return "\n".join(f"{c['name']} ({c['type']})" for c in schema)
    elif name == "run_query":
        return lens.query(inp["sql"]).to_string(max_rows=50)
    return f"Unknown tool: {name}"


def chat(messages: list[dict]) -> tuple[str, list[dict]]:
    while True:
        resp = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=SYSTEM,
            tools=TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason == "end_turn":
            return "".join(b.text for b in resp.content if b.type == "text"), messages

        results = []
        for block in resp.content:
            if block.type == "tool_use":
                print(f"  [tool] {block.name}({block.input})")
                results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": execute_tool(block.name, block.input),
                })
        messages.append({"role": "user", "content": results})


def main():
    messages = []
    print("Data Lake Agent (Claude) -- type 'quit' to exit\n")
    while True:
        q = input("You: ")
        if q.lower() in ("quit", "exit", "q"):
            break
        messages.append({"role": "user", "content": q})
        answer, messages = chat(messages)
        print(f"\nAgent: {answer}\n")


if __name__ == "__main__":
    main()

Run it:

python claude_data_agent.py

Claude discovers your tables, understands schemas, writes SQL, and explains results — all grounded in real data from your cloud Parquet files. No hallucinated numbers, no approximate retrieval, just precise queries on your actual data lake.

← Back to blog

Building a Data Query Agent with Anthropic Claude SDK and DataSpoc

Prerequisites

Step 1: Define Tools as JSON Schema

Step 2: Implement Tool Execution

Step 3: The Agent Loop

Step 4: Interactive Session

Example Session

Comparison: Anthropic SDK vs. OpenAI Function Calling

Full Working Script

Recommended

Building a Data Analyst Agent with LangGraph and DataSpoc Lens

Reactive Data Analysis with Marimo and DataSpoc Lens

Multi-Agent Analytics with AutoGen and DataSpoc