Building a Data Query Agent with Anthropic Claude SDK and DataSpoc
Claude’s tool_use feature turns the model into an agent that can call your functions, inspect results, and decide what to do next. Pair that with DataSpoc Lens and you get an AI that queries your data lake with real SQL — no guessing, no hallucinating numbers.
This post builds a complete data query agent using the Anthropic Python SDK. You will define tools, implement the message loop, and handle multi-step conversations where Claude discovers schemas, writes queries, and answers follow-up questions.
Prerequisites
pip install anthropic dataspoc-lensexport ANTHROPIC_API_KEY="sk-ant-..."Configure your data lake:
dataspoc-lens add-bucket s3://my-company-data --name productionStep 1: Define Tools as JSON Schema
Claude’s tool_use requires tools defined as JSON schemas. Each tool maps to a LensClient method:
TOOLS = [ { "name": "list_tables", "description": ( "List all available tables in the data lake. " "Call this first to discover what data exists." ), "input_schema": { "type": "object", "properties": {}, "required": [], }, }, { "name": "get_schema", "description": ( "Get column names and types for a table. " "Always call this before writing SQL to know exact column names." ), "input_schema": { "type": "object", "properties": { "table_name": { "type": "string", "description": "Full table name, e.g. 'raw.orders'", } }, "required": ["table_name"], }, }, { "name": "run_query", "description": ( "Execute a SQL query against the data lake using DuckDB syntax. " "Returns up to 50 rows. Always include LIMIT unless aggregating." ), "input_schema": { "type": "object", "properties": { "sql": { "type": "string", "description": "DuckDB-compatible SQL query", } }, "required": ["sql"], }, }, { "name": "ask_ai", "description": ( "Ask a natural language question about the data. " "Uses the built-in AI query engine. Good for quick exploration." ), "input_schema": { "type": "object", "properties": { "question": { "type": "string", "description": "Natural language question about the data", } }, "required": ["question"], }, },]Step 2: Implement Tool Execution
Each tool name maps to a LensClient call:
from dataspoc_lens import LensClient
lens = LensClient()
def execute_tool(name: str, input: dict) -> str: """Route tool calls to LensClient methods.""" if name == "list_tables": tables = lens.tables() return "\n".join(tables)
elif name == "get_schema": schema = lens.schema(input["table_name"]) lines = [f" {col['name']} ({col['type']})" for col in schema] return f"Table: {input['table_name']}\n" + "\n".join(lines)
elif name == "run_query": result = lens.query(input["sql"]) return result.to_string(max_rows=50)
elif name == "ask_ai": result = lens.ask(input["question"]) return result
else: return f"Unknown tool: {name}"Step 3: The Agent Loop
The core pattern for Claude tool_use is a loop: send messages, check for tool calls, execute them, send results back, repeat until Claude responds with text.
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = ( "You are a data analyst with access to a company data lake stored as " "Parquet files in cloud storage. You query it using DuckDB SQL via tools.\n\n" "Workflow:\n" "1. List tables to see what's available\n" "2. Check schemas before writing SQL\n" "3. Write precise SQL queries\n" "4. Present results with business context\n\n" "Use DuckDB syntax. For dates, use DATE '2026-01-01' format.")
def chat(messages: list[dict]) -> tuple[str, list[dict]]: """Send messages to Claude and handle tool calls in a loop.""" while True: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=SYSTEM_PROMPT, tools=TOOLS, messages=messages, )
# Collect the assistant's response messages.append({"role": "assistant", "content": response.content})
# If the model is done (no more tool calls), return the text if response.stop_reason == "end_turn": text = "".join( block.text for block in response.content if block.type == "text" ) return text, messages
# Process tool calls tool_results = [] for block in response.content: if block.type == "tool_use": print(f" [tool] {block.name}({block.input})") result = execute_tool(block.name, block.input) tool_results.append( { "type": "tool_result", "tool_use_id": block.id, "content": result, } )
# Send tool results back to Claude messages.append({"role": "user", "content": tool_results})This loop handles the full lifecycle: Claude asks to use a tool, we execute it, send the result, and Claude either asks for another tool or responds with its final answer.
Step 4: Interactive Session
def main(): messages = [] print("Data Lake Agent (Claude) -- type 'quit' to exit") print("-" * 50)
while True: question = input("\nYou: ") if question.lower() in ("quit", "exit", "q"): break
messages.append({"role": "user", "content": question}) answer, messages = chat(messages) print(f"\nAgent: {answer}")
if __name__ == "__main__": main()Example Session
You: What data do we have?
[tool] list_tables({})
Agent: Here are the tables in your data lake:
- **raw.customers** -- Customer records- **raw.orders** -- Order transactions- **raw.products** -- Product catalog- **curated.daily_revenue** -- Aggregated daily revenue- **curated.customer_segments** -- Customer segmentation
You: Who are our top 5 customers by total spend?
[tool] get_schema({"table_name": "raw.orders"}) [tool] get_schema({"table_name": "raw.customers"}) [tool] run_query({"sql": "SELECT c.name, SUM(o.amount) as total_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id GROUP BY c.name ORDER BY total_spend DESC LIMIT 5"})
Agent: Here are your top 5 customers by total spend:
| Customer | Total Spend ||-------------------|-------------|| Acme Corporation | $1,245,800 || TechFlow Inc | $892,340 || Global Systems | $756,120 || DataVentures | $623,450 || CloudFirst Ltd | $589,200 |
Acme Corporation leads significantly, spending 40% more than the secondhighest customer.
You: Show me Acme's orders over time, monthly
[tool] run_query({"sql": "SELECT DATE_TRUNC('month', o.order_date) as month, COUNT(*) as order_count, SUM(o.amount) as monthly_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id WHERE c.name = 'Acme Corporation' GROUP BY 1 ORDER BY 1"})
Agent: Acme Corporation's monthly orders show a clear growth trend...Comparison: Anthropic SDK vs. OpenAI Function Calling
The pattern is nearly identical. Here are the key differences:
# === Anthropic (Claude) ===response = client.messages.create( model="claude-sonnet-4-20250514", tools=TOOLS, # tools at top level messages=messages,)# Tool calls are in response.content as tool_use blocks# Results go back as tool_result in a user message
# === OpenAI ===response = openai.chat.completions.create( model="gpt-4o", tools=[{"type": "function", "function": t} for t in TOOLS], messages=messages,)# Tool calls are in response.choices[0].message.tool_calls# Results go back as role="tool" messages| Aspect | Anthropic SDK | OpenAI SDK |
|---|---|---|
| Tool definition | input_schema (JSON Schema) | parameters (JSON Schema) |
| Tool call location | content[] blocks with type: "tool_use" | tool_calls[] on the message |
| Result format | tool_result in user message | Separate role: "tool" message |
| Parallel tools | Multiple tool_use blocks in one response | Multiple entries in tool_calls |
| Stop reason | stop_reason: "tool_use" | finish_reason: "tool_calls" |
Full Working Script
"""claude_data_agent.py - Data lake agent with Anthropic Claude SDK."""
import anthropicfrom dataspoc_lens import LensClient
lens = LensClient()client = anthropic.Anthropic()
TOOLS = [ { "name": "list_tables", "description": "List all tables in the data lake.", "input_schema": { "type": "object", "properties": {}, "required": [], }, }, { "name": "get_schema", "description": "Get column names and types for a table.", "input_schema": { "type": "object", "properties": { "table_name": {"type": "string", "description": "Table name"} }, "required": ["table_name"], }, }, { "name": "run_query", "description": "Execute a DuckDB SQL query. Include LIMIT unless aggregating.", "input_schema": { "type": "object", "properties": { "sql": {"type": "string", "description": "SQL query"} }, "required": ["sql"], }, },]
SYSTEM = ( "You are a data analyst. List tables first, check schemas, then query. " "Use DuckDB SQL. Present results with business context.")
def execute_tool(name: str, inp: dict) -> str: if name == "list_tables": return "\n".join(lens.tables()) elif name == "get_schema": schema = lens.schema(inp["table_name"]) return "\n".join(f"{c['name']} ({c['type']})" for c in schema) elif name == "run_query": return lens.query(inp["sql"]).to_string(max_rows=50) return f"Unknown tool: {name}"
def chat(messages: list[dict]) -> tuple[str, list[dict]]: while True: resp = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=SYSTEM, tools=TOOLS, messages=messages, ) messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn": return "".join(b.text for b in resp.content if b.type == "text"), messages
results = [] for block in resp.content: if block.type == "tool_use": print(f" [tool] {block.name}({block.input})") results.append({ "type": "tool_result", "tool_use_id": block.id, "content": execute_tool(block.name, block.input), }) messages.append({"role": "user", "content": results})
def main(): messages = [] print("Data Lake Agent (Claude) -- type 'quit' to exit\n") while True: q = input("You: ") if q.lower() in ("quit", "exit", "q"): break messages.append({"role": "user", "content": q}) answer, messages = chat(messages) print(f"\nAgent: {answer}\n")
if __name__ == "__main__": main()Run it:
python claude_data_agent.pyClaude discovers your tables, understands schemas, writes SQL, and explains results — all grounded in real data from your cloud Parquet files. No hallucinated numbers, no approximate retrieval, just precise queries on your actual data lake.