Construindo um Agente de Consulta de Dados com Anthropic Claude SDK e DataSpoc
O recurso tool_use do Claude transforma o modelo em um agente que pode chamar suas funções, inspecionar resultados e decidir o que fazer em seguida. Combine isso com o DataSpoc Lens e você obtém uma IA que consulta seu data lake com SQL real — sem adivinhação, sem números alucinados.
Este post constrói um agente de consulta de dados completo usando o SDK Python da Anthropic. Você vai definir ferramentas, implementar o loop de mensagens e lidar com conversas multi-etapa onde o Claude descobre schemas, escreve queries e responde perguntas de acompanhamento.
Pré-requisitos
pip install anthropic dataspoc-lensexport ANTHROPIC_API_KEY="sk-ant-..."Configure seu data lake:
dataspoc-lens add-bucket s3://my-company-data --name productionPasso 1: Definir Ferramentas como JSON Schema
O tool_use do Claude requer ferramentas definidas como JSON schemas. Cada ferramenta mapeia para um método do LensClient:
TOOLS = [ { "name": "list_tables", "description": ( "List all available tables in the data lake. " "Call this first to discover what data exists." ), "input_schema": { "type": "object", "properties": {}, "required": [], }, }, { "name": "get_schema", "description": ( "Get column names and types for a table. " "Always call this before writing SQL to know exact column names." ), "input_schema": { "type": "object", "properties": { "table_name": { "type": "string", "description": "Full table name, e.g. 'raw.orders'", } }, "required": ["table_name"], }, }, { "name": "run_query", "description": ( "Execute a SQL query against the data lake using DuckDB syntax. " "Returns up to 50 rows. Always include LIMIT unless aggregating." ), "input_schema": { "type": "object", "properties": { "sql": { "type": "string", "description": "DuckDB-compatible SQL query", } }, "required": ["sql"], }, }, { "name": "ask_ai", "description": ( "Ask a natural language question about the data. " "Uses the built-in AI query engine. Good for quick exploration." ), "input_schema": { "type": "object", "properties": { "question": { "type": "string", "description": "Natural language question about the data", } }, "required": ["question"], }, },]Passo 2: Implementar a Execução das Ferramentas
Cada nome de ferramenta mapeia para uma chamada do LensClient:
from dataspoc_lens import LensClient
lens = LensClient()
def execute_tool(name: str, input: dict) -> str: """Route tool calls to LensClient methods.""" if name == "list_tables": tables = lens.tables() return "\n".join(tables)
elif name == "get_schema": schema = lens.schema(input["table_name"]) lines = [f" {col['name']} ({col['type']})" for col in schema] return f"Table: {input['table_name']}\n" + "\n".join(lines)
elif name == "run_query": result = lens.query(input["sql"]) return result.to_string(max_rows=50)
elif name == "ask_ai": result = lens.ask(input["question"]) return result
else: return f"Unknown tool: {name}"Passo 3: O Loop do Agente
O padrão central para o tool_use do Claude é um loop: enviar mensagens, verificar chamadas de ferramenta, executá-las, enviar resultados de volta, repetir até que o Claude responda com texto.
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = ( "You are a data analyst with access to a company data lake stored as " "Parquet files in cloud storage. You query it using DuckDB SQL via tools.\n\n" "Workflow:\n" "1. List tables to see what's available\n" "2. Check schemas before writing SQL\n" "3. Write precise SQL queries\n" "4. Present results with business context\n\n" "Use DuckDB syntax. For dates, use DATE '2026-01-01' format.")
def chat(messages: list[dict]) -> tuple[str, list[dict]]: """Send messages to Claude and handle tool calls in a loop.""" while True: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=SYSTEM_PROMPT, tools=TOOLS, messages=messages, )
# Collect the assistant's response messages.append({"role": "assistant", "content": response.content})
# If the model is done (no more tool calls), return the text if response.stop_reason == "end_turn": text = "".join( block.text for block in response.content if block.type == "text" ) return text, messages
# Process tool calls tool_results = [] for block in response.content: if block.type == "tool_use": print(f" [tool] {block.name}({block.input})") result = execute_tool(block.name, block.input) tool_results.append( { "type": "tool_result", "tool_use_id": block.id, "content": result, } )
# Send tool results back to Claude messages.append({"role": "user", "content": tool_results})Este loop lida com o ciclo de vida completo: o Claude pede para usar uma ferramenta, nós a executamos, enviamos o resultado, e o Claude pede outra ferramenta ou responde com sua resposta final.
Passo 4: Sessão Interativa
def main(): messages = [] print("Data Lake Agent (Claude) -- type 'quit' to exit") print("-" * 50)
while True: question = input("\nYou: ") if question.lower() in ("quit", "exit", "q"): break
messages.append({"role": "user", "content": question}) answer, messages = chat(messages) print(f"\nAgent: {answer}")
if __name__ == "__main__": main()Exemplo de Sessão
You: What data do we have?
[tool] list_tables({})
Agent: Here are the tables in your data lake:
- **raw.customers** -- Customer records- **raw.orders** -- Order transactions- **raw.products** -- Product catalog- **curated.daily_revenue** -- Aggregated daily revenue- **curated.customer_segments** -- Customer segmentation
You: Who are our top 5 customers by total spend?
[tool] get_schema({"table_name": "raw.orders"}) [tool] get_schema({"table_name": "raw.customers"}) [tool] run_query({"sql": "SELECT c.name, SUM(o.amount) as total_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id GROUP BY c.name ORDER BY total_spend DESC LIMIT 5"})
Agent: Here are your top 5 customers by total spend:
| Customer | Total Spend ||-------------------|-------------|| Acme Corporation | $1,245,800 || TechFlow Inc | $892,340 || Global Systems | $756,120 || DataVentures | $623,450 || CloudFirst Ltd | $589,200 |
Acme Corporation leads significantly, spending 40% more than the secondhighest customer.
You: Show me Acme's orders over time, monthly
[tool] run_query({"sql": "SELECT DATE_TRUNC('month', o.order_date) as month, COUNT(*) as order_count, SUM(o.amount) as monthly_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id WHERE c.name = 'Acme Corporation' GROUP BY 1 ORDER BY 1"})
Agent: Acme Corporation's monthly orders show a clear growth trend...Comparação: Anthropic SDK vs. OpenAI Function Calling
O padrão é quase idêntico. Aqui estão as diferenças principais:
# === Anthropic (Claude) ===response = client.messages.create( model="claude-sonnet-4-20250514", tools=TOOLS, # tools at top level messages=messages,)# Tool calls are in response.content as tool_use blocks# Results go back as tool_result in a user message
# === OpenAI ===response = openai.chat.completions.create( model="gpt-4o", tools=[{"type": "function", "function": t} for t in TOOLS], messages=messages,)# Tool calls are in response.choices[0].message.tool_calls# Results go back as role="tool" messages| Aspecto | Anthropic SDK | OpenAI SDK |
|---|---|---|
| Definição de ferramenta | input_schema (JSON Schema) | parameters (JSON Schema) |
| Localização da chamada | content[] blocks com type: "tool_use" | tool_calls[] na mensagem |
| Formato do resultado | tool_result em mensagem do usuário | Mensagem separada role: "tool" |
| Ferramentas paralelas | Múltiplos blocos tool_use em uma resposta | Múltiplas entradas em tool_calls |
| Razão de parada | stop_reason: "tool_use" | finish_reason: "tool_calls" |
Script Completo Funcional
"""claude_data_agent.py - Data lake agent with Anthropic Claude SDK."""
import anthropicfrom dataspoc_lens import LensClient
lens = LensClient()client = anthropic.Anthropic()
TOOLS = [ { "name": "list_tables", "description": "List all tables in the data lake.", "input_schema": { "type": "object", "properties": {}, "required": [], }, }, { "name": "get_schema", "description": "Get column names and types for a table.", "input_schema": { "type": "object", "properties": { "table_name": {"type": "string", "description": "Table name"} }, "required": ["table_name"], }, }, { "name": "run_query", "description": "Execute a DuckDB SQL query. Include LIMIT unless aggregating.", "input_schema": { "type": "object", "properties": { "sql": {"type": "string", "description": "SQL query"} }, "required": ["sql"], }, },]
SYSTEM = ( "You are a data analyst. List tables first, check schemas, then query. " "Use DuckDB SQL. Present results with business context.")
def execute_tool(name: str, inp: dict) -> str: if name == "list_tables": return "\n".join(lens.tables()) elif name == "get_schema": schema = lens.schema(inp["table_name"]) return "\n".join(f"{c['name']} ({c['type']})" for c in schema) elif name == "run_query": return lens.query(inp["sql"]).to_string(max_rows=50) return f"Unknown tool: {name}"
def chat(messages: list[dict]) -> tuple[str, list[dict]]: while True: resp = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=4096, system=SYSTEM, tools=TOOLS, messages=messages, ) messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn": return "".join(b.text for b in resp.content if b.type == "text"), messages
results = [] for block in resp.content: if block.type == "tool_use": print(f" [tool] {block.name}({block.input})") results.append({ "type": "tool_result", "tool_use_id": block.id, "content": execute_tool(block.name, block.input), }) messages.append({"role": "user", "content": results})
def main(): messages = [] print("Data Lake Agent (Claude) -- type 'quit' to exit\n") while True: q = input("You: ") if q.lower() in ("quit", "exit", "q"): break messages.append({"role": "user", "content": q}) answer, messages = chat(messages) print(f"\nAgent: {answer}\n")
if __name__ == "__main__": main()Execute:
python claude_data_agent.pyO Claude descobre suas tabelas, entende schemas, escreve SQL e explica resultados — tudo fundamentado em dados reais dos seus arquivos Parquet na nuvem. Sem números alucinados, sem recuperação aproximada, apenas consultas precisas no seu data lake real.