anthropicclaudepythonsdkdata-analysis

Construyendo un Agente de Consulta de Datos con Anthropic Claude SDK y DataSpoc

Michael San Martim · 2026-04-25

La función tool_use de Claude convierte al modelo en un agente que puede llamar tus funciónes, inspeccionar resultados y decidir qué hacer después. Combínalo con DataSpoc Lens y obtienes una IA que consulta tu data lake con SQL real — sin adivinar, sin alucinar números.

Este artículo construye un agente de consulta de datos completo usando el SDK de Python de Anthropic. Definirás herramientas, implementarás el bucle de mensajes y manejarás conversaciones de múltiples pasos donde Claude descubre schemas, escribe consultas y responde preguntas de seguimiento.

Requisitos Previos

Terminal window
pip install anthropic dataspoc-lens
export ANTHROPIC_API_KEY="sk-ant-..."

Configura tu data lake:

Terminal window
dataspoc-lens add-bucket s3://my-company-data --name production

Paso 1: Definir Herramientas cómo JSON Schema

El tool_use de Claude requiere herramientas definidas cómo JSON schemas. Cada herramienta se mapea a un método de LensClient:

TOOLS = [
{
"name": "list_tables",
"description": (
"List all available tables in the data lake. "
"Call this first to discover what data exists."
),
"input_schema": {
"type": "object",
"properties": {},
"required": [],
},
},
{
"name": "get_schema",
"description": (
"Get column names and types for a table. "
"Always call this before writing SQL to know exact column names."
),
"input_schema": {
"type": "object",
"properties": {
"table_name": {
"type": "string",
"description": "Full table name, e.g. 'raw.orders'",
}
},
"required": ["table_name"],
},
},
{
"name": "run_query",
"description": (
"Execute a SQL query against the data lake using DuckDB syntax. "
"Returns up to 50 rows. Always include LIMIT unless aggregating."
),
"input_schema": {
"type": "object",
"properties": {
"sql": {
"type": "string",
"description": "DuckDB-compatible SQL query",
}
},
"required": ["sql"],
},
},
{
"name": "ask_ai",
"description": (
"Ask a natural language question about the data. "
"Uses the built-in AI query engine. Good for quick exploration."
),
"input_schema": {
"type": "object",
"properties": {
"question": {
"type": "string",
"description": "Natural language question about the data",
}
},
"required": ["question"],
},
},
]

Paso 2: Implementar la Ejecución de Herramientas

Cada nombre de herramienta se mapea a una llamada de LensClient:

from dataspoc_lens import LensClient
lens = LensClient()
def execute_tool(name: str, input: dict) -> str:
"""Route tool calls to LensClient methods."""
if name == "list_tables":
tables = lens.tables()
return "\n".join(tables)
elif name == "get_schema":
schema = lens.schema(input["table_name"])
lines = [f" {col['name']} ({col['type']})" for col in schema]
return f"Table: {input['table_name']}\n" + "\n".join(lines)
elif name == "run_query":
result = lens.query(input["sql"])
return result.to_string(max_rows=50)
elif name == "ask_ai":
result = lens.ask(input["question"])
return result
else:
return f"Unknown tool: {name}"

Paso 3: El Bucle del Agente

El patrón central para tool_use de Claude es un bucle: enviar mensajes, verificar llamadas a herramientas, ejecutarlas, devolver resultados y repetir hasta que Claude responda con texto.

import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = (
"You are a data analyst with access to a company data lake stored as "
"Parquet files in cloud storage. You query it using DuckDB SQL via tools.\n\n"
"Workflow:\n"
"1. List tables to see what's available\n"
"2. Check schemas before writing SQL\n"
"3. Write precise SQL queries\n"
"4. Present results with business context\n\n"
"Use DuckDB syntax. For dates, use DATE '2026-01-01' format."
)
def chat(messages: list[dict]) -> tuple[str, list[dict]]:
"""Send messages to Claude and handle tool calls in a loop."""
while True:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=SYSTEM_PROMPT,
tools=TOOLS,
messages=messages,
)
# Collect the assistant's response
messages.append({"role": "assistant", "content": response.content})
# If the model is done (no more tool calls), return the text
if response.stop_reason == "end_turn":
text = "".join(
block.text for block in response.content
if block.type == "text"
)
return text, messages
# Process tool calls
tool_results = []
for block in response.content:
if block.type == "tool_use":
print(f" [tool] {block.name}({block.input})")
result = execute_tool(block.name, block.input)
tool_results.append(
{
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
}
)
# Send tool results back to Claude
messages.append({"role": "user", "content": tool_results})

Este bucle maneja el ciclo de vida completo: Claude pide usar una herramienta, la ejecutamos, enviamos el resultado, y Claude pide otra herramienta o responde con su respuesta final.

Paso 4: Sesión Interactiva

def main():
messages = []
print("Data Lake Agent (Claude) -- type 'quit' to exit")
print("-" * 50)
while True:
question = input("\nYou: ")
if question.lower() in ("quit", "exit", "q"):
break
messages.append({"role": "user", "content": question})
answer, messages = chat(messages)
print(f"\nAgent: {answer}")
if __name__ == "__main__":
main()

Ejemplo de Sesión

You: What data do we have?
[tool] list_tables({})
Agent: Here are the tables in your data lake:
- **raw.customers** -- Customer records
- **raw.orders** -- Order transactions
- **raw.products** -- Product catalog
- **curated.daily_revenue** -- Aggregated daily revenue
- **curated.customer_segments** -- Customer segmentation
You: Who are our top 5 customers by total spend?
[tool] get_schema({"table_name": "raw.orders"})
[tool] get_schema({"table_name": "raw.customers"})
[tool] run_query({"sql": "SELECT c.name, SUM(o.amount) as total_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id GROUP BY c.name ORDER BY total_spend DESC LIMIT 5"})
Agent: Here are your top 5 customers by total spend:
| Customer | Total Spend |
|-------------------|-------------|
| Acme Corporation | $1,245,800 |
| TechFlow Inc | $892,340 |
| Global Systems | $756,120 |
| DataVentures | $623,450 |
| CloudFirst Ltd | $589,200 |
Acme Corporation leads significantly, spending 40% more than the second
highest customer.
You: Show me Acme's orders over time, monthly
[tool] run_query({"sql": "SELECT DATE_TRUNC('month', o.order_date) as month, COUNT(*) as order_count, SUM(o.amount) as monthly_spend FROM raw.orders o JOIN raw.customers c ON o.customer_id = c.customer_id WHERE c.name = 'Acme Corporation' GROUP BY 1 ORDER BY 1"})
Agent: Acme Corporation's monthly orders show a clear growth trend...

Comparación: SDK de Anthropic vs. Function Calling de OpenAI

El patrón es casi idéntico. Aquí están las diferencias clave:

# === Anthropic (Claude) ===
response = client.messages.create(
model="claude-sonnet-4-20250514",
tools=TOOLS, # tools at top level
messages=messages,
)
# Tool calls are in response.content as tool_use blocks
# Results go back as tool_result in a user message
# === OpenAI ===
response = openai.chat.completions.create(
model="gpt-4o",
tools=[{"type": "function", "function": t} for t in TOOLS],
messages=messages,
)
# Tool calls are in response.choices[0].message.tool_calls
# Results go back as role="tool" messages
AspectoSDK de AnthropicSDK de OpenAI
Definición de herramientainput_schema (JSON Schema)parameters (JSON Schema)
Ubicación de la llamadacontent[] blocks con type: "tool_use"tool_calls[] en el mensaje
Formato de resultadotool_result en mensaje de usuarioMensaje separado role: "tool"
Herramientas paralelasMúltiples bloques tool_use en una respuestaMúltiples entradas en tool_calls
Razón de paradastop_reason: "tool_use"finish_reason: "tool_calls"

Script Completo Funcional

"""claude_data_agent.py - Data lake agent with Anthropic Claude SDK."""
import anthropic
from dataspoc_lens import LensClient
lens = LensClient()
client = anthropic.Anthropic()
TOOLS = [
{
"name": "list_tables",
"description": "List all tables in the data lake.",
"input_schema": {
"type": "object",
"properties": {},
"required": [],
},
},
{
"name": "get_schema",
"description": "Get column names and types for a table.",
"input_schema": {
"type": "object",
"properties": {
"table_name": {"type": "string", "description": "Table name"}
},
"required": ["table_name"],
},
},
{
"name": "run_query",
"description": "Execute a DuckDB SQL query. Include LIMIT unless aggregating.",
"input_schema": {
"type": "object",
"properties": {
"sql": {"type": "string", "description": "SQL query"}
},
"required": ["sql"],
},
},
]
SYSTEM = (
"You are a data analyst. List tables first, check schemas, then query. "
"Use DuckDB SQL. Present results with business context."
)
def execute_tool(name: str, inp: dict) -> str:
if name == "list_tables":
return "\n".join(lens.tables())
elif name == "get_schema":
schema = lens.schema(inp["table_name"])
return "\n".join(f"{c['name']} ({c['type']})" for c in schema)
elif name == "run_query":
return lens.query(inp["sql"]).to_string(max_rows=50)
return f"Unknown tool: {name}"
def chat(messages: list[dict]) -> tuple[str, list[dict]]:
while True:
resp = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=SYSTEM,
tools=TOOLS,
messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
return "".join(b.text for b in resp.content if b.type == "text"), messages
results = []
for block in resp.content:
if block.type == "tool_use":
print(f" [tool] {block.name}({block.input})")
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": execute_tool(block.name, block.input),
})
messages.append({"role": "user", "content": results})
def main():
messages = []
print("Data Lake Agent (Claude) -- type 'quit' to exit\n")
while True:
q = input("You: ")
if q.lower() in ("quit", "exit", "q"):
break
messages.append({"role": "user", "content": q})
answer, messages = chat(messages)
print(f"\nAgent: {answer}\n")
if __name__ == "__main__":
main()

Ejecútalo:

Terminal window
python claude_data_agent.py

Claude descubre tus tablas, entiende los schemas, escribe SQL y explica los resultados — todo fundamentado en datos reales de tus archivos Parquet en la nube. Sin números alucinados, sin recuperación apróximada, solo consultas precisas sobre tu data lake real.

Recomendados