Ejemplos

Tres ejemplos completos mostrando DataSpoc Lens en escenarios del mundo real.

Ejemplo 1: Explorar un data lake de e-commerce

Descubrir y analizar datos del data lake de una empresa de e-commerce en S3.

Configuracion

pip install dataspoc-lens[s3]
dataspoc-lens init
dataspoc-lens add-bucket s3://acme-ecommerce-data

Bucket added: s3://acme-ecommerce-data
Discovering tables...
┌──────────────┬─────────┬────────┬──────────┐
│ Table        │ Columns │ Rows   │ Source   │
├──────────────┼─────────┼────────┼──────────┤
│ customers    │ 8       │ 54200  │ postgres │
│ orders       │ 12      │ 486000 │ postgres │
│ products     │ 9       │ 3120   │ postgres │
│ events       │ 6       │ 2.1M   │ segment  │
└──────────────┴─────────┴────────┴──────────┘

4 table(s) found.

Explorar el catalogo

dataspoc-lens catalog --detail orders

┌─────────────────┬───────────┐
│ Column          │ Type      │
├─────────────────┼───────────┤
│ order_id        │ INTEGER   │
│ customer_id     │ INTEGER   │
│ order_date      │ DATE      │
│ ship_date       │ DATE      │
│ total           │ DOUBLE    │
│ discount        │ DOUBLE    │
│ status          │ VARCHAR   │
│ payment_method  │ VARCHAR   │
│ channel         │ VARCHAR   │
│ region          │ VARCHAR   │
│ items_count     │ INTEGER   │
│ created_at      │ TIMESTAMP │
└─────────────────┴───────────┘

Ejecutar consultas

dataspoc-lens query "
  SELECT region, COUNT(*) as orders, SUM(total) as revenue
  FROM orders
  WHERE order_date >= '2026-01-01'
  GROUP BY region
  ORDER BY revenue DESC
"

┌──────────────┬────────┬────────────┐
│ region       │ orders │ revenue    │
├──────────────┼────────┼────────────┤
│ North America│ 45200  │ 3,842,000  │
│ Europe       │ 28300  │ 2,156,000  │
│ Asia Pacific │ 18900  │ 1,420,000  │
│ Latin America│ 7600   │ 580,000    │
└──────────────┴────────┴────────────┘

(4 row(s), 0.342s)

Usar el shell interactivo

dataspoc-lens shell

lens> .tables
  customers
  orders
  products
  events

lens> SELECT payment_method, COUNT(*) as cnt
   ...> FROM orders GROUP BY payment_method ORDER BY cnt DESC;
┌────────────────┬────────┐
│ payment_method │ cnt    │
├────────────────┼────────┤
│ credit_card    │ 245000 │
│ debit_card     │ 128000 │
│ paypal         │ 82000  │
│ bank_transfer  │ 31000  │
└────────────────┴────────┘

(4 row(s), 0.215s)

lens> .export csv /tmp/payment_methods.csv
Exported 4 row(s) to /tmp/payment_methods.csv

Exportar para reportes

dataspoc-lens query "
  SELECT DATE_TRUNC('month', order_date) as month,
         COUNT(*) as orders,
         SUM(total) as revenue
  FROM orders
  WHERE order_date >= '2025-01-01'
  GROUP BY month ORDER BY month
" --export monthly_report.csv

(16 row(s), 0.456s)
Exported 16 row(s) to monthly_report.csv

Ejemplo 2: Construir una capa curada con transformaciones

Crear datasets limpios y agregados a partir de datos crudos usando transformaciones SQL.

Crear archivos de transformacion

dataspoc-lens init

Crea los siguientes archivos en ~/.dataspoc-lens/transforms/:

001_clean_customers.sql:

CREATE OR REPLACE TABLE clean_customers AS
SELECT
    id,
    TRIM(LOWER(email)) AS email,
    COALESCE(name, 'Unknown') AS name,
    created_at,
    CASE
        WHEN status IN ('active', 'Active', 'ACTIVE') THEN 'active'
        WHEN status IN ('inactive', 'Inactive') THEN 'inactive'
        ELSE 'unknown'
    END AS status
FROM customers
WHERE email IS NOT NULL
  AND email NOT LIKE '%test%';

002_order_metrics.sql:

CREATE OR REPLACE TABLE order_metrics AS
SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(total) AS lifetime_value,
    AVG(total) AS avg_order_value,
    MIN(order_date) AS first_order,
    MAX(order_date) AS last_order,
    DATEDIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order,
    COUNT(CASE WHEN status = 'cancelled' THEN 1 END) AS cancelled_orders
FROM orders
GROUP BY customer_id;

003_customer_segments.sql:

CREATE OR REPLACE TABLE customer_segments AS
SELECT
    c.id,
    c.name,
    c.email,
    m.total_orders,
    m.lifetime_value,
    m.avg_order_value,
    m.days_since_last_order,
    CASE
        WHEN m.lifetime_value > 5000 AND m.days_since_last_order < 30 THEN 'VIP Active'
        WHEN m.lifetime_value > 5000 AND m.days_since_last_order >= 30 THEN 'VIP At Risk'
        WHEN m.lifetime_value > 1000 THEN 'Regular'
        WHEN m.total_orders = 1 THEN 'One-time'
        ELSE 'Low Value'
    END AS segment
FROM clean_customers c
LEFT JOIN order_metrics m ON c.id = m.customer_id;

Ejecutar transformaciones

dataspoc-lens transform list

┌───┬─────────────────────────────┐
│ # │ File                        │
├───┼─────────────────────────────┤
│ 1 │ 001_clean_customers.sql     │
│ 2 │ 002_order_metrics.sql       │
│ 3 │ 003_customer_segments.sql   │
└───┴─────────────────────────────┘

dataspoc-lens transform run

Running 001_clean_customers.sql... OK (0.8s)
Running 002_order_metrics.sql... OK (2.1s)
Running 003_customer_segments.sql... OK (0.6s)

3 transform(s) completed successfully.

Consultar los resultados

dataspoc-lens query "
  SELECT segment, COUNT(*) as customers, AVG(lifetime_value) as avg_ltv
  FROM customer_segments
  GROUP BY segment
  ORDER BY avg_ltv DESC
"

┌──────────────┬───────────┬────────────┐
│ segment      │ customers │ avg_ltv    │
├──────────────┼───────────┼────────────┤
│ VIP Active   │ 420       │ 12,450.00  │
│ VIP At Risk  │ 185       │ 8,200.00   │
│ Regular      │ 3,200     │ 2,150.00   │
│ One-time     │ 8,400     │ 85.00      │
│ Low Value    │ 2,100     │ 320.00     │
└──────────────┴───────────┴────────────┘

(5 row(s), 0.052s)

Ejemplo 3: Analisis multi-cloud con Jupyter

Conectarse a buckets en multiples nubes y analizarlos juntos en un notebook Jupyter.

Configuracion

pip install dataspoc-lens[s3,gcs,jupyter]
dataspoc-lens init

# Add S3 bucket (product data)
dataspoc-lens add-bucket s3://acme-product-data

# Add GCS bucket (analytics data)
dataspoc-lens add-bucket gs://acme-analytics

Bucket added: s3://acme-product-data
Discovering tables...
2 table(s) found.

Bucket added: gs://acme-analytics
Discovering tables...
3 table(s) found.

Cachear para rendimiento

dataspoc-lens cache products
dataspoc-lens cache page_views
dataspoc-lens cache conversions

Iniciar Jupyter

dataspoc-lens notebook

Celdas del notebook

# Cell 1: Check available tables from both clouds
%%sql
SHOW TABLES

┌──────────────────┐
│ name             │
├──────────────────┤
│ products         │
│ inventory        │
│ page_views       │
│ conversions      │
│ sessions         │
└──────────────────┘

# Cell 2: Cross-cloud join — product data (S3) with analytics (GCS)
%%sql
SELECT
    p.name AS product,
    p.category,
    COUNT(DISTINCT pv.session_id) AS views,
    COUNT(DISTINCT c.conversion_id) AS purchases,
    ROUND(COUNT(DISTINCT c.conversion_id)::FLOAT / NULLIF(COUNT(DISTINCT pv.session_id), 0) * 100, 2) AS conversion_rate
FROM products p
LEFT JOIN page_views pv ON p.product_id = pv.product_id
LEFT JOIN conversions c ON p.product_id = c.product_id
GROUP BY p.name, p.category
ORDER BY conversion_rate DESC
LIMIT 10

# Cell 3: Visualize conversion funnel
import pandas as pd

df = conn.execute("""
    SELECT
        category,
        SUM(views) as total_views,
        SUM(purchases) as total_purchases
    FROM (
        SELECT p.category,
               COUNT(DISTINCT pv.session_id) AS views,
               COUNT(DISTINCT c.conversion_id) AS purchases
        FROM products p
        LEFT JOIN page_views pv ON p.product_id = pv.product_id
        LEFT JOIN conversions c ON p.product_id = c.product_id
        GROUP BY p.category, p.product_id
    )
    GROUP BY category
    ORDER BY total_views DESC
""").fetchdf()

df.plot(x='category', y=['total_views', 'total_purchases'],
        kind='bar', title='Views vs Purchases by Category')

# Cell 4: Check cache status
from dataspoc_lens import LensClient

with LensClient() as client:
    for entry in client.cache_status():
        size_mb = entry["size_bytes"] / (1024 * 1024)
        print(f"{entry['table']}: {entry['status']} ({size_mb:.1f} MB)")

products: fresh (0.8 MB)
page_views: fresh (45.2 MB)
conversions: fresh (12.1 MB)