Ir al contenido

Ejemplos

Tres ejemplos completos mostrando DataSpoc Lens en escenarios del mundo real.

Ejemplo 1: Explorar un data lake de e-commerce

Sección titulada «Ejemplo 1: Explorar un data lake de e-commerce»

Descubrir y analizar datos del data lake de una empresa de e-commerce en S3.

Ventana de terminal
pip install dataspoc-lens[s3]
dataspoc-lens init
dataspoc-lens add-bucket s3://acme-ecommerce-data
Bucket added: s3://acme-ecommerce-data
Discovering tables...
┌──────────────┬─────────┬────────┬──────────┐
│ Table │ Columns │ Rows │ Source │
├──────────────┼─────────┼────────┼──────────┤
│ customers │ 8 │ 54200 │ postgres │
│ orders │ 12 │ 486000 │ postgres │
│ products │ 9 │ 3120 │ postgres │
│ events │ 6 │ 2.1M │ segment │
└──────────────┴─────────┴────────┴──────────┘
4 table(s) found.
Ventana de terminal
dataspoc-lens catalog --detail orders
┌─────────────────┬───────────┐
│ Column │ Type │
├─────────────────┼───────────┤
│ order_id │ INTEGER │
│ customer_id │ INTEGER │
│ order_date │ DATE │
│ ship_date │ DATE │
│ total │ DOUBLE │
│ discount │ DOUBLE │
│ status │ VARCHAR │
│ payment_method │ VARCHAR │
│ channel │ VARCHAR │
│ region │ VARCHAR │
│ items_count │ INTEGER │
│ created_at │ TIMESTAMP │
└─────────────────┴───────────┘
Ventana de terminal
dataspoc-lens query "
SELECT region, COUNT(*) as orders, SUM(total) as revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY region
ORDER BY revenue DESC
"
┌──────────────┬────────┬────────────┐
│ region │ orders │ revenue │
├──────────────┼────────┼────────────┤
│ North America│ 45200 │ 3,842,000 │
│ Europe │ 28300 │ 2,156,000 │
│ Asia Pacific │ 18900 │ 1,420,000 │
│ Latin America│ 7600 │ 580,000 │
└──────────────┴────────┴────────────┘
(4 row(s), 0.342s)
Ventana de terminal
dataspoc-lens shell
lens> .tables
customers
orders
products
events
lens> SELECT payment_method, COUNT(*) as cnt
...> FROM orders GROUP BY payment_method ORDER BY cnt DESC;
┌────────────────┬────────┐
│ payment_method │ cnt │
├────────────────┼────────┤
│ credit_card │ 245000 │
│ debit_card │ 128000 │
│ paypal │ 82000 │
│ bank_transfer │ 31000 │
└────────────────┴────────┘
(4 row(s), 0.215s)
lens> .export csv /tmp/payment_methods.csv
Exported 4 row(s) to /tmp/payment_methods.csv
Ventana de terminal
dataspoc-lens query "
SELECT DATE_TRUNC('month', order_date) as month,
COUNT(*) as orders,
SUM(total) as revenue
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY month ORDER BY month
" --export monthly_report.csv
(16 row(s), 0.456s)
Exported 16 row(s) to monthly_report.csv

Ejemplo 2: Construir una capa curada con transformaciones

Sección titulada «Ejemplo 2: Construir una capa curada con transformaciones»

Crear datasets limpios y agregados a partir de datos crudos usando transformaciones SQL.

Ventana de terminal
dataspoc-lens init

Crea los siguientes archivos en ~/.dataspoc-lens/transforms/:

001_clean_customers.sql:

CREATE OR REPLACE TABLE clean_customers AS
SELECT
id,
TRIM(LOWER(email)) AS email,
COALESCE(name, 'Unknown') AS name,
created_at,
CASE
WHEN status IN ('active', 'Active', 'ACTIVE') THEN 'active'
WHEN status IN ('inactive', 'Inactive') THEN 'inactive'
ELSE 'unknown'
END AS status
FROM customers
WHERE email IS NOT NULL
AND email NOT LIKE '%test%';

002_order_metrics.sql:

CREATE OR REPLACE TABLE order_metrics AS
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(total) AS lifetime_value,
AVG(total) AS avg_order_value,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order,
DATEDIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order,
COUNT(CASE WHEN status = 'cancelled' THEN 1 END) AS cancelled_orders
FROM orders
GROUP BY customer_id;

003_customer_segments.sql:

CREATE OR REPLACE TABLE customer_segments AS
SELECT
c.id,
c.name,
c.email,
m.total_orders,
m.lifetime_value,
m.avg_order_value,
m.days_since_last_order,
CASE
WHEN m.lifetime_value > 5000 AND m.days_since_last_order < 30 THEN 'VIP Active'
WHEN m.lifetime_value > 5000 AND m.days_since_last_order >= 30 THEN 'VIP At Risk'
WHEN m.lifetime_value > 1000 THEN 'Regular'
WHEN m.total_orders = 1 THEN 'One-time'
ELSE 'Low Value'
END AS segment
FROM clean_customers c
LEFT JOIN order_metrics m ON c.id = m.customer_id;
Ventana de terminal
dataspoc-lens transform list
┌───┬─────────────────────────────┐
│ # │ File │
├───┼─────────────────────────────┤
│ 1 │ 001_clean_customers.sql │
│ 2 │ 002_order_metrics.sql │
│ 3 │ 003_customer_segments.sql │
└───┴─────────────────────────────┘
Ventana de terminal
dataspoc-lens transform run
Running 001_clean_customers.sql... OK (0.8s)
Running 002_order_metrics.sql... OK (2.1s)
Running 003_customer_segments.sql... OK (0.6s)
3 transform(s) completed successfully.
Ventana de terminal
dataspoc-lens query "
SELECT segment, COUNT(*) as customers, AVG(lifetime_value) as avg_ltv
FROM customer_segments
GROUP BY segment
ORDER BY avg_ltv DESC
"
┌──────────────┬───────────┬────────────┐
│ segment │ customers │ avg_ltv │
├──────────────┼───────────┼────────────┤
│ VIP Active │ 420 │ 12,450.00 │
│ VIP At Risk │ 185 │ 8,200.00 │
│ Regular │ 3,200 │ 2,150.00 │
│ One-time │ 8,400 │ 85.00 │
│ Low Value │ 2,100 │ 320.00 │
└──────────────┴───────────┴────────────┘
(5 row(s), 0.052s)

Conectarse a buckets en multiples nubes y analizarlos juntos en un notebook Jupyter.

Ventana de terminal
pip install dataspoc-lens[s3,gcs,jupyter]
dataspoc-lens init
# Add S3 bucket (product data)
dataspoc-lens add-bucket s3://acme-product-data
# Add GCS bucket (analytics data)
dataspoc-lens add-bucket gs://acme-analytics
Bucket added: s3://acme-product-data
Discovering tables...
2 table(s) found.
Bucket added: gs://acme-analytics
Discovering tables...
3 table(s) found.
Ventana de terminal
dataspoc-lens cache products
dataspoc-lens cache page_views
dataspoc-lens cache conversions
Ventana de terminal
dataspoc-lens notebook
# Cell 1: Check available tables from both clouds
%%sql
SHOW TABLES
┌──────────────────┐
│ name │
├──────────────────┤
│ products │
│ inventory │
│ page_views │
│ conversions │
│ sessions │
└──────────────────┘
# Cell 2: Cross-cloud join — product data (S3) with analytics (GCS)
%%sql
SELECT
p.name AS product,
p.category,
COUNT(DISTINCT pv.session_id) AS views,
COUNT(DISTINCT c.conversion_id) AS purchases,
ROUND(COUNT(DISTINCT c.conversion_id)::FLOAT / NULLIF(COUNT(DISTINCT pv.session_id), 0) * 100, 2) AS conversion_rate
FROM products p
LEFT JOIN page_views pv ON p.product_id = pv.product_id
LEFT JOIN conversions c ON p.product_id = c.product_id
GROUP BY p.name, p.category
ORDER BY conversion_rate DESC
LIMIT 10
# Cell 3: Visualize conversion funnel
import pandas as pd
df = conn.execute("""
SELECT
category,
SUM(views) as total_views,
SUM(purchases) as total_purchases
FROM (
SELECT p.category,
COUNT(DISTINCT pv.session_id) AS views,
COUNT(DISTINCT c.conversion_id) AS purchases
FROM products p
LEFT JOIN page_views pv ON p.product_id = pv.product_id
LEFT JOIN conversions c ON p.product_id = c.product_id
GROUP BY p.category, p.product_id
)
GROUP BY category
ORDER BY total_views DESC
""").fetchdf()
df.plot(x='category', y=['total_views', 'total_purchases'],
kind='bar', title='Views vs Purchases by Category')
# Cell 4: Check cache status
from dataspoc_lens import LensClient
with LensClient() as client:
for entry in client.cache_status():
size_mb = entry["size_bytes"] / (1024 * 1024)
print(f"{entry['table']}: {entry['status']} ({size_mb:.1f} MB)")
products: fresh (0.8 MB)
page_views: fresh (45.2 MB)
conversions: fresh (12.1 MB)