Ejemplos
Tres ejemplos completos mostrando DataSpoc Lens en escenarios del mundo real.
Ejemplo 1: Explorar un data lake de e-commerce
Sección titulada «Ejemplo 1: Explorar un data lake de e-commerce»Descubrir y analizar datos del data lake de una empresa de e-commerce en S3.
Configuracion
Sección titulada «Configuracion»pip install dataspoc-lens[s3]dataspoc-lens initdataspoc-lens add-bucket s3://acme-ecommerce-dataBucket added: s3://acme-ecommerce-dataDiscovering tables...┌──────────────┬─────────┬────────┬──────────┐│ Table │ Columns │ Rows │ Source │├──────────────┼─────────┼────────┼──────────┤│ customers │ 8 │ 54200 │ postgres ││ orders │ 12 │ 486000 │ postgres ││ products │ 9 │ 3120 │ postgres ││ events │ 6 │ 2.1M │ segment │└──────────────┴─────────┴────────┴──────────┘
4 table(s) found.Explorar el catalogo
Sección titulada «Explorar el catalogo»dataspoc-lens catalog --detail orders┌─────────────────┬───────────┐│ Column │ Type │├─────────────────┼───────────┤│ order_id │ INTEGER ││ customer_id │ INTEGER ││ order_date │ DATE ││ ship_date │ DATE ││ total │ DOUBLE ││ discount │ DOUBLE ││ status │ VARCHAR ││ payment_method │ VARCHAR ││ channel │ VARCHAR ││ region │ VARCHAR ││ items_count │ INTEGER ││ created_at │ TIMESTAMP │└─────────────────┴───────────┘Ejecutar consultas
Sección titulada «Ejecutar consultas»dataspoc-lens query " SELECT region, COUNT(*) as orders, SUM(total) as revenue FROM orders WHERE order_date >= '2026-01-01' GROUP BY region ORDER BY revenue DESC"┌──────────────┬────────┬────────────┐│ region │ orders │ revenue │├──────────────┼────────┼────────────┤│ North America│ 45200 │ 3,842,000 ││ Europe │ 28300 │ 2,156,000 ││ Asia Pacific │ 18900 │ 1,420,000 ││ Latin America│ 7600 │ 580,000 │└──────────────┴────────┴────────────┘
(4 row(s), 0.342s)Usar el shell interactivo
Sección titulada «Usar el shell interactivo»dataspoc-lens shelllens> .tables customers orders products events
lens> SELECT payment_method, COUNT(*) as cnt ...> FROM orders GROUP BY payment_method ORDER BY cnt DESC;┌────────────────┬────────┐│ payment_method │ cnt │├────────────────┼────────┤│ credit_card │ 245000 ││ debit_card │ 128000 ││ paypal │ 82000 ││ bank_transfer │ 31000 │└────────────────┴────────┘
(4 row(s), 0.215s)
lens> .export csv /tmp/payment_methods.csvExported 4 row(s) to /tmp/payment_methods.csvExportar para reportes
Sección titulada «Exportar para reportes»dataspoc-lens query " SELECT DATE_TRUNC('month', order_date) as month, COUNT(*) as orders, SUM(total) as revenue FROM orders WHERE order_date >= '2025-01-01' GROUP BY month ORDER BY month" --export monthly_report.csv(16 row(s), 0.456s)Exported 16 row(s) to monthly_report.csvEjemplo 2: Construir una capa curada con transformaciones
Sección titulada «Ejemplo 2: Construir una capa curada con transformaciones»Crear datasets limpios y agregados a partir de datos crudos usando transformaciones SQL.
Crear archivos de transformacion
Sección titulada «Crear archivos de transformacion»dataspoc-lens initCrea los siguientes archivos en ~/.dataspoc-lens/transforms/:
001_clean_customers.sql:
CREATE OR REPLACE TABLE clean_customers ASSELECT id, TRIM(LOWER(email)) AS email, COALESCE(name, 'Unknown') AS name, created_at, CASE WHEN status IN ('active', 'Active', 'ACTIVE') THEN 'active' WHEN status IN ('inactive', 'Inactive') THEN 'inactive' ELSE 'unknown' END AS statusFROM customersWHERE email IS NOT NULL AND email NOT LIKE '%test%';002_order_metrics.sql:
CREATE OR REPLACE TABLE order_metrics ASSELECT customer_id, COUNT(*) AS total_orders, SUM(total) AS lifetime_value, AVG(total) AS avg_order_value, MIN(order_date) AS first_order, MAX(order_date) AS last_order, DATEDIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order, COUNT(CASE WHEN status = 'cancelled' THEN 1 END) AS cancelled_ordersFROM ordersGROUP BY customer_id;003_customer_segments.sql:
CREATE OR REPLACE TABLE customer_segments ASSELECT c.id, c.name, c.email, m.total_orders, m.lifetime_value, m.avg_order_value, m.days_since_last_order, CASE WHEN m.lifetime_value > 5000 AND m.days_since_last_order < 30 THEN 'VIP Active' WHEN m.lifetime_value > 5000 AND m.days_since_last_order >= 30 THEN 'VIP At Risk' WHEN m.lifetime_value > 1000 THEN 'Regular' WHEN m.total_orders = 1 THEN 'One-time' ELSE 'Low Value' END AS segmentFROM clean_customers cLEFT JOIN order_metrics m ON c.id = m.customer_id;Ejecutar transformaciones
Sección titulada «Ejecutar transformaciones»dataspoc-lens transform list┌───┬─────────────────────────────┐│ # │ File │├───┼─────────────────────────────┤│ 1 │ 001_clean_customers.sql ││ 2 │ 002_order_metrics.sql ││ 3 │ 003_customer_segments.sql │└───┴─────────────────────────────┘dataspoc-lens transform runRunning 001_clean_customers.sql... OK (0.8s)Running 002_order_metrics.sql... OK (2.1s)Running 003_customer_segments.sql... OK (0.6s)
3 transform(s) completed successfully.Consultar los resultados
Sección titulada «Consultar los resultados»dataspoc-lens query " SELECT segment, COUNT(*) as customers, AVG(lifetime_value) as avg_ltv FROM customer_segments GROUP BY segment ORDER BY avg_ltv DESC"┌──────────────┬───────────┬────────────┐│ segment │ customers │ avg_ltv │├──────────────┼───────────┼────────────┤│ VIP Active │ 420 │ 12,450.00 ││ VIP At Risk │ 185 │ 8,200.00 ││ Regular │ 3,200 │ 2,150.00 ││ One-time │ 8,400 │ 85.00 ││ Low Value │ 2,100 │ 320.00 │└──────────────┴───────────┴────────────┘
(5 row(s), 0.052s)Ejemplo 3: Analisis multi-cloud con Jupyter
Sección titulada «Ejemplo 3: Analisis multi-cloud con Jupyter»Conectarse a buckets en multiples nubes y analizarlos juntos en un notebook Jupyter.
Configuracion
Sección titulada «Configuracion»pip install dataspoc-lens[s3,gcs,jupyter]dataspoc-lens init
# Add S3 bucket (product data)dataspoc-lens add-bucket s3://acme-product-data
# Add GCS bucket (analytics data)dataspoc-lens add-bucket gs://acme-analyticsBucket added: s3://acme-product-dataDiscovering tables...2 table(s) found.
Bucket added: gs://acme-analyticsDiscovering tables...3 table(s) found.Cachear para rendimiento
Sección titulada «Cachear para rendimiento»dataspoc-lens cache productsdataspoc-lens cache page_viewsdataspoc-lens cache conversionsIniciar Jupyter
Sección titulada «Iniciar Jupyter»dataspoc-lens notebookCeldas del notebook
Sección titulada «Celdas del notebook»# Cell 1: Check available tables from both clouds%%sqlSHOW TABLES┌──────────────────┐│ name │├──────────────────┤│ products ││ inventory ││ page_views ││ conversions ││ sessions │└──────────────────┘# Cell 2: Cross-cloud join — product data (S3) with analytics (GCS)%%sqlSELECT p.name AS product, p.category, COUNT(DISTINCT pv.session_id) AS views, COUNT(DISTINCT c.conversion_id) AS purchases, ROUND(COUNT(DISTINCT c.conversion_id)::FLOAT / NULLIF(COUNT(DISTINCT pv.session_id), 0) * 100, 2) AS conversion_rateFROM products pLEFT JOIN page_views pv ON p.product_id = pv.product_idLEFT JOIN conversions c ON p.product_id = c.product_idGROUP BY p.name, p.categoryORDER BY conversion_rate DESCLIMIT 10# Cell 3: Visualize conversion funnelimport pandas as pd
df = conn.execute(""" SELECT category, SUM(views) as total_views, SUM(purchases) as total_purchases FROM ( SELECT p.category, COUNT(DISTINCT pv.session_id) AS views, COUNT(DISTINCT c.conversion_id) AS purchases FROM products p LEFT JOIN page_views pv ON p.product_id = pv.product_id LEFT JOIN conversions c ON p.product_id = c.product_id GROUP BY p.category, p.product_id ) GROUP BY category ORDER BY total_views DESC""").fetchdf()
df.plot(x='category', y=['total_views', 'total_purchases'], kind='bar', title='Views vs Purchases by Category')# Cell 4: Check cache statusfrom dataspoc_lens import LensClient
with LensClient() as client: for entry in client.cache_status(): size_mb = entry["size_bytes"] / (1024 * 1024) print(f"{entry['table']}: {entry['status']} ({size_mb:.1f} MB)")products: fresh (0.8 MB)page_views: fresh (45.2 MB)conversions: fresh (12.1 MB)