Construyendo una capa de datos curada con SQL Transforms (sin necesidad de dbt)
dbt es el estandar para transformacion de datos basada en SQL. También es un proyecto de 50 archivos con profiles, sources, models, tests, macros, packages y un archivo YAML para cada modelo. Para un equipo con 5 tablas que necesita un star schema, eso es excesivo.
DataSpoc Lens tiene un enfoque mas simple: archivos SQL numerados que se ejecutan en orden. Cada archivo crea una tabla a partir de una consulta. Sin YAML, sin Jinja, sin grafo de dependencias que depurar. Solo SQL.
El problema con dbt para equipos pequenos
Para construir un star schema a partir de 3 tablas fuente en dbt, necesitas:
my_dbt_project/ dbt_project.yml profiles.yml models/ staging/ stg_users.sql stg_users.yml stg_orders.sql stg_orders.yml stg_products.sql stg_products.yml marts/ dim_customers.sql dim_customers.yml fct_orders.sql fct_orders.yml mart_revenue.sql mart_revenue.yml tests/ ... macros/ ...Son 12+ archivos antes de escribir una sola consulta de negocio. El poder de dbt viene de la gestion de dependencias, testing y documentación — todo valioso a escala. Pero para una startup con un analista, es una semana de configuración antes de entregar algun valor.
Lens Transforms: Archivos SQL numerados
En DataSpoc Lens, los transforms son archivos .sql en un directorio. Estan numerados para definir el orden de ejecucion:
transforms/ 01_clean_users.sql 02_clean_orders.sql 03_join_order_details.sql 04_daily_revenue.sql 05_customer_dashboard.sqlCada archivo usa el patrón CTAS (CREATE TABLE AS SELECT):
-- Creates a table in the curated layerCREATE OR REPLACE TABLE curated.clean_users ASSELECT ...Ejecutar todos los transforms:
dataspoc-lens transform runLens los ejecuta en orden numerico: 01, 02, 03, 04, 05. Cada transform puede referenciar tablas creadas por transforms anteriores.
Construyendo un star schema: 5 transforms
Construyamos una capa analitica completa a partir de datos crudos. Comenzamos con tres tablas fuente de Pipe:
raw.users— cuentas de usuarioraw.orders— transacciones de ordenesraw.products— catálogo de productos
Transform 1: Limpiar usuarios
-- transforms/01_clean_users.sql-- Clean and standardize user data
CREATE OR REPLACE TABLE curated.dim_customers ASSELECT user_id, LOWER(TRIM(email)) AS email, COALESCE(name, 'Unknown') AS name, CASE WHEN plan IN ('free', 'starter') THEN 'free' WHEN plan IN ('pro', 'professional') THEN 'pro' WHEN plan IN ('business', 'enterprise') THEN 'enterprise' ELSE 'unknown' END AS plan_tier, created_at::DATE AS signup_date, DATE_DIFF('day', created_at, CURRENT_DATE) AS days_since_signup, CASE WHEN DATE_DIFF('day', created_at, CURRENT_DATE) <= 30 THEN 'new' WHEN DATE_DIFF('day', created_at, CURRENT_DATE) <= 365 THEN 'active' ELSE 'veteran' END AS customer_cohortFROM raw.usersWHERE email IS NOT NULL AND email LIKE '%@%.%';Esto maneja la realidad desordenada de datos crudos: nombres de planes inconsistentes, nombres faltantes, emails invalidos.
Transform 2: Limpiar ordenes
-- transforms/02_clean_orders.sql-- Standardize orders, remove test data, add computed fields
CREATE OR REPLACE TABLE curated.fct_orders ASSELECT order_id, user_id, product_id, amount, quantity, amount * quantity AS line_total, discount_pct, amount * quantity * (1 - COALESCE(discount_pct, 0) / 100.0) AS net_revenue, status, order_date, DATE_TRUNC('month', order_date) AS order_month, DATE_TRUNC('quarter', order_date) AS order_quarter, DAYOFWEEK(order_date) AS day_of_week, CASE WHEN HOUR(created_at) BETWEEN 6 AND 11 THEN 'morning' WHEN HOUR(created_at) BETWEEN 12 AND 17 THEN 'afternoon' WHEN HOUR(created_at) BETWEEN 18 AND 22 THEN 'evening' ELSE 'night' END AS time_of_dayFROM raw.ordersWHERE status != 'test' AND amount > 0 AND order_date >= DATE '2024-01-01';Transform 3: Unir detalles de orden
-- transforms/03_join_order_details.sql-- Denormalized order fact with customer and product dimensions
CREATE OR REPLACE TABLE curated.order_details ASSELECT o.order_id, o.order_date, o.order_month, o.net_revenue, o.status, o.time_of_day,
-- Customer dimensions c.name AS customer_name, c.email AS customer_email, c.plan_tier, c.customer_cohort, c.signup_date,
-- Product dimensions p.name AS product_name, p.category AS product_category, p.sku,
-- Calculated DATE_DIFF('day', c.signup_date, o.order_date) AS days_to_purchaseFROM curated.fct_orders oJOIN curated.dim_customers c ON o.user_id = c.user_idJOIN raw.products p ON o.product_id = p.product_idWHERE o.status = 'completed';Transform 4: Agregacion de ingresos diarios
-- transforms/04_daily_revenue.sql-- Daily revenue metrics for dashboards
CREATE OR REPLACE TABLE gold.daily_revenue ASSELECT order_date, COUNT(DISTINCT order_id) AS total_orders, COUNT(DISTINCT customer_email) AS unique_customers, SUM(net_revenue) AS revenue, ROUND(AVG(net_revenue), 2) AS avg_order_value, ROUND(MEDIAN(net_revenue), 2) AS median_order_value,
-- By plan tier SUM(CASE WHEN plan_tier = 'free' THEN net_revenue ELSE 0 END) AS free_revenue, SUM(CASE WHEN plan_tier = 'pro' THEN net_revenue ELSE 0 END) AS pro_revenue, SUM(CASE WHEN plan_tier = 'enterprise' THEN net_revenue ELSE 0 END) AS enterprise_revenue,
-- By category SUM(CASE WHEN product_category = 'software' THEN net_revenue ELSE 0 END) AS software_revenue, SUM(CASE WHEN product_category = 'services' THEN net_revenue ELSE 0 END) AS services_revenue,
-- Running totals (filled in by window) SUM(SUM(net_revenue)) OVER ( ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) AS revenue_7d_rollingFROM curated.order_detailsGROUP BY order_dateORDER BY order_date;Transform 5: Vista de dashboard de clientes
-- transforms/05_customer_dashboard.sql-- Per-customer summary for the analytics dashboard
CREATE OR REPLACE TABLE gold.customer_dashboard ASSELECT customer_email, customer_name, plan_tier, customer_cohort, signup_date,
COUNT(*) AS total_orders, SUM(net_revenue) AS lifetime_value, ROUND(AVG(net_revenue), 2) AS avg_order_value, MIN(order_date) AS first_order, MAX(order_date) AS last_order, DATE_DIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order,
CASE WHEN DATE_DIFF('day', MAX(order_date), CURRENT_DATE) <= 30 THEN 'active' WHEN DATE_DIFF('day', MAX(order_date), CURRENT_DATE) <= 90 THEN 'at_risk' ELSE 'churned' END AS activity_status,
-- Most purchased category MODE(product_category) AS favorite_categoryFROM curated.order_detailsGROUP BY 1, 2, 3, 4, 5;Ejecutar los transforms
# Run all transforms in orderdataspoc-lens transform run
# Run a specific transformdataspoc-lens transform run --file 04_daily_revenue.sql
# Dry run: show the SQL without executingdataspoc-lens transform run --dry-runSalida:
[10:15:01] Running 5 transforms...[10:15:01] 01_clean_users.sql → curated.dim_customers (12,340 rows)[10:15:02] 02_clean_orders.sql → curated.fct_orders (87,650 rows)[10:15:03] 03_join_order_details.sql → curated.order_details (85,200 rows)[10:15:04] 04_daily_revenue.sql → gold.daily_revenue (480 rows)[10:15:04] 05_customer_dashboard.sql → gold.customer_dashboard (11,890 rows)[10:15:04] All transforms complete (3.2s)Consultar los resultados
# Revenue trendsdataspoc-lens query " SELECT order_date, revenue, revenue_7d_rolling FROM gold.daily_revenue WHERE order_date >= DATE '2026-04-01' ORDER BY order_date"
# At-risk customersdataspoc-lens query " SELECT customer_name, lifetime_value, days_since_last_order FROM gold.customer_dashboard WHERE activity_status = 'at_risk' ORDER BY lifetime_value DESC LIMIT 20"
# Or use AIdataspoc-lens ask "Which product categories are growing fastest this quarter?"Lens Transforms vs. dbt: Cuando usar cada uno
| Aspecto | Lens Transforms | dbt |
|---|---|---|
| Tiempo de configuración | 0 (crear archivos .sql) | Horas (proyecto, profiles, packages) |
| Gestion de dependencias | Orden numerico | DAG con ref() |
| Testing | Verificaciones SQL manuales | Tests integrados |
| Documentacion | Comentarios en SQL | YAML + docs auto-generados |
| Modelos incrementales | Full refresh | Materializacion incremental |
| CI/CD | Ejecutar en cualquier script | dbt Cloud, GitHub Actions |
| Curva de aprendizaje | Saber SQL = listo | SQL + Jinja + YAML + conceptos dbt |
Cuando Lens Transforms es mejor
- Tienes menos de 20 transforms
- Una o dos personas mantienen el pipeline
- Las dependencias son simples y lineales
- Quieres empezar a producir valor hoy, no la próxima semana
- No necesitas materializacion incremental (tus datos caben en memoria)
Cuando dbt es mejor
- Tienes 100+ modelos con dependencias complejas
- Multiples equipos contribuyen transforms
- Necesitas modelos incrementales (miles de millones de filas)
- Necesitas testing y documentación integrados
- Tienes un equipo dedicado de analytics engineering
Programar transforms
Combina transforms con extracción de Pipe en un trabajo cron:
# Run every hour: extract then transform0 * * * * dataspoc-pipe run postgres-production && dataspoc-lens transform runO en un script de shell con manejo de errores:
#!/bin/bash# etl.sh - Extract and transform
set -e
echo "[$(date)] Starting ETL..."
# Extractdataspoc-pipe run postgres-productionecho "[$(date)] Extraction complete"
# Transformdataspoc-lens transform runecho "[$(date)] Transforms complete"
# Refresh cache for fast queriesdataspoc-lens cache refresh-staleecho "[$(date)] Cache refreshed"
echo "[$(date)] ETL complete"Cinco archivos SQL. Sin Jinja. Sin YAML. Sin grafo de dependencias. Solo transforms numerados que construyen una capa analitica limpia a partir de datos crudos. Cuando tus necesidades superen esto, gradua a dbt. Hasta entonces, entrega valor hoy.