sqltransformsdata-modelingdbt-alternativeanalytics

Construyendo una capa de datos curada con SQL Transforms (sin necesidad de dbt)

Michael San Martim · 2026-04-27

dbt es el estandar para transformacion de datos basada en SQL. También es un proyecto de 50 archivos con profiles, sources, models, tests, macros, packages y un archivo YAML para cada modelo. Para un equipo con 5 tablas que necesita un star schema, eso es excesivo.

DataSpoc Lens tiene un enfoque mas simple: archivos SQL numerados que se ejecutan en orden. Cada archivo crea una tabla a partir de una consulta. Sin YAML, sin Jinja, sin grafo de dependencias que depurar. Solo SQL.

El problema con dbt para equipos pequenos

Para construir un star schema a partir de 3 tablas fuente en dbt, necesitas:

my_dbt_project/
dbt_project.yml
profiles.yml
models/
staging/
stg_users.sql
stg_users.yml
stg_orders.sql
stg_orders.yml
stg_products.sql
stg_products.yml
marts/
dim_customers.sql
dim_customers.yml
fct_orders.sql
fct_orders.yml
mart_revenue.sql
mart_revenue.yml
tests/
...
macros/
...

Son 12+ archivos antes de escribir una sola consulta de negocio. El poder de dbt viene de la gestion de dependencias, testing y documentación — todo valioso a escala. Pero para una startup con un analista, es una semana de configuración antes de entregar algun valor.

Lens Transforms: Archivos SQL numerados

En DataSpoc Lens, los transforms son archivos .sql en un directorio. Estan numerados para definir el orden de ejecucion:

transforms/
01_clean_users.sql
02_clean_orders.sql
03_join_order_details.sql
04_daily_revenue.sql
05_customer_dashboard.sql

Cada archivo usa el patrón CTAS (CREATE TABLE AS SELECT):

-- Creates a table in the curated layer
CREATE OR REPLACE TABLE curated.clean_users AS
SELECT ...

Ejecutar todos los transforms:

Terminal window
dataspoc-lens transform run

Lens los ejecuta en orden numerico: 01, 02, 03, 04, 05. Cada transform puede referenciar tablas creadas por transforms anteriores.

Construyendo un star schema: 5 transforms

Construyamos una capa analitica completa a partir de datos crudos. Comenzamos con tres tablas fuente de Pipe:

  • raw.users — cuentas de usuario
  • raw.orders — transacciones de ordenes
  • raw.products — catálogo de productos

Transform 1: Limpiar usuarios

-- transforms/01_clean_users.sql
-- Clean and standardize user data
CREATE OR REPLACE TABLE curated.dim_customers AS
SELECT
user_id,
LOWER(TRIM(email)) AS email,
COALESCE(name, 'Unknown') AS name,
CASE
WHEN plan IN ('free', 'starter') THEN 'free'
WHEN plan IN ('pro', 'professional') THEN 'pro'
WHEN plan IN ('business', 'enterprise') THEN 'enterprise'
ELSE 'unknown'
END AS plan_tier,
created_at::DATE AS signup_date,
DATE_DIFF('day', created_at, CURRENT_DATE) AS days_since_signup,
CASE
WHEN DATE_DIFF('day', created_at, CURRENT_DATE) <= 30 THEN 'new'
WHEN DATE_DIFF('day', created_at, CURRENT_DATE) <= 365 THEN 'active'
ELSE 'veteran'
END AS customer_cohort
FROM raw.users
WHERE email IS NOT NULL
AND email LIKE '%@%.%';

Esto maneja la realidad desordenada de datos crudos: nombres de planes inconsistentes, nombres faltantes, emails invalidos.

Transform 2: Limpiar ordenes

-- transforms/02_clean_orders.sql
-- Standardize orders, remove test data, add computed fields
CREATE OR REPLACE TABLE curated.fct_orders AS
SELECT
order_id,
user_id,
product_id,
amount,
quantity,
amount * quantity AS line_total,
discount_pct,
amount * quantity * (1 - COALESCE(discount_pct, 0) / 100.0) AS net_revenue,
status,
order_date,
DATE_TRUNC('month', order_date) AS order_month,
DATE_TRUNC('quarter', order_date) AS order_quarter,
DAYOFWEEK(order_date) AS day_of_week,
CASE
WHEN HOUR(created_at) BETWEEN 6 AND 11 THEN 'morning'
WHEN HOUR(created_at) BETWEEN 12 AND 17 THEN 'afternoon'
WHEN HOUR(created_at) BETWEEN 18 AND 22 THEN 'evening'
ELSE 'night'
END AS time_of_day
FROM raw.orders
WHERE status != 'test'
AND amount > 0
AND order_date >= DATE '2024-01-01';

Transform 3: Unir detalles de orden

-- transforms/03_join_order_details.sql
-- Denormalized order fact with customer and product dimensions
CREATE OR REPLACE TABLE curated.order_details AS
SELECT
o.order_id,
o.order_date,
o.order_month,
o.net_revenue,
o.status,
o.time_of_day,
-- Customer dimensions
c.name AS customer_name,
c.email AS customer_email,
c.plan_tier,
c.customer_cohort,
c.signup_date,
-- Product dimensions
p.name AS product_name,
p.category AS product_category,
p.sku,
-- Calculated
DATE_DIFF('day', c.signup_date, o.order_date) AS days_to_purchase
FROM curated.fct_orders o
JOIN curated.dim_customers c ON o.user_id = c.user_id
JOIN raw.products p ON o.product_id = p.product_id
WHERE o.status = 'completed';

Transform 4: Agregacion de ingresos diarios

-- transforms/04_daily_revenue.sql
-- Daily revenue metrics for dashboards
CREATE OR REPLACE TABLE gold.daily_revenue AS
SELECT
order_date,
COUNT(DISTINCT order_id) AS total_orders,
COUNT(DISTINCT customer_email) AS unique_customers,
SUM(net_revenue) AS revenue,
ROUND(AVG(net_revenue), 2) AS avg_order_value,
ROUND(MEDIAN(net_revenue), 2) AS median_order_value,
-- By plan tier
SUM(CASE WHEN plan_tier = 'free' THEN net_revenue ELSE 0 END) AS free_revenue,
SUM(CASE WHEN plan_tier = 'pro' THEN net_revenue ELSE 0 END) AS pro_revenue,
SUM(CASE WHEN plan_tier = 'enterprise' THEN net_revenue ELSE 0 END) AS enterprise_revenue,
-- By category
SUM(CASE WHEN product_category = 'software' THEN net_revenue ELSE 0 END) AS software_revenue,
SUM(CASE WHEN product_category = 'services' THEN net_revenue ELSE 0 END) AS services_revenue,
-- Running totals (filled in by window)
SUM(SUM(net_revenue)) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS revenue_7d_rolling
FROM curated.order_details
GROUP BY order_date
ORDER BY order_date;

Transform 5: Vista de dashboard de clientes

-- transforms/05_customer_dashboard.sql
-- Per-customer summary for the analytics dashboard
CREATE OR REPLACE TABLE gold.customer_dashboard AS
SELECT
customer_email,
customer_name,
plan_tier,
customer_cohort,
signup_date,
COUNT(*) AS total_orders,
SUM(net_revenue) AS lifetime_value,
ROUND(AVG(net_revenue), 2) AS avg_order_value,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order,
DATE_DIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order,
CASE
WHEN DATE_DIFF('day', MAX(order_date), CURRENT_DATE) <= 30 THEN 'active'
WHEN DATE_DIFF('day', MAX(order_date), CURRENT_DATE) <= 90 THEN 'at_risk'
ELSE 'churned'
END AS activity_status,
-- Most purchased category
MODE(product_category) AS favorite_category
FROM curated.order_details
GROUP BY 1, 2, 3, 4, 5;

Ejecutar los transforms

Terminal window
# Run all transforms in order
dataspoc-lens transform run
# Run a specific transform
dataspoc-lens transform run --file 04_daily_revenue.sql
# Dry run: show the SQL without executing
dataspoc-lens transform run --dry-run

Salida:

[10:15:01] Running 5 transforms...
[10:15:01] 01_clean_users.sql → curated.dim_customers (12,340 rows)
[10:15:02] 02_clean_orders.sql → curated.fct_orders (87,650 rows)
[10:15:03] 03_join_order_details.sql → curated.order_details (85,200 rows)
[10:15:04] 04_daily_revenue.sql → gold.daily_revenue (480 rows)
[10:15:04] 05_customer_dashboard.sql → gold.customer_dashboard (11,890 rows)
[10:15:04] All transforms complete (3.2s)

Consultar los resultados

Terminal window
# Revenue trends
dataspoc-lens query "
SELECT order_date, revenue, revenue_7d_rolling
FROM gold.daily_revenue
WHERE order_date >= DATE '2026-04-01'
ORDER BY order_date
"
# At-risk customers
dataspoc-lens query "
SELECT customer_name, lifetime_value, days_since_last_order
FROM gold.customer_dashboard
WHERE activity_status = 'at_risk'
ORDER BY lifetime_value DESC
LIMIT 20
"
# Or use AI
dataspoc-lens ask "Which product categories are growing fastest this quarter?"

Lens Transforms vs. dbt: Cuando usar cada uno

AspectoLens Transformsdbt
Tiempo de configuración0 (crear archivos .sql)Horas (proyecto, profiles, packages)
Gestion de dependenciasOrden numericoDAG con ref()
TestingVerificaciones SQL manualesTests integrados
DocumentacionComentarios en SQLYAML + docs auto-generados
Modelos incrementalesFull refreshMaterializacion incremental
CI/CDEjecutar en cualquier scriptdbt Cloud, GitHub Actions
Curva de aprendizajeSaber SQL = listoSQL + Jinja + YAML + conceptos dbt

Cuando Lens Transforms es mejor

  • Tienes menos de 20 transforms
  • Una o dos personas mantienen el pipeline
  • Las dependencias son simples y lineales
  • Quieres empezar a producir valor hoy, no la próxima semana
  • No necesitas materializacion incremental (tus datos caben en memoria)

Cuando dbt es mejor

  • Tienes 100+ modelos con dependencias complejas
  • Multiples equipos contribuyen transforms
  • Necesitas modelos incrementales (miles de millones de filas)
  • Necesitas testing y documentación integrados
  • Tienes un equipo dedicado de analytics engineering

Programar transforms

Combina transforms con extracción de Pipe en un trabajo cron:

Terminal window
# Run every hour: extract then transform
0 * * * * dataspoc-pipe run postgres-production && dataspoc-lens transform run

O en un script de shell con manejo de errores:

#!/bin/bash
# etl.sh - Extract and transform
set -e
echo "[$(date)] Starting ETL..."
# Extract
dataspoc-pipe run postgres-production
echo "[$(date)] Extraction complete"
# Transform
dataspoc-lens transform run
echo "[$(date)] Transforms complete"
# Refresh cache for fast queries
dataspoc-lens cache refresh-stale
echo "[$(date)] Cache refreshed"
echo "[$(date)] ETL complete"

Cinco archivos SQL. Sin Jinja. Sin YAML. Sin grafo de dependencias. Solo transforms numerados que construyen una capa analitica limpia a partir de datos crudos. Cuando tus necesidades superen esto, gradua a dbt. Hasta entonces, entrega valor hoy.

Recomendados