Exemplos

Três exemplos completos mostrando o DataSpoc Lens em cenários do mundo real.

Exemplo 1: Explorar um data lake de e-commerce

Descubra e analise dados do data lake de uma empresa de e-commerce no S3.

Configuração

pip install dataspoc-lens[s3]
dataspoc-lens init
dataspoc-lens add-bucket s3://acme-ecommerce-data

Bucket added: s3://acme-ecommerce-data
Discovering tables...
┌──────────────┬─────────┬────────┬──────────┐
│ Table        │ Columns │ Rows   │ Source   │
├──────────────┼─────────┼────────┼──────────┤
│ customers    │ 8       │ 54200  │ postgres │
│ orders       │ 12      │ 486000 │ postgres │
│ products     │ 9       │ 3120   │ postgres │
│ events       │ 6       │ 2.1M   │ segment  │
└──────────────┴─────────┴────────┴──────────┘

4 table(s) found.

Explorar o catálogo

dataspoc-lens catalog --detail orders

┌─────────────────┬───────────┐
│ Column          │ Type      │
├─────────────────┼───────────┤
│ order_id        │ INTEGER   │
│ customer_id     │ INTEGER   │
│ order_date      │ DATE      │
│ ship_date       │ DATE      │
│ total           │ DOUBLE    │
│ discount        │ DOUBLE    │
│ status          │ VARCHAR   │
│ payment_method  │ VARCHAR   │
│ channel         │ VARCHAR   │
│ region          │ VARCHAR   │
│ items_count     │ INTEGER   │
│ created_at      │ TIMESTAMP │
└─────────────────┴───────────┘

Executar queries

dataspoc-lens query "
  SELECT region, COUNT(*) as orders, SUM(total) as revenue
  FROM orders
  WHERE order_date >= '2026-01-01'
  GROUP BY region
  ORDER BY revenue DESC
"

┌──────────────┬────────┬────────────┐
│ region       │ orders │ revenue    │
├──────────────┼────────┼────────────┤
│ North America│ 45200  │ 3,842,000  │
│ Europe       │ 28300  │ 2,156,000  │
│ Asia Pacific │ 18900  │ 1,420,000  │
│ Latin America│ 7600   │ 580,000    │
└──────────────┴────────┴────────────┘

(4 row(s), 0.342s)

Usar o shell interativo

dataspoc-lens shell

lens> .tables
  customers
  orders
  products
  events

lens> SELECT payment_method, COUNT(*) as cnt
   ...> FROM orders GROUP BY payment_method ORDER BY cnt DESC;
┌────────────────┬────────┐
│ payment_method │ cnt    │
├────────────────┼────────┤
│ credit_card    │ 245000 │
│ debit_card     │ 128000 │
│ paypal         │ 82000  │
│ bank_transfer  │ 31000  │
└────────────────┴────────┘

(4 row(s), 0.215s)

lens> .export csv /tmp/payment_methods.csv
Exported 4 row(s) to /tmp/payment_methods.csv

Exportar para relatório

dataspoc-lens query "
  SELECT DATE_TRUNC('month', order_date) as month,
         COUNT(*) as orders,
         SUM(total) as revenue
  FROM orders
  WHERE order_date >= '2025-01-01'
  GROUP BY month ORDER BY month
" --export monthly_report.csv

(16 row(s), 0.456s)
Exported 16 row(s) to monthly_report.csv

Exemplo 2: Construir uma camada curada com transformações

Crie datasets limpos e agregados a partir de dados brutos usando transformações SQL.

Criar arquivos de transformação

dataspoc-lens init

Crie os seguintes arquivos em ~/.dataspoc-lens/transforms/:

001_clean_customers.sql:

CREATE OR REPLACE TABLE clean_customers AS
SELECT
    id,
    TRIM(LOWER(email)) AS email,
    COALESCE(name, 'Unknown') AS name,
    created_at,
    CASE
        WHEN status IN ('active', 'Active', 'ACTIVE') THEN 'active'
        WHEN status IN ('inactive', 'Inactive') THEN 'inactive'
        ELSE 'unknown'
    END AS status
FROM customers
WHERE email IS NOT NULL
  AND email NOT LIKE '%test%';

002_order_metrics.sql:

CREATE OR REPLACE TABLE order_metrics AS
SELECT
    customer_id,
    COUNT(*) AS total_orders,
    SUM(total) AS lifetime_value,
    AVG(total) AS avg_order_value,
    MIN(order_date) AS first_order,
    MAX(order_date) AS last_order,
    DATEDIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order,
    COUNT(CASE WHEN status = 'cancelled' THEN 1 END) AS cancelled_orders
FROM orders
GROUP BY customer_id;

003_customer_segments.sql:

CREATE OR REPLACE TABLE customer_segments AS
SELECT
    c.id,
    c.name,
    c.email,
    m.total_orders,
    m.lifetime_value,
    m.avg_order_value,
    m.days_since_last_order,
    CASE
        WHEN m.lifetime_value > 5000 AND m.days_since_last_order < 30 THEN 'VIP Active'
        WHEN m.lifetime_value > 5000 AND m.days_since_last_order >= 30 THEN 'VIP At Risk'
        WHEN m.lifetime_value > 1000 THEN 'Regular'
        WHEN m.total_orders = 1 THEN 'One-time'
        ELSE 'Low Value'
    END AS segment
FROM clean_customers c
LEFT JOIN order_metrics m ON c.id = m.customer_id;

Executar transformações

dataspoc-lens transform list

┌───┬─────────────────────────────┐
│ # │ File                        │
├───┼─────────────────────────────┤
│ 1 │ 001_clean_customers.sql     │
│ 2 │ 002_order_metrics.sql       │
│ 3 │ 003_customer_segments.sql   │
└───┴─────────────────────────────┘

dataspoc-lens transform run

Running 001_clean_customers.sql... OK (0.8s)
Running 002_order_metrics.sql... OK (2.1s)
Running 003_customer_segments.sql... OK (0.6s)

3 transform(s) completed successfully.

Consultar os resultados

dataspoc-lens query "
  SELECT segment, COUNT(*) as customers, AVG(lifetime_value) as avg_ltv
  FROM customer_segments
  GROUP BY segment
  ORDER BY avg_ltv DESC
"

┌──────────────┬───────────┬────────────┐
│ segment      │ customers │ avg_ltv    │
├──────────────┼───────────┼────────────┤
│ VIP Active   │ 420       │ 12,450.00  │
│ VIP At Risk  │ 185       │ 8,200.00   │
│ Regular      │ 3,200     │ 2,150.00   │
│ One-time     │ 8,400     │ 85.00      │
│ Low Value    │ 2,100     │ 320.00     │
└──────────────┴───────────┴────────────┘

(5 row(s), 0.052s)

Exemplo 3: Análise multi-cloud com Jupyter

Conecte a buckets em múltiplas nuvens e analise tudo junto em um notebook Jupyter.

Configuração

pip install dataspoc-lens[s3,gcs,jupyter]
dataspoc-lens init

# Add S3 bucket (product data)
dataspoc-lens add-bucket s3://acme-product-data

# Add GCS bucket (analytics data)
dataspoc-lens add-bucket gs://acme-analytics

Bucket added: s3://acme-product-data
Discovering tables...
2 table(s) found.

Bucket added: gs://acme-analytics
Discovering tables...
3 table(s) found.

Cachear para performance

dataspoc-lens cache products
dataspoc-lens cache page_views
dataspoc-lens cache conversions

Abrir o Jupyter

dataspoc-lens notebook

Células do notebook

# Cell 1: Check available tables from both clouds
%%sql
SHOW TABLES

┌──────────────────┐
│ name             │
├──────────────────┤
│ products         │
│ inventory        │
│ page_views       │
│ conversions      │
│ sessions         │
└──────────────────┘

# Cell 2: Cross-cloud join — product data (S3) with analytics (GCS)
%%sql
SELECT
    p.name AS product,
    p.category,
    COUNT(DISTINCT pv.session_id) AS views,
    COUNT(DISTINCT c.conversion_id) AS purchases,
    ROUND(COUNT(DISTINCT c.conversion_id)::FLOAT / NULLIF(COUNT(DISTINCT pv.session_id), 0) * 100, 2) AS conversion_rate
FROM products p
LEFT JOIN page_views pv ON p.product_id = pv.product_id
LEFT JOIN conversions c ON p.product_id = c.product_id
GROUP BY p.name, p.category
ORDER BY conversion_rate DESC
LIMIT 10

# Cell 3: Visualize conversion funnel
import pandas as pd

df = conn.execute("""
    SELECT
        category,
        SUM(views) as total_views,
        SUM(purchases) as total_purchases
    FROM (
        SELECT p.category,
               COUNT(DISTINCT pv.session_id) AS views,
               COUNT(DISTINCT c.conversion_id) AS purchases
        FROM products p
        LEFT JOIN page_views pv ON p.product_id = pv.product_id
        LEFT JOIN conversions c ON p.product_id = c.product_id
        GROUP BY p.category, p.product_id
    )
    GROUP BY category
    ORDER BY total_views DESC
""").fetchdf()

df.plot(x='category', y=['total_views', 'total_purchases'],
        kind='bar', title='Views vs Purchases by Category')

# Cell 4: Check cache status
from dataspoc_lens import LensClient

with LensClient() as client:
    for entry in client.cache_status():
        size_mb = entry["size_bytes"] / (1024 * 1024)
        print(f"{entry['table']}: {entry['status']} ({size_mb:.1f} MB)")

products: fresh (0.8 MB)
page_views: fresh (45.2 MB)
conversions: fresh (12.1 MB)