Pular para o conteúdo

Exemplos

Três exemplos completos mostrando o DataSpoc Lens em cenários do mundo real.

Descubra e analise dados do data lake de uma empresa de e-commerce no S3.

Terminal window
pip install dataspoc-lens[s3]
dataspoc-lens init
dataspoc-lens add-bucket s3://acme-ecommerce-data
Bucket added: s3://acme-ecommerce-data
Discovering tables...
┌──────────────┬─────────┬────────┬──────────┐
│ Table │ Columns │ Rows │ Source │
├──────────────┼─────────┼────────┼──────────┤
│ customers │ 8 │ 54200 │ postgres │
│ orders │ 12 │ 486000 │ postgres │
│ products │ 9 │ 3120 │ postgres │
│ events │ 6 │ 2.1M │ segment │
└──────────────┴─────────┴────────┴──────────┘
4 table(s) found.
Terminal window
dataspoc-lens catalog --detail orders
┌─────────────────┬───────────┐
│ Column │ Type │
├─────────────────┼───────────┤
│ order_id │ INTEGER │
│ customer_id │ INTEGER │
│ order_date │ DATE │
│ ship_date │ DATE │
│ total │ DOUBLE │
│ discount │ DOUBLE │
│ status │ VARCHAR │
│ payment_method │ VARCHAR │
│ channel │ VARCHAR │
│ region │ VARCHAR │
│ items_count │ INTEGER │
│ created_at │ TIMESTAMP │
└─────────────────┴───────────┘
Terminal window
dataspoc-lens query "
SELECT region, COUNT(*) as orders, SUM(total) as revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY region
ORDER BY revenue DESC
"
┌──────────────┬────────┬────────────┐
│ region │ orders │ revenue │
├──────────────┼────────┼────────────┤
│ North America│ 45200 │ 3,842,000 │
│ Europe │ 28300 │ 2,156,000 │
│ Asia Pacific │ 18900 │ 1,420,000 │
│ Latin America│ 7600 │ 580,000 │
└──────────────┴────────┴────────────┘
(4 row(s), 0.342s)
Terminal window
dataspoc-lens shell
lens> .tables
customers
orders
products
events
lens> SELECT payment_method, COUNT(*) as cnt
...> FROM orders GROUP BY payment_method ORDER BY cnt DESC;
┌────────────────┬────────┐
│ payment_method │ cnt │
├────────────────┼────────┤
│ credit_card │ 245000 │
│ debit_card │ 128000 │
│ paypal │ 82000 │
│ bank_transfer │ 31000 │
└────────────────┴────────┘
(4 row(s), 0.215s)
lens> .export csv /tmp/payment_methods.csv
Exported 4 row(s) to /tmp/payment_methods.csv
Terminal window
dataspoc-lens query "
SELECT DATE_TRUNC('month', order_date) as month,
COUNT(*) as orders,
SUM(total) as revenue
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY month ORDER BY month
" --export monthly_report.csv
(16 row(s), 0.456s)
Exported 16 row(s) to monthly_report.csv

Exemplo 2: Construir uma camada curada com transformações

Seção intitulada “Exemplo 2: Construir uma camada curada com transformações”

Crie datasets limpos e agregados a partir de dados brutos usando transformações SQL.

Terminal window
dataspoc-lens init

Crie os seguintes arquivos em ~/.dataspoc-lens/transforms/:

001_clean_customers.sql:

CREATE OR REPLACE TABLE clean_customers AS
SELECT
id,
TRIM(LOWER(email)) AS email,
COALESCE(name, 'Unknown') AS name,
created_at,
CASE
WHEN status IN ('active', 'Active', 'ACTIVE') THEN 'active'
WHEN status IN ('inactive', 'Inactive') THEN 'inactive'
ELSE 'unknown'
END AS status
FROM customers
WHERE email IS NOT NULL
AND email NOT LIKE '%test%';

002_order_metrics.sql:

CREATE OR REPLACE TABLE order_metrics AS
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(total) AS lifetime_value,
AVG(total) AS avg_order_value,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order,
DATEDIFF('day', MAX(order_date), CURRENT_DATE) AS days_since_last_order,
COUNT(CASE WHEN status = 'cancelled' THEN 1 END) AS cancelled_orders
FROM orders
GROUP BY customer_id;

003_customer_segments.sql:

CREATE OR REPLACE TABLE customer_segments AS
SELECT
c.id,
c.name,
c.email,
m.total_orders,
m.lifetime_value,
m.avg_order_value,
m.days_since_last_order,
CASE
WHEN m.lifetime_value > 5000 AND m.days_since_last_order < 30 THEN 'VIP Active'
WHEN m.lifetime_value > 5000 AND m.days_since_last_order >= 30 THEN 'VIP At Risk'
WHEN m.lifetime_value > 1000 THEN 'Regular'
WHEN m.total_orders = 1 THEN 'One-time'
ELSE 'Low Value'
END AS segment
FROM clean_customers c
LEFT JOIN order_metrics m ON c.id = m.customer_id;
Terminal window
dataspoc-lens transform list
┌───┬─────────────────────────────┐
│ # │ File │
├───┼─────────────────────────────┤
│ 1 │ 001_clean_customers.sql │
│ 2 │ 002_order_metrics.sql │
│ 3 │ 003_customer_segments.sql │
└───┴─────────────────────────────┘
Terminal window
dataspoc-lens transform run
Running 001_clean_customers.sql... OK (0.8s)
Running 002_order_metrics.sql... OK (2.1s)
Running 003_customer_segments.sql... OK (0.6s)
3 transform(s) completed successfully.
Terminal window
dataspoc-lens query "
SELECT segment, COUNT(*) as customers, AVG(lifetime_value) as avg_ltv
FROM customer_segments
GROUP BY segment
ORDER BY avg_ltv DESC
"
┌──────────────┬───────────┬────────────┐
│ segment │ customers │ avg_ltv │
├──────────────┼───────────┼────────────┤
│ VIP Active │ 420 │ 12,450.00 │
│ VIP At Risk │ 185 │ 8,200.00 │
│ Regular │ 3,200 │ 2,150.00 │
│ One-time │ 8,400 │ 85.00 │
│ Low Value │ 2,100 │ 320.00 │
└──────────────┴───────────┴────────────┘
(5 row(s), 0.052s)

Conecte a buckets em múltiplas nuvens e analise tudo junto em um notebook Jupyter.

Terminal window
pip install dataspoc-lens[s3,gcs,jupyter]
dataspoc-lens init
# Add S3 bucket (product data)
dataspoc-lens add-bucket s3://acme-product-data
# Add GCS bucket (analytics data)
dataspoc-lens add-bucket gs://acme-analytics
Bucket added: s3://acme-product-data
Discovering tables...
2 table(s) found.
Bucket added: gs://acme-analytics
Discovering tables...
3 table(s) found.
Terminal window
dataspoc-lens cache products
dataspoc-lens cache page_views
dataspoc-lens cache conversions
Terminal window
dataspoc-lens notebook
# Cell 1: Check available tables from both clouds
%%sql
SHOW TABLES
┌──────────────────┐
│ name │
├──────────────────┤
│ products │
│ inventory │
│ page_views │
│ conversions │
│ sessions │
└──────────────────┘
# Cell 2: Cross-cloud join — product data (S3) with analytics (GCS)
%%sql
SELECT
p.name AS product,
p.category,
COUNT(DISTINCT pv.session_id) AS views,
COUNT(DISTINCT c.conversion_id) AS purchases,
ROUND(COUNT(DISTINCT c.conversion_id)::FLOAT / NULLIF(COUNT(DISTINCT pv.session_id), 0) * 100, 2) AS conversion_rate
FROM products p
LEFT JOIN page_views pv ON p.product_id = pv.product_id
LEFT JOIN conversions c ON p.product_id = c.product_id
GROUP BY p.name, p.category
ORDER BY conversion_rate DESC
LIMIT 10
# Cell 3: Visualize conversion funnel
import pandas as pd
df = conn.execute("""
SELECT
category,
SUM(views) as total_views,
SUM(purchases) as total_purchases
FROM (
SELECT p.category,
COUNT(DISTINCT pv.session_id) AS views,
COUNT(DISTINCT c.conversion_id) AS purchases
FROM products p
LEFT JOIN page_views pv ON p.product_id = pv.product_id
LEFT JOIN conversions c ON p.product_id = c.product_id
GROUP BY p.category, p.product_id
)
GROUP BY category
ORDER BY total_views DESC
""").fetchdf()
df.plot(x='category', y=['total_views', 'total_purchases'],
kind='bar', title='Views vs Purchases by Category')
# Cell 4: Check cache status
from dataspoc_lens import LensClient
with LensClient() as client:
for entry in client.cache_status():
size_mb = entry["size_bytes"] / (1024 * 1024)
print(f"{entry['table']}: {entry['status']} ({size_mb:.1f} MB)")
products: fresh (0.8 MB)
page_views: fresh (45.2 MB)
conversions: fresh (12.1 MB)