End-to-End Demo

Esta página aún no está disponible en tu idioma.

This walkthrough runs the complete DataSpoc workflow from scratch: download a real dataset, ingest it into a local data lake with Pipe, then query and export with Lens. The entire process takes under a minute.

Prerequisites

Python 3.10+
DataSpoc Pipe and Lens installed (or available via python -m)
curl available on your system

If you have the dataspoc workspace cloned:

cd dataspoc-pipe
source .venv/bin/activate

Or install from PyPI:

pip install dataspoc-pipe dataspoc-lens

Running the demo

bash examples/e2e-demo.sh

The script creates a temporary directory, runs all steps automatically, and prints the results. Below is a detailed explanation of each step.

Step 1: Download the Iris dataset

The script downloads the classic Iris dataset (150 rows, 5 columns) from the seaborn-data GitHub repository.

curl -sL "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv" \
  -o "$DEMO_DIR/iris.csv"

The CSV contains columns: sepal_length, sepal_width, petal_length, petal_width, species.

Expected output:

--- Step 1: Downloading Iris dataset ---
  Downloaded 150 rows to /tmp/dataspoc-e2e-XXXXXX/iris.csv

Step 2: Configure the mock Singer tap

A config file is created pointing the mock tap at the downloaded CSV:

{"csv_path": "/tmp/dataspoc-e2e-XXXXXX/iris.csv", "stream_name": "iris"}

The mock_tap_csv.py script reads any CSV and emits Singer protocol messages (SCHEMA, RECORD, STATE), so Pipe can ingest it without installing a real Singer tap.

Expected output:

--- Step 2: Setting up mock Singer tap ---
  Tap config: /tmp/dataspoc-e2e-XXXXXX/tap-config.json
  Verifying tap output (first 2 messages):
    {"type": "SCHEMA", "stream": "iris", "schema": {...}, "key_properties": []}
    {"type": "RECORD", "stream": "iris", "record": {...}}

Step 3: Initialize Pipe and create the pipeline

Pipe is initialized and a pipeline YAML is written directly (bypassing the interactive wizard):

source:
  tap: "python examples/mock_tap_csv.py"
  config: "/tmp/dataspoc-e2e-XXXXXX/tap-config.json"
destination:
  bucket: "file:///tmp/dataspoc-e2e-XXXXXX/lake"
  path: raw
  compression: zstd
incremental:
  enabled: false
schedule:
  cron: null

Expected output:

--- Step 3: Initializing DataSpoc Pipe ---
  Pipeline config saved to ~/.dataspoc-pipe/pipelines/iris-demo.yaml

Step 4: Run the pipeline

Pipe executes the tap, reads the Singer messages from stdout, converts records to a PyArrow table, and writes a zstd-compressed Parquet file to the lake.

dataspoc-pipe run iris-demo

Expected output:

--- Step 4: Running DataSpoc Pipe (ingest CSV -> Parquet) ---
Running: iris-demo
  iris: 150 records...
Done! 150 records in 1 stream(s)
  iris: 150

Step 5: Inspect the lake

The lake now contains a Parquet file and a manifest:

find "$LAKE_DIR" -name '*.parquet'
dataspoc-pipe manifest "file://$LAKE_DIR"

Expected output:

--- Step 5: Inspecting lake contents ---
  Parquet files in lake:
    /tmp/dataspoc-e2e-XXXXXX/lake/raw/iris/iris_0000.parquet (4096 bytes)

  Manifest:
  Tables in file:///tmp/dataspoc-e2e-XXXXXX/lake:
    iris: 150 rows (raw/iris)

Lake structure

lake/
  .dataspoc/
    manifest.json
    logs/iris-demo/<timestamp>.json
  raw/
    iris/
      iris_0000.parquet

Step 6: Pipeline status

dataspoc-pipe status

Expected output:

--- Step 6: Pipeline status ---
  iris-demo: OK (150 records, last run just now)

Step 7: Set up Lens

Lens is initialized and the lake is registered as a bucket. The script also converts the manifest format (Pipe writes a dict-keyed manifest, Lens expects a list-keyed manifest).

dataspoc-lens init
dataspoc-lens add-bucket "file:///tmp/dataspoc-e2e-XXXXXX/lake"

Expected output:

--- Step 7: Setting up DataSpoc Lens ---
  Manifest converted for Lens compatibility.
  Bucket added: file:///tmp/dataspoc-e2e-XXXXXX/lake

Step 8: View the catalog

dataspoc-lens catalog

Expected output:

--- Step 8: Viewing catalog ---
  iris: 150 rows, 5 columns

Step 9: Run queries

Five queries demonstrate different analytical patterns:

Query 1 — First 10 rows:

SELECT * FROM iris LIMIT 10

Query 2 — Row count:

SELECT COUNT(*) AS total_rows FROM iris

Returns 150.

Query 3 — Average measurements per species:

SELECT species,
       ROUND(AVG(sepal_length), 2) AS avg_sepal_len,
       ROUND(AVG(sepal_width), 2)  AS avg_sepal_wid,
       ROUND(AVG(petal_length), 2) AS avg_petal_len,
       ROUND(AVG(petal_width), 2)  AS avg_petal_wid
FROM iris
GROUP BY species
ORDER BY species

Expected results:

species	avg_sepal_len	avg_sepal_wid	avg_petal_len	avg_petal_wid
setosa	5.01	3.42	1.46	0.24
versicolor	5.94	2.77	4.26	1.33
virginica	6.59	2.97	5.55	2.03

Query 4 — Species distribution:

SELECT species, COUNT(*) AS n FROM iris GROUP BY species ORDER BY n DESC

Each species has exactly 50 rows.

Query 5 — Top 5 largest petals:

SELECT species, petal_length, petal_width
FROM iris
ORDER BY petal_length DESC
LIMIT 5

All top results are from the virginica species.

Step 10: Export results

The demo exports the full dataset to CSV and a summary to JSON:

dataspoc-lens export "SELECT * FROM iris" --format csv --output export.csv
dataspoc-lens export "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(petal_length),2) AS avg_petal_len FROM iris GROUP BY species ORDER BY species" --format json --output summary.json

Expected output:

--- Step 10: Exporting results ---
  Exported files:
  -rw-r--r-- 1 user user 5.2K export.csv
  -rw-r--r-- 1 user user  210 summary.json

The Docker alternative

If you prefer not to install anything locally, the Docker demo image includes Pipe, Lens, Jupyter, and three pre-ingested datasets (Iris, Titanic, Tips).

cd dataspoc-pipe

# Build
docker build -f examples/Dockerfile.demo -t dataspoc-demo .

# Run Jupyter
docker run -p 8888:8888 dataspoc-demo

# Or run queries directly
docker run -it dataspoc-demo dataspoc-lens shell

See the Pipe examples page for full details on the Docker image.

Cleanup

The demo creates files in two locations:

# Remove the temporary demo directory
rm -rf /tmp/dataspoc-e2e-XXXXXX

# Remove the pipeline config
rm -f ~/.dataspoc-pipe/pipelines/iris-demo.yaml

The script prints the exact paths at the end of the run. The Lens config (~/.dataspoc-lens/) can be removed with:

rm -rf ~/.dataspoc-lens

Full script

The complete e2e-demo.sh is available in the dataspoc-pipe repository.

#!/bin/bash
# ============================================================================
# DataSpoc E2E Demo -- From raw data to analysis
#
# Downloads a real dataset from the web, ingests it with DataSpoc Pipe,
# then analyzes it with DataSpoc Lens.
#
# Usage:
#   cd dataspoc-pipe
#   source .venv/bin/activate
#   bash examples/e2e-demo.sh
# ============================================================================

set -euo pipefail

# -- Resolve paths ----------------------------------------------------------
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
DEMO_DIR=$(mktemp -d -t dataspoc-e2e-XXXXXX)
LAKE_DIR="$DEMO_DIR/lake"
MOCK_TAP="$SCRIPT_DIR/mock_tap_csv.py"

# Ensure dataspoc-lens is importable even if not pip-installed
export PYTHONPATH="${PROJECT_DIR}/lens/src:${PROJECT_DIR}/src:${PYTHONPATH:-}"

# Resolve CLI commands -- prefer installed entry-points, fall back to module
if command -v dataspoc-pipe &>/dev/null; then
    PIPE_CMD="dataspoc-pipe"
else
    PIPE_CMD="python -m dataspoc_pipe.cli"
fi

if command -v dataspoc-lens &>/dev/null; then
    LENS_CMD="dataspoc-lens"
else
    LENS_CMD="python -m dataspoc_lens"
fi

# Dataset URL -- Iris from the UCI Machine Learning Repository (GitHub mirror)
DATASET_URL="https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

echo "============================================================"
echo "  DataSpoc E2E Demo"
echo "============================================================"
echo ""
echo "  Working directory : $DEMO_DIR"
echo "  Lake directory    : $LAKE_DIR"
echo "  Pipe CLI          : $PIPE_CMD"
echo "  Lens CLI          : $LENS_CMD"
echo ""

# -- Step 1: Download dataset -----------------------------------------------
echo "--- Step 1: Downloading Iris dataset ---"
curl -sL "$DATASET_URL" -o "$DEMO_DIR/iris.csv"
ROW_COUNT=$(tail -n +2 "$DEMO_DIR/iris.csv" | wc -l)
echo "  Downloaded $ROW_COUNT rows to $DEMO_DIR/iris.csv"
echo ""

# -- Step 2: Create tap config ----------------------------------------------
echo "--- Step 2: Setting up mock Singer tap ---"
cat > "$DEMO_DIR/tap-config.json" <<EOF
{"csv_path": "$DEMO_DIR/iris.csv", "stream_name": "iris"}
EOF
echo "  Tap config: $DEMO_DIR/tap-config.json"

# Quick sanity check -- first 2 lines from mock tap
echo "  Verifying tap output (first 2 messages):"
set +o pipefail
python "$MOCK_TAP" --config "$DEMO_DIR/tap-config.json" 2>/dev/null | head -2 | while read -r line; do
    echo "    $line"
done
set -o pipefail
echo ""

# -- Step 3: Initialize Pipe and create pipeline ----------------------------
echo "--- Step 3: Initializing DataSpoc Pipe ---"
$PIPE_CMD init

# Write pipeline YAML directly (avoids interactive wizard)
mkdir -p ~/.dataspoc-pipe/pipelines
cat > ~/.dataspoc-pipe/pipelines/iris-demo.yaml <<EOF
source:
  tap: "python $MOCK_TAP"
  config: "$DEMO_DIR/tap-config.json"
destination:
  bucket: "file://$LAKE_DIR"
  path: raw
  compression: zstd
incremental:
  enabled: false
schedule:
  cron: null
EOF
echo "  Pipeline config saved to ~/.dataspoc-pipe/pipelines/iris-demo.yaml"
echo ""

# -- Step 4: Run Pipe -------------------------------------------------------
echo "--- Step 4: Running DataSpoc Pipe (ingest CSV -> Parquet) ---"
$PIPE_CMD run iris-demo
echo ""

# -- Step 5: Inspect the lake -----------------------------------------------
echo "--- Step 5: Inspecting lake contents ---"
echo "  Parquet files in lake:"
find "$LAKE_DIR" -name '*.parquet' -printf "    %p (%s bytes)\n" 2>/dev/null || true
echo ""
echo "  Manifest:"
$PIPE_CMD manifest "file://$LAKE_DIR"
echo ""

# -- Step 6: Pipeline status -------------------------------------------------
echo "--- Step 6: Pipeline status ---"
$PIPE_CMD status
echo ""

# -- Step 7: Initialize Lens ------------------------------------------------
echo "--- Step 7: Setting up DataSpoc Lens ---"

# Note: Pipe writes a dict-keyed manifest; Lens expects a list-keyed manifest.
# Convert the manifest so Lens can read it via manifest-first discovery.
python - "$LAKE_DIR" <<'PYEOF'
import json, sys
mpath = f"{sys.argv[1]}/.dataspoc/manifest.json"
try:
    with open(mpath) as f:
        m = json.load(f)
    tables_dict = m.get("tables", {})
    if isinstance(tables_dict, dict):
        tables_list = []
        for key, val in tables_dict.items():
            entry = dict(val)
            if "location" not in entry:
                entry["location"] = f"raw/{key}"
            if "row_count" not in entry:
                stats = entry.pop("stats", {})
                entry["row_count"] = stats.get("total_rows", 0)
            tables_list.append(entry)
        m["tables"] = tables_list
        with open(mpath, "w") as f:
            json.dump(m, f, indent=2)
        print("  Manifest converted for Lens compatibility.")
except FileNotFoundError:
    print("  No manifest found (Lens will use scan fallback).")
PYEOF

$LENS_CMD init
$LENS_CMD add-bucket "file://$LAKE_DIR"
echo ""

# -- Step 8: Catalog ---------------------------------------------------------
echo "--- Step 8: Viewing catalog ---"
$LENS_CMD catalog
echo ""

# -- Step 9: Run queries -----------------------------------------------------
echo "--- Step 9: Querying with DataSpoc Lens ---"

echo ""
echo "[Query 1] First 10 rows:"
$LENS_CMD query "SELECT * FROM iris LIMIT 10"

echo ""
echo "[Query 2] Row count:"
$LENS_CMD query "SELECT count(*) AS total_rows FROM iris"

echo ""
echo "[Query 3] Average measurements per species:"
$LENS_CMD query "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(sepal_width),2) AS avg_sepal_wid, ROUND(AVG(petal_length),2) AS avg_petal_len, ROUND(AVG(petal_width),2) AS avg_petal_wid FROM iris GROUP BY species ORDER BY species"

echo ""
echo "[Query 4] Species distribution:"
$LENS_CMD query "SELECT species, count(*) AS n FROM iris GROUP BY species ORDER BY n DESC"

echo ""
echo "[Query 5] Top 5 largest petals:"
$LENS_CMD query "SELECT species, petal_length, petal_width FROM iris ORDER BY petal_length DESC LIMIT 5"
echo ""

# -- Step 10: Export results -------------------------------------------------
echo "--- Step 10: Exporting results ---"
$LENS_CMD export "SELECT * FROM iris" --format csv --output "$DEMO_DIR/export.csv"
$LENS_CMD export "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(petal_length),2) AS avg_petal_len FROM iris GROUP BY species ORDER BY species" --format json --output "$DEMO_DIR/summary.json"
echo ""
echo "  Exported files:"
ls -lh "$DEMO_DIR/export.csv" "$DEMO_DIR/summary.json"
echo ""

# -- Done --------------------------------------------------------------------
echo "============================================================"
echo "  Demo complete!"
echo "============================================================"
echo ""
echo "  Lake location   : $LAKE_DIR"
echo "  CSV export       : $DEMO_DIR/export.csv"
echo "  JSON export      : $DEMO_DIR/summary.json"
echo ""
echo "  To explore interactively:"
echo "    $LENS_CMD shell"
echo ""
echo "  To clean up:"
echo "    rm -rf $DEMO_DIR"
echo "    rm -f ~/.dataspoc-pipe/pipelines/iris-demo.yaml"
echo "============================================================"