End-to-End Demo
Esta página aún no está disponible en tu idioma.
This walkthrough runs the complete DataSpoc workflow from scratch: download a real dataset, ingest it into a local data lake with Pipe, then query and export with Lens. The entire process takes under a minute.
Prerequisites
Section titled “Prerequisites”- Python 3.10+
- DataSpoc Pipe and Lens installed (or available via
python -m) curlavailable on your system
If you have the dataspoc workspace cloned:
cd dataspoc-pipesource .venv/bin/activateOr install from PyPI:
pip install dataspoc-pipe dataspoc-lensRunning the demo
Section titled “Running the demo”bash examples/e2e-demo.shThe script creates a temporary directory, runs all steps automatically, and prints the results. Below is a detailed explanation of each step.
Step 1: Download the Iris dataset
Section titled “Step 1: Download the Iris dataset”The script downloads the classic Iris dataset (150 rows, 5 columns) from the seaborn-data GitHub repository.
curl -sL "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv" \ -o "$DEMO_DIR/iris.csv"The CSV contains columns: sepal_length, sepal_width, petal_length, petal_width, species.
Expected output:
--- Step 1: Downloading Iris dataset --- Downloaded 150 rows to /tmp/dataspoc-e2e-XXXXXX/iris.csvStep 2: Configure the mock Singer tap
Section titled “Step 2: Configure the mock Singer tap”A config file is created pointing the mock tap at the downloaded CSV:
{"csv_path": "/tmp/dataspoc-e2e-XXXXXX/iris.csv", "stream_name": "iris"}The mock_tap_csv.py script reads any CSV and emits Singer protocol messages (SCHEMA, RECORD, STATE), so Pipe can ingest it without installing a real Singer tap.
Expected output:
--- Step 2: Setting up mock Singer tap --- Tap config: /tmp/dataspoc-e2e-XXXXXX/tap-config.json Verifying tap output (first 2 messages): {"type": "SCHEMA", "stream": "iris", "schema": {...}, "key_properties": []} {"type": "RECORD", "stream": "iris", "record": {...}}Step 3: Initialize Pipe and create the pipeline
Section titled “Step 3: Initialize Pipe and create the pipeline”Pipe is initialized and a pipeline YAML is written directly (bypassing the interactive wizard):
source: tap: "python examples/mock_tap_csv.py" config: "/tmp/dataspoc-e2e-XXXXXX/tap-config.json"destination: bucket: "file:///tmp/dataspoc-e2e-XXXXXX/lake" path: raw compression: zstdincremental: enabled: falseschedule: cron: nullExpected output:
--- Step 3: Initializing DataSpoc Pipe --- Pipeline config saved to ~/.dataspoc-pipe/pipelines/iris-demo.yamlStep 4: Run the pipeline
Section titled “Step 4: Run the pipeline”Pipe executes the tap, reads the Singer messages from stdout, converts records to a PyArrow table, and writes a zstd-compressed Parquet file to the lake.
dataspoc-pipe run iris-demoExpected output:
--- Step 4: Running DataSpoc Pipe (ingest CSV -> Parquet) ---Running: iris-demo iris: 150 records...Done! 150 records in 1 stream(s) iris: 150Step 5: Inspect the lake
Section titled “Step 5: Inspect the lake”The lake now contains a Parquet file and a manifest:
find "$LAKE_DIR" -name '*.parquet'dataspoc-pipe manifest "file://$LAKE_DIR"Expected output:
--- Step 5: Inspecting lake contents --- Parquet files in lake: /tmp/dataspoc-e2e-XXXXXX/lake/raw/iris/iris_0000.parquet (4096 bytes)
Manifest: Tables in file:///tmp/dataspoc-e2e-XXXXXX/lake: iris: 150 rows (raw/iris)Lake structure
Section titled “Lake structure”lake/ .dataspoc/ manifest.json logs/iris-demo/<timestamp>.json raw/ iris/ iris_0000.parquetStep 6: Pipeline status
Section titled “Step 6: Pipeline status”dataspoc-pipe statusExpected output:
--- Step 6: Pipeline status --- iris-demo: OK (150 records, last run just now)Step 7: Set up Lens
Section titled “Step 7: Set up Lens”Lens is initialized and the lake is registered as a bucket. The script also converts the manifest format (Pipe writes a dict-keyed manifest, Lens expects a list-keyed manifest).
dataspoc-lens initdataspoc-lens add-bucket "file:///tmp/dataspoc-e2e-XXXXXX/lake"Expected output:
--- Step 7: Setting up DataSpoc Lens --- Manifest converted for Lens compatibility. Bucket added: file:///tmp/dataspoc-e2e-XXXXXX/lakeStep 8: View the catalog
Section titled “Step 8: View the catalog”dataspoc-lens catalogExpected output:
--- Step 8: Viewing catalog --- iris: 150 rows, 5 columnsStep 9: Run queries
Section titled “Step 9: Run queries”Five queries demonstrate different analytical patterns:
Query 1 — First 10 rows:
SELECT * FROM iris LIMIT 10Query 2 — Row count:
SELECT COUNT(*) AS total_rows FROM irisReturns 150.
Query 3 — Average measurements per species:
SELECT species, ROUND(AVG(sepal_length), 2) AS avg_sepal_len, ROUND(AVG(sepal_width), 2) AS avg_sepal_wid, ROUND(AVG(petal_length), 2) AS avg_petal_len, ROUND(AVG(petal_width), 2) AS avg_petal_widFROM irisGROUP BY speciesORDER BY speciesExpected results:
| species | avg_sepal_len | avg_sepal_wid | avg_petal_len | avg_petal_wid |
|---|---|---|---|---|
| setosa | 5.01 | 3.42 | 1.46 | 0.24 |
| versicolor | 5.94 | 2.77 | 4.26 | 1.33 |
| virginica | 6.59 | 2.97 | 5.55 | 2.03 |
Query 4 — Species distribution:
SELECT species, COUNT(*) AS n FROM iris GROUP BY species ORDER BY n DESCEach species has exactly 50 rows.
Query 5 — Top 5 largest petals:
SELECT species, petal_length, petal_widthFROM irisORDER BY petal_length DESCLIMIT 5All top results are from the virginica species.
Step 10: Export results
Section titled “Step 10: Export results”The demo exports the full dataset to CSV and a summary to JSON:
dataspoc-lens export "SELECT * FROM iris" --format csv --output export.csvdataspoc-lens export "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(petal_length),2) AS avg_petal_len FROM iris GROUP BY species ORDER BY species" --format json --output summary.jsonExpected output:
--- Step 10: Exporting results --- Exported files: -rw-r--r-- 1 user user 5.2K export.csv -rw-r--r-- 1 user user 210 summary.jsonThe Docker alternative
Section titled “The Docker alternative”If you prefer not to install anything locally, the Docker demo image includes Pipe, Lens, Jupyter, and three pre-ingested datasets (Iris, Titanic, Tips).
cd dataspoc-pipe
# Builddocker build -f examples/Dockerfile.demo -t dataspoc-demo .
# Run Jupyterdocker run -p 8888:8888 dataspoc-demo
# Or run queries directlydocker run -it dataspoc-demo dataspoc-lens shellSee the Pipe examples page for full details on the Docker image.
Cleanup
Section titled “Cleanup”The demo creates files in two locations:
# Remove the temporary demo directoryrm -rf /tmp/dataspoc-e2e-XXXXXX
# Remove the pipeline configrm -f ~/.dataspoc-pipe/pipelines/iris-demo.yamlThe script prints the exact paths at the end of the run. The Lens config (~/.dataspoc-lens/) can be removed with:
rm -rf ~/.dataspoc-lensFull script
Section titled “Full script”The complete e2e-demo.sh is available in the dataspoc-pipe repository.
#!/bin/bash# ============================================================================# DataSpoc E2E Demo -- From raw data to analysis## Downloads a real dataset from the web, ingests it with DataSpoc Pipe,# then analyzes it with DataSpoc Lens.## Usage:# cd dataspoc-pipe# source .venv/bin/activate# bash examples/e2e-demo.sh# ============================================================================
set -euo pipefail
# -- Resolve paths ----------------------------------------------------------SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"DEMO_DIR=$(mktemp -d -t dataspoc-e2e-XXXXXX)LAKE_DIR="$DEMO_DIR/lake"MOCK_TAP="$SCRIPT_DIR/mock_tap_csv.py"
# Ensure dataspoc-lens is importable even if not pip-installedexport PYTHONPATH="${PROJECT_DIR}/lens/src:${PROJECT_DIR}/src:${PYTHONPATH:-}"
# Resolve CLI commands -- prefer installed entry-points, fall back to moduleif command -v dataspoc-pipe &>/dev/null; then PIPE_CMD="dataspoc-pipe"else PIPE_CMD="python -m dataspoc_pipe.cli"fi
if command -v dataspoc-lens &>/dev/null; then LENS_CMD="dataspoc-lens"else LENS_CMD="python -m dataspoc_lens"fi
# Dataset URL -- Iris from the UCI Machine Learning Repository (GitHub mirror)DATASET_URL="https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
echo "============================================================"echo " DataSpoc E2E Demo"echo "============================================================"echo ""echo " Working directory : $DEMO_DIR"echo " Lake directory : $LAKE_DIR"echo " Pipe CLI : $PIPE_CMD"echo " Lens CLI : $LENS_CMD"echo ""
# -- Step 1: Download dataset -----------------------------------------------echo "--- Step 1: Downloading Iris dataset ---"curl -sL "$DATASET_URL" -o "$DEMO_DIR/iris.csv"ROW_COUNT=$(tail -n +2 "$DEMO_DIR/iris.csv" | wc -l)echo " Downloaded $ROW_COUNT rows to $DEMO_DIR/iris.csv"echo ""
# -- Step 2: Create tap config ----------------------------------------------echo "--- Step 2: Setting up mock Singer tap ---"cat > "$DEMO_DIR/tap-config.json" <<EOF{"csv_path": "$DEMO_DIR/iris.csv", "stream_name": "iris"}EOFecho " Tap config: $DEMO_DIR/tap-config.json"
# Quick sanity check -- first 2 lines from mock tapecho " Verifying tap output (first 2 messages):"set +o pipefailpython "$MOCK_TAP" --config "$DEMO_DIR/tap-config.json" 2>/dev/null | head -2 | while read -r line; do echo " $line"doneset -o pipefailecho ""
# -- Step 3: Initialize Pipe and create pipeline ----------------------------echo "--- Step 3: Initializing DataSpoc Pipe ---"$PIPE_CMD init
# Write pipeline YAML directly (avoids interactive wizard)mkdir -p ~/.dataspoc-pipe/pipelinescat > ~/.dataspoc-pipe/pipelines/iris-demo.yaml <<EOFsource: tap: "python $MOCK_TAP" config: "$DEMO_DIR/tap-config.json"destination: bucket: "file://$LAKE_DIR" path: raw compression: zstdincremental: enabled: falseschedule: cron: nullEOFecho " Pipeline config saved to ~/.dataspoc-pipe/pipelines/iris-demo.yaml"echo ""
# -- Step 4: Run Pipe -------------------------------------------------------echo "--- Step 4: Running DataSpoc Pipe (ingest CSV -> Parquet) ---"$PIPE_CMD run iris-demoecho ""
# -- Step 5: Inspect the lake -----------------------------------------------echo "--- Step 5: Inspecting lake contents ---"echo " Parquet files in lake:"find "$LAKE_DIR" -name '*.parquet' -printf " %p (%s bytes)\n" 2>/dev/null || trueecho ""echo " Manifest:"$PIPE_CMD manifest "file://$LAKE_DIR"echo ""
# -- Step 6: Pipeline status -------------------------------------------------echo "--- Step 6: Pipeline status ---"$PIPE_CMD statusecho ""
# -- Step 7: Initialize Lens ------------------------------------------------echo "--- Step 7: Setting up DataSpoc Lens ---"
# Note: Pipe writes a dict-keyed manifest; Lens expects a list-keyed manifest.# Convert the manifest so Lens can read it via manifest-first discovery.python - "$LAKE_DIR" <<'PYEOF'import json, sysmpath = f"{sys.argv[1]}/.dataspoc/manifest.json"try: with open(mpath) as f: m = json.load(f) tables_dict = m.get("tables", {}) if isinstance(tables_dict, dict): tables_list = [] for key, val in tables_dict.items(): entry = dict(val) if "location" not in entry: entry["location"] = f"raw/{key}" if "row_count" not in entry: stats = entry.pop("stats", {}) entry["row_count"] = stats.get("total_rows", 0) tables_list.append(entry) m["tables"] = tables_list with open(mpath, "w") as f: json.dump(m, f, indent=2) print(" Manifest converted for Lens compatibility.")except FileNotFoundError: print(" No manifest found (Lens will use scan fallback).")PYEOF
$LENS_CMD init$LENS_CMD add-bucket "file://$LAKE_DIR"echo ""
# -- Step 8: Catalog ---------------------------------------------------------echo "--- Step 8: Viewing catalog ---"$LENS_CMD catalogecho ""
# -- Step 9: Run queries -----------------------------------------------------echo "--- Step 9: Querying with DataSpoc Lens ---"
echo ""echo "[Query 1] First 10 rows:"$LENS_CMD query "SELECT * FROM iris LIMIT 10"
echo ""echo "[Query 2] Row count:"$LENS_CMD query "SELECT count(*) AS total_rows FROM iris"
echo ""echo "[Query 3] Average measurements per species:"$LENS_CMD query "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(sepal_width),2) AS avg_sepal_wid, ROUND(AVG(petal_length),2) AS avg_petal_len, ROUND(AVG(petal_width),2) AS avg_petal_wid FROM iris GROUP BY species ORDER BY species"
echo ""echo "[Query 4] Species distribution:"$LENS_CMD query "SELECT species, count(*) AS n FROM iris GROUP BY species ORDER BY n DESC"
echo ""echo "[Query 5] Top 5 largest petals:"$LENS_CMD query "SELECT species, petal_length, petal_width FROM iris ORDER BY petal_length DESC LIMIT 5"echo ""
# -- Step 10: Export results -------------------------------------------------echo "--- Step 10: Exporting results ---"$LENS_CMD export "SELECT * FROM iris" --format csv --output "$DEMO_DIR/export.csv"$LENS_CMD export "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(petal_length),2) AS avg_petal_len FROM iris GROUP BY species ORDER BY species" --format json --output "$DEMO_DIR/summary.json"echo ""echo " Exported files:"ls -lh "$DEMO_DIR/export.csv" "$DEMO_DIR/summary.json"echo ""
# -- Done --------------------------------------------------------------------echo "============================================================"echo " Demo complete!"echo "============================================================"echo ""echo " Lake location : $LAKE_DIR"echo " CSV export : $DEMO_DIR/export.csv"echo " JSON export : $DEMO_DIR/summary.json"echo ""echo " To explore interactively:"echo " $LENS_CMD shell"echo ""echo " To clean up:"echo " rm -rf $DEMO_DIR"echo " rm -f ~/.dataspoc-pipe/pipelines/iris-demo.yaml"echo "============================================================"