Skip to content

End-to-End Demo

This walkthrough runs the complete DataSpoc workflow from scratch: download a real dataset, ingest it into a local data lake with Pipe, then query and export with Lens. The entire process takes under a minute.

  • Python 3.10+
  • DataSpoc Pipe and Lens installed (or available via python -m)
  • curl available on your system

If you have the dataspoc workspace cloned:

Terminal window
cd dataspoc-pipe
source .venv/bin/activate

Or install from PyPI:

Terminal window
pip install dataspoc-pipe dataspoc-lens
Terminal window
bash examples/e2e-demo.sh

The script creates a temporary directory, runs all steps automatically, and prints the results. Below is a detailed explanation of each step.


The script downloads the classic Iris dataset (150 rows, 5 columns) from the seaborn-data GitHub repository.

Terminal window
curl -sL "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv" \
-o "$DEMO_DIR/iris.csv"

The CSV contains columns: sepal_length, sepal_width, petal_length, petal_width, species.

Expected output:

--- Step 1: Downloading Iris dataset ---
Downloaded 150 rows to /tmp/dataspoc-e2e-XXXXXX/iris.csv

A config file is created pointing the mock tap at the downloaded CSV:

{"csv_path": "/tmp/dataspoc-e2e-XXXXXX/iris.csv", "stream_name": "iris"}

The mock_tap_csv.py script reads any CSV and emits Singer protocol messages (SCHEMA, RECORD, STATE), so Pipe can ingest it without installing a real Singer tap.

Expected output:

--- Step 2: Setting up mock Singer tap ---
Tap config: /tmp/dataspoc-e2e-XXXXXX/tap-config.json
Verifying tap output (first 2 messages):
{"type": "SCHEMA", "stream": "iris", "schema": {...}, "key_properties": []}
{"type": "RECORD", "stream": "iris", "record": {...}}

Step 3: Initialize Pipe and create the pipeline

Section titled “Step 3: Initialize Pipe and create the pipeline”

Pipe is initialized and a pipeline YAML is written directly (bypassing the interactive wizard):

source:
tap: "python examples/mock_tap_csv.py"
config: "/tmp/dataspoc-e2e-XXXXXX/tap-config.json"
destination:
bucket: "file:///tmp/dataspoc-e2e-XXXXXX/lake"
path: raw
compression: zstd
incremental:
enabled: false
schedule:
cron: null

Expected output:

--- Step 3: Initializing DataSpoc Pipe ---
Pipeline config saved to ~/.dataspoc-pipe/pipelines/iris-demo.yaml

Pipe executes the tap, reads the Singer messages from stdout, converts records to a PyArrow table, and writes a zstd-compressed Parquet file to the lake.

Terminal window
dataspoc-pipe run iris-demo

Expected output:

--- Step 4: Running DataSpoc Pipe (ingest CSV -> Parquet) ---
Running: iris-demo
iris: 150 records...
Done! 150 records in 1 stream(s)
iris: 150

The lake now contains a Parquet file and a manifest:

Terminal window
find "$LAKE_DIR" -name '*.parquet'
dataspoc-pipe manifest "file://$LAKE_DIR"

Expected output:

--- Step 5: Inspecting lake contents ---
Parquet files in lake:
/tmp/dataspoc-e2e-XXXXXX/lake/raw/iris/iris_0000.parquet (4096 bytes)
Manifest:
Tables in file:///tmp/dataspoc-e2e-XXXXXX/lake:
iris: 150 rows (raw/iris)
lake/
.dataspoc/
manifest.json
logs/iris-demo/<timestamp>.json
raw/
iris/
iris_0000.parquet
Terminal window
dataspoc-pipe status

Expected output:

--- Step 6: Pipeline status ---
iris-demo: OK (150 records, last run just now)

Lens is initialized and the lake is registered as a bucket. The script also converts the manifest format (Pipe writes a dict-keyed manifest, Lens expects a list-keyed manifest).

Terminal window
dataspoc-lens init
dataspoc-lens add-bucket "file:///tmp/dataspoc-e2e-XXXXXX/lake"

Expected output:

--- Step 7: Setting up DataSpoc Lens ---
Manifest converted for Lens compatibility.
Bucket added: file:///tmp/dataspoc-e2e-XXXXXX/lake
Terminal window
dataspoc-lens catalog

Expected output:

--- Step 8: Viewing catalog ---
iris: 150 rows, 5 columns

Five queries demonstrate different analytical patterns:

Query 1 — First 10 rows:

SELECT * FROM iris LIMIT 10

Query 2 — Row count:

SELECT COUNT(*) AS total_rows FROM iris

Returns 150.

Query 3 — Average measurements per species:

SELECT species,
ROUND(AVG(sepal_length), 2) AS avg_sepal_len,
ROUND(AVG(sepal_width), 2) AS avg_sepal_wid,
ROUND(AVG(petal_length), 2) AS avg_petal_len,
ROUND(AVG(petal_width), 2) AS avg_petal_wid
FROM iris
GROUP BY species
ORDER BY species

Expected results:

speciesavg_sepal_lenavg_sepal_widavg_petal_lenavg_petal_wid
setosa5.013.421.460.24
versicolor5.942.774.261.33
virginica6.592.975.552.03

Query 4 — Species distribution:

SELECT species, COUNT(*) AS n FROM iris GROUP BY species ORDER BY n DESC

Each species has exactly 50 rows.

Query 5 — Top 5 largest petals:

SELECT species, petal_length, petal_width
FROM iris
ORDER BY petal_length DESC
LIMIT 5

All top results are from the virginica species.

The demo exports the full dataset to CSV and a summary to JSON:

Terminal window
dataspoc-lens export "SELECT * FROM iris" --format csv --output export.csv
dataspoc-lens export "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(petal_length),2) AS avg_petal_len FROM iris GROUP BY species ORDER BY species" --format json --output summary.json

Expected output:

--- Step 10: Exporting results ---
Exported files:
-rw-r--r-- 1 user user 5.2K export.csv
-rw-r--r-- 1 user user 210 summary.json

If you prefer not to install anything locally, the Docker demo image includes Pipe, Lens, Jupyter, and three pre-ingested datasets (Iris, Titanic, Tips).

Terminal window
cd dataspoc-pipe
# Build
docker build -f examples/Dockerfile.demo -t dataspoc-demo .
# Run Jupyter
docker run -p 8888:8888 dataspoc-demo
# Or run queries directly
docker run -it dataspoc-demo dataspoc-lens shell

See the Pipe examples page for full details on the Docker image.


The demo creates files in two locations:

Terminal window
# Remove the temporary demo directory
rm -rf /tmp/dataspoc-e2e-XXXXXX
# Remove the pipeline config
rm -f ~/.dataspoc-pipe/pipelines/iris-demo.yaml

The script prints the exact paths at the end of the run. The Lens config (~/.dataspoc-lens/) can be removed with:

Terminal window
rm -rf ~/.dataspoc-lens

The complete e2e-demo.sh is available in the dataspoc-pipe repository.

#!/bin/bash
# ============================================================================
# DataSpoc E2E Demo -- From raw data to analysis
#
# Downloads a real dataset from the web, ingests it with DataSpoc Pipe,
# then analyzes it with DataSpoc Lens.
#
# Usage:
# cd dataspoc-pipe
# source .venv/bin/activate
# bash examples/e2e-demo.sh
# ============================================================================
set -euo pipefail
# -- Resolve paths ----------------------------------------------------------
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
DEMO_DIR=$(mktemp -d -t dataspoc-e2e-XXXXXX)
LAKE_DIR="$DEMO_DIR/lake"
MOCK_TAP="$SCRIPT_DIR/mock_tap_csv.py"
# Ensure dataspoc-lens is importable even if not pip-installed
export PYTHONPATH="${PROJECT_DIR}/lens/src:${PROJECT_DIR}/src:${PYTHONPATH:-}"
# Resolve CLI commands -- prefer installed entry-points, fall back to module
if command -v dataspoc-pipe &>/dev/null; then
PIPE_CMD="dataspoc-pipe"
else
PIPE_CMD="python -m dataspoc_pipe.cli"
fi
if command -v dataspoc-lens &>/dev/null; then
LENS_CMD="dataspoc-lens"
else
LENS_CMD="python -m dataspoc_lens"
fi
# Dataset URL -- Iris from the UCI Machine Learning Repository (GitHub mirror)
DATASET_URL="https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
echo "============================================================"
echo " DataSpoc E2E Demo"
echo "============================================================"
echo ""
echo " Working directory : $DEMO_DIR"
echo " Lake directory : $LAKE_DIR"
echo " Pipe CLI : $PIPE_CMD"
echo " Lens CLI : $LENS_CMD"
echo ""
# -- Step 1: Download dataset -----------------------------------------------
echo "--- Step 1: Downloading Iris dataset ---"
curl -sL "$DATASET_URL" -o "$DEMO_DIR/iris.csv"
ROW_COUNT=$(tail -n +2 "$DEMO_DIR/iris.csv" | wc -l)
echo " Downloaded $ROW_COUNT rows to $DEMO_DIR/iris.csv"
echo ""
# -- Step 2: Create tap config ----------------------------------------------
echo "--- Step 2: Setting up mock Singer tap ---"
cat > "$DEMO_DIR/tap-config.json" <<EOF
{"csv_path": "$DEMO_DIR/iris.csv", "stream_name": "iris"}
EOF
echo " Tap config: $DEMO_DIR/tap-config.json"
# Quick sanity check -- first 2 lines from mock tap
echo " Verifying tap output (first 2 messages):"
set +o pipefail
python "$MOCK_TAP" --config "$DEMO_DIR/tap-config.json" 2>/dev/null | head -2 | while read -r line; do
echo " $line"
done
set -o pipefail
echo ""
# -- Step 3: Initialize Pipe and create pipeline ----------------------------
echo "--- Step 3: Initializing DataSpoc Pipe ---"
$PIPE_CMD init
# Write pipeline YAML directly (avoids interactive wizard)
mkdir -p ~/.dataspoc-pipe/pipelines
cat > ~/.dataspoc-pipe/pipelines/iris-demo.yaml <<EOF
source:
tap: "python $MOCK_TAP"
config: "$DEMO_DIR/tap-config.json"
destination:
bucket: "file://$LAKE_DIR"
path: raw
compression: zstd
incremental:
enabled: false
schedule:
cron: null
EOF
echo " Pipeline config saved to ~/.dataspoc-pipe/pipelines/iris-demo.yaml"
echo ""
# -- Step 4: Run Pipe -------------------------------------------------------
echo "--- Step 4: Running DataSpoc Pipe (ingest CSV -> Parquet) ---"
$PIPE_CMD run iris-demo
echo ""
# -- Step 5: Inspect the lake -----------------------------------------------
echo "--- Step 5: Inspecting lake contents ---"
echo " Parquet files in lake:"
find "$LAKE_DIR" -name '*.parquet' -printf " %p (%s bytes)\n" 2>/dev/null || true
echo ""
echo " Manifest:"
$PIPE_CMD manifest "file://$LAKE_DIR"
echo ""
# -- Step 6: Pipeline status -------------------------------------------------
echo "--- Step 6: Pipeline status ---"
$PIPE_CMD status
echo ""
# -- Step 7: Initialize Lens ------------------------------------------------
echo "--- Step 7: Setting up DataSpoc Lens ---"
# Note: Pipe writes a dict-keyed manifest; Lens expects a list-keyed manifest.
# Convert the manifest so Lens can read it via manifest-first discovery.
python - "$LAKE_DIR" <<'PYEOF'
import json, sys
mpath = f"{sys.argv[1]}/.dataspoc/manifest.json"
try:
with open(mpath) as f:
m = json.load(f)
tables_dict = m.get("tables", {})
if isinstance(tables_dict, dict):
tables_list = []
for key, val in tables_dict.items():
entry = dict(val)
if "location" not in entry:
entry["location"] = f"raw/{key}"
if "row_count" not in entry:
stats = entry.pop("stats", {})
entry["row_count"] = stats.get("total_rows", 0)
tables_list.append(entry)
m["tables"] = tables_list
with open(mpath, "w") as f:
json.dump(m, f, indent=2)
print(" Manifest converted for Lens compatibility.")
except FileNotFoundError:
print(" No manifest found (Lens will use scan fallback).")
PYEOF
$LENS_CMD init
$LENS_CMD add-bucket "file://$LAKE_DIR"
echo ""
# -- Step 8: Catalog ---------------------------------------------------------
echo "--- Step 8: Viewing catalog ---"
$LENS_CMD catalog
echo ""
# -- Step 9: Run queries -----------------------------------------------------
echo "--- Step 9: Querying with DataSpoc Lens ---"
echo ""
echo "[Query 1] First 10 rows:"
$LENS_CMD query "SELECT * FROM iris LIMIT 10"
echo ""
echo "[Query 2] Row count:"
$LENS_CMD query "SELECT count(*) AS total_rows FROM iris"
echo ""
echo "[Query 3] Average measurements per species:"
$LENS_CMD query "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(sepal_width),2) AS avg_sepal_wid, ROUND(AVG(petal_length),2) AS avg_petal_len, ROUND(AVG(petal_width),2) AS avg_petal_wid FROM iris GROUP BY species ORDER BY species"
echo ""
echo "[Query 4] Species distribution:"
$LENS_CMD query "SELECT species, count(*) AS n FROM iris GROUP BY species ORDER BY n DESC"
echo ""
echo "[Query 5] Top 5 largest petals:"
$LENS_CMD query "SELECT species, petal_length, petal_width FROM iris ORDER BY petal_length DESC LIMIT 5"
echo ""
# -- Step 10: Export results -------------------------------------------------
echo "--- Step 10: Exporting results ---"
$LENS_CMD export "SELECT * FROM iris" --format csv --output "$DEMO_DIR/export.csv"
$LENS_CMD export "SELECT species, ROUND(AVG(sepal_length),2) AS avg_sepal_len, ROUND(AVG(petal_length),2) AS avg_petal_len FROM iris GROUP BY species ORDER BY species" --format json --output "$DEMO_DIR/summary.json"
echo ""
echo " Exported files:"
ls -lh "$DEMO_DIR/export.csv" "$DEMO_DIR/summary.json"
echo ""
# -- Done --------------------------------------------------------------------
echo "============================================================"
echo " Demo complete!"
echo "============================================================"
echo ""
echo " Lake location : $LAKE_DIR"
echo " CSV export : $DEMO_DIR/export.csv"
echo " JSON export : $DEMO_DIR/summary.json"
echo ""
echo " To explore interactively:"
echo " $LENS_CMD shell"
echo ""
echo " To clean up:"
echo " rm -rf $DEMO_DIR"
echo " rm -f ~/.dataspoc-pipe/pipelines/iris-demo.yaml"
echo "============================================================"