DataSpoc Pipe

DataSpoc Pipe is a data ingestion engine that connects 400+ Singer taps to Parquet files in cloud storage. It handles the first mile of your data platform: getting data from sources into an organized, queryable data lake.

What it does

Reads from any Singer-compatible tap (databases, APIs, SaaS tools, files)
Converts records to Apache Parquet with automatic schema detection
Writes to S3, GCS, Azure Blob, or local filesystem
Maintains an auto-catalog (manifest.json) so downstream tools can discover tables
Supports incremental extraction via Singer bookmarks
Streams data in batches for low memory usage on large datasets

How you can use it

Interface	Use case
CLI	Interactive use, cron jobs, CI/CD pipelines
Python SDK	Embed in scripts, notebooks, or applications
MCP Server	Let AI agents (Claude, etc.) manage pipelines

Install

pip install dataspoc-pipe

Quick example

# Initialize config directory
dataspoc-pipe init

# Create a pipeline with interactive wizard
dataspoc-pipe add my-pipeline

# Run it
dataspoc-pipe run my-pipeline

# Check results
dataspoc-pipe status

Architecture

[Data Source] --> [Singer Tap] --> stdout --> [Pipe Engine] --> Parquet --> [Cloud Bucket]
                                                  |
                                            manifest.json
                                            state.json
                                            logs/

Pipe runs Singer taps as subprocesses, reads their JSON output via stdout, buffers records into batches, optionally applies a Python transform, converts to PyArrow tables, and writes Parquet files to the destination bucket.

Open source

DataSpoc Pipe is licensed under Apache 2.0. Free to use, modify, and distribute.

GitHub: github.com/dataspoclab/dataspoc-pipe
PyPI: pypi.org/project/dataspoc-pipe