DataSpoc Pipe
DataSpoc Pipe is a data ingestion engine that connects 400+ Singer taps to Parquet files in cloud storage. It handles the first mile of your data platform: getting data from sources into an organized, queryable data lake.
What it does
Section titled “What it does”- Reads from any Singer-compatible tap (databases, APIs, SaaS tools, files)
- Converts records to Apache Parquet with automatic schema detection
- Writes to S3, GCS, Azure Blob, or local filesystem
- Maintains an auto-catalog (
manifest.json) so downstream tools can discover tables - Supports incremental extraction via Singer bookmarks
- Streams data in batches for low memory usage on large datasets
How you can use it
Section titled “How you can use it”| Interface | Use case |
|---|---|
| CLI | Interactive use, cron jobs, CI/CD pipelines |
| Python SDK | Embed in scripts, notebooks, or applications |
| MCP Server | Let AI agents (Claude, etc.) manage pipelines |
Install
Section titled “Install”pip install dataspoc-pipeQuick example
Section titled “Quick example”# Initialize config directorydataspoc-pipe init
# Create a pipeline with interactive wizarddataspoc-pipe add my-pipeline
# Run itdataspoc-pipe run my-pipeline
# Check resultsdataspoc-pipe statusArchitecture
Section titled “Architecture”[Data Source] --> [Singer Tap] --> stdout --> [Pipe Engine] --> Parquet --> [Cloud Bucket] | manifest.json state.json logs/Pipe runs Singer taps as subprocesses, reads their JSON output via stdout, buffers records into batches, optionally applies a Python transform, converts to PyArrow tables, and writes Parquet files to the destination bucket.
Open source
Section titled “Open source”DataSpoc Pipe is licensed under Apache 2.0. Free to use, modify, and distribute.