Skip to content

DataSpoc Pipe

DataSpoc Pipe is a data ingestion engine that connects 400+ Singer taps to Parquet files in cloud storage. It handles the first mile of your data platform: getting data from sources into an organized, queryable data lake.

  • Reads from any Singer-compatible tap (databases, APIs, SaaS tools, files)
  • Converts records to Apache Parquet with automatic schema detection
  • Writes to S3, GCS, Azure Blob, or local filesystem
  • Maintains an auto-catalog (manifest.json) so downstream tools can discover tables
  • Supports incremental extraction via Singer bookmarks
  • Streams data in batches for low memory usage on large datasets
InterfaceUse case
CLIInteractive use, cron jobs, CI/CD pipelines
Python SDKEmbed in scripts, notebooks, or applications
MCP ServerLet AI agents (Claude, etc.) manage pipelines
Terminal window
pip install dataspoc-pipe
Terminal window
# Initialize config directory
dataspoc-pipe init
# Create a pipeline with interactive wizard
dataspoc-pipe add my-pipeline
# Run it
dataspoc-pipe run my-pipeline
# Check results
dataspoc-pipe status
[Data Source] --> [Singer Tap] --> stdout --> [Pipe Engine] --> Parquet --> [Cloud Bucket]
|
manifest.json
state.json
logs/

Pipe runs Singer taps as subprocesses, reads their JSON output via stdout, buffers records into batches, optionally applies a Python transform, converts to PyArrow tables, and writes Parquet files to the destination bucket.

DataSpoc Pipe is licensed under Apache 2.0. Free to use, modify, and distribute.