Multi-Cloud Storage
Pipe writes Parquet files to any supported storage provider. The destination is configured via the bucket URI in your pipeline YAML.
Supported providers
Section titled “Supported providers”| Provider | URI scheme | Extra required | Backend |
|---|---|---|---|
| Local filesystem | file:// | none | built-in |
| Amazon S3 | s3:// | pip install dataspoc-pipe[s3] | s3fs |
| Google Cloud Storage | gs:// | pip install dataspoc-pipe[gcs] | gcsfs |
| Azure Blob Storage | az:// | pip install dataspoc-pipe[azure] | adlfs |
All cloud storage access is handled by fsspec, which provides a unified interface across providers.
Local filesystem
Section titled “Local filesystem”No extra install or credentials needed. Useful for development and testing.
Pipeline config
Section titled “Pipeline config”destination: bucket: file:///tmp/lake path: raw compression: zstdResult
Section titled “Result”/tmp/lake/ .dataspoc/manifest.json raw/csv/orders/dt=2025-01-20/orders_0000.parquetAmazon S3
Section titled “Amazon S3”Install
Section titled “Install”pip install dataspoc-pipe[s3]Credentials
Section titled “Credentials”Pipe uses the standard AWS credential chain. Configure one of the following:
Environment variables:
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLEexport AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEYexport AWS_DEFAULT_REGION=us-east-1AWS CLI profile:
aws configureIAM instance role (EC2/ECS/Lambda):
No configuration needed. Credentials are provided automatically by the instance metadata service.
IAM role with SSO:
aws sso login --profile my-profileexport AWS_PROFILE=my-profilePipeline config
Section titled “Pipeline config”destination: bucket: s3://my-company-datalake path: raw compression: zstdIAM permissions required
Section titled “IAM permissions required”The IAM principal needs these S3 permissions on the bucket:
{ "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-company-datalake", "arn:aws:s3:::my-company-datalake/*" ]}Google Cloud Storage
Section titled “Google Cloud Storage”Install
Section titled “Install”pip install dataspoc-pipe[gcs]Credentials
Section titled “Credentials”Service account key:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.jsonApplication default credentials (local development):
gcloud auth application-default loginAttached service account (GCE/Cloud Run/GKE):
No configuration needed. Credentials come from the metadata server.
Pipeline config
Section titled “Pipeline config”destination: bucket: gs://my-company-datalake path: raw compression: zstdIAM roles required
Section titled “IAM roles required”Grant the service account the Storage Object Admin role (roles/storage.objectAdmin) on the bucket, or a custom role with:
storage.objects.createstorage.objects.getstorage.objects.deletestorage.objects.list
Azure Blob Storage
Section titled “Azure Blob Storage”Install
Section titled “Install”pip install dataspoc-pipe[azure]Credentials
Section titled “Credentials”Connection string:
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"Account name and key:
export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccountexport AZURE_STORAGE_ACCOUNT_KEY=base64encodedkey==Managed identity (Azure VM/App Service/Functions):
No configuration needed. adlfs uses the managed identity automatically.
Azure CLI:
az loginexport AZURE_STORAGE_ACCOUNT_NAME=mystorageaccountPipeline config
Section titled “Pipeline config”destination: bucket: az://my-container path: raw compression: zstdMultiple destinations
Section titled “Multiple destinations”You can create multiple pipelines for the same source writing to different buckets:
source: tap: tap-postgres config: /home/you/.dataspoc-pipe/sources/orders.jsondestination: bucket: s3://primary-lake path: raw
# pipelines/orders-gcs.yamlsource: tap: tap-postgres config: /home/you/.dataspoc-pipe/sources/orders.jsondestination: bucket: gs://backup-lake path: rawBucket structure
Section titled “Bucket structure”Regardless of the provider, Pipe always writes the same directory layout:
<bucket>/ .dataspoc/ manifest.json # Table catalog state/<pipeline>/state.json # Incremental bookmarks logs/<pipeline>/<timestamp>.json # Execution logs <path>/ <source>/<table>/ dt=<partition_value>/ <table>_0000.parquet # Data filesThis consistent structure is the contract between Pipe and downstream tools like DataSpoc Lens.