Skip to content

Multi-Cloud Storage

Pipe writes Parquet files to any supported storage provider. The destination is configured via the bucket URI in your pipeline YAML.

ProviderURI schemeExtra requiredBackend
Local filesystemfile://nonebuilt-in
Amazon S3s3://pip install dataspoc-pipe[s3]s3fs
Google Cloud Storagegs://pip install dataspoc-pipe[gcs]gcsfs
Azure Blob Storageaz://pip install dataspoc-pipe[azure]adlfs

All cloud storage access is handled by fsspec, which provides a unified interface across providers.

No extra install or credentials needed. Useful for development and testing.

destination:
bucket: file:///tmp/lake
path: raw
compression: zstd
/tmp/lake/
.dataspoc/manifest.json
raw/csv/orders/dt=2025-01-20/orders_0000.parquet
Terminal window
pip install dataspoc-pipe[s3]

Pipe uses the standard AWS credential chain. Configure one of the following:

Environment variables:

Terminal window
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=us-east-1

AWS CLI profile:

Terminal window
aws configure

IAM instance role (EC2/ECS/Lambda):

No configuration needed. Credentials are provided automatically by the instance metadata service.

IAM role with SSO:

Terminal window
aws sso login --profile my-profile
export AWS_PROFILE=my-profile
destination:
bucket: s3://my-company-datalake
path: raw
compression: zstd

The IAM principal needs these S3 permissions on the bucket:

{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-company-datalake",
"arn:aws:s3:::my-company-datalake/*"
]
}
Terminal window
pip install dataspoc-pipe[gcs]

Service account key:

Terminal window
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Application default credentials (local development):

Terminal window
gcloud auth application-default login

Attached service account (GCE/Cloud Run/GKE):

No configuration needed. Credentials come from the metadata server.

destination:
bucket: gs://my-company-datalake
path: raw
compression: zstd

Grant the service account the Storage Object Admin role (roles/storage.objectAdmin) on the bucket, or a custom role with:

  • storage.objects.create
  • storage.objects.get
  • storage.objects.delete
  • storage.objects.list
Terminal window
pip install dataspoc-pipe[azure]

Connection string:

Terminal window
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

Account name and key:

Terminal window
export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount
export AZURE_STORAGE_ACCOUNT_KEY=base64encodedkey==

Managed identity (Azure VM/App Service/Functions):

No configuration needed. adlfs uses the managed identity automatically.

Azure CLI:

Terminal window
az login
export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount
destination:
bucket: az://my-container
path: raw
compression: zstd

You can create multiple pipelines for the same source writing to different buckets:

pipelines/orders-s3.yaml
source:
tap: tap-postgres
config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
bucket: s3://primary-lake
path: raw
# pipelines/orders-gcs.yaml
source:
tap: tap-postgres
config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
bucket: gs://backup-lake
path: raw

Regardless of the provider, Pipe always writes the same directory layout:

<bucket>/
.dataspoc/
manifest.json # Table catalog
state/<pipeline>/state.json # Incremental bookmarks
logs/<pipeline>/<timestamp>.json # Execution logs
<path>/
<source>/<table>/
dt=<partition_value>/
<table>_0000.parquet # Data files

This consistent structure is the contract between Pipe and downstream tools like DataSpoc Lens.