Pular para o conteúdo

Multi-Cloud Storage

Este conteúdo não está disponível em sua língua ainda.

Pipe writes Parquet files to any supported storage provider. The destination is configured via the bucket URI in your pipeline YAML.

ProviderURI schemeExtra requiredBackend
Local filesystemfile://nonebuilt-in
Amazon S3s3://pip install dataspoc-pipe[s3]s3fs
Google Cloud Storagegs://pip install dataspoc-pipe[gcs]gcsfs
Azure Blob Storageaz://pip install dataspoc-pipe[azure]adlfs

All cloud storage access is handled by fsspec, which provides a unified interface across providers.

No extra install or credentials needed. Useful for development and testing.

destination:
bucket: file:///tmp/lake
path: raw
compression: zstd
/tmp/lake/
.dataspoc/manifest.json
raw/csv/orders/dt=2025-01-20/orders_0000.parquet
Terminal window
pip install dataspoc-pipe[s3]

Pipe uses the standard AWS credential chain. Configure one of the following:

Environment variables:

Terminal window
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=us-east-1

AWS CLI profile:

Terminal window
aws configure

IAM instance role (EC2/ECS/Lambda):

No configuration needed. Credentials are provided automatically by the instance metadata service.

IAM role with SSO:

Terminal window
aws sso login --profile my-profile
export AWS_PROFILE=my-profile
destination:
bucket: s3://my-company-datalake
path: raw
compression: zstd

The IAM principal needs these S3 permissions on the bucket:

{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-company-datalake",
"arn:aws:s3:::my-company-datalake/*"
]
}
Terminal window
pip install dataspoc-pipe[gcs]

Service account key:

Terminal window
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Application default credentials (local development):

Terminal window
gcloud auth application-default login

Attached service account (GCE/Cloud Run/GKE):

No configuration needed. Credentials come from the metadata server.

destination:
bucket: gs://my-company-datalake
path: raw
compression: zstd

Grant the service account the Storage Object Admin role (roles/storage.objectAdmin) on the bucket, or a custom role with:

  • storage.objects.create
  • storage.objects.get
  • storage.objects.delete
  • storage.objects.list
Terminal window
pip install dataspoc-pipe[azure]

Connection string:

Terminal window
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

Account name and key:

Terminal window
export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount
export AZURE_STORAGE_ACCOUNT_KEY=base64encodedkey==

Managed identity (Azure VM/App Service/Functions):

No configuration needed. adlfs uses the managed identity automatically.

Azure CLI:

Terminal window
az login
export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount
destination:
bucket: az://my-container
path: raw
compression: zstd

You can create multiple pipelines for the same source writing to different buckets:

pipelines/orders-s3.yaml
source:
tap: tap-postgres
config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
bucket: s3://primary-lake
path: raw
# pipelines/orders-gcs.yaml
source:
tap: tap-postgres
config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
bucket: gs://backup-lake
path: raw

Regardless of the provider, Pipe always writes the same directory layout:

<bucket>/
.dataspoc/
manifest.json # Table catalog
state/<pipeline>/state.json # Incremental bookmarks
logs/<pipeline>/<timestamp>.json # Execution logs
<path>/
<source>/<table>/
dt=<partition_value>/
<table>_0000.parquet # Data files

This consistent structure is the contract between Pipe and downstream tools like DataSpoc Lens.