Multi-Cloud Storage
Este conteúdo não está disponível em sua língua ainda.
Pipe writes Parquet files to any supported storage provider. The destination is configured via the bucket URI in your pipeline YAML.
Supported providers
Section titled “Supported providers”| Provider | URI scheme | Extra required | Backend |
|---|---|---|---|
| Local filesystem | file:// | none | built-in |
| Amazon S3 | s3:// | pip install dataspoc-pipe[s3] | s3fs |
| Google Cloud Storage | gs:// | pip install dataspoc-pipe[gcs] | gcsfs |
| Azure Blob Storage | az:// | pip install dataspoc-pipe[azure] | adlfs |
All cloud storage access is handled by fsspec, which provides a unified interface across providers.
Local filesystem
Section titled “Local filesystem”No extra install or credentials needed. Useful for development and testing.
Pipeline config
Section titled “Pipeline config”destination: bucket: file:///tmp/lake path: raw compression: zstdResult
Section titled “Result”/tmp/lake/ .dataspoc/manifest.json raw/csv/orders/dt=2025-01-20/orders_0000.parquetAmazon S3
Section titled “Amazon S3”Install
Section titled “Install”pip install dataspoc-pipe[s3]Credentials
Section titled “Credentials”Pipe uses the standard AWS credential chain. Configure one of the following:
Environment variables:
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLEexport AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEYexport AWS_DEFAULT_REGION=us-east-1AWS CLI profile:
aws configureIAM instance role (EC2/ECS/Lambda):
No configuration needed. Credentials are provided automatically by the instance metadata service.
IAM role with SSO:
aws sso login --profile my-profileexport AWS_PROFILE=my-profilePipeline config
Section titled “Pipeline config”destination: bucket: s3://my-company-datalake path: raw compression: zstdIAM permissions required
Section titled “IAM permissions required”The IAM principal needs these S3 permissions on the bucket:
{ "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-company-datalake", "arn:aws:s3:::my-company-datalake/*" ]}Google Cloud Storage
Section titled “Google Cloud Storage”Install
Section titled “Install”pip install dataspoc-pipe[gcs]Credentials
Section titled “Credentials”Service account key:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.jsonApplication default credentials (local development):
gcloud auth application-default loginAttached service account (GCE/Cloud Run/GKE):
No configuration needed. Credentials come from the metadata server.
Pipeline config
Section titled “Pipeline config”destination: bucket: gs://my-company-datalake path: raw compression: zstdIAM roles required
Section titled “IAM roles required”Grant the service account the Storage Object Admin role (roles/storage.objectAdmin) on the bucket, or a custom role with:
storage.objects.createstorage.objects.getstorage.objects.deletestorage.objects.list
Azure Blob Storage
Section titled “Azure Blob Storage”Install
Section titled “Install”pip install dataspoc-pipe[azure]Credentials
Section titled “Credentials”Connection string:
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"Account name and key:
export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccountexport AZURE_STORAGE_ACCOUNT_KEY=base64encodedkey==Managed identity (Azure VM/App Service/Functions):
No configuration needed. adlfs uses the managed identity automatically.
Azure CLI:
az loginexport AZURE_STORAGE_ACCOUNT_NAME=mystorageaccountPipeline config
Section titled “Pipeline config”destination: bucket: az://my-container path: raw compression: zstdMultiple destinations
Section titled “Multiple destinations”You can create multiple pipelines for the same source writing to different buckets:
source: tap: tap-postgres config: /home/you/.dataspoc-pipe/sources/orders.jsondestination: bucket: s3://primary-lake path: raw
# pipelines/orders-gcs.yamlsource: tap: tap-postgres config: /home/you/.dataspoc-pipe/sources/orders.jsondestination: bucket: gs://backup-lake path: rawBucket structure
Section titled “Bucket structure”Regardless of the provider, Pipe always writes the same directory layout:
<bucket>/ .dataspoc/ manifest.json # Table catalog state/<pipeline>/state.json # Incremental bookmarks logs/<pipeline>/<timestamp>.json # Execution logs <path>/ <source>/<table>/ dt=<partition_value>/ <table>_0000.parquet # Data filesThis consistent structure is the contract between Pipe and downstream tools like DataSpoc Lens.