Multi-Cloud Storage

Este conteúdo não está disponível em sua língua ainda.

Pipe writes Parquet files to any supported storage provider. The destination is configured via the bucket URI in your pipeline YAML.

Supported providers

Provider	URI scheme	Extra required	Backend
Local filesystem	`file://`	none	built-in
Amazon S3	`s3://`	`pip install dataspoc-pipe[s3]`	s3fs
Google Cloud Storage	`gs://`	`pip install dataspoc-pipe[gcs]`	gcsfs
Azure Blob Storage	`az://`	`pip install dataspoc-pipe[azure]`	adlfs

All cloud storage access is handled by fsspec, which provides a unified interface across providers.

Local filesystem

No extra install or credentials needed. Useful for development and testing.

Pipeline config

destination:
  bucket: file:///tmp/lake
  path: raw
  compression: zstd

Result

/tmp/lake/
  .dataspoc/manifest.json
  raw/csv/orders/dt=2025-01-20/orders_0000.parquet

Amazon S3

Install

pip install dataspoc-pipe[s3]

Credentials

Pipe uses the standard AWS credential chain. Configure one of the following:

Environment variables:

export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=us-east-1

AWS CLI profile:

aws configure

IAM instance role (EC2/ECS/Lambda):

No configuration needed. Credentials are provided automatically by the instance metadata service.

IAM role with SSO:

aws sso login --profile my-profile
export AWS_PROFILE=my-profile

Pipeline config

destination:
  bucket: s3://my-company-datalake
  path: raw
  compression: zstd

IAM permissions required

The IAM principal needs these S3 permissions on the bucket:

{
  "Effect": "Allow",
  "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:DeleteObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::my-company-datalake",
    "arn:aws:s3:::my-company-datalake/*"
  ]
}

Google Cloud Storage

Install

pip install dataspoc-pipe[gcs]

Credentials

Service account key:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Application default credentials (local development):

gcloud auth application-default login

Attached service account (GCE/Cloud Run/GKE):

No configuration needed. Credentials come from the metadata server.

Pipeline config

destination:
  bucket: gs://my-company-datalake
  path: raw
  compression: zstd

IAM roles required

Grant the service account the Storage Object Admin role (roles/storage.objectAdmin) on the bucket, or a custom role with:

storage.objects.create
storage.objects.get
storage.objects.delete
storage.objects.list

Azure Blob Storage

Install

pip install dataspoc-pipe[azure]

Credentials

Connection string:

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

Account name and key:

export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount
export AZURE_STORAGE_ACCOUNT_KEY=base64encodedkey==

Managed identity (Azure VM/App Service/Functions):

No configuration needed. adlfs uses the managed identity automatically.

Azure CLI:

az login
export AZURE_STORAGE_ACCOUNT_NAME=mystorageaccount

Pipeline config

destination:
  bucket: az://my-container
  path: raw
  compression: zstd

Multiple destinations

You can create multiple pipelines for the same source writing to different buckets:

source:
  tap: tap-postgres
  config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
  bucket: s3://primary-lake
  path: raw

# pipelines/orders-gcs.yaml
source:
  tap: tap-postgres
  config: /home/you/.dataspoc-pipe/sources/orders.json
destination:
  bucket: gs://backup-lake
  path: raw

Bucket structure

Regardless of the provider, Pipe always writes the same directory layout:

<bucket>/
  .dataspoc/
    manifest.json                         # Table catalog
    state/<pipeline>/state.json           # Incremental bookmarks
    logs/<pipeline>/<timestamp>.json      # Execution logs
  <path>/
    <source>/<table>/
      dt=<partition_value>/
        <table>_0000.parquet              # Data files

This consistent structure is the contract between Pipe and downstream tools like DataSpoc Lens.