Skip to content

DataSpoc ML

DataSpoc ML lets you train, predict, and explain machine learning models directly on Parquet data stored in cloud buckets. No data movement, no separate ML infrastructure — your models run where your data already lives.

  • Automated feature engineering — detects column types, generates meaningful features, handles missing values.
  • Model selection — evaluates multiple algorithms and picks the best one for your data.
  • Hyperparameter tuning — optimizes model parameters automatically.
  • Drift monitoring — tracks model performance over time and alerts when predictions degrade.

DataSpoc ML is a commercial product integrated with Lens. It is accessed entirely through dataspoc-lens ml commands — there is no separate CLI to install.

To get started, contact ml@dataspoc.com.

ML reads from the same bucket that Pipe writes to and Lens queries. The data flow is:

Pipe (ingest) → bucket/raw/ and bucket/curated/
ML (train) → bucket/ml/models/<model>/
ML (predict) → bucket/ml/predictions/<model>/
Lens (query) → predictions appear as SQL tables

ML writes to two directories in the bucket:

PathContents
ml/models/<model>/model.pklSerialized trained model
ml/models/<model>/features.jsonFeature definitions and transformations
ml/models/<model>/metrics.jsonTraining metrics (accuracy, AUC, RMSE, etc.)
ml/predictions/<model>/*.parquetPrediction output as Parquet files

Predictions saved as Parquet are automatically discoverable by Lens and appear as queryable SQL tables.