DataSpoc ML

DataSpoc ML lets you train, predict, and explain machine learning models directly on Parquet data stored in cloud buckets. No data movement, no separate ML infrastructure — your models run where your data already lives.

What it does

Automated feature engineering — detects column types, generates meaningful features, handles missing values.
Model selection — evaluates multiple algorithms and picks the best one for your data.
Hyperparameter tuning — optimizes model parameters automatically.
Drift monitoring — tracks model performance over time and alerts when predictions degrade.

Commercial product

DataSpoc ML is a commercial product integrated with Lens. It is accessed entirely through dataspoc-lens ml commands — there is no separate CLI to install.

To get started, contact ml@dataspoc.com.

How it connects

ML reads from the same bucket that Pipe writes to and Lens queries. The data flow is:

Pipe (ingest) → bucket/raw/ and bucket/curated/
                        ↓
                   ML (train) → bucket/ml/models/<model>/
                   ML (predict) → bucket/ml/predictions/<model>/
                        ↓
                   Lens (query) → predictions appear as SQL tables

Bucket artifacts

ML writes to two directories in the bucket:

Path	Contents
`ml/models/<model>/model.pkl`	Serialized trained model
`ml/models/<model>/features.json`	Feature definitions and transformations
`ml/models/<model>/metrics.json`	Training metrics (accuracy, AUC, RMSE, etc.)
`ml/predictions/<model>/*.parquet`	Prediction output as Parquet files

Predictions saved as Parquet are automatically discoverable by Lens and appear as queryable SQL tables.