DataSpoc ML
DataSpoc ML lets you train, predict, and explain machine learning models directly on Parquet data stored in cloud buckets. No data movement, no separate ML infrastructure — your models run where your data already lives.
What it does
Section titled “What it does”- Automated feature engineering — detects column types, generates meaningful features, handles missing values.
- Model selection — evaluates multiple algorithms and picks the best one for your data.
- Hyperparameter tuning — optimizes model parameters automatically.
- Drift monitoring — tracks model performance over time and alerts when predictions degrade.
Commercial product
Section titled “Commercial product”DataSpoc ML is a commercial product integrated with Lens. It is accessed entirely through dataspoc-lens ml commands — there is no separate CLI to install.
To get started, contact ml@dataspoc.com.
How it connects
Section titled “How it connects”ML reads from the same bucket that Pipe writes to and Lens queries. The data flow is:
Pipe (ingest) → bucket/raw/ and bucket/curated/ ↓ ML (train) → bucket/ml/models/<model>/ ML (predict) → bucket/ml/predictions/<model>/ ↓ Lens (query) → predictions appear as SQL tablesBucket artifacts
Section titled “Bucket artifacts”ML writes to two directories in the bucket:
| Path | Contents |
|---|---|
ml/models/<model>/model.pkl | Serialized trained model |
ml/models/<model>/features.json | Feature definitions and transformations |
ml/models/<model>/metrics.json | Training metrics (accuracy, AUC, RMSE, etc.) |
ml/predictions/<model>/*.parquet | Prediction output as Parquet files |
Predictions saved as Parquet are automatically discoverable by Lens and appear as queryable SQL tables.