Training Models
Train a machine learning model from any table in your data lake with a single command.
dataspoc-lens ml train --target <column> --from <table>| Flag | Description |
|---|---|
--target | The column you want to predict |
--from | The source table (raw, curated, or gold layer) |
What happens
Section titled “What happens”- Reads Parquet — loads the source table from your bucket.
- Feature engineering — automatically detects column types, encodes categoricals, generates interaction features, and handles missing values.
- Model selection — evaluates multiple algorithms (gradient boosting, random forest, logistic regression, etc.) and selects the best performer.
- Training — trains the selected model with optimized hyperparameters.
- Saves to bucket — writes artifacts to
bucket/ml/models/<model>/.
Output artifacts
Section titled “Output artifacts”After training completes, three files are saved to the bucket:
| File | Description |
|---|---|
model.pkl | The serialized trained model |
features.json | Feature definitions, transformations, and column mappings |
metrics.json | Evaluation metrics (accuracy, precision, recall, AUC, RMSE, etc.) |
These files are stored at:
bucket/ ml/ models/ <model>/ model.pkl features.json metrics.jsonExample: training a churn model
Section titled “Example: training a churn model”Suppose you have a curated/customers/activity table with a churned column (1 = churned, 0 = active):
dataspoc-lens ml train --target churned --from curated/customers/activityOutput:
[ML] Loading table curated/customers/activity...[ML] 45,231 rows, 18 columns[ML] Feature engineering: 42 features generated[ML] Evaluating models...[ML] Best model: GradientBoosting (AUC=0.91)[ML] Training final model...[ML] Saved to ml/models/churned_activity/[ML] Done.You can then inspect the model with dataspoc-lens ml explain --model churned_activity or generate predictions with dataspoc-lens ml predict.