Skip to content

Training Models

Train a machine learning model from any table in your data lake with a single command.

Terminal window
dataspoc-lens ml train --target <column> --from <table>
FlagDescription
--targetThe column you want to predict
--fromThe source table (raw, curated, or gold layer)
  1. Reads Parquet — loads the source table from your bucket.
  2. Feature engineering — automatically detects column types, encodes categoricals, generates interaction features, and handles missing values.
  3. Model selection — evaluates multiple algorithms (gradient boosting, random forest, logistic regression, etc.) and selects the best performer.
  4. Training — trains the selected model with optimized hyperparameters.
  5. Saves to bucket — writes artifacts to bucket/ml/models/<model>/.

After training completes, three files are saved to the bucket:

FileDescription
model.pklThe serialized trained model
features.jsonFeature definitions, transformations, and column mappings
metrics.jsonEvaluation metrics (accuracy, precision, recall, AUC, RMSE, etc.)

These files are stored at:

bucket/
ml/
models/
<model>/
model.pkl
features.json
metrics.json

Suppose you have a curated/customers/activity table with a churned column (1 = churned, 0 = active):

Terminal window
dataspoc-lens ml train --target churned --from curated/customers/activity

Output:

[ML] Loading table curated/customers/activity...
[ML] 45,231 rows, 18 columns
[ML] Feature engineering: 42 features generated
[ML] Evaluating models...
[ML] Best model: GradientBoosting (AUC=0.91)
[ML] Training final model...
[ML] Saved to ml/models/churned_activity/
[ML] Done.

You can then inspect the model with dataspoc-lens ml explain --model churned_activity or generate predictions with dataspoc-lens ml predict.