Training Models

Train a machine learning model from any table in your data lake with a single command.

Usage

dataspoc-lens ml train --target <column> --from <table>

Flag	Description
`--target`	The column you want to predict
`--from`	The source table (raw, curated, or gold layer)

What happens

Reads Parquet — loads the source table from your bucket.
Feature engineering — automatically detects column types, encodes categoricals, generates interaction features, and handles missing values.
Model selection — evaluates multiple algorithms (gradient boosting, random forest, logistic regression, etc.) and selects the best performer.
Training — trains the selected model with optimized hyperparameters.
Saves to bucket — writes artifacts to bucket/ml/models/<model>/.

Output artifacts

After training completes, three files are saved to the bucket:

File	Description
`model.pkl`	The serialized trained model
`features.json`	Feature definitions, transformations, and column mappings
`metrics.json`	Evaluation metrics (accuracy, precision, recall, AUC, RMSE, etc.)

These files are stored at:

bucket/
  ml/
    models/
      <model>/
        model.pkl
        features.json
        metrics.json

Example: training a churn model

Suppose you have a curated/customers/activity table with a churned column (1 = churned, 0 = active):

dataspoc-lens ml train --target churned --from curated/customers/activity

Output:

[ML] Loading table curated/customers/activity...
[ML] 45,231 rows, 18 columns
[ML] Feature engineering: 42 features generated
[ML] Evaluating models...
[ML] Best model: GradientBoosting (AUC=0.91)
[ML] Training final model...
[ML] Saved to ml/models/churned_activity/
[ML] Done.

You can then inspect the model with dataspoc-lens ml explain --model churned_activity or generate predictions with dataspoc-lens ml predict.