Employee Development Cost Prediction with ElasticNet Algorithm in ML

HR leaders need to know the cost of a new training program for a given employee profile before approving budgets or signing vendor contracts. Historic HR logs show that development costs per employee depend on grade, department, prior training hours, performance band, and the chosen delivery mode (virtual, classroom, or blended). Many of those fields are tightly collinear—senior grades ↔ longer programs ↔ classroom learning—so:

Ordinary least‑squares produces unstable, flip‑sign coefficients.
Pure Lasso (ℓ¹) may over‑shrink and drop genuinely useful predictors.

An ElasticNet model (Ridge ℓ² + Lasso ℓ¹) offers the best of both worlds, giving a robust  yet  sparse estimator of Training_Cost (USD) that finance and L&D teams can rely on.

Libraries Required

Purpose	Library
Data wrangling	pandas, numpy
Visuals	matplotlib, seaborn
ML workflow	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset

Employee/HR Dataset

Step-by-Step Code Implementation

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

Load & Inspect Data

df = pd.read_csv("data/employee_data.csv")     # adjust filename if different
print(df[['Department','Grade','Training Hours','Training Cost']].head())
sns.histplot(df['Training Cost'], kde=True)
plt.title('Training‑cost distribution'); plt.show()

Define Target & Features

y = df['Training Cost']            # USD per employee

X = df[['Department','Grade','Gender','Education Level',
        'Performance Band','Training Mode','Training Hours']]

cat_cols = ['Department','Grade','Gender','Education Level',
            'Performance Band','Training Mode']
num_cols = ['Training Hours']

ElasticNet Pipeline

Pre‑processing: ColumnTransformer one‑hot‑encodes categorical HR attributes and z‑scales numeric hours, all inside CV folds—eliminating leakage.

preprocess = ColumnTransformer([
        ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
        ('num', StandardScaler(), num_cols)
    ])

pipe = Pipeline([
        ('prep', preprocess),
        ('enet', ElasticNet(max_iter=15000, random_state=42))
    ])

Train/Test Split & Hyper‑Parameter Grid

ElasticNet logic:

α controls overall shrinkage;
l1_ratio balances Ridge (handles collinearity) and Lasso (sparsity). 162 model combos (18 α × 9 ratios) are 5‑fold cross‑validated; the lowest RMSE wins.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=df['Department'])

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

gs = GridSearchCV(pipe, param_grid,
                  cv=5,
                  scoring='neg_root_mean_squared_error',
                  n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)

print("Best α:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])

Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.2f} | R²: {r2:.3f}")

Interpret Top Coefficients

Interpretability: the coefficient chart may show that Classroom training adds $280 versus virtual, each extra 10 training hours costs ≈ $45, while Grade E (Executive) programs carry a $900 premium—insights procurement can act on.

# Retrieve feature names post‑encoding
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feat_names = np.hstack([ohe_names, num_cols])

# Reverse‑scale numeric coefficient(s)
scale = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef  = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale

impact = (pd.Series(coef, index=feat_names)
            .sort_values(key=abs, ascending=False)
            .head(15))

plt.figure(figsize=(9,5))
impact.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net – Drivers of Training Cost'); plt.xlabel('Δ Cost (USD)')
plt.tight_layout(); plt.show()

Summary

This end‑to‑end Elastic Net workflow:

Forecasts per‑employee development cost early, with low hold‑out RMSE.
Balances multicollinearity & sparsity, keeping correlated HR signals while trimming noise.
Provides clear dollar levers—grade, mode, hours—empowering HR to budget accurately and negotiate vendor rates.
Refreshing the model with fresh payroll‑training extracts is a one‑liner (gs.fit()), keeping cost prediction sharp as L&D strategy evolves.