Employee Development Cost Prediction with ElasticNet Algorithm in ML
HR leaders need to know the cost of a new training program for a given employee profile before approving budgets or signing vendor contracts. Historic HR logs show that development costs per employee depend on grade, department, prior training hours, performance band, and the chosen delivery mode (virtual, classroom, or blended). Many of those fields are tightly collinear—senior grades ↔ longer programs ↔ classroom learning—so:
- Ordinary least‑squares produces unstable, flip‑sign coefficients.
- Pure Lasso (ℓ¹) may over‑shrink and drop genuinely useful predictors.
An ElasticNet model (Ridge ℓ² + Lasso ℓ¹) offers the best of both worlds, giving a robust yet sparse estimator of Training_Cost (USD) that finance and L&D teams can rely on.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visuals | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset
Step-by-Step Code Implementation
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
Load & Inspect Data
df = pd.read_csv("data/employee_data.csv") # adjust filename if different
print(df[['Department','Grade','Training Hours','Training Cost']].head())
sns.histplot(df['Training Cost'], kde=True)
plt.title('Training‑cost distribution'); plt.show()
Define Target & Features
y = df['Training Cost'] # USD per employee
X = df[['Department','Grade','Gender','Education Level',
'Performance Band','Training Mode','Training Hours']]
cat_cols = ['Department','Grade','Gender','Education Level',
'Performance Band','Training Mode']
num_cols = ['Training Hours']
ElasticNet Pipeline
Pre‑processing: ColumnTransformer one‑hot‑encodes categorical HR attributes and z‑scales numeric hours, all inside CV folds—eliminating leakage.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=15000, random_state=42))
])
Train/Test Split & Hyper‑Parameter Grid
ElasticNet logic:
- α controls overall shrinkage;
- l1_ratio balances Ridge (handles collinearity) and Lasso (sparsity). 162 model combos (18 α × 9 ratios) are 5‑fold cross‑validated; the lowest RMSE wins.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['Department'])
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
print("Best α:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])
Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.2f} | R²: {r2:.3f}")
Interpret Top Coefficients
Interpretability: the coefficient chart may show that Classroom training adds $280 versus virtual, each extra 10 training hours costs ≈ $45, while Grade E (Executive) programs carry a $900 premium—insights procurement can act on.
# Retrieve feature names post‑encoding
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feat_names = np.hstack([ohe_names, num_cols])
# Reverse‑scale numeric coefficient(s)
scale = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale
impact = (pd.Series(coef, index=feat_names)
.sort_values(key=abs, ascending=False)
.head(15))
plt.figure(figsize=(9,5))
impact.plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net – Drivers of Training Cost'); plt.xlabel('Δ Cost (USD)')
plt.tight_layout(); plt.show()
Summary
This end‑to‑end Elastic Net workflow:
- Forecasts per‑employee development cost early, with low hold‑out RMSE.
- Balances multicollinearity & sparsity, keeping correlated HR signals while trimming noise.
- Provides clear dollar levers—grade, mode, hours—empowering HR to budget accurately and negotiate vendor rates.
- Refreshing the model with fresh payroll‑training extracts is a one‑liner (gs.fit()), keeping cost prediction sharp as L&D strategy evolves.