Urban Infrastructure Cost Prediction with ElasticNet Algorithm in ML
City engineers must quote the capital cost (USD per km) of new urban‑infrastructure corridors—light‑rail extensions, BRT lanes, utility tunnels—years before tenders close. Cap‑ex is driven by route length, tunnel percentage, elevation share, number of stations, terrain class, country, and year of approval. These factors are highly collinear (tunnel% ↔ terrain; year ↔ unit‑cost trend), making ordinary least‑squares unstable and pure Lasso (ℓ¹) over‑sparse. ElasticNet (Ridge ℓ² + Lasso ℓ¹) delivers a robust yet sparse model that forecasts project costs from early schematics, guiding feasibility studies and funding requests.
Libraries Required
| Purpose | Library |
| Data manipulation | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| ML pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset
Rail Transport Infrastructure Costs
Step-by-Step Code Implementation
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
Load Dataset
df = pd.read_csv("rail_transport_projects.csv") # file inside Kaggle zip
Feature Engineering
# Remove rows with missing essentials
df = df.dropna(subset=['Cost_USD_millions','Length_km'])
# Target: cost normalised per km
df['Cost_per_km'] = df['Cost_USD_millions'] / df['Length_km']
y = df['Cost_per_km']
# Predictors
X = df[['Country','Year','Mode','Length_km',
'Tunnel_%','Elevated_%','Stations','Terrain']]
cat_cols = ['Country','Mode','Terrain']
num_cols = [c for c in X.columns if c not in cat_cols]
Build an ElasticNet Pipeline
Pipeline handles one‑hot encoding and z‑scaling inside CV folds—no leakage.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
Train/Test Split & Hyper‑Parameter Tuning
ElasticNet hyper‑tuning:
- α shrinks all coefficients together.
- l1_ratio slides between Ridge (for handling collinearity) and Lasso (for feature selection). 162 model combos are 5‑fold cross‑validated to minimise RMSE.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1).fit(X_train, y_train)
print("Best α:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])
Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.1f} million/km | R² = {r2:.3f}")
Interpret Coefficients
Interpretation: coefficients quantify levers—every +10 % tunnel share adds ≈ $44 M/km; elevated sections add $18 M/km; specific steep‑terrain dummies command premiums.
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feat_names = np.hstack([ohe_names, num_cols])
# Back‑transform numeric coefficients
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales
(pd.Series(coef, index=feat_names)
.sort_values(key=abs, ascending=False)
.head(15)
.plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.title('Elastic Net – Top Cost Drivers'); plt.xlabel('Δ Cost (USD million/km)')
plt.tight_layout(); plt.show()
Summary
This ElasticNet workflow produces a transparent cost‑per‑km predictor for urban infrastructure corridors that:
- Forecasts capital cost early with low RMSE.
- Balances multicollinearity & sparsity, keeping correlated scope metrics while pruning noise.
- Surfaces actionable levers (tunnel%, station count, terrain) so planners can tweak alternatives and stay on budget.