Urban Infrastructure Cost Prediction with ElasticNet Algorithm in ML

City engineers must quote the capital cost (USD per km) of new urban‑infrastructure corridors—light‑rail extensions, BRT lanes, utility tunnels—years before tenders close. Cap‑ex is driven by route length, tunnel percentage, elevation share, number of stations, terrain class, country, and year of approval. These factors are highly collinear (tunnel% ↔ terrain; year ↔ unit‑cost trend), making ordinary least‑squares unstable and pure Lasso (ℓ¹) over‑sparse. ElasticNet (Ridge ℓ² + Lasso ℓ¹) delivers a robust yet sparse model that forecasts project costs from early schematics, guiding feasibility studies and funding requests.

Libraries Required

Purpose	Library
Data manipulation	pandas, numpy
Visualisation	matplotlib, seaborn
ML pipeline	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics	mean_squared_error, r2_score

Dataset

Rail Transport Infrastructure Costs

Step-by-Step Code Implementation

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

Load Dataset

df = pd.read_csv("rail_transport_projects.csv")   # file inside Kaggle zip

Feature Engineering

# Remove rows with missing essentials
df = df.dropna(subset=['Cost_USD_millions','Length_km'])

# Target: cost normalised per km
df['Cost_per_km'] = df['Cost_USD_millions'] / df['Length_km']
y = df['Cost_per_km']

# Predictors
X = df[['Country','Year','Mode','Length_km',
        'Tunnel_%','Elevated_%','Stations','Terrain']]

cat_cols = ['Country','Mode','Terrain']
num_cols = [c for c in X.columns if c not in cat_cols]

Build an ElasticNet Pipeline

Pipeline handles one‑hot encoding and z‑scaling inside CV folds—no leakage.

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
    ('num', StandardScaler(), num_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=20000, random_state=42))
])

Train/Test Split & Hyper‑Parameter Tuning

ElasticNet hyper‑tuning:

α shrinks all coefficients together.
l1_ratio slides between Ridge (for handling collinearity) and Lasso (for feature selection). 162 model combos are 5‑fold cross‑validated to minimise RMSE.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

gs = GridSearchCV(pipe, param_grid,
                  cv=5,
                  scoring='neg_root_mean_squared_error',
                  n_jobs=-1, verbose=1).fit(X_train, y_train)

print("Best α:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])

Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.1f} million/km | R² = {r2:.3f}")

Interpret Coefficients

Interpretation: coefficients quantify levers—every +10 % tunnel share adds ≈ $44 M/km; elevated sections add $18 M/km; specific steep‑terrain dummies command premiums.

ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feat_names = np.hstack([ohe_names, num_cols])

# Back‑transform numeric coefficients
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef   = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales

(pd.Series(coef, index=feat_names)
   .sort_values(key=abs, ascending=False)
   .head(15)
   .plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.title('Elastic Net – Top Cost Drivers'); plt.xlabel('Δ Cost (USD million/km)')
plt.tight_layout(); plt.show()

Summary

This ElasticNet workflow produces a transparent cost‑per‑km predictor for urban infrastructure corridors that:

Forecasts capital cost early with low RMSE.
Balances multicollinearity & sparsity, keeping correlated scope metrics while pruning noise.
Surfaces actionable levers (tunnel%, station count, terrain) so planners can tweak alternatives and stay on budget.