Regularization in Python

harish reddy
10 min readNov 7, 2019

Regularization helps to solve over fitting problem in machine learning. Simple model will be a very poor generalization of data. At the same time, complex model may not perform well in test data due to over fitting. We need to choose the right model in between simple and complex model. Regularization helps to choose preferred model complexity, so that model is better at predicting. Regularization is nothing but adding a penalty term to the objective function and control the model complexity using that penalty term. It can be used for many machine learning algorithms.

Why Regularization

  1. Over fitting
  2. Over fitting with linear models
  3. Regularization of linear models
  4. Regularized regression in scikit-learn
  5. Comparing regularized linear models with unregularized linear models.

Part 1: Overfitting

What is overfitting?

  • Building a model that matches the training data “too closely”.
  • Learning from the error/distrubance/noise in the data, rather than just the truevalues/signal.

How does over-fitting occur?

  • Evaluating a model by testing it on the same data that was used to train it.
  • Creating a model that is “too complex”.

What is the impact of over-fitting?

  • Model will do well on the training data, but won’t generalize to out-of-sample data i.e., test
  • Model will have low bias, but high variance.

Part 2: Over-fitting with linear models

What are the general characteristics of linear models?

  • Low model complexity
  • High bias, low variance
  • Generally, Does not tend to over-fit

there is always a chance for over-fitting and it can still occur with linear models if you allow them to have high variance.
Some common causes are:

Cause 1: Irrelevant features

Linear models can over-fit if we include “irrelevant features”, meaning features that are unrelated to the response. Why?

Because it will learn a coefficient for every feature you include in the model, regardless of whether that feature has the impact or the noise.

This is especially a problem when p (number of features) is close to n (number of observations), because that model will naturally have high variance.

Cause 2: Correlated features(Multicollinearity)

Linear models can over-fit if the included features are highly correlated with one another. Why?

We use OLS (Ordinary Least Squares ) method (OLS takes some assumptions) [scikit-learn documentation]

“Coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance.”

Cause 3: Large coefficients

Linear models can over-fit if the coefficients (after feature standardization) are too large. why?

Because the larger the absolute value of the coefficient, the more power it has to change the predicted response, resulting in a higher variance.

Part 3: Regularization of linear models

  • Regularization is a method for “constraining” or “regularizing” the size of the coefficients, thus “shrinking” them towards zero.
  • It reduces model variance and thus minimizes overfitting.
  • If the model is too complex, it tends to reduce variance more than it increases bias, resulting in a model that is more likely to generalize.

Our aim is to locate the optimum model complexity, and thus regularization is useful when we believe our model is too complex.

How does regularization work?

For a normal linear regression model, we estimate the coefficients using the least squares criterion, which minimizes the residual sum of squares (RSS):

For a regularized linear regression model, we minimize the sum of RSS and a “penalty term” that penalizes coefficient size.

Lasso and Ridge path diagrams

A larger alpha (towards the left of each diagram) results in more regularization:

  • Lasso regression shrinks coefficients all the way to zero, thus removing them from the model
  • Ridge regression shrinks coefficients toward zero, but they rarely reach zero

Source code for the diagrams: Lasso regression and Ridge regression

How should we choose between Lasso regression and Ridge regression?

  • Lasso regression is preferred if we believe many features are irrelevant or if we prefer a sparse model.
  • If model performance is your primary concern, it is best to try both.
  • Elastic-Net regression is a combination of lasso regression and ridge Regression.

Should features be standardized?

  • Yes, because otherwise, features would be penalized simply because of their scale.
  • Also, standardizing avoids penalizing the intercept, which wouldn’t make intuitive sense.

Visualizing regularization

Below is a visualization of what happens when you apply regularization. The general idea is that you are restricting the allowed values of your coefficients to a certain “region”. Within that region, you want to find the coefficients that result in the best model.

In this diagram:

  • We are fitting a linear regression model with two features, 𝑥1 and 𝑥2.
  • 𝛽̂ represents the set of two coefficients, 𝛽1 and 𝛽2, which minimize the RSS for the unregularized model.
  • Regularization restricts the allowed positions of 𝛽̂ to the blue constraint region:
  • For lasso, this region is a diamond because it constrains the absolute value of the coefficients.
  • For ridge, this region is a circle because it constrains the square of the coefficients.
  • The size of the blue region is determined by 𝛼, with a smaller 𝛼 resulting in a larger region:
  • When 𝛼 is zero, the blue region is infinitely large, and thus the coefficient sizes are not constrained.
  • When 𝛼 increases, the blue region gets smaller and smaller.

In this case, 𝛽̂ is not within the blue constraint region. Thus, we need to move 𝛽̂ until it intersects the blue region, while increasing the RSS as little as possible.

Part 4: Regularized regression in scikit-learn

  • Communities and Crime dataset from the UCI Machine Learning Repository: data, data dictionary
  • Goal: Predict the violent crime rate for a community given socioeconomic and law enforcement data

Load and prepare the crime dataset

# Importing necessary packages and functions required
import numpy as np
import pandas as pd
import matplotlib.pyplot
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Reading the dataset
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data'
crime = pd.read_csv(url, header=None, na_values=['?'])
crime.head()
Head of data
crime[127].describe()
Output:
count 1994.000000
mean 0.237979
std 0.232985
min 0.000000
25% 0.070000
50% 0.150000
75% 0.330000
max 1.000000
Name: 127, dtype: float64
#Plotting Data for missing valuesplt.figure(figsize=(20, 6))
sns.heatmap(crime.isnull(),yticklabels=False,cbar=False,cmap='viridis')
We can see there are lot of missing values in several columns.
crime.drop([0, 1, 2, 3, 4], axis=1, inplace=True)# removing missing values
crime.dropna(inplace=True)
plt.figure(figsize=(20, 6))
sns.heatmap(crime.isnull(),yticklabels=False,cbar=False,cmap='viridis')
we see there are no missing values.
crime.shapeOutput:
(319, 123)
# defining X and y
X = crime.drop(127, axis=1)
y = crime[127]

Splitting the data for training and testing

# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.3, random_state=1)

Linear regression

# build a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)print ("iNTERCEPT : ",linreg.intercept_)
print ("CO-EFFICIENT : ",linreg.coef_)
Output:
iNTERCEPT : 0.9922125287583894
CO-EFFICIENT : [-3.93018330e+00 7.31324716e-01 -3.01181421e-01 -2.96634778e-01
-1.83170801e-01 2.81575284e-01 -1.48744636e+00 -4.84695533e-01
1.24104896e+00 -5.32282788e-01 4.64331123e+00 -1.17079618e-01
1.04229108e+00 1.36950901e-01 -3.12434116e-01 -1.16489196e+00....
y_pred = linreg.predict(X_test)# calculate R^2 value, MAE, MSE, RMSE

from sklearn.metrics import r2_score
from sklearn import metrics
print("R-Square Value",r2_score(y_test,y_pred))
print("\n")
print ("mean_absolute_error :",metrics.mean_absolute_error(y_test, y_pred))
print("\n")
print ("mean_squared_error : ",metrics.mean_squared_error(y_test, y_pred))
print("\n")
print ("root_mean_squared_error : ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output:R-Square Value 0.19704456295409056


mean_absolute_error : 0.16725155824888352


mean_squared_error : 0.04986345422693408


root_mean_squared_error : 0.22330126337961922

MSE is more popular than MAE because MSE "eliminates" larger errors. But, RMSE is even more better than MSE because RMSE is interpretable in the "y" units.

Ridge regression

  • Ridge documentation
  • alpha: must be positive, increase for more regularization
  • normalize: scales the features (without using StandardScaler)
# alpha=0 is equivalent to linear regression
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
# calculate R^2 value, MAE, MSE, RMSE

from sklearn import metrics
print("R-Square Value",r2_score(y_test,y_pred))
print("\n")
print ("mean_absolute_error :",metrics.mean_absolute_error(y_test, y_pred))
print("\n")
print ("mean_squared_error : ",metrics.mean_squared_error(y_test, y_pred))
print("\n")
print ("root_mean_squared_error : ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output:R-Square Value 0.19704456295863382


mean_absolute_error : 0.16725155824816149


mean_squared_error : 0.04986345422665194


root_mean_squared_error : 0.2233012633789875
# try alpha=0.1
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)

# calculate R^2 value, MAE, MSE, RMSE

from sklearn import metrics
print("R-Square Value",r2_score(y_test,y_pred))
print("\n")
print ("mean_absolute_error :",metrics.mean_absolute_error(y_test, y_pred))
print("\n")
print ("mean_squared_error : ",metrics.mean_squared_error(y_test, y_pred))
print("\n")
print ("root_mean_squared_error : ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output:R-Square Value 0.5347697501566349


mean_absolute_error : 0.12769772972161883


mean_squared_error : 0.028890753082631383


root_mean_squared_error : 0.16997280100837128
# examing the coefficients
print(ridgereg.coef_)
Output:
[-5.77226675e-03 2.26721774e-02 4.98857382e-02 -6.70174168e-02
-1.83566112e-02 5.26888536e-02 1.17689929e-02 -5.72468914e-02
1.52761058e-03 4.29131248e-02 1.04586550e-04 -1.85621890e-02....
  • RidgeCV: ridge regression with built-in cross-validation of the alpha parameter
  • alphas: array of alpha values to try
# create an array of alpha values
alpha_range = 10.**np.arange(-2, 3)
alpha_range
array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])# select the best alpha with RidgeCV
from sklearn.linear_model import RidgeCV
ridgeregcv = RidgeCV(alphas=alpha_range, normalize=True, scoring='neg_mean_squared_error')
ridgeregcv.fit(X_train, y_train)
ridgeregcv.alpha_
1.0# predict method uses the best alpha value
y_pred = ridgeregcv.predict(X_test)
# calculate R^2 value, MAE, MSE, RMSE

from sklearn import metrics
print("R-Square Value",r2_score(y_test,y_pred))
print("\n")
print ("mean_absolute_error :",metrics.mean_absolute_error(y_test, y_pred))
print("\n")
print ("mean_squared_error : ",metrics.mean_squared_error(y_test, y_pred))
print("\n")
print ("root_mean_squared_error : ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output:R-Square Value 0.5318623469518291


mean_absolute_error : 0.13256644483823624


mean_squared_error : 0.02907130253772373


root_mean_squared_error : 0.17050308659295213

Lasso regression

  • Lasso documentation
  • alpha: must be positive, increase for more regularization
  • normalize: scales the features (without using StandardScaler)
# try alpha=0.001 and examine coefficients
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.001, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
Output:
[ 0. 0. 0. -0.25339884 0. 0.
0. -0. -0. 0. 0. 0.
-0. -0. -0. -0.17865705 0. 0.
-0. -0. -0. -0. -0. -0.02230294
-0. 0. 0. 0. 0.0998841 -0.
0. -0. 0.01893786 -0. -0.03169217 0.
0. -0. 0.11479343 0. 0. 0.
0. -0.16845012 -0.27294066 -0. -0. -0.
-0. 0. 0. 0. 0. -0.
0. 0. 0. 0. 0. 0.
-0. 0. 0. 0. 0. 0.
0. -0. 0. 0.02709397 -0. 0.
-0. -0. 0. 0. 0. 0.
0. -0. -0. -0. -0. -0.
-0. -0. 0. -0. -0. 0.00233805
0.15404259 0. -0. -0. 0. -0.
0. 0. -0. 0. 0. 0.
0.03385823 0. -0.0136048 -0. 0. 0.
0.01441679 0. 0. 0. -0. 0.
-0. -0. 0.04851355 0. -0. 0.0220025
-0. 0. ]
# try alpha=0.01 and examine coefficients
lassoreg = Lasso(alpha=0.01, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
Output:[ 0. 0. 0. -0.04214088 0. 0.
0. 0. 0. -0. 0. 0.
-0. -0. -0. -0. -0. 0.
-0. -0. -0. -0. -0. -0.
-0. -0. -0. 0. 0. 0.
0. -0. 0. -0. -0. 0.
0. -0. 0. 0. 0. 0.
0. -0. -0.29715868 -0. -0. -0.
-0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
-0. 0. 0. 0. 0. 0.
0. -0. 0. 0. -0. 0.
-0. -0. 0. 0. -0. 0.
0. -0. -0. -0. -0. -0.
-0. -0. 0. -0. -0. 0.
0. 0. 0. -0. 0. 0.
0. 0. -0. 0. 0. 0.
0. 0. -0. -0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. -0. 0.
-0. 0. ]
# calculate RMSE (for alpha=0.01)
y_pred = lassoreg.predict(X_test)
# calculate MAE, MSE, RMSE
# calculate R^2 value, MAE, MSE, RMSE

from sklearn import metrics
print("R-Square Value",r2_score(y_test,y_pred))
print("\n")
print ("mean_absolute_error :",metrics.mean_absolute_error(y_test, y_pred))
print("\n")
print ("mean_squared_error : ",metrics.mean_squared_error(y_test, y_pred))
print("\n")
print ("root_mean_squared_error : ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output:R-Square Value 0.3241322659149898


mean_absolute_error : 0.16889755277533727


mean_squared_error : 0.04197132028397072


root_mean_squared_error : 0.20486903202770965
  • LassoCV: lasso regression with built-in cross-validation of the alpha parameter
  • n_alphas: number of alpha values (automatically chosen) to try
# select the best alpha with LassoCV
from sklearn.linear_model import LassoCV
lassoregcv = LassoCV(n_alphas=100, normalize=True, random_state=1)
lassoregcv.fit(X_train, y_train)
print('alpha : ',lassoregcv.alpha_)
output:
alpha : 0.002080882923737423
# examine the coefficients
print(lassoregcv.coef_)
Output:
[ 0. 0. 0. -0.25126823 0. 0.
0. 0. 0. -0. 0. 0.
-0. -0. -0. -0.11419567 0. 0.
-0. -0. -0. -0. -0. -0.
-0. 0. -0. 0. 0.10231515 0.
0. -0. 0. -0. -0. 0.
0. -0. 0. 0. 0. 0.12654959
0. -0.04766931 -0.38728958 -0. -0. -0.
-0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
-0. 0. 0. 0. 0. 0.
0. -0. 0. 0. -0. 0.
-0. -0. 0. 0. 0. 0.
0. -0. -0. -0. -0. -0.
-0. -0. 0. -0. -0. 0.
0.13736646 0. -0. -0. -0. -0.
0. 0. -0. 0. 0. 0.
0. 0. -0. -0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0.01538139 0. -0. 0.
-0. 0. ]
# predict method uses the best alpha value
y_pred = lassoregcv.predict(X_test)
# calculate R^2 value, MAE, MSE, RMSE

from sklearn import metrics
print("R-Square Value",r2_score(y_test,y_pred))
print("\n")
print ("mean_absolute_error :",metrics.mean_absolute_error(y_test, y_pred))
print("\n")
print ("mean_squared_error : ",metrics.mean_squared_error(y_test, y_pred))
print("\n")
print ("root_mean_squared_error : ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Output:R-Square Value 0.551986438475168


mean_absolute_error : 0.13457411347964207


mean_squared_error : 0.027821598419367693


root_mean_squared_error : 0.16679807678557834

We can see increase in R-Square Value as we applied regularization i.e., L1 and L2.

We can see decrease in other metrics MAE, MSE and RMSE with different values of L1 and L2.

Note: The threshold for RMSE and related error measures are difficult to establish. Normally a RMSE > 0.5 is related to a bad predictive model.

My Github Profile

References

  1. towards data science
  2. geeksforgeeks
  3. stackexchange
  4. stackabuse

--

--

harish reddy

A Statistics Postgradute, a data science enthusiast. I love to see how the knowledge of data analysis and ML techniques are solving the worlds critical problems