# Diagnosing Overfitting with Regularization

Regularization is a way in which we can overcome the problem of Overfitting and High variance.In this tutorial,you will learn how Regularization works,how it can solve problem of overfitting and three important Regularization algorithms Ridge , Lasso and Elastic Net.

If you do not know anything about Underfitting and Overfitting,I would suggest you to go through this tutorial before reading this tutorial,as to understand this tutorials you should know about Underfitting,Overfitting,and Bias-Variance Tradeoff.

## What is Regularization?

Regularization is a technique which is used to solve the problem of Overfitting and High Variance.Overfitting is a phenomenon in machine learning,where model is very complex and tries to fit each training data.By covering each training data,it also picks up the noise points,which causes model to perform very good on training set,but performs very poor on testing set.Goal of Regularization is to solve this problem by making models simpler.

## Why Regularization?

Sometimes,due to more number of features,model becomes complex and starts overfitting. One way to handle this problem is by removing some features.But,by doing this,we are throwing away some information from our dataset.By using regularization,we keep all the features but ensure that model do not overfit the dataset.

## How Regularization Works?

Regularization works by making models simpler.It achieves this by penalizing model parameters.If you know about Linear Regression,then you know that,while building the Regression models,each feature column is associated with some weights called model parameters.Depending upon this model parameters,slope of regression line is decided.If these model parameters are high,then slope of model will be more.If model parameter is very low,then slope of model will be less.

Regularization adds some amount of Bias to the model,to reduce the variance.By adding Bias to a high variance model,model does not fits to training data well and hence training accuracy will be low.But,it also reduces the Overfitting by reducing the Variance and this model will perform good in testing phase.

## Mathematics behind Regularization

Equation for a Regression model is y=w0+w1X1+w2X2+….wnXn.

Cost function of linear Regression is given by:

Cost function is the sum of difference between actual value and predicted value which is also called residual sum of squared error.When value of this equation is low,then we can say that,our model is a good model.w0,w1,..wn are called model parameters which defines the slope of the model.Regularization penalize these model parameters whenever they goes very high,and in this way it controls the slope of the model so that model does not gets complicated and does not Overfit.

Regularization cost function is given by:

Here,we add some penalty to the cost function.If model parameters goes high,then cost also increases,and we can know that model is overfitting because cost gets higher.If model parameters goes low,then cost function reduces,and we can know that our model is performing good.If we penalize model parameters too much,so that they becomes 0,then our cost function becomes normal cost function.

λ is called regularization parameter.λ tries to fit our model to training data well while keeping the model parametrs small.λ ranges between 0 to infinity.λ controls the factor by which model will be penalized,if it tries to get complex.

Lets see role of λ with graph where we have one feature(Area of House) and one Output variable(Price).

As you can see from above image,if value of λ is 0,then regularization term is 0,and cost function is normal cost function.Therefore, model fits data very well and Overfitting occurs.If we increase value of λ,then model slope starts decreasing and when λ becomes very high,model gets very flat.These means,model prediction is very less sensitivity to Area of House.

So,we need to find a good value for λ,that will minimise cost and reduce overfitting.We can find this using cross-validation.

There are three important Regularization algorithm,which we can use and they are Ridge,Lasso and Elastic Net.

## Ridge Regression(L2 Regularization)

Ridge Regression is regularzation algorithm which we can use to diagnose overfitting.Cost function of Ridge Regression is:

In Ridge Regression,we add square of model parameters and then multiply it with λ and add it to the cost function.If model parameters are low,then there square will be also low,and overall cost will be low.If model parameters try to go high,then overall cost will also increase.In this way,Ridge regression penalizes the model parameters,when they try to go high.

### Ridge Regression in Python

Lets implement Ridge Regression in Python.

```###importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
```

Now lets generate some random data on which we can implement Ridge Regression ,Lasso Regression and Elastic Net.

```n_samples = 10
X = np.linspace(0, 10, 10)
y = X ** 3 + np.random.randn(n_samples) * 100 + 100
plt.figure(figsize=(10,5))
plt.scatter(X, y)
```

Now,lets first use simple linear regression and see how our model performs.

```lin_reg = LinearRegression()
lin_reg.fit(X.reshape(-1, 1), y)
model_pred = lin_reg.predict(X.reshape(-1,1))
plt.figure(figsize=(10,5));
plt.scatter(X, y);
plt.plot(X, model_pred);
print(r2_score(y, model_pred))##0.8563
```

R-squared score is 0.85,which means our model is able to explain only about 85% of our data.Lets look at the plot.

Although,model is fitting well to our data,but suppose we want to fit a higher order polynomial model to this data,so that model covers all the points.

```from sklearn.preprocessing import PolynomialFeatures

poly_reg1 = PolynomialFeatures(degree=8)
X_poly1 = poly_reg1.fit_transform(X.reshape(-1, 1))
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly1, y.reshape(-1, 1))
y_pred2 = lin_reg_2.predict(X_poly1)
plt.figure(figsize=(10,5));
plt.scatter(X, y);
plt.plot(X, y_pred2);
print(r2_score(y, y_pred2)##0.9975
```

R-squared score is 0.9975,which means our model is explaining 99.75% of data which is awesome.Lets look at the plot.

From above plot,you can see that our model is overfitting.This is because,we have fitted a 8th degree polynomial to the data.

### Finding best parameters with GridSearchCV

```from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge

ridge=Ridge(normalize=True,random_state=42)

parameters={'alpha':[1e-15,1e-10,1e-8,1e-4,0.001,0.01,0.1,1,2,3,4,5]}

ridge_regressor=GridSearchCV(ridge,parameters,scoring='neg_mean_squared_error',cv=5)

ridge_regressor.fit(X_poly1,y.reshape(-1,1))

ridge_regressor.best_params_
```

GridSearchCV is use to find hyperparameters of a model by using Cross_validation.Here,we are using 5-fold cross-validation.Here,λ is our hyperparameter which we want to find,and in sklearn it is called alpha.After fitting model to our dataset,we can get best value of parameter using best_params_ attribute.

Output:

{‘alpha’: 1}

GridSearchCV returned alpha=1 as best parameter.Lets create another Ridge model with alpha=1,and fit it to the data.

```ridge1=Ridge(alpha=1,normalize=True)

#parameters={'alpha':[1e-15,1e-10,1e-8,1e-4,0.001,0.01,0.1,1,2,3,4,5]}

#ridge_regressor=GridSearchCV(ridge,parameters,scoring='neg_mean_squared_error',cv=5)

ridge1.fit(X_poly1,y.reshape(-1,1))
y_pred4 = ridge1.predict(X_poly1)
plt.figure(figsize=(10,5));
plt.scatter(X, y);
plt.plot(X, y_pred4);
print(r2_score(y, y_pred4)) ##0.8663
```

R-squared score is 0.8663 which means our model is explaining 86.63% of data.You will think that,there is only 1% increase in R-squared score comparing with Simple Linear Regression without Polynomial Degree.But,Lets look at the plot.

Now,you can see that with the same 8th degree polynomial,we are getting a very good fit to the data,without overfitting and R-squared score is also 86.63.This type of model is a very good model where we get high accuracy without overfitting.Althouh,there is only 1% increase in R-squared score compared to Linear regression without Polynomial Degree(First Model),this model will perform better in Real World. As you can see that, data is non-linear,and if we fit simple linear regression without any polynomial degree,it will only fit straight line to data,no matter how the data is.But,from above plot you can see that,the data is non-linear and our model is also non-linear,so in real world this model will genearalize well to the non-linear data.

## Lasso Regression(L1 Regularization)

Lasso Regression is almost same as Ridge Regression.The only difference between Ridge and Lasso Regression is the way of penalizing the model parameters.Lets look at the cost function of Lasso Regression:

In Lasso Regularization,we penalize model parameters by taking absolute value of model parameters.

### Lasso Regression in python

Lasso Regression can be implemented in same way in which we implemented Ridge Regression.We just need to import Lasso class and use it to build our model.So,lets do it.

```from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

lasso=Lasso(normalize=True,random_state=42)

parameters={'alpha':[1e-15,1e-10,1e-8,1e-4,0.001,0.01,0.1,1,2,3,4,5]}

lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error',cv=5)

lasso_regressor.fit(X_poly1,y.reshape(-1,1))
lasso_regressor.best_params_ ##Output {'alpha': 3}
```

GridSearchCV method returned alpha=3 as best value for alpha(λ).Lets cretae another Lasso model with alpha(λ)=3.

```lasso1=Lasso(alpha=3,normalize=True)

lasso1.fit(X_poly1,y.reshape(-1,1))
y_pred4 = lasso1.predict(X_poly1)
plt.figure(figsize=(10,5));
plt.scatter(X, y);
plt.plot(X, y_pred4);
print(r2_score(y, y_pred4)) ##0.9056321662491851

```

R-squared score is 0.9056 which means our model is explaining 90.56% of data.This is much better than Ridge Regression.Lets look at the plot of lasso regression.

You can see that,Lasso Regression is also fitting the data very well and giving much better R-squared score than Ridge Regression.

## Elastic Net Regression

Elastic Net regularization technique is combination of both Ridge and Lasso Regularization.By using Elastic Net we can use Lasso Regularization power to remove irrelevant features or we can use power of Ridge Regularization to reduce this features close to 0 without actually making them 0,so that some information is preserved.

Mathematical Equation for Elastic Net is given by:

As you can see,equation have both L1 and L2 penalty terms.If λ1 is set to 1 and λ2 is 0,then we have Lasso Regression and if we set λ1 to 0 and λ2 to 1,then we have Ridge Regression.We can find these two λ‘s using CV as they are hyperparameters.

### Elastic Net in python

```##I am using data which is used in previous methods
##We need to import Elastic Net

from sklearn.linear_model import ElasticNet

##Model Creation##
elastic=ElasticNet(normalize=True)
Best_params=GridSearchCV(estimator=elastic,param_grid={'alpha':np.logspace(-5,2,8),'l1_ratio':[.2,.4,.6,.8]},scoring='neg_mean_squared_error',n_jobs=1,cv=10)

Best_params.fit(X_poly1,y.reshape(-1,1))
print(Best_params.best_params_)
```

Elastic model requires normalized data and normalize=True does the same.GridSearchCV is used to find best hyperparameters using cross validation.We are using 10-fold cross-validation.n_jobs=1 tells interpreter to use all the cpu cores available on the pc. We pass different values for λ (in sklearn it is denoted by alpha) which is called regularization parameter.λ controls the factor by which model will be penalized,if it tries to get complex. l1_ratio is used toc hoose between Ridge and Lasso.If l1_ratio is set to 0 means model is same as Ridge and if l1_ratio is set to 1 means model is same as Lasso.So,to we need to keep l1_ratio between 0 and 1,to use the model as a ElasticNet Regularization model.

Output of above code will be:

{‘alpha’: 1.0, ‘l1_ratio’: 0.6}

Let’s make another model with these values and refit the data to the new model.

```elastic1=ElasticNet(normalize=True,alpha=1.0,l1_ratio=0.6)
elastic1.fit(X_poly1,y.reshape(-1,1))

y_pred = elastic1.predict(X_poly1)
##plt.figure(figsize=(10,5));
##plt.scatter(X, y);
##plt.plot(X, y_pred);
print(r2_score(y, y_pred))

```

In this case,R-squared score is 0.74939,which is less than Ridge and Lasso.But,it all depends on our business case,which regularization to use.There can be many reasons for this low R-squared score ,one may be due to small dataset.

## Ridge vs Lasso Regularization

Ridge Regularization shrinks the features close to 0,but it does not make them completely 0.On the other hand,Lasso Regularization makes irrelevant features 0 and hence completely removes them.

So,if you want your model to be simple but don’t want to loose any information by removing features,use Ridge Regularization.

If you want your model to be simple and don’t care about loosing information by removing irrelevant features,use Lasso Regularization.Lasso Regularization makes irrelevant features 0,and hence it can be also used for feature selection.

So,This is all for this tutorial.If you have any doubt or any suggestions,feel free to comment below.

Thank You. 