Evaluating Regression models with python scikit-learn

In this guide,you will learn how to evaluate Regression models with various metrics like Root Mean Square Error(RMSE), Mean Absolute Error(MAE) ,Mean Square Error(MSE),R-Squared Score and Adjusted R-squared Score.

Let’s first implement our regression model then,we will evaluate it using rmse ,mse,mae,r-square and adjusted r-squared metrics.If you want to know about linear regression with detailed explanation,you can check this post >>https://expertsteaching.com/simple-and-multipe-linear-regression-detailed-explanation-with-scikit-learn-implementation/.

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

First,we are importing all the useful classes from sklearn. Here, important class is metrics.In this class there are useful methods for evaluating models. StandardScaler is used to make feature variables on the same scale.train_test_split is used to split dataset in train and test set.

df=pd.read_csv("autoMPG.csv")
df.head()
dataset for evaluting of regression models
Dataset

Here,columns from Cylinders-to-Origin are our independent features and MPG is dependent variable or target variable.

df1=df.iloc[:,:7]
###df1.head()

y=df.iloc[:,-1]
###y.head()

We are storing independent columns in new dataset variable df1 and separating target variable MPG in variable y.

X_train, X_test, y_train, y_test = train_test_split(df1, y, test_size=0.2)

We are using train_test_split function to split dataset in train and test set,so that we can train model on train set and test accuracy of model on test set. First parameter is dataset which we want to split,second argument is target varibale , third argument is independent columns and fourth argument is % of dataset which we want to use as test-set,here 0.2 which means 20%.

Next,we need to make our data on same scale,because regression models are based on distance metrics.So,if one variable is having much higher values than other column,then it will dominate other columns and accuracy will be decreased. You can see in dataset, weight and displacement have higher values than no. of cylinders ,horse_power,acceleration.

scaler =StandardScaler().fit(X_train)

We are using StandardScaler() method and fitting our X_train which is our training data, to it.It has just learned all the parameter and stored in scaler variable. It has not transformed the datset. Note that,we only standardise the features colums (X_train ,X_test), not the target columns(y_train,y_test).

print(scaler.scale_)
print("\n")
print(scaler.mean_)
Output:
scale:
[1.70980553e+00 1.06654023e+02 3.99600248e+01 8.64602372e+02
 2.87910546e+00 3.70504833e+00 8.10582861e-01]

mean:
[5.43769968e+00 1.94314696e+02 1.04105431e+02 2.96772204e+03
 1.56405751e+01 7.59137380e+01 1.59105431e+00]

scale_ is Per feature relative scaling of the data. mean_ is mean value for each feature in the training set. This all model has learned from fit method.

lr=LinearRegression()
lr.fit(scaler.transform(X_train),y_train)

predictions = lr.predict(scaler.transform(X_test))

We are instantiating LinearRegression class and storing in lr variable. Then we fit X_train,y_train to lr.fit() . See,how we are using scaler.transform on X_test to standardise it. We now use predict method on X_test to predict the output(MPG) based on previous training.

Evaluating the Regression Models

Now,we have predicted the output from X_test,we can compare it with y_test,which have actual out. But,comparing it with each other is not feasible if we have large testing datset. We will use some metrics to evaluate accuracy of regression models.

Mean Absolute Error (MAE)

MEAN ABSOLUTE EROOR
MEAN ABSOLUTE ERROR

Mean Absolute Error is average of sum of absolute errors. Yi is actual output and Yi^ is predicted output.We are taking absolute value of their difference. n is the no. of samples in test data. MAE gives us an estimate of by how much our model is deviating from actual output.It is also called “mean absolute deviation

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, predictions))  

###OUTPUT
##Mean Squared Error: 9.903275011580973

This indicates that our model is deviating by 9.9032.

MEAN SQUARED ERROR

mean squared error
MEAN SQUARED EROOR

MEAN SQUARED ERROR is another metric to evaluate regression models. In mean squared error metric,difference between actual and predicted output are first squared and then summed up.Afterwards, their average is taken. Mean squared error is important when we have outliers (large errors) which are important.For eg. if error is -5,then MAD will consider it as 5,but MSE will first square it and will become 25.So,it is useful when we want to consider large errors as useful factor.

print('Mean Squared Error:', metrics.mean_squared_error(y_test, predictions))  

###OUTPUT
###Mean Squared Error: 9.903275011580973

ROOT MEAN SQUARED ERROR

root mean squred error
ROOT MEAN SQUARED ERROR

ROOT MEAN SQUARED ERROR is just square root of MSE. RMSE is more interpret-able because it is on same scale as of data.In MSE,differences are squared,so we take root to make it on same scale as of data.

print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

###OUTPUT
###Root Mean Squared Error: 3.146946934980152

R-SQUARED SCORE

R-squared is a metric that tells us, how much data is explained by our Regression model. A model which explains much of data then it is a good model.Hence,a model with higher R-square score is a good model.It is also called “coefficient of determination“.

R-squared formula
R-squared formula

SSRegression is sum of squared values of y-y^ (actual value – predicted) value. SSTotal is sum of squared values of y-y (actual values -mean of target variable) . SSRegression/SSTotal gives us how much of data is not explained our model. 1-SSRegression/SSTotal gives us how much of data is explained by our model.

metrics.r2_score(y_test,predictions)

###OUTPUT
###0.7878645597601919

Here,you can see our model explains 78.78% of our data.

ADJUSTED R-SQUARED SCORE

Problem with R-squared is that,it is based on how much of data is explained by our model.But,if our number of features gets increased then R-squared score will also increase because,model will have more data so it will cover more data and explain much of data.So,we will get a higher R-score and we will think that model is doing better.

But,Its not true. Because, some feature which are not useful in prediction will also contribute in increasing the R-score,but it will reduce the accuracy of model. Adjusted R-squared penalise the features that are not useful and it will reduce the Adjusted R-squared score. So,if your R-square score is increasing but Adjusted R-squared score is decreasing,then you know that some features are not useful and remove them.

Adjusted R-squared score.png
Adjusted R-squared

N is number of samples and p is number of predictors( features).

import statsmodels.api as sm
X = sm.add_constant(X_train)
model = sm.OLS(y_train.values.reshape(-1,1), X).fit()

 model.summary()
Adjusted R-squared output
Adjusted R-squared Output

That’s all for this guide.Hope,I explained all the metrics clearly.If you have any doubt or suggestion please drop a comment below.

Thank You.

Amarjeet

About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.

2 Comments

  1. I’ll have to thank you for the success today

  2. Love your blog, do you have a YouTube Channel as well?

Leave a Reply

Your email address will not be published. Required fields are marked *