Many times you may have came across the term Bias-Variance Tradeoff while learning or reading about supervised machine learning and wondered what these term means.In this post you will learn about Bias,Variance and Bias-Variance Tradeoff.You will also learn about Underfitting and Overfitting and Polynomial Regression.
What is Bias?
Bias is assumption or belief made by machine learning models about the data.Error due to these assumptions is called Bias error.If a model have high bias,then it will simplify the model and will give very less attention to the training data.This will lead to high training and testing error.Lets understand this by examples.Suppose you go to a new country and take a Taxi.Taxi driver behaves rudely with you.Then you might think that,all taxi drivers are rude.But this will be unfair with all the other drivers whom you have not seen yet or have not interacted with.
In this case,you have a high belief about taxi drivers that, all drivers are rude and this is perfect example of High Bias,because you have not seen all the drivers,but you have made an assumption about all drivers.This will cause high bias error and lead to Underfitting problem.
What is Variance?
Variance is a measure,which tell us how scattered are predicted values from actual values.If a model have high variance,then it will try to cover all the data points.As model is covering all the points, training error will be 0 and accuracy will be very high.But,as accuracy is very high,don’t think that,the model is very good,because it will not perform well in real world.But why?
Underfitting and Overfitting
As you know,in supervised machine learning,we train our models on training data and then test it on testing data.If model will try to cover all the data points, it will cover the noise points also.But,in real world it will not be able to differentiate between actual data and noisy data because in training phase it have covered noisy data too.So in real world also,it will consider noisy data useful, because it does not know how noisy data looks like and there will be high error in testing phase and this is due to high variance.When model tries to cover all the data points due to high variance,and this leads to Overfitting.
Underfitting occurs when model is very simple due to high bias or due to less amount of data and model is not able to learn much about the underlying patterns in data.
Overfitting occurs when model tries to cover all the data points and cover noisy points too.This causes training accuracy to be very high,but increases testing error and decreases testing accuracy.
Look at the image above.See,how model is trying to capture all the points,hence training accuracy will be very high and training error will be close to 0.But,it will not perform well in real world.
Which fit is good?
I would like to answer this question with following diagram.Let’s have a look at it.
Above image shows you,how model performs for low and high values of Bias and Variance.
In first image,when there is low variance and high bias,this is the case of underfitting as model is very simple due to high bias.
In second image,when there is high bias and high variance,you can see our predictions are off the target due to errors.
In third image,when there is low variance and low bias,we get correct predictions as all the predictions are at the centre which is the truth point.
And in fourth image,due to high variance and low bias,overfitting occurs.
Now,you may have got your answer.For a good model,variance and Bias,both should be low.But,in real world there is always be a tradeoff between bias and variance.So,what is Bias-Variance Tradeoff.
If our model is too simple,then it may have high bias and low variance.In this case Underfitting will occur.If our model is too flexible and and have more number of features,then it will have high variance and in this case Overfitting will occur.So,we need to find a right balance between Bias and Variance without Overfitting and Underfitting.This is called “Bias-variance Tradeoff“.
In above graph,you can see that as complexity of model increases (number of features or degree of polynomials),Bias error starts reducing and reaches close to zero.But ,as Bias error reduces,Variance error increases because model gets very complex and Overfitting occurs.You can see,total error also increases.
And,if we reduce model complexity,variance error reduces because model becomes very simple but Bias error increases due to simple model and High Bias.This is case of Underfitting.Here also,you can see,total error is very high.
We want to find optimal complexity for our model to balance Bias and Variance error.The optimal complexity point lies between High Bias and High Variance region where all the curves are at lower level,”Valley Point”.
How to know if our model have Bias or Variance?
To know if your model has high bias or high variance,you can plot Bias and Variance Error.If,Training Error is less and testing error is very high,this means,model has high variance and its overfitting.If,Training error is high and Test error is also high,means model has high bias and its underfitting.
How to fix High Bias and High Variance Problem?
High Bias can be fixed using:
- Adding More features to dataset
- Reducing Regularization
- Add some complexity to the features (I have included this topic below in this post,Polynomial Regression)
High Variance can be fixed using:
- Increasing data size
- Making models little simple
- Reducing Features
I will write another post where I will write about Diagnosing Bias-Variance Problems in details covering Regularization,Validation curve and Learning Curve.
Now,I am implementing Polynomial Regression in python to show you Overfitting and Underfitting,
Polynomial Regression , Underfitting and Overfitting
import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score
Here,We are first importing all the required libraries.
n_samples = 15 X = np.linspace(0, 10, 15) y = X ** 3 + np.random.randn(n_samples) * 100 + 100 plt.figure(figsize=(10,5)) plt.scatter(X, y)
We are generating 15 random data points which are linearly seperated between 1 and 10 with np.linspace() and storing in X,which is our Independent Feature Variable.
In next line we are generating Output Feature Variable y,and adding some none linearity in it and then plotting it with plt.scatter().
This is how our dataset look like.Now,first we will use Simple linear regression,and see how well it fits our data.
lin_reg = LinearRegression() lin_reg.fit(X.reshape(-1, 1), y) model_pred = lin_reg.predict(X.reshape(-1,1)) plt.figure(figsize=(10,5)); plt.scatter(X, y); plt.plot(X, model_pred); print(r2_score(y, model_pred))
You can see that,as our data is Non-Linear,but we are trying to fit a Linear model,hence we are not getting very good fit to data.Notice that,R-squared score is about 0.72,which means our model is able to explain only about 72% of our data.This is the case of Underfitting where model is very simple and not able to cover most of the data points.
Now,lets make our model Non-Linear by increasing degree of our features.
from sklearn.preprocessing import PolynomialFeatures poly_reg = PolynomialFeatures(degree=2) X_poly = poly_reg.fit_transform(X.reshape(-1, 1)) lin_reg_2 = LinearRegression() lin_reg_2.fit(X_poly, y.reshape(-1, 1)) y_pred = lin_reg_2.predict(X_poly) plt.figure(figsize=(10,5)); plt.scatter(X, y); plt.plot(X, y_pred); print(r2_score(y, y_pred))
We are transforming our feature variable into a second order polynomial feature.Next,we are making a new instance of Linear Regression,and then we are fitting this new model on Polynomial Features.
In above image,you can see that,now our model is fitting very well to the given data and our R-squared score is also increased to 0.93.
Let’s change degree of polynomial and increase it to some random number,lets take 14 and see what happens?
poly_reg1 = PolynomialFeatures(degree=14) X_poly1 = poly_reg1.fit_transform(X.reshape(-1, 1)) lin_reg_3 = LinearRegression() lin_reg_3.fit(X_poly1, y.reshape(-1, 1)) y_pred2 = lin_reg_3.predict(X_poly1) plt.figure(figsize=(10,5)); plt.scatter(X, y); plt.plot(X, y_pred2); print(r2_score(y, y_pred2))
In above image,you will notice that our model is fitting all the data points very well and R-squared score is 0.9995,almost 100%.Now,you will think that,this model is best,but this is the worst model you will ever create.Beacuse,this is overfitting,where model is trying to fit all data points.And,in real world Data Science and Machine Learning Porblem,you will never get accuracy and R-squared 100%.So,whenever you see that,your model is giving accuracy or R-squared score which is very high,model might be Overfitting.
So,you need to figure out yourself,which degree of polynomial will give high accuracy and R-squared square without Underfitting and Overfitting.
You can plot Degree of Polynomial vs R-squared score graph for different Degrees of Polynomia and from graph you can see and choose which degree can give you high R-squared score without Overfitting or Underfitting.
r_squared= for i in range(1,21): poly_reg1 = PolynomialFeatures(degree=i) X_poly1 = poly_reg1.fit_transform(X.reshape(-1, 1)) lin_reg_3 = LinearRegression() lin_reg_3.fit(X_poly1, y.reshape(-1, 1)) y_pred2 = lin_reg_3.predict(X_poly1) r_squared.append(r2_score(y, y_pred2)) plt.figure(figsize=(10,5)) #plt.scatter(X, y); plt.xlabel("Degree of Ploynomial") plt.ylabel("R-squared") plt.plot(range(1,21), r_squared);
But,This scores we are calculating only for training set.To know whether our model is underfitting or overfitting,we need to see how our model perform on unseen data(Test data) and we need to look at training as well as test score graphs.This topic I will cover in some other post.
I hope,you have understood Bias,Variance,Bias-variance Tradeoff,Underfitting and Overfitting.If you have any doubt or any suggestion,feel free to comment below.