Simple and Multiple Linear Regression –Detailed Explanation with Scikit-learn Implementation

This tutorial will give you a brief and detailed understanding of Simple and Multiple Linear Regression concept.I will be also showing you how to implement Simple and Multiple linear regression using scikit-learn.

What is Linear Regression?

Linear Regression is a statistical method which consists of one or more independent variable and one dependent variable.Linear regression is used to predict the value of dependent variable based on the independent variable.Linear Regression does this by fitting a line to the data given to it, called “Best Fit line”.Best fit line is one which explain much of data and minimizes the error between the true and predicted value.

linear regression image
linear regression image

As,you can see in above image, the blue line is called Regression Line and this line should minimize the overall residual (meaning sum of difference between actual and predicted output should be minimum for all point). This is called cost function.This cost function is given by

linear regression cost function
linear regression cost function

where yi is actual value and y^i is predicted value.This is called residual sum of square error.The line which minimizes this cost is the “Best Fit Line”.

In mathematical form,the line can be represented as a equation of line like y=mx+b but in Linear Regression you will find that , the equation is written like y=w0+w1*X1+w2*X2+…..+wnXn ,where y is dependent variable and wo,w1,w3 are model parameters and X1,X2,Xn are variables or columns or features. w0 is called intercept and it denotes that,if all the columns are zero,then what will model predict.In that case equation becomes y=w0.

With respect to our image,we can represent our equation as Salary=w0+w1*Age. Our aim is to fit a line to the given data and get values of wo and w1 so that we can find salary if we have age of a person by putting values in the regression line equation.Note:Linear Regression is used only for continuous variables like price,distance not for categorical or discrete variable like color,month name etc.

If the data have only one Independent (feature) column and one Dependent(output) column,then these type of linear regression is called “Simple Linear Regression“. If the data have more than one Independent (feature) column then it is called “Multiple Linear regression“.

Simple Linear Regression with Scikit-Learn

I am going to use advertising dataset to implement the linear regression in scikit-learn.The dataset is about about the sales of a product in 200 different markets, together with advertising budgets in each of these markets for different media channels: TV, radio and newspaper. The sales are in thousands of units and the budget is in thousands of dollars.

import pandas as pd
from sklearn.linear_model import LinearRegression
df=pd.read_csv("advertsing.csv",names=["Index","Tv","Radio","Newspaper","Sales"])
df=df[["Tv","Radio","Newspaper","Sales"]]
df.head()

Lets look how data looks like:

linear regression advertising dataset
Advertising Dataset

On First and Second line of code we import pandas and LinearRegression class.Pandas is used for data cleaning and wrangling and Linear Regression is the class which we will be using to create our model. Next,we read data and store in variable df. df.head() shows part of data as data may be very large to fit on screen.

First ,we will be doing simple linear regression,so I am going to drop 2 columns,so that there will be only 1 independent column and 1 dependent column.

df1=df[["Tv","Sales"]]
X=df1["Tv"]
y=df1["Sales"]

Now,X is our dataset having only one column(feature) which is Tv and this column is our independent variable. y is our dependent variable beacause y(sales) will be predicted depending upon the values of X i,e; budget spent on Tv advertising.

lr=LinearRegression()
lr.fit(X.values.reshape(-1,1),y.values.reshape(-1,1))
print(lr.intercept_)
print(lr.coef_)

#####  intercept==[7.03259355]
#####  coef==[[0.04753664]]

In above code,we first make a instance of class LinearRegression.Then we fit our X and y to this model. fit method learns all the models parameters. reshape(-1,1) is a way to inform sklearn that, we want to arrange all the values in a column but we don’t know how many rows are there but we know that we want 1 column.So thats why we used reshape(-1,1). sklearn will take care of the row count and output will be a column. Next we print intercept and coef of model. intercept is w0 and as there is only one dependent variable ,there will be only one model parameter(coef) w1.

Equation of Linear Regression will be now Sales=7.0325+0.0475*Tv

Now,suppose someone ask you to predict the sales, if he is going to spend some money on Tv advetising,you can put that value inplace of Tv and solve to get Sales.

Multiple Linear Regression with Scikit-Learn

In multiple Linear regression we will have more than one independent column and one dependent column.We will use Tv,Newspaper and Radio column as independent column and Sales as dependent column.

df=pd.read_csv("advertsing.csv",names=["Index","Tv","Radio","Newspaper","Sales"])

df=df[["Tv","Radio","Newspaper","Sales"]]

X=df[["Tv","Radio","Newspaper"]]
y=df["Sales"]
X.head()
Multiple Regression eith sklearn
Multiple Regression dataset
lm=LinearRegression()
lm.fit(X,y)
print(lm.intercept_)
print(lm.coef_)

#### intercept== 2.9388893694594085
#### coef== [ 0.04576465  0.18853002 -0.00103749]

Code is same as simple linear regression,but this time we have not done reshape(-1,1) because this is pandas dataframe ( a table like structure) and when we were using only TV column ,that was pandas series ( a single column) and fit method will throw error if we will pass a series.

Now,you can see we have 3 coef,one for each column.Now,the Linear Regression equation becomes Sales=2.9388+0.0457*Tv+0.1885*Radio+(-0.0010)*Newspaper

Interpreting the coefficients and intercept

Now you know how to implement linear regression with sklearn.But,you should know what the intercept and coefficients mean. w0 is called intercept and it denotes that,if all the columns are zero,then what will model predict.In that case equation becomes y=w0. Coefficients gives relation between dependent and independent variable. So, A “unit” increase in TV ad spending is associated with a 0.047537 “unit” increase in Sales

I hope I explained all the details very clearly.If you have any doubt or want to give some suggestion,leave a comment below.

Thank you.

Amarjeet

About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.

2 Comments

  1. Pingback: Logistic Regression-A detailed explanation with scikit-learn Implementation - ExpertsTeaching

  2. Pingback: Evaluating Regression models with python scikit-learn - ExpertsTeaching

Leave a Reply

Your email address will not be published. Required fields are marked *