This tutorial will give you a brief and detailed understanding of Simple and Multiple Linear Regression concept.I will be also showing you how to implement Simple and Multiple linear regression using scikit-learn.

**What is Linear Regression**?

Linear Regression is a statistical method which consists of one or more independent variable and one dependent variable.Linear regression is used to predict the value of dependent variable based on the independent variable.Linear Regression does this by fitting a line to the data given to it, called “Best Fit line”.Best fit line is one which explain much of data and minimizes the error between the true and predicted value.

As,you can see in above image, the blue line is called Regression Line and this line should minimize the overall residual (meaning sum of difference between actual and predicted output should be minimum for all point). This is called cost function.This cost function is given by

where y_{i} is actual value and y^{^}_{i} is predicted value.This is called residual sum of square error.The line which minimizes this cost is the “Best Fit Line”.

In mathematical form,the line can be represented as a equation of line like **y=mx+b** but in Linear Regression you will find that , the equation is written like **y=w0+w1*X1+w2*X2+…..+wnXn **,where y is dependent variable and **wo,w1,w3** are model parameters and **X1,X2,Xn** are variables or columns or features. **w0** is called intercept and it denotes that,if all the columns are zero,then what will model predict.In that case equation becomes **y=w0**.

With respect to our image,we can represent our equation as **Salary=w0+w1*Age**. Our aim is to fit a line to the given data and get values of **wo** and **w1** so that we can find salary if we have age of a person by putting values in the regression line equation.**Note:Linear Regression is used only for continuous variables like price,distance not for categorical or discrete variable like color,month name etc.**

If the data have only one Independent (feature) column and one Dependent(output) column,then these type of linear regression is called “**Simple Linear Regression**“. If the data have more than one Independent (feature) column then it is called “**Multiple Linear regression**“.

**Simple Linear Regression with Scikit-Learn**

I am going to use advertising dataset to implement the linear regression in scikit-learn.The dataset is about about the sales of a product in 200 different markets, together with advertising budgets in each of these markets for different media channels: TV, radio and newspaper. The sales are in thousands of units and the budget is in thousands of dollars.

import pandas as pd from sklearn.linear_model import LinearRegression df=pd.read_csv("advertsing.csv",names=["Index","Tv","Radio","Newspaper","Sales"]) df=df[["Tv","Radio","Newspaper","Sales"]] df.head()

Lets look how data looks like:

On First and Second line of code we import pandas and LinearRegression class.Pandas is used for data cleaning and wrangling and Linear Regression is the class which we will be using to create our model. Next,we read data and store in variable df. df.head() shows part of data as data may be very large to fit on screen.

First ,we will be doing simple linear regression,so I am going to drop 2 columns,so that there will be only 1 independent column and 1 dependent column.

df1=df[["Tv","Sales"]] X=df1["Tv"] y=df1["Sales"]

Now,**X** is our dataset having only one column(feature) which is Tv and this column is our independent variable.** y** is our dependent variable beacause **y(sales)** will be predicted depending upon the values of X i,e; budget spent on Tv advertising.

lr=LinearRegression() lr.fit(X.values.reshape(-1,1),y.values.reshape(-1,1)) print(lr.intercept_) print(lr.coef_) ##### intercept==[7.03259355] ##### coef==[[0.04753664]]

In above code,we first make a instance of class LinearRegression.Then we fit our X and y to this model. **fit method** learns all the models parameters. **reshape(-1,1**) is a way to inform sklearn that, we want to arrange all the values in a column but we don’t know how many rows are there but we know that we want 1 column.So thats why we used **reshape(-1,1)**. sklearn will take care of the row count and output will be a column. Next we print intercept and coef of model. intercept is **w0** and as there is only one dependent variable ,there will be only one model parameter(coef) **w1.**

Equation of Linear Regression will be now **Sales=7.0325+0.0475*Tv**

Now,suppose someone ask you to predict the sales, if he is going to spend some money on Tv advetising,you can put that value inplace of Tv and solve to get Sales.

**Multiple Linear Regression with Scikit-Learn**

In multiple Linear regression we will have more than one independent column and one dependent column.We will use Tv,Newspaper and Radio column as independent column and Sales as dependent column.

df=pd.read_csv("advertsing.csv",names=["Index","Tv","Radio","Newspaper","Sales"]) df=df[["Tv","Radio","Newspaper","Sales"]] X=df[["Tv","Radio","Newspaper"]] y=df["Sales"] X.head()

lm=LinearRegression() lm.fit(X,y) print(lm.intercept_) print(lm.coef_) #### intercept== 2.9388893694594085 #### coef== [ 0.04576465 0.18853002 -0.00103749]

Code is same as simple linear regression,but this time we have not done **reshape(-1,1) **because this is** pandas dataframe** ( a table like structure) and when we were using only TV column ,that was **pandas series** ( a single column) and fit method will throw error if we will pass a series.

Now,you can see we have 3 coef,one for each column.Now,the Linear Regression equation becomes **Sales=2.9388+0.0457*Tv+0.1885*Radio+(-0.0010)*Newspaper**

**Interpreting the coefficients and intercept**

Now you know how to implement linear regression with sklearn.But,you should know what the intercept and coefficients mean. **w0** is called **intercept** and it denotes that,if all the columns are zero,then what will model predict.In that case equation becomes **y=w0**. **Coefficients gives relation between dependent and independent variable. So, A “unit” increase in TV ad spending is associated with a 0.047537 “unit” increase in Sales**

I hope I explained all the details very clearly.If you have any doubt or want to give some suggestion,leave a comment below.

*Thank you.*

Pingback: Logistic Regression-A detailed explanation with scikit-learn Implementation - ExpertsTeaching

Pingback: Evaluating Regression models with python scikit-learn - ExpertsTeaching