Logistic Regression–A detailed explanation with scikit-learn Implementation

This guide will give you a brief and detailed explanation of Logistic Regression and Multinobial logistic regression.You will also learn how to implement Logistic Regression and Multinobial Logistic Regression with scikit-learn.

What is Logistic Regression?

Logistic Regression is a statistical method for predicting for predicting a dependent variable given a set of independent variable.Note that,in Logistic Regression the dependent variable is a categorical variable like “Yes/No” or “0/1” or “Absent/Present” and is used for classification problems.Dependent variable with two classes is called Binary logistic regression or just logistic regression.Dependent variable with more than two classes is called multinobial logistic regression.

Logistic Regression uses the same equation as the linear regression.If you want a detailed explanation of how linear regression works you can read this post Linear Regression. It passes the linear regression equation output to a special function called logit or Sigmoid function which maps the value resulting from linear regression equation between 0 and 1,i.e,it gives us the probability of being in a particular class.

Sigmoid curve(logit function)
Sigmoid curve(logit function)

In above image,you can see that,on Y-axis(which is our outcome or dependent variable) is between 0 and 1.Notice a horizontal line at 0.5,which indicates that,if logit function value is less than 0.5 then the model will predict class 0 as output and if logit function value is greater than 0.5,model will predict class 1 as output.

Φ(z)=1/1+e-z is called logit function and z is our linear regression equation i,e; z=w0+w1*x1+w2*X2+….+wnXn. We pass linear regression equation to this logit function and it returns values between 0 and 1. The graph of logit function is called sigmoid curve and it is a “S” shaped curve.

Logistic Regression with Scikit-Learn

Now,we will be implementing logistic regression with scikit-learn.Lets look at dataset.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

df=pd.read_csv("pima_diabetes.csv")
df.head()
Diabetes dataset
Diabetes dataset for logistic regression

Here you can see,the outcome variable which is our Dependent (output) column is having only two values either 0 or 1 (present or absent).So,it is a binomial logistic regression or just logistic regression.All other columns are independent columns.

Lets separate our target variable(Outcome) from our input variable(all other columns.

#grabbing first column to second last column in variable X
X=df.iloc[:,0:-1]
print(X.head())

#grabbing last column(Outcome) in variable y
y=df.iloc[:,-1]
print(y.head)
Training data
Training Data (X)
logistic regression test data
Test data (y)
# create training and testing vars
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

####OUTPUT
#####(614, 8) (614,)
#####(154, 8) (154,)

Here in above code, I am using train_test_split function of scikit-learn to split data for training and testing. test_size represent % of data for testing part and 0.2 mean 20% of data will we used as testing data. So,when we print them,you can see,X_train have 614 rows with 8 columns and y_train have one column with 614 rows.Similarly,d X_test have 154 rows with 8 columns and y_test have one column with 154 rows.

We split the data in train and test set because we need to find out accuracy of model by comparing the actual output (y_test) with the model predicted output(X_test). To do this we need data which model have not seen previously. That’s why,we first split our data in train and test set,and train set is used to train the model and test set is used to evaluate the model.

Now ,lets use LogisticRegression class to create our model.

model=LogisticRegression()
model.fit(X_train,y_train)
predictions=model.predict(X_test)

print("Predictions")
print(Predictions")

Model Prediction for X_test data

In above image,model predicts 0 or 1 for X_train data based on its training on X_train.We can find out whether model is predicting right or wrong by comparing it with y_test values,because y_test value contains actual values of X_test.Lets look at first 10 predictions and actual output:

print("Model Prediction")
print(predictions[0:9])

print("Actual Output")
print(y_test[0:9].ravel())
Predictions vs Actual Output
Predictions vs Actual Output

You can see,for first 10 prediction model is correct 9 times. So,do we need to manually check this if we have large number of dataset? Of-course not. Scikit-learn has a function to handle all this things in metric class.

from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_test,predictions)
print(accuracy)

### accuracy=0.8051948051948052

accuracy_score is used to measure accuracy of model.The first parameter to accuracy score is the actual values and the second parameter is the value our model has predicted.In this case,model has predicted accuracy of about 80% which means,about 80% of predictions made by our model are correct.

Multinobial Logistic Regression

In multinobial logistic regression,the target variable or output variable contains more than one class.The equation of logistic regression (logit function) which we saw earlier is also changed to softmax function.

Softmax function

Here,as we have more than one classes in output variable,we need to predict classes based on the probability.If a class is having higher probability than other,it mean we will predict that class as output. In equation, X is a instance or a row of independent variables and we want probability of y which is a class, based on the given instance.This equation will be computed for K times because we have K number of classes and it will give K number of probabilities. Lets implement multinobial logistic regression in scikit-learn.

#####import libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target

Here,iris is a data-set which is provided by sklearn for practise purpose.It has 4 features(independent columns)- sepal length,sepal width,petal length,petal width.It has a target (dependent) column with 3 classes-Iris Setosa,Iris Versicolour,Iris Virginica which are names of flowers.We import and load data by using load_iris() method.Next,we load our independent columns in X and dependent (target) columns in y.

# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

Here ,we are standardising our data ,because values can be on different scale. If a column is having values higher than other column,algorithm will give more importance to that column and this will induce a bias and lead to error in prediction.So,we need to standardise the data so that they are on the same scale.

Dataset without standardisation
Dataset without standardisation
Dataset with standardisation
# Create one-vs-rest logistic regression object
clf = LogisticRegression(random_state=0, multi_class='multinomial', solver='newton-cg')

#create model
model = clf.fit(X_std, y)

We are creating a model by using LogisticRegression class and passing multi_class=’multinomial’ to it so that model will know that we want to create a multinobial regression model. Next,we fit our training data to model. Now,its time for some prediction. Let’s create a training instance.

new_observation = [[.5, .5, .5, .5]]
model.predict(new_observation)

####output--> array([1])

We are creating a new instance of observation by assigning each independent column sepal length,sepal width,petal length,petal width to a value 0.5. Now ,we predict our instance using predict method. It gives values as 1.This 1 belongs to the class Iris Versicolour. Why model is predicting this example as class 1? Lets find out the reason.

model.predict_proba(new_observation)
###OUTPUT
# array([[0.01982536, 0.74491994, 0.2352547 ]])

As you can see,predict_proba gives us the probabilities of the new observation for being in a particular class.Probabilty of observation for class 1 is 0.744,thats why it is predicting the new observation as class 1.

That’s it for this tutorial.Hope,you understood all the things well.If you have any doubt or want to give any suggestion,drop a comment below.

Thank You.

Amarjeet

About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *