Standardization and Normalization in Machine Learning

In this post you will learn about Normalization and Standardization in machine learning.You will also learn why it is important and why and how to use Normalization and Standardization in Machine Learning in Python.

Why to Normalize or Standardize?

Sometimes,features of our dataset may have different scales.For example,One feature may be in cm and other may be in meter or one may be in pounds and other in KG’s.So,if features are on different scale,features having higher values will dominate the results .This will result in very low training accuracy.So,to overcome this problem,we need to make our feature invariant of scale and make them on same scale(feature scaling).In this post,you will learn how to perform feature scaling using Normalization and Standardization.

Standardization

Standardization is a feature scaling technique in which we scale our features such that they follow Gaussian Distribution,i,e,they have mean of 0 and std. Deviation of 1.(μ=0 and σ=1).

We can Standardize our features using following formula.Z is called standard score or Z-score.

Standardization in Machine Learning
Standardization Formula

Standardization in Python

Now,I am going to show you how we can implement standardization in python.I will perform Logistic Regression on a dataset with and without standardization and show you how it affects our accuracy and results.

import pandas as pd

df = pd.read_csv(
    'wine.csv',
     header=None,
     usecols=[0,1,2],names=['Class label', 'Alcohol', 'Malic acid'])
df.head()
standardization in machine learning
Wine Dataset

As we can see,both the features,Alcohol and Malic Acid are on different scales.Alcohol is measured in (percent/volume) and Malic Acid is measured in (g/l).

Lets first try to classify our examples without standardizing it.

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

X_wine = df.values[:,1:]
y_wine = df.values[:,0]

X_train, X_test, y_train, y_test = train_test_split(X_wine, y_wine,
    test_size=0.30, random_state=12345)


lr=LogisticRegression()
fit = lr.fit(X_train, y_train)

pred_train = lr.predict(X_train)

print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train)))

pred_test = lr.predict(X_test)

print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))

classification accuracy without standardization
classification accuracy without standardization

Now,lets standardize our features and see if it improves our accuracy.

from sklearn import preprocessing

std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)

lr_std=LogisticRegression()
fit = lr_std.fit(X_train_std, y_train)

pred_train_std = lr_std.predict(X_train_std)

print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train_std)))

pred_test_std = lr_std.predict(X_test_std)

print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))
Classification after standardization
Classification after standardization

As you can see,there is a significant amount of increase in both training and test accuracy in Standardize model compared to Non-Standardize model.

Normalization( MinMaxScaling )

Normalization,also called MinMaxScaling,rescale the value in the range [0,1].It can be used when we don’t know about the underlying distribution of dataset or if dataset is not Gaussian.

Normalization in machine learning
Normalization Formula

Drawbacks of Normalization

As normalization rescale the values between 0 and 1,if there are outliers in our dataset,normalization may drop that outlier points.This problem does not occurs in standardization,because,standardization creates a new dataset which is not bound to any limits.

Normalization in Python

Lets first try to classify our examples without standardizing it. We will use same dataset and same classification algorithm(Logistic regression).

lr1=LogisticRegression()
fit = lr1.fit(X_train, y_train)

pred_train = lr1.predict(X_train)

print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train)))

pred_test = lr.predict(X_test)

print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
Training accuracy without normalization
Training accuracy without normalization

Now,lets look at accuracy after normalizing our features.

norm_scale = preprocessing.MinMaxScaler().fit(X_train)
X_train_norm = norm_scale.transform(X_train)
X_test_norm = norm_scale.transform(X_test)

lr_norm=LogisticRegression()
fit = lr_norm.fit(X_train_norm, y_train)

pred_train_norm = lr_norm.predict(X_train_norm)

print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train_norm)))

pred_test_norm = lr_norm.predict(X_test_norm)

print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_norm)))
Classification accuracy with normalization
Classification accuracy with normalization

As you can see,normalized features are working great than the non-normalized features for classification.

When we should scale our features?

One question which will be in your mind will be that,when we should scale our features and which algorithms need scale-invariant features.

The answer is almost all algorithm need scale-invariant features.This is because,almost all machine learning algorithms works on the basis of distance metric like Euclidean distance,Manhattans distance etc.Because of the way how distance is computed,if features are on different scale,if one feature is large,then distance matrix will be affected.So,whenever you use a algorithm,which uses any kind of distance metrics ,you should scale your features.

Algorithms like linear regression,logistic regression, SVM, KNN, PCA, Perceptron etc. all uses distance metrics to carry out their tasks.Only some of the algoritms like tree based algoritms (CART,Decision Trees etc) does not depend on dixtance metrics.So,we can ignore feature scaling when using tree based algoritms.

That’s all for this post.I hope I explained it well.If you have any doubt or suggestion,feel free to comment below.

Thank You.

Amarjeet

About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *