Cross Validation Techniques

In this post you will learn about cross validation and various cross validation techniques like K-fold ,Repeated K-fold, Stratified, LOOCV and LOOPV cross-validation techniques and how how we can implement these techniques in Python.

What is Cross Validation?

Cross validation is a model validation technique. Cross validation is used to to overcome the the problem of reduction in training data when we split data set into train-test-validation set,which I have covered in this post.

In cross validation, we partition our data set into 2 sets,Train and Test set.Train set is used for both training and validation and Test set is used after cross validation to test model performance . Training data is partitioned into subsets  and model is trained on some of these subsets and validation is done on remaining subsets. To reduce the variability we perform this operation many times each time with different training and test set and average over the accuracy to get the the models performance.

Below I am going to show you some of the cross validation techniques and their implementation in Python.

K-Fold Cross Validation

In K-fold cross validation technique data set is partition into K partitions called folds. K-fold cross validation technique takes a parameter K which specifies in how many folds we want to partition or data. From this folds , one fold  is used as Test set  and rest  folds  are used as Training set. This process is repeated K times each time with different Train and Test folds, so that each fold has a chance to be in  Training set and Test set.

k fold cross validation in python
K fold cross validation (K=6)

In above image you can see,data is partitioned into 6 folds.In each iteration,we have K-1(5) train set and 1 test.This process is repeated 6 times.In each iteration we have a different training and train set.

This technique is very useful when we have very less amount of data.If we don’t have much amount of data to train the model,and if we use holdout method(Train-Test split),then we will have less amount of data to train the model because train-test split partition data into 50-50 ratio from which one is used for training and other is used for testing.This will lead to underfitting and high bias.To overcome this problem,we use K-Fold cross validation technique,which gives us much data to train the model and leaves a significant amount of data for validation.As each data point is used in training set exactly K-1 times and K times in validation set.This reduces problem of underfitting and high bias.After all the iteration,we take average of scores returned in each iteration.

Making K-Folds for CV in Python

import numpy as np
from sklearn.model_selection import KFold
import pandas as pd

df = pd.read_csv(
    'wine.csv',
     header=None,
     usecols=[0,1,2],names=['Class label', 'Alcohol', 'Malic acid'])


kf = KFold(n_splits=2)
for train, test in kf.split(df):
    print("%s" % (train))
    print("%s" % ( test))
    print("\n")

Above code splits the data into 2 folds,as K=2,and returns the indices of train and test set.

K fold cross validation in python
K fold cross validation in python

So,to get actual values from dataset corresponding to these indices,we can do something like below.

X_train, X_test, y_train, y_test = df1.iloc[train], df1.iloc[test], y.iloc[train], y.iloc[test]
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

#####OUTPUT####
(89, 2)
(89,)
(89, 2)
(89,)

K-Fold CV in Python

NOTE:You can use below code and change value of K and KFold technique to implement all the techniques of cross validation which I have listed below.

import numpy as np
from sklearn import svm
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
import pandas as pd

svc = svm.SVC(C=1, kernel='linear')

df = pd.read_csv(
    'wine.csv',
     header=None,
     usecols=[0,1,2],names=['Class label', 'Alcohol', 'Malic acid'])

kf = KFold(n_splits=10)

df1=df.iloc[:,1:3]
y=df.iloc[:,0]

print([svc.fit(df1.iloc[train], y.iloc[train]).score(df1.iloc[test], y.iloc[test]) for train, test in kf.split(df)])
K fold cross validation code inpython
Output Scores

Repeated K fold cross validation

Problem with K fold cross validation is that,the data which is used for training, same is used for testing.So,there is no randomness in data and our model will be biased.To overcome this problem,we can use Repeated K fold cross validation.

In repeated K-fold cross validation ,K-Fold cross validation technique is repeated n times.So,K folds repeated n times gives us K*n sets to train and test the model.After each K-Fold cross validation,data is shuffled so that,there will be randomness in splits and each repetition will have split which will contain different data than before.

Repeated K fold cross validation in Python

from sklearn.model_selection import RepeatedKFold
random_state = 12883823
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
for train, test in rkf.split(df):
    print("%s" % (train))
    print("%s" % (test))
    print("\n")

Run the above code,and you will notice that,in each split data is different than the previous split.This causes randomness in the dataset and solves the problem of bias.

Stratified K fold cross validation

Sometimes,we have large imbalance in target variables. Splitting imbalance dataset in normal way does not guarantees that,while making the split,ratio of classes will be preserved.

stratified k fold cross validation
Imbalance in Target variable classes

From above image,you can see dataset has 700 rows,but there is huge difference in number of target variable classes.There are 500 rows with class 1 and only 200 rows with class 0.If we create folds from these type of dataset,there will be imbalance of classes like below.

imbalance in classes in k fold cross validation
Imbalance of classes in Folds

To overcome this problem,we use stratified K-fold cross validation technique.This method uses,Stratified sampling method,which splits the dataset in such a way that each split have all the classes in ratio with their original count.If we have class imbalance in our dataset,then we can use this cv technique.

Stratified K fold cross validation in python

from sklearn.model_selection import StratifiedKFold, KFold
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(df1, y):
    print('train -  {}   |   test -  {}'.format(
    np.bincount(y[train],minlength=3), np.bincount(y[test],minlength=3)))
class imbalance in dataset
Target variable class counts

From above image,you can see that,there are total 178 example from which 71 are of class 2,59 are of class 1 and 48 are of class 3.Let’s see output of our stratified cross validation code output.

stratified cross validation
stratified cross validation

From output you can see that,in each split,ratio of classes are almost equal.Of course,class 2 samples are more than class 1 and class 1 samples are more than class 3.But,the ratios of selecting these classes across all the splits of train and test set are almost equal.

LOOCV(Leave One out Cross Validation)

LOOCV is the simplest cross validation technique.In this technique,we partition our data in such a way that,for a dataset of length N examples,1 examples is used for training and N-1 examples are used for training.This is repeated so that, each example gets a chance to be a test example.By doing this,we will have N models.We take score of all the models and average them to get the performance of the models.

from sklearn.model_selection import LeaveOneOut
X = range(10)
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))
Leave one out cross validation
Leave one out cross validation Output

You can see that,as our dataset X has 10 examples,there are 10 splits.In each split,test set is different.

Problems with LOOCV

  • As we are giving chance to every example to be a test set,we will have N models.Training N models if N is very large is computationally intensive.
  • As we are making exactly N models,training accuracy will be very high and problem of High variance will occur.

LOOPV(Leave P Out Cross Validation)

LOOPV is almost similar to LOOCV.In LOOPV,we use use Value of P > 1,because for P=1,LOOPV will be same as LOOCV.LOOPV creates all possible combination of train/test set by taking P examples for train set.

from sklearn.model_selection import LeavePOut

X = np.ones(4)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print("%s %s" % (train, test))
Leave p out cross validation
Leave p out cross validation Output

As our P was 2,it created all the possible combination of sets with 2 examples in test set.

That all for this tutorials.If you have any doubt or any suggestion,please comment down below.

Thank You.

Amarjeet

About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.

One Comment

  1. Pingback: Train Test And Validation Set - ExpertsTeaching

Leave a Reply

Your email address will not be published. Required fields are marked *