Train Test And Validation Set

In this tutorial,you will learn why we need train,test and validation set and how it is used in our model training and evaluation.You will also learn how you can split dataset into these set’s using Python.

Training Set

This is the part of dataset which is used to train the model.Typically,training set contains about 60-70% of total dataset. Model is first trained on training set,so that it learns the parameters and underlying concepts from dataset.

Validation Set

This is the part of dataset which is used to validate the performance of model on training set.Typically,validation set contains about 15-30% of total dataset. After training the model,model is evaluated on validation set.We only provide dependent feature column to the model,and based upon the training,model predict the output.This output is matched against original output(independent) feature of validation set and evaluated based upon original and predicted output.If model accuracy is low on validation set,we need to adjust model parameter and again we need to train the model and repeat the same steps until we get a satisfying accuracy score.

Test Set

This is the part of dataset which is used to Test the model performance.Test set is used to see how well model will generalize to unseen data in real world.Size of Test set is about 15-30% of total dataset.

Dependent feature column from Test set is given to the model,and based upon the training,model predict the output.This output is matched against original output(independent) feature of Test set and evaluated based upon original and predicted output.

Now,one thing you may have noticed that,we are doing almost same thing with both the sets(test set and validation set).So you may be thinking that,why we need two different sets to evaluate the model?Can’t we merge test and validation set and perform model evaluation on this set only?I will answer this question in next section.

Why we need validation set?

We can perform testing and model validation using test set.We can train our model on training set and test it on test set.If our accuracy is low,we will change model parameters and again train the model and test it on test set.We keep doing this until we get a satisfactory training score.

But,the problem here is,we are changing our model parameters after seeing low scores on testing set.Although,we are not using this set in training set,after seeing low scores on test set and then changing model parameters is a kind of cheating.This is like showing test set to the model while training it.Because,if you change your model parameters after seeing low scores on testing set,after repeating this process,at some time you will get a good accuracy,but this model will not perform well in real world.This is because,you have now trained your model in such away that,it will give high accuracy on test set,because you have adjusted your model parameters according to test set.

So,we need a set which is isolated from the training and testing phase,on which we can perform model validation before using it in real world.And,that is where validation set comes in picture.

We partition our dataset,in 3 set’s,one part is training set, other is testing set and last one is validation set.Training and validation is done on train and validation set respectively.When we are very sure that,model is ready to be used in real world,a final test should be made using test set.Here,we get a clear understanding of how well this model will generalize to the new examples,which it have never seen.

How to decide size of these 3 sets?

There is no such fixed rule about choosing the size of these sets,but according to the experts 70-15-15 or 60-20-20 is a good size for train,test and validation set respectively.

What are drawbacks of splitting dataset into different sets?

As we are splitting dataset into different sets,we are reducing amount of data for training the model.We need large training set to train the model otherwise due to less amount of data,model will underfit and will result in high bias errors.

We can use various cross-validation methods like K-Fold Cross Validation,Repeated K-Fold cross validation, LOOCV(Leave One Out Cross Validation),Stratifies Cross Validation etc.

Train Test and Validation set partitioning with Python Sci-Kit Learn

import pandas as pd

df = pd.read_csv(
    'wine.csv',
     header=None,
     usecols=[0,1,2],names=['Class label', 'Alcohol', 'Malic acid'])

print(df.head())
print(df.shape)
Cross validation in Python
Cross validation in Python

As you can see from above image,our dataset have 178 rows(training examples) with 3 column Alcohol,Malic Acid and class label.Malic Acid and Alcohol are dependent column and Class Label is Independent column(output).

Now,we need to split this dataset into train ,test and validation set.First,we will separate out Test set from our dataset.We need to choose fraction of dataset randomly to avoid any bias.As,you can see from image above,I have selected top 5 rows from dataset,but all are having class label as 1.

df_val["Class label"].unique()

####OUTPUT###
array([3, 2, 1], dtype=int64)

But,class label contains values 1,2 and 3 and you van verify that from above output. We want our train ,test and validation set’s to have almost equal number of all the class instances to avoid high bias and class inequality problems.This is done using randomly choosing instances from dataset and assigning it to set’s.

df_val = df.sample(frac =.20) 
#df_val["Class label"].unique()
df.drop(df_val.index,inplace=True)
print(df_val.shape)
print(df.shape)




df_val is our validation set.we are separating our validation set from our train and test splits. .sample() is a function which is used to randomly select data from our original dataset df. frac=.20 means we want randomly select 20% of original dataframe and assign it to df_val.After assigning 20% of our original dataset df to df_val,we are removing that rows from original df using df.drop(df_val.index,inplace=True), because we don’t want our validation dataset examples in training or testing set’s. Let’s check shape of df and df_val after above operations.

Validation set in python
Validation set in python

Now we can split our df into train and test split.First we will separate dependent and independent features in y and X respectively and then we will split them in train and test split.

from sklearn.model_selection import train_test_split

X_wine = df.values[:,1:]
y_wine = df.values[:,0]

X_train, X_test, y_train, y_test = train_test_split(X_wine, y_wine,
    test_size=0.30, random_state=12345)

We are using train_test_split method from sklearn to split dataset into train and test set. train_test_split returns 4 values,train and test set for dependent and independent columns. test_size=0.30 specifies that we want test set size to be 30% of total dataset.

Now we can pass X_train and y_train to a model for training and after training we can pass y_test to model for predicting the output.This predicted out is evaluated against original test outcomes(y_test) and model performance is calculated.

lr=LogisticRegression()
fit = lr.fit(X_train, y_train)
pred_train = lr.predict(X_train)
    
print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train)))


pred_val = lr.predict(df_val.iloc[:,1:])

print('\nPrediction accuracy for the validation dataset')
print('{:.2%}\n'.format(metrics.accuracy_score( df_val.iloc[:,0], pred_val)))


Prediction accuracy for the training dataset 72.73% 
Prediction accuracy for the validation dataset 66.67% 

Here,in Logistic Regression I am not using any parameter and hyperparameter.Notice that,training accuracy was about 72%,but on validation set,accuracy is 66.67%.

Let’s create a new logistic regression model with parameter ‘C’ which is Inverse of regularization strength with value =2 (By default its value is 1) and see what happens to our train and validation accuracy.

lr1=LogisticRegression(C=2)
fit = lr1.fit(X_train, y_train)
pred_train = lr1.predict(X_train)
    
print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train)))

pred_val = lr1.predict(df_val.iloc[:,1:])

print('\nPrediction accuracy for the validation dataset')
print('{:.2%}\n'.format(metrics.accuracy_score( df_val.iloc[:,0], pred_val)))

Prediction accuracy for the training dataset
77.78%
Prediction accuracy for the validation dataset
75.00%

Now,As our training and validation scores are close enough,we know that this model will perform better in real world,and now we can check this model accuracy on test set.

pred_test = lr1.predict(X_test)

print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
Prediction accuracy for the test dataset
74.42%

You can see that,our test accuracy is also higher.So,this model can be called as a good model and this model will generalize well to unseen data.

One thing you should know that,training ,testing and validation accuracy can also be low due to less amount of data in the training set and changing parameters of model will not help much.We were having 178 instances from which 36 instances was assigned to validation set,and 30% of remaining dataset was assigned to test set.So,for training we were only having 99 instances.This amount is very less for a model to learn concepts from a dataset.So,if dataset is small,we need to add more number of training examples to dataset for good training.

This issues can be solved using cross-validation techniques,which I will cover in different post.

I hope, I have explained all the concepts you need to know about train test and validation set, clearly.If you have any doubt or suggestion,you can comment down below.

Thank You.

Amarjeet

About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.

One Comment

  1. Pingback: Cross Validation Techniques - ExpertsTeaching

Leave a Reply

Your email address will not be published. Required fields are marked *