Evaluation Metrics for Classification Models

Evaluating performance of machine learning models is a crucial task of model building process.When you finished building the model,you need the perform some analysis on the performance of model to check whether you model will do best in real world.

In this post,you will learn about how to evaluate performance of classification algorithms like logistic regression,SVM,KNN,Naive Bayes Classifier etc.You will learn about various metrics like accuracy,Confusion matrix,precision,recall/sensitivity and F1-score.

1.Confusion Matrix

Confusion matrix is not a metric to evaluate the performance of model,but all the classification metrics can be understood and calculated from it.Confusion matrix is a square matrix ,where rows are Actual values and columns are Predicted values.

Confusion Matrix
Confusion Matrix

This is called confusion matrix and all the values in it,have some meaning ant it is very important to know what they mean.But before that,lets understand the context in which we will be understanding all these.We will use cancer data set as an example where we need to predict whether person has cancer or not.

Lets assign, 1-> Person is having cancer(Positive) and 0-> Person is not having cancer(Negative)

TP (True Positive)

True Positives are the cases where model predicted an example as Positive(1) when the actual value was also Positive(1).

For example,If our model predicted Positive(1) for a person who have cancer (1), then it is classifying it to correct class.This case,where Positive values are classified as Positive are called True Positives.

TN(True Negative)

True Negatives are the cases where model predicted an example as Negative(0) when the actual value was also Negative(0).

For example,If our model predicted Negative(0) for a person who does not have cancer (0), then it is classifying it to correct class.This case,where Negative values are classified as Negative are called True Negative.

FP(False Positive)

False Positive are the cases,where model predicted a particular instance as Positive(1) when actually it was Negative(0),means model is Falsely classifying Negative values as Positive. Hence,it is called False Positive.

For example,If our model is Predicting a person has cancer(1) but in reality he does not have cancer.This case is called False Positive.

FN(False Negative)

False Negative are the cases,where model predicted a particular instance as Negative(0) when actually it was Positive(1),means model is Falsely classifying Positive values as Negative. Hence,it is called False Positive.

For example,If our model is Predicting a person do not have cancer(0) but in reality he have cancer.This case is called False Negative.

In ideal situation ,we want our model to have 0 False postive and 0 False Negative values as this will give us 100% accuracy but in practical case,we will not have a case like this.If a model is giving us accuracy of 100% ,it means our model is Overfitting.

Why these values matters and How to use them?

In practical world,we will not get a model where accuracy 100% because there will be always some amount of error associated with it,but we can adjust these values to reach to a high accuracy.

To increase accuracy of model,we need to minimize False Positive and False Negatives.Which value to minimize totally depends on our business case and problem which we are trying to solve.

1.Minimising False Negatives

Suppose ,in our cancer classification problem, a person who is having cancer(1) is falsely classified as Not having cancer(0).These is a critical case,because it may lead to death of the person. If a person is having cancer,it should be classified as Positive, so that,doctors can start his treatment before it’s too late.

In these case,we need to minimise number of False Negatives.

2.Minimising False Positives

Now,suppose a person who is not having cancer(0) is falsely classified as having cancer(1).These is not a critical case,because when person is sent for treatment , doctors can do a checkup and can know that he don’t have cancer.But,we need to minimise it,so that patient does not panic and do not waste their money on checkups.

In these case,we need to minimize False Positives.


Accuracy is the simplest metric that we can use to check performance of a model.Accuracy is defined as Total Number of Correct Predictions over all other Predictions.

From Confusion Matrix,we can calculate Accuracy as


Notice that,In numerator we are summing TP and TN ,because these are the values where model predicted as actual output.To know whether your accuracy is high or not,you would look at diagonal values.The diagonal values should be high and off diagonal values should be low.

When to use and not to use Accuracy?

Accuracy is simplest metric to calculate,but we should use it only when our data-set is perfectly balanced.A balanced data-set is a data-set,where every target has approximately equal no of example/instances.Let me explain you with an example.

In our cancer data-set,there are two target variable, Have Cancer(1) and does not have Cancer(0).Suppose,there are total 1000 instances in our data-set,from which only 10 instances are of class 1 (have cancer) and all others are of class 0 (does not have cancer).That is,in data-set majority is of class 0 and it size is 99% of data-set.

This is called imbalanced data-set. In this case,we can not use accuracy as metric to check performance of a model,because the class with majority will dominate the other class.Maximum time,model will predict output as 0 because model has not learned anything about class 1 due to less number of instances of class 1.

Now,suppose I have created a model,which is worst.It is predicting every instance of class 1(have cancer) instances as class 0(do not have cancer) and class 0(do not have cancer) as 0(do not have cancer).Now,we have only 10 instances of class 1,then only 10 prediction will be made wrong. You can notice that,accuracy will be 99%,although model is predicting all the values of class 1 wrong (predicting all class 1 instances to class 0).

Now,can you really say your model is 99% accurate?Obviously not. Because,it is not classifying all the class 1 examples wrong.

So,If data-set is not balanced,we can not use accuracy to analyse performance of classification model.


Precision is defined as ,from Proportion of examples which are classified as Positive,how much of them were really positive.From confusion matrix,we can calculate Precision as


In our cancer data-set example,from 1000 people,only 10 people have cancer (class 1).Now,suppose model is bad and predicting all the 990 people who do not have cancer as have cancer.In these case,TP=10 (Because model is predicting all people who have cancer as have cancer) and FP=990 (Because model is predicting person not having cancer falsely as having cancer).Now,if you plug in the values you will get

precision calculation
Precision Calculation

Precision=0.01 or 1%.Now ,you can see,Precision is also conveying the same thing which we saw in our data-set,that is only 1 % of people have cancer from all the people who were classified as having cancer by model.1% of 1000 people is equal to 10,and you know that only 10 people were having cancer and rest others were falsely classified as having cancer.


Recall is defined as ,what proportion of patient who actually had cancer were correctly classified by model as having cancer.People who actually have cancer are TP + FN (Note: We are using FN because we want all the patient who had cancer,and FN is values,which is falsely predicted as not having cancer by model.Hence,we need to consider it to get number of patient who actually had cancer).


In our cancer data-set,10 people from 1000 were actually had cancer.Let’s say model is predicting every case as cancer.Now,as we are classifying each Positive(having cancer) case as Positive correctly.So TP will be 5,and TP + FN=5,because there is no case,where model is falsely classifying Positive case as Negative.

Recall Calculation
Recall Calculation

Above calculation will give you,1.00 or 100%.Because,all the positive cases,where patient actually had cancer are classified as Positive correctly.

What is difference between Precision and Recall and when to use what?

You can see from definition of Precision and Recall that, Precision focuses on getting True Positives from all Positively classified examples and Recall focuses on True Positives from all the Positive examples.

So,If you want to minimise false negatives then you should aim for recall close to 100% and if you want to minimize false positives then you should aim for precision close to 100%.


Specificity is defined as ,what proportion of patient who did not had cancer were correctly classified by model as not having cancer or Negative.People who do not have cancer are TN + FP (Note: We are using FP because we want all the patient who do not have cancer,and FP is values,which is falsely predicted as having cancer by model.Hence,we need to consider it to get number of patient who don’t have cancer). From definition,you must have noticed that,Specificity is just opposite of Recall. Recall focuses on Positive classes while Specificity focuses on Negative classes.


Again consider Cancer data-set,where from 1000 patients,only 10 are having cancer.Let’s say our model is predicting every case as not having Cancer(0).Then,TN will be 990,because,990 patient don’t have cancer and are classified correctly by model. FP will be 0.

Specificity Calculation
Specificity Calculation

Computing above equation, will get 1.00 or 100%, all patient who don’t had cancer were correctly classified as Negative or as Not having cancer by model.


From above metrics,you can see that both Precision and Recall focus on Positive classes.So,is there any single metric that can represent both?

One thing we can do is ,we can take arithmetic mean , (Precision + Recall)/2. But, again these suffers if data-set is not balanced.

So,instead of arithmetic mean,we need to compute harmonic mean.Harmonic mean is just like arithmetic mean,if data-set is balanced.But,if data-set is not balanced,then gives more importance to data which is less in size.If data is not-balanced,it penalises the large values.

We can do these using F1-score.F1 score is given by


So,If any metric between Precision and Recall is very small,then F1-score will be be very low. F1-score ranges from 0 to 1 where 0 represent bad score and 1 represent good score.We want to make F1-score close to 1 so that, and these will happen only when Precision and Recall are almost equal.

Accuracy vs F1-score

Accuracy should be used if data-set is balanced,that is,data-set has approximately equal number of instances/examples of each class.

Accuracy can be used where ,TP and TN both are of equal importance and we want to maximise both the values.

F1-Score should be used when data-set is not balanced. F1-score is used when FP and FN are very crucial.

These were some metrics that we use to evalute classification.In next post,I will write about some other advance methods to evalute classification models like ROC-AUC curve,Log-loss and F-Beta score.

Thank You.


About Amarjeet

Amarjeet,BE in CS ,love to code in python and passionate about Machine Learning and Data Science. Expertsteaching.com is just a medium to share what I have learned till now with world.
Bookmark the permalink.


  1. Pingback: Evaluating Classification Models with ROC AUC and LOG LOSS - ExpertsTeaching

  2. Wonderful, what a blog it is!

Leave a Reply

Your email address will not be published. Required fields are marked *