Explain accuracy precision recall and f beta score

In this tutorial, we will learn about the performance metrics of a classification model. We will be learning about accuracy, precision, recall and f-beta score.

Explain Accuracy, Precision, Recall, and F-beta Score

A confusion matrix provides a wealth of information. It helps us understand how effectively the classification model is working through calculated metrics like accuracy, precision, recall, and f-beta score. However, one of the most popular questions among aspiring data scientists is when should these measures be used. The answer to this query can be found in this tutorial. Let's take a look at each of these metrics and see how they're used.

Learn to use RNN for Text Classification with Source Code 

1) Accuracy

Formula: (TP + TN) / (TP+TN+PF+FN)

Accuracy is one of the most used performance metrics. It is the ratio of correctly predicted observations to all the observations. However, deeming a model to be the best, solely based on accuracy is incorrect. Accuracy is a relevant measure when the dataset is symmetric and the number of FPs is almost the same as the number of FNs. However, in the case of asymmetric datasets, we need to resort to other performance metrics because we are concerned about the number of wrongly classified positive and negative predictions as well. For example, in the case of Covid-19 classification, what if wrongly classify the person as negative but the person goes on to fall ill and his condition becomes severe. He might even end up spreading the virus. This is precisely why we need to break down the accuracy formula even more.

Let us go through Type I and Type II errors before understanding Precision, Recall, and F-beta score.

Type I error – False Positive i.e. the case where we reject the null hypothesis
Type 2 error – False Negative i.e. the case where we do not reject the null hypothesis

With this in mind let us move on to Precision.

2) Precision

Formula: TP/ (TP+FP) i.e. TP/Total predicted positive

Precision is implicitly defined as the proportion of accurately detected positive cases among all predicted positive cases. It refers to how precisely is your model able to predict the actual positives. It focuses on Type I error. We must use precision as a performance metric when having False Positives is more concerning. For example – email spam detection. In email spam detection, if an email that is not a spam email is incorrectly classified as spam then the user might end up missing critical emails. In this case, it is more important for the model to be precise.

3) Recall

Formula: TP/ (TP+FN) i.e. TP / Total Actual positive

It's the proportion of accurately detected positive cases among all positive instances. With the same reasoning, we know that when a False Negative has a higher cost, the recall will be the performance metric we use to choose our best model. For example – Fraudulent Detection. A bank may face severe consequences if an actual positive (fraudulent transaction) is predicted as a negative (non-fraudulent) transaction. In the same way, predicting an actually Positive (Covid-19) person as negative is very dangerous. In these cases, we must focus on getting a higher recall.

Precision-Recall Trade-off

The values for both precision, as well as recall, lie between 0 and 1. In our scenario, we wish to prevent overlooking true positive cases by classifying passengers as COVID positive and negative. It would be particularly problematic if a person is genuinely positive but our model fails to detect it because there is a substantial risk of the virus spreading if these individuals are allowed to board the flight. So, even if there's a minuscule chance that a person has COVID, we can't risk identifying them as negative. As a result, we plan so that if the output probability is larger than 0.25, we designate them COVID positive. Therefore, recall is higher but precision is reduced.

Let us now consider an opposite scenario where we must designate a person positive only when we are certain that the person is positive. We can achieve this by setting the threshold of the probability higher (eg: 0.85). This means that a person is positive only when its probability is greater than 0.85 and negative otherwise. We can notice a trade-off between recall and precision for most of the classifiers as we change the threshold of the probability. It is sometimes more convenient to integrate precision and recall into a single statistic when comparing multiple models with varied precision-recall values. To calculate performance, we need a statistic that takes both recall and precision into account.

4) F-beta Score

Formula: ((1+beta2) * Precision * Recall) / (beta2 * Precision + Recall)

As previously stated, we require a statistic that considers both recall and precision, and the F-beta score fulfills this requirement. The weighted harmonic mean of precision and recall is known as the F-beta score. Its value lies between 1 and 0, where 1 is the best and 0 is the worst. The weight “beta” is assigned depending upon the case scenario. If precision is more important, beta is reduced to less than one. When beta is greater than one, recall is prioritized. However, if the beta is set as 1, we get something called an F1 score which is the harmonic mean of precision and recall and gives equal weightage to both of them.

Beta = 1 is the default value. The formula becomes –
F1 score = (2 * Precision * Recall) / (Precision + Recall)

To prioritize precision, you can set a smaller beta value such as 0.5. The formula becomes –
F0.5 score = (1.25 * Precision * Recall) / (0.25 * Precision + Recall)

To prioritize recall, you can set a larger beta value such as 2. The formula becomes –
F2 score = (5 * Precision * Recall) / (4 * Precision + Recall)

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Deep Learning Project- Real-Time Fruit Detection using YOLOv4
In this deep learning project, you will learn to build an accurate, fast, and reliable real-time fruit detection system using the YOLOv4 object detection model for robotic harvesting platforms.

Deploying Machine Learning Models with Flask for Beginners
In this MLOps on GCP project you will learn to deploy a sales forecasting ML Model using Flask.

AI Video Summarization Project using Mixtral, Whisper, and AWS
In this AI Video Summarization Project, you will build a quiz generation tool by extracting key concepts from educational videos and generating concise summaries.

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Machine Learning Project to Forecast Rossmann Store Sales
In this machine learning project you will work on creating a robust prediction model of Rossmann's daily sales using store, promotion, and competitor data.

Build Multi Class Text Classification Models with RNN and LSTM
In this Deep Learning Project, you will use the customer complaints data about consumer financial products to build multi-class text classification models using RNN and LSTM.

MLOps Project on GCP using Kubeflow for Model Deployment
MLOps using Kubeflow on GCP - Build and deploy a deep learning model on Google Cloud Platform using Kubeflow pipelines in Python

Learn to Build a Neural network from Scratch using NumPy
In this deep learning project, you will learn to build a neural network from scratch using NumPy

Deploy Transformer-BART Model on Paperspace Cloud
In this MLOps Project you will learn how to deploy a Tranaformer BART Model for Abstractive Text Summarization on Paperspace Private Cloud

Build a Face Recognition System in Python using FaceNet
In this deep learning project, you will build your own face recognition system in Python using OpenCV and FaceNet by extracting features from an image of a person's face.