Selecting Metrics for Machine Learning

Fayrix Machine Learning expert shares performance metrics that are commonly used in Data Science for assessing performance of Machine Learning models

Key steps to selecting evaluation metrics

First of all, metrics which we optimise tweaking a model and performance evaluation metrics in machine learning are not typically the same. Below, we discuss metrics used to optimise Machine Learning models. For performance evaluation, initial business metrics can be used.
Understanding the task
Based on prerequisites, we need to understand what kind of problems we are trying to solve. Here is a list of some common problems in machine learning:
  • Classification. This algorithm will predict data type from defined data arrays. For example, it may respond with yes/no/not sure.
  • Regression. The algorithm will predict some values. For example, weather forecast for tomorrow.
  • Ranking. The model will predict an order of items. For example, we have a student group and need to rank all the students depending on their height from the tallest to the shortest.
In our case, we are solving the problem of finding mathematical metrics which will also optimize the initial business problem. Below we list basic metrics to start with.
CLASSIFICATION performance metrics

Confusion Matrix

This matrix is used to evaluate the accuracy of a classifier and is presented in the table below.
Some examples
False Positive (FP) moves a trusted email to junk in an anti-spam engine.
False Negative (FN) in medical screening can incorrectly show desease absense, when it is actually positive.

Accuracy Metric

This metric is the basis one. It indicates the number of correctly classified items compared to the total number of items.
Keep in mind that accuracy metric has some limitations: it doesn't work well with unbalanced classes that can have many items of the same class and few other classes.

Recall/Sensitivity Metric

Recall Metric shows how many True Positives the model has classified from the total number of positive values.

Precision Metric

This metric represents the number of True Positives which are really positive compared to the total number of positively predicted values.

F1 score

This metric is a combination of precision and recall metrics which serves as a comprise. The best F1 score equals 1, while the worst one is 0.

performance metrics

Mean Absolute Error (MAE)

This regression metric indicates the average sum of absolute difference between the actual and predicted value.

Mean Square Error (MSE)

Mean Squared Error (MSE) calculates the average sum of squared difference between the actual and predicted value for the entire data points. All related values are raised to the second power therefore all of negative values are not compensated by positives. Moreover, due to the features of this metric, the impact of errors is higher. For example, if the error in our initial calculations is 1/2/3, MSE will equal 1/4/9 respectively. The less MSE is, the more accurate our predictions is. MSE =1 is the optimal point in which our forecast is perfectly accurate.

MSE has some advantages over MAE:
1. MSE highlights large errors over small ones.
2. MSE is differentiable which helps find minimum and maximum values using mathematical methods more effectively.

Root Mean Square Error (RMSE)

RMSE is a square root of MSE. It is easy to interpret compared to MSE and it uses smaller absolute values which is helpful for computer calculations.
performance metrics

Basic Metric

Best Predicted vs Human, BPH:
The most relevant item is taken from an algorithm-generated ranking and then compared to a human-generated ranking. This metric results in the binary vector that shows the difference in estimations of an algorithm and a human.

Kendall's tau coefficient

Best Predicted vs Human, BPH:
Kendall's tau coefficient shows the correlation between the two lists of ranked items based on the number of concordant and discordant pairs in a pairwise: in each case we have two ranks (machine and human prediction). Firstly, the ranked items are turned into a pairwise comparison matrix with the correlation between the current rank and others. A concordant pair means an algorithm rank correlates with a human rank. Otherwise, this will be a discordant pair. Therefore, this coefficient is defined as following:

The values of τ varies from 0 to 1. The closer |τ| is to 1, the better ranking is. For instance, when τ-value is close -1, the ranking is just as accurate, however the order of its items should be vice-a-versa. This is quite consistent with estimate indicators which assign the highest rank to the best values, whereas during manual human ranking the best ones receive the lowest ranks. τ-value=0 indicates the lack of any correlation between ranks.