Is accuracy good enough?
Accuracy, Precision, Recall, and F1
Is being accurate good enough?
Good reference with the formulas and how each work: https://en.wikipedia.org/wiki/Precision_and_recall
Often times when we are working on a data science project the client will want to know how accurate the model that you are building actually is. But is accuracy really the best way to judge your particular model? There are tons of different ways that you can judge the "goodness" of a model, accuracy is just the most well known. In this post we are going to take a look at accuracy, see where it excels, where it falls behind, and how it can be adjusted to fit particular scenarios.
Accuracy
A model's accuracy is calculated by
or more simply stated as the percentage of correct classifications. For a practical example lets look at medical testing. Lets say you want to try to judge how good a new cancer detection method performs. If we want to judge the accuracy of our new method we would calculate it by adding the number of times it detected cancer when there was cancer (true positive) and the number of times it didn't detect cancer when no cancer existed (true negative). Then we divide that sum by the total number of trials.
If we have 100 people in our experiment where 50 people have cancer and 50 people that don't we could get results similar to the following.
Have cancer | No cancer | Total | |
---|---|---|---|
Found cancer | 38 | 6 | 44 |
Didn't find cancer | 12 | 44 | 56 |
Total | 50 | 50 | 100 |
Using our formula the accuracy is equal to
In this instance accuracy is a pretty decent judge of how good your detection system is. For certain cases though, accuracy can be really misleading. Specifically when there is a heavier sampling of some of the possible outcomes. So lets change our example a little bit. Lets take a new detection method and use it on 100 new people. However, in this instance we only 5 people actually have cancer, while 95 don't.
Have cancer | No cancer | Total | |
---|---|---|---|
Found cancer | 0 | 0 | 0 |
Didn't find cancer | 5 | 95 | 100 |
Total | 5 | 95 | 100 |
Using our formula the accuracy for our new experiment with our new sample is equal to
Now we are getting a much better accuracy score, but the test could be completely fake because all it ever said was that nobody had cancer. So the next question is, how can we fix it? One approach is called balanced accuracy. In this approach we average our the rates for the true positive and true negative, which gives us the formula.
Now if we compare this to the balanced accuracy of the first example
we get the same value as the previously calculated accuracy. We should have expected this to be the case because we already had a balanced experiment (50 with cancer, 50 without). There are other approaches that can also be used to improve accuracy for similar experiments such as over sampling the smaller sample size, but I personally find the balanced accuracy measure the most intuitive and it still gives excellent results.
There are other issues that can still effect an experiment where accuracy, and even balanced accuracy still aren't good enough. This is often the case when you value the results of one side more than the other. For example, we may have a preliminary medical test where we want to make sure that we identify every person that has some condition, even if that means we have some false positives (where the test says the have a condition that they don't have). In this case, we would want to lean more on the sensitivity of the test which is calculated by the true positive divided by the total number of people with the condition (true positive plus the false negative).
On the other had, we may also be more worried about identifying the healthy cases more so than the sick, such as when screening for a dangerous position such as an astronaut. Then we would want to focus on the specificity of the test, which is calculated as the true negative divided by the number of healthy people.
The last test that we will cover is called the -score, which is when you care more about the incorrectly identified cases more so than the correctly identified ones. In this case the -score is calculated as the harmonic mean of the sensitivity and the specificity. We get this value through the following equation:
This metric is often used in applications such as search (think Google), though it has been expanded to allow for weighting if you care more for false negative or false positives through an equation.